AutoML for Text, Image, and Tabular Data - Jonas Mueller Amazon Web Services - The Statistical and ...

Page created by Chester Perry

Health & Fitness

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

AutoML for Text, Image, and Tabular Data - Jonas Mueller Amazon Web Services - The Statistical and ...

AutoML for Text, Image, and Tabular Data

 github.com/awslabs/autogluon

 Jonas Mueller
 jonasmue@amazon.com
 Amazon Web Services

AutoGluon Overview
• Enables easy to use & highly accurate supervised learning for tabular, image, text data (AutoML)

• Open source & easy to extend (add custom models to the set AutoGluon trains/tunes/ensembles)

• AutoML that is highly customizable/controllable
 Can specify: what metric AutoGluon optimizes for, how long training should take, trade-off inference-latency vs accuracy, how
 many CPU/GPU compute-resources to use, what model-types / hyperparameter-values to consider, ...

• Leverage automatic hyperparameter tuning, model selection/ensembling, neural architecture
 search, and data preprocessing (best techniques for your dataset are automatically applied)

http://auto.gluon.ai

What you can achieve with AutoGluon
• AutoGluon produces superior models than someone lacking: ML expertise or time to dive into modeling

• Quickly see achievable accuracy on your raw data with just a couple lines of Python code

• Painlessly deploy trained models (and higher-accuracy model ensembles)

• Flexibility: provided presets allow you to specify how much you care about:
 optimizing for accuracy at all costs, reducing inference-time, reducing model-size, …

• Easy-to-use interpretability methods tell you:
 - how predictive each variable is (permutation feature importance)
 - why AutoGluon made a particular prediction (Shapley values)

• Easily improve existing any bespoke modeling pipelines with advanced hyperparameter optimization (HPO)

o offer the simplest user experience with AutoGluon-Tabular. This means AutoGluon must
 AutoGluon is easy to use
l design decisions left unspecified. To illustrate this, consider a structured dataset of raw
a file train.csv, with the label values to predict stored in a column named class. In this
AutoGluon model
 • Only can beoftrained
 3 lines andtrain
 code to tested as follows:
 and use a model
 1 from autogluon import TabularPrediction as task
 2 predictor = task . fit ( " train . csv " , label = " class " )
 3 predictions = predictor . predict ( " test . csv " )

it(), AutoGluon automatically: preprocesses the raw data, identifies what type of prediction
 this is (binary/multiclass
 fit() automatically classification
 does or regression)
 the following:as well as the types of each feature,
 data into various folds for model-training vs. validation, individually fits various models, and
 1. preprocesses raw data (identifies the type of each feature)
 2. identifies what type of prediction problem this is
 (binary/multi-class classification or2regression)
 3. splits data appropriately (eg. training/validation sets, k-fold split)

 4. individually trains/tunes various models (RandomForest/ExtraTrees, KNN, NeuralNet, GBDT)
 5. assembles ensemble that outperforms all the individual models

Built-in Prediction Tasks
 Suitable for:
 • classification & regression
 • raw CSV/Parquet files
 • tables with text/string fields, date-times, missing values

 • raw image files (JPG, PNG, etc)
 • images of various sizes
 • smaller labeled datasets

 • classification & regression
 + many NLP tasks (GLUE/SuperGLUE)
 • raw datasets with multiple text fields
 • smaller labeled datasets

Tabular Data Benchmarks
1. OpenML AutoML Benchmark (39 datasets, 10 train/test folds each)
 • https://openml.github.io/automlbenchmark/
 • Curated binary and multiclass classification tasks to assess AutoML

2. Kaggle Benchmark (datasets from 11 prediction competitions)
 • Represent realistic modern-day applications of supervised learning
 • Regression, binary, and multi-class classification tasks
 • Many different evaluation metrics
 (tailored to business use-case of each application)

Reference: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020

Comparing AutoML tools: Kaggle Competitions
 auto−sklearn TPOT Auto−WEKA H2O AutoML GCP−Tables AutoGluon

 bnp−paribas

 santander−trans...

 santander−satis...

 porto−seguro

 ieee−fraud

 walmart−recruit...

 otto−group

 house−prices

 allstate−claims

 mercedes−benz

 santander−value
 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
 Percentile Rank on Leaderboard

AutoGluon outperformed 99% of data science teams after training
for 4h on raw data from otto and bnp-paribas competitions

Reference: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020

Otto-Group Kaggle Competition

• In 4 hours (on a m5.2xlarge
 CPU) Autogluon achieved a
 score that took the best
 human team over 1 week to
 beat on Kaggle.
• This competition (Otto) had
 over 3500 teams compete.

 AG 4h score

Comparing AutoML tools: AutoML Benchmark
 auto−sklearn TPOT Auto−WEKA H2O AutoML GCP−Tables AutoGluon

 apsfailure
 airlines
 albert
 amazon_employee...
 australian
 kddcup09_appete...
 miniboone
 adult
 bank−marketing
 blood−transfusion
 christine
 credit−g
 guiellermo
 higgs
 jasmine
 kc1
 kr−vs−kp
 nomao
 numerai28.6
 phoneme
 riccardo
 sylvine
 covertype
 dionis
 fashion−mnist
 helena
 jannis
 robert
 shuttle
 volkert
 car
 cnae−9
 connect−4
 dilbert
 fabert
 jungle_chess_2p...
 mfeat−factors
 segment
 vehicle
 0.1 0.5 1 2 5 10 20
 Loss on Test Data (relative to AutoGluon)

Reference: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020

912
913
914

 Comparing AutoML tools vs AutoGluon
915 Table S8. Comparison of the AutoML frameworks on the AutoML Benchmark, evaluating the accuracy metric (with 1h training time
916 limit). Columns are defined as in Table 2. Rescaled misclassification rate is calculated in the same manner as rescaled loss, but applied
917 specifically to the accuracy metric. Here, we additionally compare with the commercial Sagemaker Autopilot framework described in
918 §C.1. All frameworks were optimized on AUC except for AutoPilot, which was optimized on accuracy. This demonstrates that while
919 AutoGluon performs the best in the primary evaluation metric, it also performs favourably on secondary metrics such as accuracy, even
920 compared to AutoPilot which optimized directly on accuracy.
921
922 Framework Wins Losses Failures Champion Avg. Rank Avg. Rescaled Misclassification Rate Avg. Time (min)
923 AutoGluon 0 0 0 17 1.8421 0.0509 56
924 GCP-Tables 6 15 13 5 2.9211 0.1973 83
925 auto-sklearn 7 22 5 3 3.7105 0.2506 60
926 H2O AutoML 5 27 1 1 4.0263 0.3198 58
 TPOT 5 26 4 3 4.6053 0.4102 67
927 AutoPilot 2 23 12 0 5.1842 0.4937 58
928 Auto-WEKA 6 27 4 4 5.7105 0.7212 60
929
930
 Results for AutoML Benchmark: 39 datasets total, 1 hour time-limit imposed on AutoML systems
931
932
933
934

 More benchmarks: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020

AutoGluon vs other AutoML for Tabular Data
• Many existing AutoML frameworks
 Open-source: Auto-WEKA, auto-sklearn, TPOT, H2O, Auto-Keras, auto-xgboost, hyperopt-sklearn, NNI
 Cloud: GCP AutoML Tables, Sagemaker AutoPilot, Azure ML, DataRobot, Darwin AutoML

• Prior work: AutoML == Model+Hyperparameter Selection

• AutoGluon: Rely on strategies that win prediction contests
 1. model ensembling via multi-layer stacking
 2. thoughtful data preprocessing
 3. repeated data-splitting (bagging)
 4. greater emphasis on modern deep learning techniques

AutoGluon and AutoML
To obtain these results:
• AutoGluon did not hyperparameter tune
 (results further improve with built-in AutoGluon HPO)
• AutoGluon used the same hyperparameters for every dataset
 (factory default hyperparameters for each model)
• AutoGluon trained using the same 3 lines of code for every dataset:

Multi-Layer Stack Ensembling
• Stacker model uses predictions of
 every base model as extra features

 Stack
 …
• Layer L+1 stacker model uses layer L
 predictions as extra features

 Base …
• For simplicity: stacker models types
 = base model types
NOTE: Stacker must be trained with held-out predictions of lower-layer models

Bagging (Bootstrap Aggregation)

 Train k different copies of model with different chunk of data held-out from each.

Reference: Van der Laan et al. Super Learner. Statistical Applications in Genetics and Molecular Biology, 2007.

Hyperparameter Optimization (HPO)

AutoGluon offers advanced HPO for your models
 • No change to your code:
 Just add Python decorator above your
 train_evaluate()function.

 • Random/grid search, or superior proposal of next
 hyperparameter to try via Bayesian Optimization
 Early Stopping –
 • Multifidelity HPO to skip non-promising hyperparameters
 (asynchronous Hyperband, BOHB)

 • Easily distribute HPO jobs across multiple machines
 (just provide IP addresses)

Reference: Klein et al. Model-based Asynchronous Hyperparameter and Neural Architecture Search. arXiv 2020.

Hyperparameter Tuning Drawbacks
• Requires many repeated model training runs
• Inefficient: Even state of the art Bayesian optimization methods are
 not much better than random search
• Most of the models being trained are thrown away and don’t
 contribute to the final result
• The more hyperparameter tuning done, the more the final model is
 overfit on the validation data.
 • Strategies like cross-validation help but makes tuning take even longer.
• Less helpful when the model being tuned is part of a model ensemble

AutoGluon performance reliably improves with
longer training times

Accelerate Inference via Distillation

 • Ensemble = slow for prediction, substantial memory/disk requirement

 • Single model = faster, cheaper, easier-to-maintain, accelerators exist
 We consider four types of students: Neural Network, Random Forest, LightGBM, CatBoost

 • Distillation Strategy: Student = single model, Teacher = ensemble
 Train student on dataset where labels = predictions from Teacher.

 • Augment training set with synthetic data so student better learns how
 to closely mimic teacher

Reference: Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation. NeurIPS, 2020

Augmenting Dataset with Synthetic Features
• Multivariate generative modeling is hard
• Estimating ! "! ) is easier (univariate distribution)
• Use Gibbs Sampling to generate new feature-values for augmentation
 Iterate MCMC chain:

• Initialize each Gibbs sampling chain at real training datapoint

• Number of Gibbs sampling rounds controls distance between
 augmented data distribution and training data distribution

Transformer as Generative Model
 • Use one Transformer model to simultaneously model all ! "! )

 • Want model to be agnostic to ordering of features

 • Train Transformer via maximum pseudo-likelihood:
 • For continuous features, output layer = mixture of Gaussians
 • Use masking to omit information about feature i (like BERT)
 • Use positional encoding so Transformer remembers which value
 corresponds to which feature
Transformer Reference: Vaswani et al. Attention Is All You Need. NeurIPS 2017.
Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019.

Gibbs Sampling Examples
 Left to Right:

 1. Original training data

 2. Samples after 1 round of Gibbs
 Sampling with random init

 3. Samples after many Gibbs
 rounds with random init

 4. Samples after 1 round of Gibbs
 Sampling with data init

Gibbs sampling outperforms other augmentation

Distillation: Accuracy vs Inference Latency

(A) Regression (9 datasets) (B) Binary classification (11 datasets) (C) Multiclass classification (9 datasets)

TEACHER = AutoGluon with presets = ‘best-quality’ (large stack ensemble)
GIB-1 = Individual models distilled using AutoGluon
Red/Green Dots = best of {NeuralNet, LightGBM, CatBoost, RandForest}, selected by validation-score

Image Classification with AutoGluon

Dataset format: each folder = name of class, files in folder = images from the class

AutoGluon fit() for Image/Text Data
1. Preprocess raw data & split into training/validation
2. (Optional) Apply appropriate data augmentation
3. (Optional) Load appropriate pretrained net from model-zoo
 (GluonNLP/GluonCV: ResNet for classifying images, Yolo for object-detection, BERT/ELECTRA for text)

4. Adapt network architecture for user’s prediction task
5. Train neural net on the data (optionally on multiple GPUs)
6. Repeatedly train under different hyperparameter configurations
 to find the best hyperparameters (optionally over many machines)
7. (Optional) Retrain models on training+validation data
 and construct more-accurate model ensemble

 Reference: Dive into Deep Learning Textbook (http://d2l.ai/)

AutoGluon Image Classification Performance

 Performance in 4 Kaggle competitions (rank on leaderboard)

• Under 10 lines of code required to produce each result.

Reference: Image Classification on Kaggle using AutoGluon. Medium, 2020

AutoGluon Text Classification Performance
• University of Stuttgart compared 4 AutoML tools on 13 text
 datasets from Kaggle (AutoGluon, H2O, autosklearn, TPOT)

• On average, AutoGluon is 2% better than the best other
 AutoML tool

• AutoGluon TextPrediction is 3.1% superior to AutoGluon
 TabularPrediction (text as n-grams)

• AutoML slightly beats/matches best human performance
 in 4/13 datasets
Blohm, Hanussek, Kinz. Leveraging Automated Machine Learning for Text Classification:
Evaluation of AutoML Tools and Comparison with Human Performance. arXiv, 2020.

Multimodal Data (Numeric & Categorical & Text)

Modeling Multimodal Data with Text Fields
• Fit Text Neural Network on text features (Transformer)

• Fit classical Tabular Models on numeric+categorical features
 (GBDT, random forest, etc.)

• Option 1: Ensemble Tabular & Text Models

• Option 2: Fit Tabular Models after featurizing text into vector form
 (N-gram or Transformer embedding)

• Option 3: Adapt Text Network to additionally operate on
 numeric+categorical features

Aggregating Text & Tabular Models

Multimodal Benchmark Datasets

AutoGluon Multimodal Results

So, how do 3 lines of code fare on Kaggle?

 Just using Multimodal Transformer Network
 would score 2nd place (out of 2380 teams)

Product Sentiment Classification Competition:
1st Place in 3 lines of code
 (Fits Stack-Ensemble using both Tabular Models & Multimodal Transformer)

ResNeSt: Split-Attention Network

 What else can AutoGluon do
 Going Deeper with Convolutions

 • Efficient neural architecture search
 (ENAS/Proxyless-NAS)

olutional Networks ProxylessNAS: Direct Neural Architecture
 Squeeze-and-Excitation Networks
mage Recognition Search on Target Task and Hardware

 • Global interpretability
 (permutation feature importance)

 • Local interpretability
 (Shapley values)

AutoGluon Usage across Amazon
• AWS ML Solutions Lab: rapidly build ML proof-of-concepts for enterprise customers
• Machine Learning University: taught as easy way to solve practical ML problems
• Used in winning solution of 3+ internal prediction competitions
• Used in production for many applications including:
 • Demand Forecasting
 • Fraud Detection
 • Classifying customer service requests

• Used by many external organizations like:
 Mayo Clinic, Boston Consulting, University of {Stuttgart, British Columbia}

• Check out: auto.gluon.ai & produce accurate models in 5 min!

• Train/Deploy AutoGluon in the cloud:
 • AWS MarketPlace: aws.amazon.com/marketplace/pp/prodview-n4zf5pmjt7ism
 • AWS Sagemaker: github.com/awslabs/amazon-sagemaker-
 examples/tree/master/advanced_functionality/autogluon-tabular

• Contributors welcomed! github.com/awslabs/autogluon
 Email us for invite to public Slack channel for contributors: autogluon.slack.com

• Ask questions: github.com/awslabs/autogluon/issues
 or email: autogluon-dev@amazon.com

You can also read