AutoML for Text, Image, and Tabular Data - Jonas Mueller Amazon Web Services - The Statistical and ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
AutoML for Text, Image, and Tabular Data github.com/awslabs/autogluon Jonas Mueller jonasmue@amazon.com Amazon Web Services
AutoGluon Overview • Enables easy to use & highly accurate supervised learning for tabular, image, text data (AutoML) • Open source & easy to extend (add custom models to the set AutoGluon trains/tunes/ensembles) • AutoML that is highly customizable/controllable Can specify: what metric AutoGluon optimizes for, how long training should take, trade-off inference-latency vs accuracy, how many CPU/GPU compute-resources to use, what model-types / hyperparameter-values to consider, ... • Leverage automatic hyperparameter tuning, model selection/ensembling, neural architecture search, and data preprocessing (best techniques for your dataset are automatically applied) http://auto.gluon.ai
What you can achieve with AutoGluon • AutoGluon produces superior models than someone lacking: ML expertise or time to dive into modeling • Quickly see achievable accuracy on your raw data with just a couple lines of Python code • Painlessly deploy trained models (and higher-accuracy model ensembles) • Flexibility: provided presets allow you to specify how much you care about: optimizing for accuracy at all costs, reducing inference-time, reducing model-size, … • Easy-to-use interpretability methods tell you: - how predictive each variable is (permutation feature importance) - why AutoGluon made a particular prediction (Shapley values) • Easily improve existing any bespoke modeling pipelines with advanced hyperparameter optimization (HPO)
o offer the simplest user experience with AutoGluon-Tabular. This means AutoGluon must AutoGluon is easy to use l design decisions left unspecified. To illustrate this, consider a structured dataset of raw a file train.csv, with the label values to predict stored in a column named class. In this AutoGluon model • Only can beoftrained 3 lines andtrain code to tested as follows: and use a model 1 from autogluon import TabularPrediction as task 2 predictor = task . fit ( " train . csv " , label = " class " ) 3 predictions = predictor . predict ( " test . csv " ) it(), AutoGluon automatically: preprocesses the raw data, identifies what type of prediction this is (binary/multiclass fit() automatically classification does or regression) the following:as well as the types of each feature, data into various folds for model-training vs. validation, individually fits various models, and 1. preprocesses raw data (identifies the type of each feature) 2. identifies what type of prediction problem this is (binary/multi-class classification or2regression) 3. splits data appropriately (eg. training/validation sets, k-fold split) 4. individually trains/tunes various models (RandomForest/ExtraTrees, KNN, NeuralNet, GBDT) 5. assembles ensemble that outperforms all the individual models
Built-in Prediction Tasks Suitable for: • classification & regression • raw CSV/Parquet files • tables with text/string fields, date-times, missing values • raw image files (JPG, PNG, etc) • images of various sizes • smaller labeled datasets • classification & regression + many NLP tasks (GLUE/SuperGLUE) • raw datasets with multiple text fields • smaller labeled datasets
Tabular Data Benchmarks 1. OpenML AutoML Benchmark (39 datasets, 10 train/test folds each) • https://openml.github.io/automlbenchmark/ • Curated binary and multiclass classification tasks to assess AutoML 2. Kaggle Benchmark (datasets from 11 prediction competitions) • Represent realistic modern-day applications of supervised learning • Regression, binary, and multi-class classification tasks • Many different evaluation metrics (tailored to business use-case of each application) Reference: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020
Comparing AutoML tools: Kaggle Competitions auto−sklearn TPOT Auto−WEKA H2O AutoML GCP−Tables AutoGluon bnp−paribas santander−trans... santander−satis... porto−seguro ieee−fraud walmart−recruit... otto−group house−prices allstate−claims mercedes−benz santander−value 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Percentile Rank on Leaderboard AutoGluon outperformed 99% of data science teams after training for 4h on raw data from otto and bnp-paribas competitions Reference: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020
Otto-Group Kaggle Competition • In 4 hours (on a m5.2xlarge CPU) Autogluon achieved a score that took the best human team over 1 week to beat on Kaggle. • This competition (Otto) had over 3500 teams compete. AG 4h score
Comparing AutoML tools: AutoML Benchmark auto−sklearn TPOT Auto−WEKA H2O AutoML GCP−Tables AutoGluon apsfailure airlines albert amazon_employee... australian kddcup09_appete... miniboone adult bank−marketing blood−transfusion christine credit−g guiellermo higgs jasmine kc1 kr−vs−kp nomao numerai28.6 phoneme riccardo sylvine covertype dionis fashion−mnist helena jannis robert shuttle volkert car cnae−9 connect−4 dilbert fabert jungle_chess_2p... mfeat−factors segment vehicle 0.1 0.5 1 2 5 10 20 Loss on Test Data (relative to AutoGluon) Reference: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020
912 913 914 Comparing AutoML tools vs AutoGluon 915 Table S8. Comparison of the AutoML frameworks on the AutoML Benchmark, evaluating the accuracy metric (with 1h training time 916 limit). Columns are defined as in Table 2. Rescaled misclassification rate is calculated in the same manner as rescaled loss, but applied 917 specifically to the accuracy metric. Here, we additionally compare with the commercial Sagemaker Autopilot framework described in 918 §C.1. All frameworks were optimized on AUC except for AutoPilot, which was optimized on accuracy. This demonstrates that while 919 AutoGluon performs the best in the primary evaluation metric, it also performs favourably on secondary metrics such as accuracy, even 920 compared to AutoPilot which optimized directly on accuracy. 921 922 Framework Wins Losses Failures Champion Avg. Rank Avg. Rescaled Misclassification Rate Avg. Time (min) 923 AutoGluon 0 0 0 17 1.8421 0.0509 56 924 GCP-Tables 6 15 13 5 2.9211 0.1973 83 925 auto-sklearn 7 22 5 3 3.7105 0.2506 60 926 H2O AutoML 5 27 1 1 4.0263 0.3198 58 TPOT 5 26 4 3 4.6053 0.4102 67 927 AutoPilot 2 23 12 0 5.1842 0.4937 58 928 Auto-WEKA 6 27 4 4 5.7105 0.7212 60 929 930 Results for AutoML Benchmark: 39 datasets total, 1 hour time-limit imposed on AutoML systems 931 932 933 934 More benchmarks: AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Arxiv, 2020
AutoGluon vs other AutoML for Tabular Data • Many existing AutoML frameworks Open-source: Auto-WEKA, auto-sklearn, TPOT, H2O, Auto-Keras, auto-xgboost, hyperopt-sklearn, NNI Cloud: GCP AutoML Tables, Sagemaker AutoPilot, Azure ML, DataRobot, Darwin AutoML • Prior work: AutoML == Model+Hyperparameter Selection • AutoGluon: Rely on strategies that win prediction contests 1. model ensembling via multi-layer stacking 2. thoughtful data preprocessing 3. repeated data-splitting (bagging) 4. greater emphasis on modern deep learning techniques
AutoGluon and AutoML To obtain these results: • AutoGluon did not hyperparameter tune (results further improve with built-in AutoGluon HPO) • AutoGluon used the same hyperparameters for every dataset (factory default hyperparameters for each model) • AutoGluon trained using the same 3 lines of code for every dataset:
Multi-Layer Stack Ensembling • Stacker model uses predictions of every base model as extra features Stack … • Layer L+1 stacker model uses layer L predictions as extra features Base … • For simplicity: stacker models types = base model types NOTE: Stacker must be trained with held-out predictions of lower-layer models
Bagging (Bootstrap Aggregation) Train k different copies of model with different chunk of data held-out from each. Reference: Van der Laan et al. Super Learner. Statistical Applications in Genetics and Molecular Biology, 2007.
Hyperparameter Optimization (HPO)
AutoGluon offers advanced HPO for your models • No change to your code: Just add Python decorator above your train_evaluate()function. • Random/grid search, or superior proposal of next hyperparameter to try via Bayesian Optimization Early Stopping – • Multifidelity HPO to skip non-promising hyperparameters (asynchronous Hyperband, BOHB) • Easily distribute HPO jobs across multiple machines (just provide IP addresses) Reference: Klein et al. Model-based Asynchronous Hyperparameter and Neural Architecture Search. arXiv 2020.
Hyperparameter Tuning Drawbacks • Requires many repeated model training runs • Inefficient: Even state of the art Bayesian optimization methods are not much better than random search • Most of the models being trained are thrown away and don’t contribute to the final result • The more hyperparameter tuning done, the more the final model is overfit on the validation data. • Strategies like cross-validation help but makes tuning take even longer. • Less helpful when the model being tuned is part of a model ensemble
AutoGluon performance reliably improves with longer training times
Accelerate Inference via Distillation • Ensemble = slow for prediction, substantial memory/disk requirement • Single model = faster, cheaper, easier-to-maintain, accelerators exist We consider four types of students: Neural Network, Random Forest, LightGBM, CatBoost • Distillation Strategy: Student = single model, Teacher = ensemble Train student on dataset where labels = predictions from Teacher. • Augment training set with synthetic data so student better learns how to closely mimic teacher Reference: Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation. NeurIPS, 2020
Augmenting Dataset with Synthetic Features • Multivariate generative modeling is hard • Estimating ! "! ) is easier (univariate distribution) • Use Gibbs Sampling to generate new feature-values for augmentation Iterate MCMC chain: • Initialize each Gibbs sampling chain at real training datapoint • Number of Gibbs sampling rounds controls distance between augmented data distribution and training data distribution
Transformer as Generative Model • Use one Transformer model to simultaneously model all ! "! ) • Want model to be agnostic to ordering of features • Train Transformer via maximum pseudo-likelihood: • For continuous features, output layer = mixture of Gaussians • Use masking to omit information about feature i (like BERT) • Use positional encoding so Transformer remembers which value corresponds to which feature Transformer Reference: Vaswani et al. Attention Is All You Need. NeurIPS 2017. Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019.
Gibbs Sampling Examples Left to Right: 1. Original training data 2. Samples after 1 round of Gibbs Sampling with random init 3. Samples after many Gibbs rounds with random init 4. Samples after 1 round of Gibbs Sampling with data init
Gibbs sampling outperforms other augmentation
Distillation: Accuracy vs Inference Latency (A) Regression (9 datasets) (B) Binary classification (11 datasets) (C) Multiclass classification (9 datasets) TEACHER = AutoGluon with presets = ‘best-quality’ (large stack ensemble) GIB-1 = Individual models distilled using AutoGluon Red/Green Dots = best of {NeuralNet, LightGBM, CatBoost, RandForest}, selected by validation-score
Image Classification with AutoGluon Dataset format: each folder = name of class, files in folder = images from the class
AutoGluon fit() for Image/Text Data 1. Preprocess raw data & split into training/validation 2. (Optional) Apply appropriate data augmentation 3. (Optional) Load appropriate pretrained net from model-zoo (GluonNLP/GluonCV: ResNet for classifying images, Yolo for object-detection, BERT/ELECTRA for text) 4. Adapt network architecture for user’s prediction task 5. Train neural net on the data (optionally on multiple GPUs) 6. Repeatedly train under different hyperparameter configurations to find the best hyperparameters (optionally over many machines) 7. (Optional) Retrain models on training+validation data and construct more-accurate model ensemble Reference: Dive into Deep Learning Textbook (http://d2l.ai/)
AutoGluon Image Classification Performance Performance in 4 Kaggle competitions (rank on leaderboard) • Under 10 lines of code required to produce each result. Reference: Image Classification on Kaggle using AutoGluon. Medium, 2020
AutoGluon Text Classification Performance • University of Stuttgart compared 4 AutoML tools on 13 text datasets from Kaggle (AutoGluon, H2O, autosklearn, TPOT) • On average, AutoGluon is 2% better than the best other AutoML tool • AutoGluon TextPrediction is 3.1% superior to AutoGluon TabularPrediction (text as n-grams) • AutoML slightly beats/matches best human performance in 4/13 datasets Blohm, Hanussek, Kinz. Leveraging Automated Machine Learning for Text Classification: Evaluation of AutoML Tools and Comparison with Human Performance. arXiv, 2020.
Multimodal Data (Numeric & Categorical & Text)
Modeling Multimodal Data with Text Fields • Fit Text Neural Network on text features (Transformer) • Fit classical Tabular Models on numeric+categorical features (GBDT, random forest, etc.) • Option 1: Ensemble Tabular & Text Models • Option 2: Fit Tabular Models after featurizing text into vector form (N-gram or Transformer embedding) • Option 3: Adapt Text Network to additionally operate on numeric+categorical features
Aggregating Text & Tabular Models
Multimodal Benchmark Datasets
AutoGluon Multimodal Results
So, how do 3 lines of code fare on Kaggle? Just using Multimodal Transformer Network would score 2nd place (out of 2380 teams)
Product Sentiment Classification Competition: 1st Place in 3 lines of code (Fits Stack-Ensemble using both Tabular Models & Multimodal Transformer)
ResNeSt: Split-Attention Network What else can AutoGluon do Going Deeper with Convolutions • Efficient neural architecture search (ENAS/Proxyless-NAS) olutional Networks ProxylessNAS: Direct Neural Architecture Squeeze-and-Excitation Networks mage Recognition Search on Target Task and Hardware • Global interpretability (permutation feature importance) • Local interpretability (Shapley values)
AutoGluon Usage across Amazon • AWS ML Solutions Lab: rapidly build ML proof-of-concepts for enterprise customers • Machine Learning University: taught as easy way to solve practical ML problems • Used in winning solution of 3+ internal prediction competitions • Used in production for many applications including: • Demand Forecasting • Fraud Detection • Classifying customer service requests • Used by many external organizations like: Mayo Clinic, Boston Consulting, University of {Stuttgart, British Columbia}
• Check out: auto.gluon.ai & produce accurate models in 5 min! • Train/Deploy AutoGluon in the cloud: • AWS MarketPlace: aws.amazon.com/marketplace/pp/prodview-n4zf5pmjt7ism • AWS Sagemaker: github.com/awslabs/amazon-sagemaker- examples/tree/master/advanced_functionality/autogluon-tabular • Contributors welcomed! github.com/awslabs/autogluon Email us for invite to public Slack channel for contributors: autogluon.slack.com • Ask questions: github.com/awslabs/autogluon/issues or email: autogluon-dev@amazon.com
You can also read