Intel Acceleration for Classical Machine Learning - Laurent Duhem - HPC/AI Solutions Architect () Shailen Sobhee - AI ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Intel® AI Workshop 2021 Intel® Acceleration for Classical Machine Learning Laurent Duhem – HPC/AI Solutions Architect (Laurent.duhem@intel.com) Shailen Sobhee - AI Software Technical Consultant (shailen.sobhee@intel.com)
Notices and Disclaimers ▪ Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. ▪ No product or component can be absolutely secure. ▪ Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks . ▪ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks . ▪ Intel® Advanced Vector Extensions (Intel® AVX) provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. ▪ Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. ▪ Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. ▪ Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. ▪ © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. 2
Executive Summary ▪ Intel® Distribution for Python covers major usages in HPC and Data Science ▪ Achieve faster Python application performance — right out of the box — with minimal or no changes to a code ▪ Accelerate NumPy*, SciPy*, and scikit-learn* with integrated Intel® Performance Libraries such as Intel® oneMKL (Math Kernel Library) and Intel® oneDAL (Data Analytics Library) ▪ Analysts ▪ Access the latest vectorization and multithreading instructions, Numba* and ▪ Data Scientists Cython*, composable parallelism with Threading Building Blocks, and more ▪ Machine Learning Developers 3
Intel® Distribution for Python Architecture Command Line Scientific Environments Developer Environments > python script.py Interface Language CPython Intel® Distribution for Python GIL: Numerical Parallelism Packages Python tbb4py smp mpi4py daal4py Technologies DPC++ oneDAL iomp impi Native TBB Intel oneMKL Community technology Intel technology 4
5 Accelerated NumPy and SciPy • Optimizations include use of oneMKL which has optimized BLAS/LAPACK operations, FFT computations • Optimizations also include use of Intel® C and Fortran compilers to enable better use of vectorization • Interface directly works with single and double precision NumPy arrays • Natively supports multidimensional transforms 5
Intel® Distribution for Python Architecture Command Line Scientific Environments Developer Environments > python script.py Interface Language CPython Intel® Distribution for Python GIL: Numerical Parallelism Packages Python tbb4py smp mpi4py daal4py Technologies DPC++ oneDAL iomp impi Native TBB Intel oneMKL Community technology Intel technology 6
oneAPI Data Analytics Library (oneDAL) Optimized building blocks for all stages of data analytics on Intel Architecture GitHub: https://github.com/oneapi-src/oneDAL 7
9Intel® oneAPI Data Analytics Library(oneDAL) Algorithms Machine Learning Ridge Regression Linear DBSCAN Regression LASSO Regression Unsupervised K-Means learning Clustering Decision Tree AdaBoost Supervised Brown/Logit EM for GMM learning Random Forest Boosting Gradient Boosting Naïve Bayes Alternating Least Classification Logistic Collaborative Squares Regression filtering kNN Apriori Algorithms supporting Intel GPU (Gen 9 & Gen12) & dGPU Algorithms supporting batch processing SVM Algorithms supporting batch and distributed processing 9
10 Intel® oneAPI Data Analytics Library (oneDAL) algorithms Data Transformation and Analysis Basic statistics Correlation and Dimensionality for datasets Matrix factorizations Outlier detection dependence reduction Low order Cosine SVD PCA Univariate moments distance QR Quantiles Correlation Association rule Multivariate distance mining (Apriori) Cholesky Order Variance- statistics Covariance tSVD Optimization solvers Math functions matrix (SGD, AdaGrad, lBFGS, CD) (exp, log,…) Algorithms supporting batch processing Intel GPU (Gen 9 & Gen12) & dGPU Algorithms supporting batch processing Algorithms supporting batch, online and/or distributed processing 10
11 K-Means Using Scikit-learn and daal4py ▪ Scikit-learn ▪ daal4py from sklearn.cluster import KMeans from daal4py import kmeans_init, kmeans import pandas as pd import pandas as pd data = pd.read_csv("./kmeans.csv") data = pd.read_csv("./kmeans.csv") # Load the data init = kmeans_init(nClusters=20, # Compute initial method="plusPlusDense").compute(data) # centroids algo = KMeans(n_clusters=20, algo = kmeans(nClusters=20, # Configure K-means init='k-means++', max_iter=5) maxIterations=5, assignFlag=True) # main object result = algo.fit(data) result = algo.compute(data, # Compute the init.centroids) # clusters and labels result.labels_ result.assignments # Print the results result.cluster_centers_ result.centroids 11
scikit-learn Optimized building blocks for all stages of data analytics on Intel Architecture GitHub: https://github.com/oneapi-src/oneDAL 12
The most popular ML package for Python* 13 13
Intel Distribution for Python (IDP) Scikit-learn Common Scikit-learn Scikit-learn with Intel CPU opts Same Code, import daal4py as d4p Same Behavior d4p.patch_sklearn() ▪ from sklearn.svm import SVC from sklearn.svm import SVC ▪ X, Y = get_dataset() X, Y = get_dataset() • Scikit-learn, not scikit-learn-like clf = SVC().fit(X, y) • Scikit-learn conformance ▪ clf = SVC().fit(X, y) (mathematical equivalence) res = clf.predict(X) defined by Scikit-learn Consortium, ▪ res = clf.predict(X) continuously vetted by public CI Available through Intel conda Scikit-learn mainline (conda install daal4py –c intel) Intel Confidential 14
Intel optimized Scikit-Learn Speedup of Intel® oneDAL powered Scikit-Learn over the original Scikit-Learn K-means fit 1M x 20, k=1000 44.0 K-means predict, 1M x 20, k=1000 Same Code, 3.6 PCA fit, 1M x 50 4.0 Same Behavior PCA transform, 1M x 50 27.2 Random Forest fit, higgs1m 38.3 Random Forest predict, higgs1m 55.4 Ridge Reg fit 10M x 20 53.4 Linear Reg fit 2M x 100 91.8 LASSO fit, 9M x 45 50.9 SVC fit, ijcnn 29.0 • Scikit-learn, not scikit-learn-like SVC predict, ijcnn 95.3 SVC fit, mnist 82.4 SVC predict, mnist 221.0 • Scikit-learn conformance DBSCAN fit, 500K x 50 17.3 (mathematical equivalence) train_test_split, 5M x 20 9.4 defined by Scikit-learn Consortium, kNN predict, 100K x 20, class=2, k=5 continuously vetted by public CI 131.4 kNN predict, 20K x 50, class=2, k=5 113.8 0.0 50.0 100.0 150.0 200.0 250.0 HW: Intel Xeon Platinum 8276L CPU @ 2.20GHz, 2 sockets, 28 cores per socket; Details: https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912 15
Available algorithms ▪ Accelerated IDP Scikit-learn algorithms: • Linear/Ridge Regression • Logistic Regression • ElasticNet/LASSO • PCA • K-means • DBSCAN • SVC • train_test_split(), assume_all_finite() • Random Forest Regression/Classification - DAAL 2020.3 • kNN (kd-tree and brute force) - DAAL 2020.3 16
Demo 17
XGBoost Optimized building blocks for all stages of data analytics on Intel Architecture GitHub: https://github.com/oneapi-src/oneDAL 18
Gradient Boosting - overview • Gradient Boosting: • Boosting algorithm (Decision Trees - base learners) • Solve many types of ML problems (classification, regression, learning to rank) • Highly-accurate, widely used by Data Scientists • Compute intensive workload • Known implementations: XGBoost*, LightGBM*, CatBoost*, Intel® DAAL, … Error Error Error 19
DMLC XGBoost* ACCELERATION ▪ Intel® contributed 3 Pull requests into XGBoost* project on GitHub* during the year Goal: performance optimizations of ‘hist’ mode for Intel® CPUs 20 20
21 XGBoost training improvements: Metric Library versions Airline-OHE, 4.69M Train time, s XGBoost 0.81 4481 XGBoost 1.2.0 243 Accuracy XGBoost 0.81 0.841544 XGBoost 1.2.0 0.842981 Speedup: 18.4 Workload description: Airline dataset was preprocessed with OHE, and then after random permutation first 7M rows were selected and divided to train test parts (70%-30%). 2 x Intel® Xeon Gold 6230R @ 26 cores, OS: CentOS Linux 8 (Core), 193 GB RAM. SW: XGBoost :1.2, 0.81 versions from xgboost PIP chanel. compiler – G++ 7.4, Intel DAAL: 2020.3 version, downloaded from conda. Python env: Python 3.7, Numpy 1.18.5, Pandas 0.25.3, Scikit-lean 0.23.2. 21
XGB and LGBM prediction acceleration daal4py Gradient Boosting Model Convertors XGBoost: xgb_model = xgb.train(params, X_train) # Train common XGBoost model as usual import daal4py as d4p daal_model = d4p.get_gbt_model_from_xgboost(xgb_model) # XGBoost model to DAAL model daal_prediction = d4p.gbt_classification_prediction(…).compute(X_test, daal_model) # make fast prediction with DAAL LGBM: lgb_model = lgb.train(params, X_train) # Train common LGBM model as usual import daal4py as d4p daal_model = d4p.get_gbt_model_from_lightgbm(xgb_model) # LGBM model to DAAL model daal_prediction = d4p.gbt_classification_prediction(…).compute(X_test, daal_model) # make fast prediction with DAAL Convert already trained XGB/LGBM model to speedup prediction performance without accuracy loosing Prediction time, s Prediction, time s Accuracy/MSE Dataset LGBM + Speed up XGB + Speed up LGBM + XGB + LGBM XGB LGBM XGB daal4py daal4py daal4py daal4py Higgs 9.156 0.728 12.6 5.514 0.7 7.9 0.75626 0.75626 0.75828 0.75828 Mortgage 9.156 0.728 12.6 5.514 0.7 7.9 0.49061 0.49061 0.4879 0.4879 MSRank 0.857 0.111 7.7 0.934 0.121 7.7 0.57101 0.57101 0.57177 0.57177 Intel Confidential 22
Demo 23
Intel® Distribution for Python Architecture Command Line Scientific Environments Developer Environments > python script.py Interface Language Extension Numba Release GIL CPython Release GIL Intel® Distribution for Python SDC LLVM IR GIL: C++ Numerical Parallelism Dataframe daal4py Packages Python tbb4py smp mpi4py Technologies oneMKL DPC++ oneDAL impi iomp Native TBB Intel Community Intel technology technology 24
Intel® Scalable Just import Numba DataFrame Compiler and use decorator ▪ Extension for Numba* to accelerate AI workflows ▪ Supports more data types (Series, Dataframes, ASCII/Unicode strings) ▪ Compiler, not a library ▪ Scales from laptops to multi-core servers ▪ Open-source project Github page https://github.com/IntelPython/sdc Documentation https://intelpython.github.io/sdc-doc/latest/index.html ▪ Available as conda package and pip wheels 25
Intel® SDC SPEEDUP SDC VS. Pandas 16 14.5491 14 12 10.9496 10 8 6 4 3.3001 1.6991 2 0 1 thread 4 threads 20 threads 40 threads run_etl Intel® Xeon™ Gold 6248 CPU @ 2.50GHz, 2x20 cores Numba* 0.51.2, Pandas* 1.0.5, SDC 0.37.0 26
Demo 27
Modin ▪ Usable and Scalable memory Pandas DataFrame CPU CPU CPU CPU Idle cores memory To use Modin, replace the pandas import Modin DataFrame CPU CPU CPU CPU Full utilization 28
Modin Execution time Pandas vs. Modin[ray] 400 350 340.0729 300 10.8 speedup 250 Time, s 200 150 100 50 31.2453 0 Pandas Modin Intel® Xeon™ Gold 6248 CPU @ 2.50GHz, 2x20 cores ▪ Dataset size: 2.4GB 29
End-to-End Data Pipeline Acceleration ▪ Workload: Train a model using 50yrs of Census dataset from IPUMS.org to predict income based on education ▪ Solution: Intel Modin for data ingestion and ETL, Daal4Py and Intel scikit-learn for model training and prediction ▪ Perf Gains: • Read_CSV (Read from disk and store as a dataframe) : 6x • ETL operations : 38x • Train Test Split : 4x • ML training (fit & predict) with Ridge Regression : 21x For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. See backup for configuration details. 30
Intel® Distribution for Python Architecture Command Line Scientific Environments Developer Environments > python script.py Interface Language Extension Numba Release GIL CPython Release GIL Intel® Distribution for Python SDC LLVM IR GIL: C++ Numerical Parallelism Dataframe daal4py Packages Python tbb4py smp mpi4py Technologies oneMKL DPC++ oneDAL impi iomp Native TBB Intel Community Intel technology technology 31
Envision a GPU-enabled Python Library Ecosystem Data Parallel Python Unified Python Offload Programming Model Extending PyData ecosystem for XPU with device_context(“gpu”): a_dparray = dpnp.random.random(1024, 3) X_dparray = numba.njit(compute_embedding)(a_dparray) res_dparray = daal4py.kmeans().compute(X_dparray) Optimized Packages for Intel CPUs & GPUs Jit Compilation ••• numpy → dpnp Unified Data & Execution Infrastructure ndarray → dparray NDA Presentation host memory → unified shared mem zero-copy USM array interface common device execution queues CPU → XPU DPC++ RUNTIME OpenCL Level 0 CUDA 32
New Additions to Numba’s Language Design @dppy.kernel @njit import dpctl from numba import njit import numba_dppy as dppy import numpy as np import numpy as np import dpctl @njit @dppy.kernel def f1(a, b): def sum(a,b,c): c = a + b i = dppy.get_global_id[0] return c c[i] = a[i] + b[i] a = np.ones(1024 dtype=np.float32) a = np.ones(1024 dtype=np.float32) b = np.ones(1024, dtype=np.float32) b = np.ones(1024, dtype=np.float32) c = np.zeros_like(a) with dpctl.device_context("gpu"): with dpctl.device_context("gpu"): c = f1(a, b) NDA Presentation sum[1024, dppy. DEFAULT_LOCAL_SIZE](a, b, c) Explicit kernels, Low-level kernel NumPy-based array programming, auto- programming for expert ninjas offload, high-productivity 33
Seamless interoperability and sharing of resources • Different packages share same execution context import dpctl, numba, dpnp, daal4py @numba.njit def compute(a): • Data can be ... Numba function exchanged without with dpctl.device_context("gpu"): extra copies and kept a_dparray = dpnp.random.random(1024, 3) on the device X_dparray = compute(a_dparray) res_dparray = daal4py.kmeans().compute(X_dparray) daal4py function 34
Portability Across Architectures import numba import numpy as np import math @numba.vectorize(nopython=True) # Runs on CPU by default def cndf2(inp): blackscholes(...) out = 0.5 + 0.5 * math.erf((math.sqrt(2.0) / 2.0) * inp) return out # Runs on GPU @numba.njit(parallel={"offload": True}, fastmath=True) with dpctl.device_context("gpu"): def blackscholes(sptprice, strike, rate, volatility, timev): blackscholes(...) logterm = np.log(sptprice / strike) powterm = 0.5 * volatility * volatility den = volatility * np.sqrt(timev) # In future d1 = (((rate + powterm) * timev) + logterm) / den with dpctl.device_context(“cuda:gpu"): d2 = d1 - den blackscholes(...) NofXd1 = cndf2(d1) NofXd2 = cndf2(d2) futureValue = strike * np.exp(-rate * timev) c1 = futureValue * NofXd2 call = sptprice * NofXd1 - c1 put = call - futureValue + sptprice return put 35
Scikit-Learn on XPU Stock on Host: Optimized on Host: Offload to XPU: SAME NUMERIC BEHAVIOR import daal4py as d4p import daal4py as d4p d4p.patch_sklearn() as defined by d4p.patch_sklearn() import dpctl Scikit-learn from sklearn.svm import SVC from sklearn.svm import SVC from sklearn.svm import SVC Consortium X, Y = get_dataset() X, Y = get_dataset() X, Y = get_dataset() & continuously with dpctl.device_context(“gpu”): validated by CI clf = SVC().fit(X, y) clf = SVC().fit(X, y) clf = SVC().fit(X, y) res = clf.predict(X) res = clf.predict(X) res = clf.predict(X) NDA Presentation 36
Installing Intel® Distribution for Python* 2021 > conda create -n idp –c intel intelpython3_core python=3.x Anaconda.org > conda activate idp https://anaconda.org/intel/packages > conda install intel::numpy https://software.intel.com/content/www/us/en/develop/articles/installing-intel- free-libs-and-python-apt-repo.html YUM/APT https://software.intel.com/content/www/us/en/develop/articles/installing-intel- free-libs-and-python-yum-repo.html Docker Hub docker pull intelpython/intelpython3_full https://software.intel.com/content/www/us/en/develop/tools/onea oneAPI pi/ai-analytics-toolkit.html Standalone https://software.intel.com/content/www/us/en/develop/articles/one api-standalone-components.html#python Installer > pip install intel-numpy > pip install intel-scipy + Intel library Runtime packages PyPI > pip install mkl_fft + Intel development packages > pip install mkl_random 37
Get the Most from Your Code Today with Intel® Tech.Decoded Visit TechDecoded.intel.io to learn how to put key optimization strategies into practice with Intel development tools. Big Picture Videos TOPICS: Discover Intel’s vision for Visual Computing key development areas. Code Modernization Essential Webinars Systems & IoT Gain strategies, practices and tools to optimize Data Science application and solution performance. Data Center & Cloud Quick Hit How-To Videos 38 Learn how to do specific programming tasks using Intel® tools. 38
More Resources Intel® Distribution for Python • Product page – overview, features, FAQs… • Training materials – movies, tech briefs, documentation, evaluation guides… • Support – forums, secure support… • Machine Learning Benchmarks • https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics • https://github.com/IntelPython/scikit-learn_bench 39
Thank you
You can also read