Intel Acceleration for Classical Machine Learning - Laurent Duhem - HPC/AI Solutions Architect () Shailen Sobhee - AI ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Intel® AI Workshop 2021 Intel® Acceleration for Classical Machine Learning Laurent Duhem – HPC/AI Solutions Architect (Laurent.duhem@intel.com) Shailen Sobhee - AI Software Technical Consultant (shailen.sobhee@intel.com)
Notices and Disclaimers
▪ Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration.
▪ No product or component can be absolutely secure.
▪ Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete
information about performance and benchmark results, visit http://www.intel.com/benchmarks .
▪ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
complete information visit http://www.intel.com/benchmarks .
▪ Intel® Advanced Vector Extensions (Intel® AVX) provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause
a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies
depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.
▪ Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
▪ Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost
savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
▪ Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are
accurate.
▪ © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
2Executive Summary
▪ Intel® Distribution for Python covers major
usages in HPC and Data Science
▪ Achieve faster Python application performance
— right out of the box — with minimal or no
changes to a code
▪ Accelerate NumPy*, SciPy*, and scikit-learn*
with integrated Intel® Performance Libraries
such as Intel® oneMKL (Math Kernel Library) and
Intel® oneDAL (Data Analytics Library)
▪ Analysts
▪ Access the latest vectorization and
multithreading instructions, Numba* and ▪ Data Scientists
Cython*, composable parallelism with
Threading Building Blocks, and more ▪ Machine Learning Developers
3Intel® Distribution for Python Architecture
Command Line Scientific Environments Developer Environments
> python script.py
Interface
Language
CPython Intel® Distribution for Python
GIL:
Numerical Parallelism
Packages
Python
tbb4py smp mpi4py
daal4py
Technologies
DPC++
oneDAL iomp impi
Native
TBB
Intel
oneMKL Community
technology
Intel
technology
45
Accelerated NumPy and SciPy
• Optimizations include use of oneMKL which has optimized
BLAS/LAPACK operations, FFT computations
• Optimizations also include use of Intel® C and Fortran compilers to
enable better use of vectorization
• Interface directly works with single and double precision NumPy
arrays
• Natively supports multidimensional transforms
5Intel® Distribution for Python Architecture
Command Line Scientific Environments Developer Environments
> python script.py
Interface
Language
CPython Intel® Distribution for Python
GIL:
Numerical Parallelism
Packages
Python
tbb4py smp mpi4py
daal4py
Technologies
DPC++
oneDAL iomp impi
Native
TBB
Intel
oneMKL Community
technology
Intel
technology
6oneAPI Data Analytics Library (oneDAL)
Optimized building blocks for all stages of data analytics on Intel Architecture
GitHub: https://github.com/oneapi-src/oneDAL
79Intel® oneAPI Data Analytics Library(oneDAL) Algorithms
Machine Learning
Ridge
Regression
Linear DBSCAN
Regression
LASSO
Regression Unsupervised K-Means
learning Clustering
Decision Tree AdaBoost
Supervised Brown/Logit EM for GMM
learning Random Forest
Boosting
Gradient Boosting Naïve Bayes Alternating
Least
Classification Logistic Collaborative Squares
Regression filtering
kNN Apriori
Algorithms supporting Intel GPU (Gen 9 & Gen12) & dGPU
Algorithms supporting batch processing SVM
Algorithms supporting batch and distributed processing
910
Intel® oneAPI Data Analytics Library (oneDAL) algorithms
Data Transformation and Analysis
Basic statistics Correlation and Dimensionality
for datasets Matrix factorizations Outlier detection
dependence reduction
Low order Cosine SVD
PCA Univariate
moments distance
QR
Quantiles Correlation Association rule Multivariate
distance mining (Apriori)
Cholesky
Order Variance-
statistics Covariance
tSVD Optimization solvers Math functions
matrix
(SGD, AdaGrad, lBFGS, CD) (exp, log,…)
Algorithms supporting batch processing Intel GPU (Gen 9 & Gen12) & dGPU
Algorithms supporting batch processing
Algorithms supporting batch, online and/or distributed processing
1011
K-Means Using Scikit-learn and daal4py
▪ Scikit-learn ▪ daal4py
from sklearn.cluster import KMeans from daal4py import kmeans_init, kmeans
import pandas as pd import pandas as pd
data = pd.read_csv("./kmeans.csv") data = pd.read_csv("./kmeans.csv") # Load the data
init = kmeans_init(nClusters=20, # Compute initial
method="plusPlusDense").compute(data) # centroids
algo = KMeans(n_clusters=20, algo = kmeans(nClusters=20,
# Configure K-means
init='k-means++', max_iter=5) maxIterations=5, assignFlag=True) # main object
result = algo.fit(data) result = algo.compute(data, # Compute the
init.centroids) # clusters and labels
result.labels_ result.assignments # Print the results
result.cluster_centers_ result.centroids
11scikit-learn
Optimized building blocks for all stages of data analytics on Intel Architecture
GitHub: https://github.com/oneapi-src/oneDAL
12The most popular ML package for Python*
13
13Intel Distribution for Python (IDP) Scikit-learn
Common Scikit-learn Scikit-learn with Intel CPU opts
Same Code,
import daal4py as d4p Same Behavior
d4p.patch_sklearn()
▪ from sklearn.svm import SVC from sklearn.svm import SVC
▪
X, Y = get_dataset() X, Y = get_dataset()
• Scikit-learn, not scikit-learn-like
clf = SVC().fit(X, y) • Scikit-learn conformance
▪ clf = SVC().fit(X, y) (mathematical equivalence)
res = clf.predict(X) defined by Scikit-learn Consortium,
▪ res = clf.predict(X)
continuously vetted by public CI
Available through Intel conda
Scikit-learn mainline (conda install daal4py –c intel)
Intel Confidential
14Intel optimized Scikit-Learn
Speedup of Intel® oneDAL powered Scikit-Learn
over the original Scikit-Learn
K-means fit 1M x 20, k=1000 44.0
K-means predict, 1M x 20, k=1000
Same Code,
3.6
PCA fit, 1M x 50 4.0
Same Behavior
PCA transform, 1M x 50 27.2
Random Forest fit, higgs1m 38.3
Random Forest predict, higgs1m 55.4
Ridge Reg fit 10M x 20 53.4
Linear Reg fit 2M x 100 91.8
LASSO fit, 9M x 45 50.9
SVC fit, ijcnn 29.0
• Scikit-learn, not scikit-learn-like
SVC predict, ijcnn 95.3
SVC fit, mnist 82.4
SVC predict, mnist 221.0 • Scikit-learn conformance
DBSCAN fit, 500K x 50 17.3
(mathematical equivalence)
train_test_split, 5M x 20 9.4
defined by Scikit-learn Consortium,
kNN predict, 100K x 20, class=2, k=5
continuously vetted by public CI
131.4
kNN predict, 20K x 50, class=2, k=5 113.8
0.0 50.0 100.0 150.0 200.0 250.0
HW: Intel Xeon Platinum 8276L CPU @ 2.20GHz, 2 sockets, 28 cores per socket;
Details: https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912
15Available algorithms
▪ Accelerated IDP Scikit-learn algorithms:
• Linear/Ridge Regression
• Logistic Regression
• ElasticNet/LASSO
• PCA
• K-means
• DBSCAN
• SVC
• train_test_split(), assume_all_finite()
• Random Forest Regression/Classification - DAAL 2020.3
• kNN (kd-tree and brute force) - DAAL 2020.3
16Demo
17XGBoost
Optimized building blocks for all stages of data analytics on Intel Architecture
GitHub: https://github.com/oneapi-src/oneDAL
18Gradient Boosting - overview
• Gradient Boosting:
• Boosting algorithm (Decision Trees - base learners)
• Solve many types of ML problems
(classification, regression, learning to rank)
• Highly-accurate, widely used by Data Scientists
• Compute intensive workload
• Known implementations: XGBoost*, LightGBM*, CatBoost*, Intel® DAAL, …
Error Error Error
19DMLC XGBoost* ACCELERATION
▪ Intel® contributed 3 Pull requests
into XGBoost* project on
GitHub* during the year
Goal: performance optimizations
of ‘hist’ mode for Intel® CPUs
20
2021
XGBoost training improvements:
Metric Library versions Airline-OHE,
4.69M
Train time, s XGBoost 0.81 4481
XGBoost 1.2.0 243
Accuracy XGBoost 0.81 0.841544
XGBoost 1.2.0 0.842981
Speedup: 18.4
Workload description: Airline dataset was
preprocessed with OHE, and then after
random permutation first 7M rows were
selected and divided to train test parts
(70%-30%).
2 x Intel® Xeon Gold 6230R @ 26 cores, OS: CentOS Linux 8 (Core), 193 GB RAM.
SW: XGBoost :1.2, 0.81 versions from xgboost PIP chanel. compiler – G++ 7.4, Intel DAAL: 2020.3 version, downloaded from conda. Python env: Python 3.7, Numpy 1.18.5, Pandas 0.25.3, Scikit-lean
0.23.2.
21XGB and LGBM prediction acceleration
daal4py Gradient Boosting Model Convertors
XGBoost:
xgb_model = xgb.train(params, X_train) # Train common XGBoost model as usual
import daal4py as d4p
daal_model = d4p.get_gbt_model_from_xgboost(xgb_model) # XGBoost model to DAAL model
daal_prediction = d4p.gbt_classification_prediction(…).compute(X_test, daal_model) # make fast prediction with DAAL
LGBM:
lgb_model = lgb.train(params, X_train) # Train common LGBM model as usual
import daal4py as d4p
daal_model = d4p.get_gbt_model_from_lightgbm(xgb_model) # LGBM model to DAAL model
daal_prediction = d4p.gbt_classification_prediction(…).compute(X_test, daal_model) # make fast prediction with DAAL
Convert already trained XGB/LGBM model to speedup prediction performance without accuracy loosing
Prediction time, s Prediction, time s Accuracy/MSE
Dataset LGBM + Speed up XGB + Speed up LGBM + XGB +
LGBM XGB LGBM XGB
daal4py daal4py daal4py daal4py
Higgs 9.156 0.728 12.6 5.514 0.7 7.9 0.75626 0.75626 0.75828 0.75828
Mortgage 9.156 0.728 12.6 5.514 0.7 7.9 0.49061 0.49061 0.4879 0.4879
MSRank 0.857 0.111 7.7 0.934 0.121 7.7 0.57101 0.57101 0.57177 0.57177
Intel Confidential
22Demo
23Intel® Distribution for Python Architecture
Command Line Scientific Environments Developer Environments
> python script.py
Interface
Language
Extension Numba Release GIL CPython Release GIL
Intel® Distribution for Python
SDC
LLVM IR GIL: C++
Numerical Parallelism
Dataframe
daal4py
Packages
Python
tbb4py smp mpi4py
Technologies
oneMKL DPC++ oneDAL impi
iomp
Native
TBB
Intel
Community Intel
technology technology
24Intel® Scalable Just import Numba
DataFrame Compiler and use decorator
▪ Extension for Numba* to accelerate AI workflows
▪ Supports more data types (Series, Dataframes,
ASCII/Unicode strings)
▪ Compiler, not a library
▪ Scales from laptops to multi-core servers
▪ Open-source project
Github page https://github.com/IntelPython/sdc
Documentation https://intelpython.github.io/sdc-doc/latest/index.html
▪ Available as conda package and pip wheels
25Intel® SDC
SPEEDUP SDC VS. Pandas
16
14.5491
14
12 10.9496
10
8
6
4 3.3001
1.6991
2
0
1 thread 4 threads 20 threads 40 threads
run_etl
Intel® Xeon™ Gold 6248 CPU @ 2.50GHz, 2x20 cores
Numba* 0.51.2, Pandas* 1.0.5, SDC 0.37.0
26Demo
27Modin
▪ Usable and Scalable
memory
Pandas
DataFrame
CPU CPU CPU CPU
Idle cores
memory
To use Modin, replace the pandas import
Modin
DataFrame
CPU CPU CPU CPU
Full utilization
28Modin
Execution time Pandas vs. Modin[ray]
400
350 340.0729
300
10.8
speedup
250
Time, s
200
150
100
50 31.2453
0
Pandas Modin
Intel® Xeon™ Gold 6248 CPU @ 2.50GHz, 2x20 cores
▪ Dataset size: 2.4GB
29End-to-End Data
Pipeline Acceleration
▪ Workload: Train a model using 50yrs of Census dataset
from IPUMS.org to predict income based on education
▪ Solution: Intel Modin for data ingestion and ETL,
Daal4Py and Intel scikit-learn for model training and
prediction
▪ Perf Gains:
• Read_CSV (Read from disk and store as a dataframe) : 6x
• ETL operations : 38x
• Train Test Split : 4x
• ML training (fit & predict) with Ridge Regression : 21x
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
See backup for configuration details. 30Intel® Distribution for Python Architecture
Command Line Scientific Environments Developer Environments
> python script.py
Interface
Language
Extension Numba Release GIL CPython Release GIL
Intel® Distribution for Python
SDC
LLVM IR GIL: C++
Numerical Parallelism
Dataframe
daal4py
Packages
Python
tbb4py smp mpi4py
Technologies
oneMKL DPC++ oneDAL impi
iomp
Native
TBB
Intel
Community Intel
technology technology
31Envision a GPU-enabled Python Library Ecosystem
Data Parallel Python
Unified Python Offload Programming Model
Extending PyData ecosystem for XPU with device_context(“gpu”):
a_dparray = dpnp.random.random(1024, 3)
X_dparray = numba.njit(compute_embedding)(a_dparray)
res_dparray = daal4py.kmeans().compute(X_dparray)
Optimized Packages for Intel CPUs & GPUs Jit Compilation
•••
numpy → dpnp
Unified Data & Execution Infrastructure
ndarray → dparray
NDA Presentation
host memory → unified shared mem zero-copy USM array interface common device execution queues
CPU → XPU
DPC++ RUNTIME
OpenCL Level 0 CUDA
32New Additions to Numba’s Language Design
@dppy.kernel @njit
import dpctl from numba import njit
import numba_dppy as dppy import numpy as np
import numpy as np import dpctl
@njit
@dppy.kernel def f1(a, b):
def sum(a,b,c): c = a + b
i = dppy.get_global_id[0] return c
c[i] = a[i] + b[i]
a = np.ones(1024 dtype=np.float32) a = np.ones(1024 dtype=np.float32)
b = np.ones(1024, dtype=np.float32) b = np.ones(1024, dtype=np.float32)
c = np.zeros_like(a) with dpctl.device_context("gpu"):
with dpctl.device_context("gpu"): c = f1(a, b)
NDA Presentation
sum[1024, dppy. DEFAULT_LOCAL_SIZE](a, b, c)
Explicit kernels, Low-level kernel NumPy-based array programming, auto-
programming for expert ninjas offload, high-productivity
33Seamless interoperability and sharing of resources
• Different packages
share same execution
context
import dpctl, numba, dpnp, daal4py
@numba.njit
def compute(a): • Data can be
...
Numba function exchanged without
with dpctl.device_context("gpu"):
extra copies and kept
a_dparray = dpnp.random.random(1024, 3) on the device
X_dparray = compute(a_dparray)
res_dparray = daal4py.kmeans().compute(X_dparray)
daal4py function
34Portability Across Architectures
import numba
import numpy as np
import math
@numba.vectorize(nopython=True)
# Runs on CPU by default
def cndf2(inp): blackscholes(...)
out = 0.5 + 0.5 * math.erf((math.sqrt(2.0) / 2.0) * inp)
return out
# Runs on GPU
@numba.njit(parallel={"offload": True}, fastmath=True) with dpctl.device_context("gpu"):
def blackscholes(sptprice, strike, rate, volatility, timev): blackscholes(...)
logterm = np.log(sptprice / strike)
powterm = 0.5 * volatility * volatility
den = volatility * np.sqrt(timev) # In future
d1 = (((rate + powterm) * timev) + logterm) / den with dpctl.device_context(“cuda:gpu"):
d2 = d1 - den blackscholes(...)
NofXd1 = cndf2(d1)
NofXd2 = cndf2(d2)
futureValue = strike * np.exp(-rate * timev)
c1 = futureValue * NofXd2
call = sptprice * NofXd1 - c1
put = call - futureValue + sptprice
return put
35Scikit-Learn on XPU
Stock on Host: Optimized on Host: Offload to XPU: SAME
NUMERIC
BEHAVIOR
import daal4py as d4p import daal4py as d4p
d4p.patch_sklearn()
as defined by
d4p.patch_sklearn()
import dpctl
Scikit-learn
from sklearn.svm import SVC from sklearn.svm import SVC from sklearn.svm import SVC Consortium
X, Y = get_dataset() X, Y = get_dataset() X, Y = get_dataset()
& continuously
with dpctl.device_context(“gpu”):
validated by CI
clf = SVC().fit(X, y) clf = SVC().fit(X, y) clf = SVC().fit(X, y)
res = clf.predict(X) res = clf.predict(X) res = clf.predict(X)
NDA Presentation
36Installing Intel® Distribution for Python* 2021
> conda create -n idp –c intel intelpython3_core python=3.x
Anaconda.org > conda activate idp
https://anaconda.org/intel/packages > conda install intel::numpy
https://software.intel.com/content/www/us/en/develop/articles/installing-intel-
free-libs-and-python-apt-repo.html
YUM/APT https://software.intel.com/content/www/us/en/develop/articles/installing-intel-
free-libs-and-python-yum-repo.html
Docker Hub docker pull intelpython/intelpython3_full
https://software.intel.com/content/www/us/en/develop/tools/onea
oneAPI pi/ai-analytics-toolkit.html
Standalone https://software.intel.com/content/www/us/en/develop/articles/one
api-standalone-components.html#python
Installer
> pip install intel-numpy
> pip install intel-scipy + Intel library Runtime packages
PyPI > pip install mkl_fft + Intel development packages
> pip install mkl_random
37Get the Most from Your Code Today with Intel® Tech.Decoded
Visit TechDecoded.intel.io to learn how to
put key optimization strategies into practice
with Intel development tools.
Big Picture Videos TOPICS:
Discover Intel’s vision for
Visual Computing
key development areas.
Code Modernization
Essential Webinars
Systems & IoT
Gain strategies, practices
and tools to optimize Data Science
application and solution
performance. Data Center & Cloud
Quick Hit How-To Videos
38
Learn how to do specific
programming tasks using
Intel® tools.
38More Resources
Intel® Distribution for Python
• Product page – overview, features, FAQs…
• Training materials – movies, tech briefs, documentation, evaluation guides…
• Support – forums, secure support…
• Machine Learning Benchmarks
• https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics
• https://github.com/IntelPython/scikit-learn_bench
39Thank you
You can also read