Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Vector Engine and AI The NEC SX-Aurora TSUBASA Dr. Erich Focht, NEC Deutschland GmbH April 2019 1 © NEC Corporation 2018
NEC Vector Supercomputers: High Sustained Performance First 8.23 Bytes/FLOP Tadashi Watanabe >1GFLOPS Multi-lane pipelines Vector Caches CMOS 8 Bytes/FLOP air cooled Single chip 4 Bytes/FLOP Earth Simulator vector processor ADB 2 Bytes/FLOP vector cache Multi-core Best HPCG 1 Byte/FLOP vector SoC efficiency! 10% 2 © NEC Corporation 2018
Frovedis: FRamework Of VEctorized and DIStributed data analytics ▌C++ framework similar to Spark Supports Spark/Python interface ▌MPI is used for high performance communication ▌Optimized for SX-Aurora TSUBASA (also works on x86) Open Source! github.com/frovedis Spark / Python Interface Matrix Library Machine Learning DataFrame Frovedis Core 9 © NEC Corporation 2019
Machine Learning Library Implemented with Frovedis Core and Matrix Library Supports both dense and sparse data Sparse data support is important in large scale machine learning ▌Supported algorithms: ▌ Under development: Linear model Word2vec Frequent Pattern Mining • Logistic Regression Factorization Spectral Clustering • Multinominal Logistic Machines Hierarchical Clustering Regression Decision Tree Latent Dirichlet Allocation • Linear Regression Deep Learning (MLP, CNN) Naïve Bayes • Linear SVM Graph algorithms Random Forest ALS Gradient Boosting Decision K-means • Shortest Path, Tree PageRank, Preprocessing Connected Components • SVD, PCA ▌ We will support more! 15 © NEC Corporation 2019
DataFrame A B C D ▌Supports similar interface as Spark DataFrame Select, Filter, Sort, Join, Group by/Aggregate (SQL interface is not supported yet) A B C D ▌Implemented as distributed column store rank #0 Each column is represented as distributed vector rank #1 Each operation only scans argument columns: other columns are created when necessary rank #2 (late materialization) Reduces size of data to access 16 © NEC Corporation 2019
Spark / Python Interface ▌Writing C++ programs is sometimes tedious, so we created a wrapp er interface to Spark Call the framework through the same Spark API Users do not have to be aware of vector hardware ▌Implementation: created a server with the functionalities Receives RPC request from Spark and executes ML algorithm, etc. Only pre-built algorithms can be used from Spark ▌Other languages can also be supported by this architecture Currently Python is supported (scikit-learn API) 17 © NEC Corporation 2019
Performance Evaluation: Machine Learning ▌Xeon (Gold 6126) 1 socket vs 1 VE, with sparse data (w/o I/O) Speed Up (Spark = 1) LR uses CTR data provided by 120 113.2 Criteo (1/4 of the original, 6GB) 100 K-means and SVD used Wikipedia 80 doc-term matrix (10GB) 56.8 60 Spark version: 2.2.1 42.8 40 20 10.6 8.8 Workloads 1 1 1 5.3 0 ▌ Web ads optimization (Logistic regression) LR K-means SVD ▌ Document clustering (K-means) Spark/x86 Frovedis/x86 Frovedis/VE ▌ Recommendation (Singular value decomposition) 18 © NEC Corporation 2019
Performance Evaluation: DataFrame ▌Evaluated with TPC-H SF-20 50 47.3 Speed Up (Spark = 1) Q1: group by/aggregate 45 40 34.8 Q3: filter, join, group by/aggregate 35 33.8 30 Q5: filter, join, group by/aggregate (lar 25 ger join) 20 15 10.1 10.6 Q6: filter, group by/aggregate 10 8.8 5.8 5 1 3.2 1 1 1 0 Q01 Q03 Q05 Q06 Spark/x86 Frovedis/x86 Frovedis/VE 19 © NEC Corporation 2019
Tensorflow for Aurora (Beta) ▌Hand-optimized Vector Engine DNN Library: veDNN ▌Built with LLVM-VE – Supports scalar code + vector intrinsics – RangeVectorizer (RV) in guided mode (AKA “needs directives”) – https://sx-aurora.github.io/posts/Testing-LLVM-VE-RV-update/ for(int64_t for(int64_t i=0; i=0; i
Why is this so difficult to optimize? What data scientists see: x = Conv(x, kernel=1x1, bias=True) x = ReLU(x) x = AvgPooling(x, kernel=13x13) What HPC people see: function(Conv): for(Batch, OutChannel, Y, X): for(InChannel, KernelY, KernelX): output[…] += input[…] * weight[…] output[…] += bias[…] function(ReLU): for(Batch, OutChannel, Y, X): output[…] = max(0, input[…]) function(AvgPooling): for(Batch, OutChannel, Y, X): for(KernelY, KernelX): output[…] += input[…] / (13*13) 31 © NEC Corporation 2019
Why is this so difficult to optimize? What we actually want: function(FusedNetwork): for(Batch, OutChannel): float N[…] for(Y, X): for(InChannel, KernelY, KernelX): N[…] += input[…] * weight[…] N[…] += bias[…] N[…] = max(0, X) for(Y, X): for(KernelY, KernelX): output[…] += N[…] / (13*13) 32 © NEC Corporation 2019
Inference (128x Batched) Sol vs PyTorch v1.0.1 1000 900 800 Execution Time (ms) 700 600 500 400 300 200 100 0 PyTorch 1.0.1 Sol 34 © NEC Corporation 2019
You can also read