Increasing AI Performance and Efficiency with Intel DL Boost

This blog post was originally published at Intel's website. It is reprinted here with the permission of Intel.

PyTorch acceleration baked into the latest generation of Intel Xeons.
That will help speed up the 200 trillion predictions and 6 billion translations Facebook does every day. https://t.co/2gM75pFvrC

— Yann LeCun (@ylecun) April 9, 2019

Intel® Deep Learning Boost (Intel® DL Boost) is a group of acceleration features introduced in our 2^nd Generation Intel® Xeon® Scalable processors. It provides significant performance increases to inference applications built using leading deep learning frameworks such as PyTorch*, TensorFlow*, MXNet*, PaddlePaddle*, and Caffe*. ^[1]

Understanding Intel® Deep Learning Boost

Intel DL Boost follows a long history of Intel adding acceleration features to its hardware to increase the performance of targeted workloads. The initial Intel Xeon Scalable processors included a 512-bit-wide Fused Multiply Add (FMA) instruction in the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions it introduced. This FMA instruction facilitated increases in data parallelism and helped Intel Xeon Scalable processors deliver 5.7x performance for AI and deep learning applications. ^[2]

With Intel DL Boost, we build upon this foundation to further accelerate AI on Intel® architecture. The first of several innovations planned for Intel DL Boost are the Vector Neural Network Instructions (VNNI), which have two main benefits to deep learning applications:

VNNI use a single instruction for deep learning computations that formerly required three separate instructions. As you would expect, using one instruction in place of three yields significant performance benefits.
VNNI enable INT8 deep learning inference. INT8’s lower precision increases power efficiency by decreasing compute and memory bandwidth requirements. INT8 inference has produced significant performance benefits with little loss of accuracy. ^[3] Intel DL Boost and related tools we provide significantly ease the use of INT8 inference and are accelerating its adoption by the broader industry.

Intel DL Boost helps contribute to a theoretical peak speedup of 4x for INT8 inference on 2^nd Gen Intel Xeon Scalable processors, ^[4] in comparison to FP32 inference. Further, with 56 cores per socket, we are predicting that Intel Xeon Platinum 9200 processors will deliver up to twice the performance of Intel Xeon Platinum 8200 processors. ^[5] In fact, Intel Xeon Platinum 9282 processors recently demonstrated the ability to exceed the performance of NVIDIA* Tesla* V100 on ResNet-50 inference.

You can learn more about the technical details of VNNI from our recent Intel.ai blog on the topic.

Why Intel DL Boost?

The rapid proliferation of AI inference services, the need for these services to render results quickly, and the tendency for increasingly complex deep learning applications to be processor-intensive are helping drive unprecedented demand for high-performance, low-latency compute. It is often easiest and most efficient to meet this demand with IT infrastructure already in place or already familiar – the Intel Xeon Scalable processor-based systems trusted for so many other workloads. Fortunately, as customers and researchers have shown time and again, Intel architecture makes a highly performant platform for AI inference.

With Intel DL Boost, Intel architecture and 2^nd Gen Intel Xeon Scalable processors are a more capable AI inference platform than ever before. Even better, Intel DL Boost’s innovation will continue in the next generation of Intel Xeon Scalable processors, in which we will introduce support for the bfloat16 floating-point format. Look for more information on this soon.

Getting Started with Intel DL Boost

We have been working with the AI community to optimize the most popular open source deep learning frameworks for Intel DL Boost to help developers benefit from the performance and efficiency gains it provides.

Developers can use tools Intel offers to convert a FP32 trained model to an INT8 quantized model. This new INT8 model will benefit from Intel DL Boost acceleration when used for inference in place of the earlier FP32 model and run on 2^nd Gen Intel Xeon Scalable processors. As additional support, Intel also provides a Model Zoo, which includes INT8 quantized versions of many pre-trained models, such as ResNet101, Faster-RCNN, and Wide&Deep. We hope these models and tools get you up and running with Intel DL Boost more quickly.

Intel DL Boost in Action

Several earlier intel.ai blogs provide more details on Intel DL Boost integrations into various popular deep learning frameworks and results customers are seeing from early use of these optimizations.

TensorFlow: Our blog on accelerating TensorFlow inference explains how to use the Intel AI Quantization Tools for TensorFlow to convert a pre-trained FP32 model to a quantized INT8 model. Several pre-trained INT8 quantized models for TensorFlow are included in the Intel Model Zoo in categories like image recognition, object detection, and recommendation systems. Dell EMC has reported a greater than 3x improvement in performance over the initial Intel Xeon Scalable processors using our pretrained INT8 Resnet50 Model and 2^nd Gen Intel Xeon Scalable processors with Intel DL Boost. ^[6]

PyTorch: Intel and Facebook have partnered to increase PyTorch performance with Intel DL Boost and other optimizations. With Intel DL Boost and 2^nd Gen Intel Xeon Scalable processors, we have found up to 7.7x performance for a FP32 model and up to 19.5x performance for an INT8 model when running ResNet50 inference. ^[7] As a result of this collaboration, Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) optimizations are integrated directly into the PyTorch framework, enabling optimization of PyTorch models with minimal code changes. We’ve provided a blog that should be a good way to get started with PyTorch on 2^nd Gen Intel Xeon Scalable processors.

Apache MXNet: The Apache MXNet community has delivered quantization approaches to enable INT8 inference and use of VNNI. We have found 3.0x performance for ResNet-50 for an INT8 model with Intel DL Boost and 2^nd Gen Intel Xeon Scalable processors in comparison to a FP32 on Intel Xeon Scalable processors. ^[8] iFLYTEK, which is leveraging 2^nd Gen Intel Xeon Scalable processor and Intel® Optane™ SSDs for its AI applications, has reported that Intel DL Boost has resulted in similar or better performance in comparison to inference using alternative architectures.

PaddlePaddle: Intel and Baidu have collaborated since 2016 to optimize PaddlePaddle performance for Intel architecture. We’ve provided an in-depth overview of INT8 support in PaddlePaddle. In Intel’s testing, INT8 inference resulted in 2.8x throughput for ResNet-50 v1.5 with just 0.4% accuracy loss in comparison to an earlier FP32 model. ^[9]

Intel Caffe: Our Intel Caffe GitHub Wiki explains how to use low-precision inference to speed up performance without losing accuracy by using our Calibrator accuracy tool. JD.com collaborated with Intel engineers to use Intel DL Boost to increase the performance of a text detection application by 2.4x with no accuracy degradation in comparison to an earlier FP32 model. ^[10]

Enabling Cutting-Edge AI on Intel® Architecture

With all-new software libraries and optimizations, coupled with hardware innovation, CPUs have never been more performant for AI than they are today. My team and I look forward to continuing to deliver these AI breakthroughs on Intel architecture. To follow our work, please stay tuned to intel.ai and follow Intel AI on Twitter at @IntelAI and @IntelAIResearch.

^[1] My thanks to my Intel colleagues Jayaram Bobba, Wei Wang, Ramesh AG, Eric Lin, Jason Ye, Jiong Gong, and Wei Li for their assistance with this post.

^[2] Intel Xeon Scalable processors with Intel AVX-512 optimizations (December 2018) provide up to 5.7x performance in comparison to Intel Xeon Scalable processors at launch (July 2017), for details see https://bit.ly/2WLijLn, slides 14 and 32.

^[3] https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training

^[4] https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training

^[5] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 2x inference throughput in comparison to initial Intel Xeon Scalable processors, for details see https://bit.ly/2WLijLn, slides 14 and 32.

^[6] Source: Dell EMC, https://blog.dellemc.com/en-us/accelerating-insight-2nd-generation-intel-xeon-scalable-processors/

^[7] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 7.7x performance for a FP32 model and 19.5x performance for an INT8 model in comparison to a FP32 model on 2nd Gen Intel Xeon Scalable processors without Intel MKL-DNN optimization, for details see https://software.intel.com/en-us/articles/intel-and-facebook-collaborate-to-boost-pytorch-cpu-performance.

^[8] MxNet on ResNet-50 Throughput Performance on Intel® Xeon® Platinum 8280 Processor: Tested by Intel as of 3/1/2019. 2 socket Intel® Xeon® Platinum 8280 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0271.120720180605 (ucode:0x4000013),CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: MxNet https://github.com/apache/incubator-mxnet/ -b master da5242b732de39ad47d8ecee582f261ba5935fa9, Compiler: gcc 4.8.5,MKL DNN version: v0.17, ResNet50: https://github.com/apache/incubator-MXNet/blob/master/python/MXNet/gluon/model_zoo/vision/resnet.py, BS=64, synthetic data, 2 instance/2 socket, Datatype: INT8 vs Tested by Intel as of 3/1/2019. 2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2633 MHz), BIOS: SE5C620.86B.0D.01.0286.121520181757, CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: MxNet https://github.com/apache/incubator-mxnet/ -b master da5242b732de39ad47d8ecee582f261ba5935fa9, Compiler: gcc 4.8.5,MKL DNN version: v0.17, ResNet50: https://github.com/apache/incubator-MXNet/blob/master/python/MXNet/gluon/model_zoo/vision/resnet.py, BS=64, synthetic data, 2 instance/2 socket, Datatype: INT8 and FP32

^[9] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 2.8x throughput for ResNet-50 v1.5 with just 0.4% accuracy loss, for details see https://www.intel.ai/int8-inference-support-in-paddlepaddle-on-2nd-generation-intel-xeon-scalable-processors/.

^[10] 2nd Gen Intel Xeon Scalable processors with Intel DL Boost provide up to 2.4x text detection performance, for details see https://bit.ly/2WLijLn, slides 15 and 33.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available security updates. See configuration disclosures for details. No product or component can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Huma Abidi
Director of Machine Learning/Deep Learning Software Engineering, Intel

If you're building AI or vision-enabled products, you've come to the right place.

Increasing AI Performance and Efficiency with Intel DL Boost

Understanding Intel® Deep Learning Boost

Why Intel DL Boost?

Getting Started with Intel DL Boost

Intel DL Boost in Action

Enabling Cutting-Edge AI on Intel® Architecture

Pages

Topics

Contact

Address

Phone