This blog post was originally published at Geisel Software’s website. It is reprinted here with the permission of Geisel Software.
How to optimize and accelerate GPUS: tools, techniques, and real-world scenarios
Machine learning (ML) systems analyze tremendous amounts of data to identify hidden patterns and make predictions based on those patterns. This requires a very high level of parallel processing. And if these systems are meant to be used in real-time scenarios, the level of processing is critical to the feasibility of the application. Modern advancements in GPU computing technology are opening up new realms of possibility for computationally intensive, real-time applications. Emerging applications of AI and ML, such as machine vision and three-dimensional graphical processing, are seeing some of the greatest benefits of accelerated computing.
This article will cover how to optimize and accelerate GPUs:
- Parallel processing on CPUs, including SIMD instructions on multiple cores.
- High-performance parallel processing on Graphics Processing Units (GPUs) with native software kernels.
- A typical machine learning product design scenario with GPU acceleration.
- Tips for writing high-performance GPU algorithms.
- Conclusions and takeaways
Parallelizing CPUs
When talking about parallel computing, we typically think of GPUs and their vast array of independent cores. Historically, CPUs have not had strong, innate parallel computing abilities, but CPU features like SIMD instruction sets make it possible to perform parallel computing on CPUs. While SIMD-equipped CPUs might not come close to the parallel capacity of modern GPUs, they can still offer great performance improvements over a CPU that is processing serially. CPU parallelization can also work hand-in-hand with GPU optimizations to fully utilize available computing resources.
Intel’s AVX-512 SIMD instruction set, for example, can process up to 64 computations in parallel on a CPU instead of processing one at a time. SIMD is also relatively lightweight to apply, as it is simply a set of instructions and does not require extra libraries or additional memory to operate.
The parallelism of CPUs can be taken one step further by using multithreading and SIMD together to scale up parallel operations on multicore CPUs. One of the libraries that can be used to add parallelism to a CPU is oneTBB, a Threading Building Blocks library by Intel. This API provides a powerful, easy-to-use capability to spread processing iterations over multiple threads and fully utilize the available CPU cores.
Using CPU parallelization, it is often possible to accelerate algorithm performance by a factor of 2x to 5x, which can make a big difference for many applications. CPU parallelization can even be used to accelerate machine learning tasks for small IoT edge computing devices. For example, Intel has recently introduced the AVX-512 Vector Neural Network Instruction set (VNNI), which is designed to accelerate inner convolutional neural network loops. This allows the neural network to run in parallel using only CPU resources. Although neural nets are typically trained on powerful computers equipped with GPUs, these trained networks can be deployed to small IoT devices which can perform machine learning in the field.
What is a GPU?
While CPUs can be adapted to serve some parallel computing needs, GPUs, originally designed to process images and graphics, take parallelism to another level. Modern GPUs are now used as powerful, general-purpose computing machines, capable of processing enormous amounts of data in parallel. Each GPU core is capable of running independently, resulting in massive parallelism. While a modern multi-core SIMD CPU may be able to perform hundreds of operations in parallel, the power of a GPU is limited only by how many computing units it contains. Today’s GPUs contain thousands of independent computing units; for example, NVIDIA’s new RTX 4090 GPUs provide 16,384 cores.
GPUs provide programming environments that open up this power for use in general-purpose algorithms. NVIDIA GPUs use the CUDA (Compute Unified Device Architecture) kernel API which acts as the interface to the GPU. The CUDA environment is optimized to run on NVIDIA’s native GPU hardware, resulting in greater throughput.
GPUs are being used in many industries to perform an array of accelerated computing tasks like machine vision processing, neural networking algorithms, graphical processing, autonomous navigation, swarm robotics, 3D point cloud processing, and more.
GPU Acceleration in a Machine Learning Application
A useful application of GPU acceleration is in machine learning and vision problems. For this example, let’s say we’re trying to create the perfect, autonomous grocery store checkout kiosk that can scan and bag groceries for customers.
A hypothetical solution to this problem uses an overhead camera to create a 2D color image of everything that passes along the conveyer belt, in conjunction with depth-sensing cameras or a 3D LiDAR sensor to create a full 3D representation of the grocery items.
This system uses a neural network machine learning component that identifies items and compares them to its library of store inventory items. Once an item has been identified using the 2D camera, this data must then be correlated to the 3D representation. This allows us to confirm the identification and determine the orientation of the item so it can be bagged by the robotic arm. Other complex operations may be required such as tracking items as they move along the conveyer belt.
Both the machine learning and 3D data manipulation must be performed in real time and are computationally intensive. This is where GPU computing comes into play, accelerating operations such as generating 3D point clouds, matrix transformations, vector math, data filtering, boundary checks, and image processing. Without this level of performance, this product would not be feasible since it would be unable to keep up with the outside world.
Maximizing GPU Performance: Overcoming Bottlenecks
Two easy tips to speed up a GPU and avoid bottlenecking are making sure that the GPU is fed data as fast as possible and is using as many transistors as possible at every moment. Whenever the GPU is waiting for additional input data to be processed, there will be an immediate loss of potential processing speed. This type of bottleneck is often the result of a memory issue but also can be caused by higher-level issues with operation sequencing.
Two of the most common GPU bottlenecks are both memory-related. The first occurs when transferring large blocks of data between the CPU and GPU. The second occurs when multiple GPU cores must wait for their turn to access the same area of memory. This is known as memory contention.
How can these GPU bottlenecks be avoided?
- Reducing time spent on loading and storing data in memory is a key place to start.
- Programming APIs like CUDA allow users to combine multiple operations to eliminate redundant store and load cycles.
- Combining multiple GPU operations can significantly reduce memory accesses and improve throughput.
- Programs should also use conventional multithreading to feed data to the GPU as fast as possible and keep all computing units busy.
Parallelizing Algorithms
GPUs, with their astounding computational abilities, have the potential to greatly increase processing speeds in algorithmic operations. However, simply moving an algorithm onto a GPU won’t necessarily return great results. A chain is only as strong as its weakest link, so analyzing and eliminating bottlenecks is critical to achieving maximum performance. Algorithms must also be tuned to ensure maximum usage of all available GPU cores, which requires breaking complex operations down into many smaller pieces.
It is not uncommon to see performance improvements of 10x to 100x with a well-tuned GPU algorithm. Using this approach, the Geisel Software team has achieved up to 700x performance improvements for our GPU acceleration clients.
Avoid Sequential Processing
One technique to optimize and accelerate GPUs is to use thread-safe blocking queues. These queues can be preloaded with data by the CPU, allowing the GPU to run at full capacity while the CPU goes on to perform other tasks.
Other features like shared memory on NVIDIA GPUs can also be used to reduce memory overhead for some algorithms. Each GPU core contains its own small area of shared memory, which can be accessed up to 100x faster than the primary GPU memory. Where applicable, this feature can greatly increase the speed of a GPU algorithm.
Takeaways and Conclusions
You should now have a good understanding of how to optimize & accelerate GPUs with these key takeaways:
- An optimized algorithm is only as good as its weakest link. If you have five steps in your algorithm that all take about the same amount of time, and you optimize four of them, you’re only at most 5x faster, because the single unoptimized algorithm will limit overall performance.
- Careful analysis of performance bottlenecks is critical.
- Take care of the low-hanging fruit first. Optimize the slowest component of the algorithm, retest to identify the next bottleneck, and repeat this process until you’ve eliminated as many bottlenecks as possible. However, don’t waste time optimizing operations which are already fast, and always confirm your assumptions about performance with actual timing measurements.
- Take advantage of tools like shared memory, blocking queues, multithreading, and SIMD. They can dramatically improve performance when applied properly.
- Always feed work to the GPU as fast as possible and make sure that you’re using every compute unit at any given time.
- Cut down on the number of memory accesses in your algorithm by combining tasks to eliminate memory load/store cycles.
- Simply implementing an algorithm on a GPU won’t necessarily yield great improvements. You need to architect the algorithm such that it takes full advantage of the parallel benefits offered by GPUs (doing all of the operations listed in the points above is a good place to start).
For a basic CUDA tutorial, you can visit: https://developer.nvidia.com/blog/even-easier-introduction-cuda/
Stephen Phillips
Senior Software Engineer, Geisel Software