This blog post was originally published at Cadence’s website. It is reprinted here with the permission of Cadence.
The Goal: Long Shelf Life
Chips being designed today for the automotive, mobile handset, AI-IoT (artificial intelligence – Internet of things), and other AI applications will be fabricated in a year or two, designed into end products that will hit the market in three or more years, and then have a product lifecycle of at least five years. These chips will be used in systems with a large number and various types of sensors. Therefore, the AI inference solution must be tailored to the end application for maximum efficiency and the longest possible shelf life.
The challenge for the SoC architect is to design a chip with an AI inference solution that is flexible for any future neural network not yet invented. This white paper will explain how an optimized DSP (digital signal processor) is the key to long shelf life for algorithms that have yet to be invented for AI applications.
Figure 1. AI architectures are rapidly changing
AI Architecture Evolution
The world of AI is moving fast. Every major company is investing in new ways to bring AI capabilities to the market efficiently. Just look at the computer vision market. Since the 2012 Imagenet competition, the use of AI has increased in many ways. In the past, various computer vision algorithms were used for face, people, and image detection. Over time, those algorithms have been replaced by neural networks with functions such as classification, segmentation, and object detection. From 2012 to 2018, Convolutional Neural Networks (CNNs) gained popularity with the use of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks for audio and voice neural networks.
This changed rapidly with the introduction of the “Attention Is All You Need” paper in 2017. This began the evolution of transformers. In 2018, an explosion of Natural Language Processing (NLP) papers was followed by the use of transformers in vision applications with vision transformers (VIT) and SWIN Transformers.
Here are some key dates:
- 1990 to 2014: Seq2Seq, RNN, LSTM, GRUSimple attention mechanism
- 2017: Beginning of transformers; “Attention is all you need” paper
- 2018: Explosion of transformers in NLP
- 2018 to 2020: Explosion into other fields: VITs (vision), Alphafold-2 (biology)
- 2021: Start of GPT AI: GPT-X, DALL-E
- 2022: ChatGP, Whisper, Robotics Transformers, Stable Diffusion
Significant investment is being made into generative AI based on a transformer network architecture in the cloud. These transformer networks are also being utilized in end-point (on-device) and edge in automotive, artificial reality/virtual reality (AR/VR), and AI-IOT SoCs to make various vision-based inferences on-device.
While the first vision transformer was introduced in 2018, there have been several other improvements, particularly with changes in the “attention” architecture. With the rapid evolution of these vision transformer networks, developing a dedicated hardware AI accelerator is a challenging task because future transformers are unknown. If the product reaches production in two to three years and will stay in production for at least five years, the architecture must be future-proof. Recent history suggests that this rapid evolution of neural network architectures will continue, and the challenge of keeping up with the latest algorithms will only worsen.
Typical AI Computation Block
AI processing involves much more than just the AI hardware accelerator. To develop an on-device AI inference solution that is future-proof, scalable, and designed to provide full flexibility to the software developer and application, the AI computation block must have a scalar processor, a vector DSP, and the AI hardware accelerator.
Figure 2. The structure of a typical AI computation block
The AI hardware accelerator will provide the best speed and energy efficiency, as the implementation is all done in the hardware. However, because it is implemented in fixed-function RTL (register transfer level) hardware, the functionality and architecture cannot be changed once designed into the chip and will not provide any future-proofing.
In most cases, this AI hardware accelerator will do an excellent job servicing today’s neural network architecture, but it will not be able to address future neural network requirements. Also, this AI hardware accelerator may be limited in data-type support in most cases. Most of these AI hardware accelerators are designed to support 8-bit integer math, which may be good enough to achieve the desired accuracy. However, in some cases, 32-bit, 16-bit, or 8-bit floating point support may be necessary, depending on the requirements of future implementations.
The included scalar CPU is usually required for conventional control processing. It is often used for security and scalar processing in 8-, 16-, and 32-bit applications. This processing power is not a good match for the AI algorithms.
The crucial part of this AI triad is the vector DSP. As AI networks continue to change rapidly, new operators such as Softmax and layer normalization get implemented and invented as the attention layer continues to evolve. There have been significant new implementations with the vision transformers developed over the last few years.
Another significant change with the advent of transformers is their extensive use of matrix multiplication compared to CNNs, which are convolution-heavy. A vector DSP is needed to implement any layers not implemented within the AI hardware accelerator.
One could argue that these functions could be implemented on the scalar CPU or a GPU if present on the SoC. The scalar CPU and GPU have other tasks to do and cannot dedicate the necessary computational bandwidth, significantly slowing new processing requirements so that real-time processing is not possible. Also, the CPU and GPU will not be as power-efficient as the vector DSP.
Example of a Transformer Network on a Typical AI Computational Block
A SWIN (Shifted WINdow) transformer is a new transformer network type in the rapidly evolving transformer-based network space. A SWIN transformer is a hierarchical transformer whose representation is computed with shifting windows, bringing greater efficiency. SWIN uses a weighted sum of input features to generate an output, whereas vision transformers use a fixed set of input features and a variable time delay to generate an output.
When a SWIN network is executed on an AI computational block that includes an AI hardware accelerator designed for an older CNN architecture, only ~23% of the workload might run on the AI hardware accelerator’s fixed architecture. In this one instance, the SWIN network requires ~77% of the workload to be executed on a programmable device, which should be done on the vector DSP.
Figure 3. SWIN transformer example
On-Device Processing Pipeline
In a typical automotive, AR/VR or AI-IOT SoC vision subsystem using many types of sensors such as cameras (image sensors), radar, or even Lidar, data pre-processing is always necessary once the data is processed by the analog front-end of each sensor system and before that data can be fed into an AI inference engine. This pre-processing is very computationally oriented and requires the parallel processing of a large amount of data from various high-resolution sensors.
Figure 4. The on-device pipeline processing chain
For cameras, the trend is to get data at 4K or 8K at 60 frames per second. For radar, the trend is to get data with very big data cubes. Processing these vast amounts of data cannot be done on a CPU or GPU due to higher processing requirements and requirements for lower power consumption. A dedicated hardware block is inappropriate for the SoC designer because of the need for flexibility and future-proofing for the advanced algorithms.
Once the AI hardware engine has done the necessary inferencing task, the output data (or inference frame) requires further post-processing. This post-processing could include ROI (region of interest) alignment, NMS (non-maximum suppression), data reformatting, and more. Again, the data rates are high, and a CPU or GPU cannot perform the necessary tasks in real time.
The Cadence Solution: Vision DSPs
Cadence® Tensilica® Vision DSPs solve this design issue for demanding embedded vision and AI applications in the mobile, automotive, surveillance, augmented reality (AR)/ virtual reality (VR), drone, and wearable markets. The four DSPs in the Vision family offer 0.4TOPS (tera operations per second) to 3.84TOPS maximum performance. All Vision DSPs are built on the same VLIW (very-long instruction word) SIMD (single instruction/multiple data) architecture and offer an N-way programming model, allowing easy software migration to and from each DSP. All DSPs offer the best DSP performance as well as necessary MAC (multiply-accumulate) performance for today’s AI workloads.
The four DSPs in this family range from a low-end 128-bit SIMD DSP to a high-end 1024-bit SIMD DSP. They are all ISO 26262-ready for the automotive market. For sufficient pixel processing throughout, the Vision DSP family architecture incorporates advanced VLIW/SIMD support for the industry’s highest number of ALU and MAC operations per cycle, as well as the industry’s widest and most flexible memory bus. All DSPs in this family offer:
- Integrated DMA engine
- Interface for instruction memory
- Instruction cache
- Two AXI interfaces
- Optional vector floating point unit with support for double-, single-, and half-precision, complex floating point and bfloat operations
- Base-level AI accelerations by offering sufficient 8-bit and 16-bit MAC acceleration
Specialized instructions allow the Vision DSP family members to speed up pixel processing efficiently. Various architecture enhancements boost the performance while keeping energy consumption low. The Vision DSP family provides unprecedented flexibility in system implementations at power-consumption levels, significantly reducing the need for hardware accelerators or quickly augmenting fixed-function AI accelerators.
Based on the proven Tensilica Xtensa® architecture, the Vision DSP family shares the same development environment. Designed to be configurable and easily modified, designers employing the Vision DSP development platform can create unique, differentiated hardware tailored to specific application requirements for optimal performance and energy efficiency. The Tensilica Instruction Extension (TIE) language provides a powerful way to optimize the Vision DSPs and extend the functionality by defining custom execution units, register files, I/O interfaces, load/store instructions, and multi-issue instructions without having to worry about pipelining and control/interface logic as the instruction extensions are integrated directly into the DSP pipeline.
Vision 110 and Vision 130 DSPs
The Vision 110 and Vision 130 DSPs optimize area, performance, and low energy. The Vision 110 DSP with 128-bit SIMD is the smallest Vision DSP targeted for lower-power applications, offering a third of the power and area compared to the Vision 130 DSP while offering easy software transition via the N-way programming model. The Vision 110 DSP can be configured without various features to reduce area and even reduce the data RAM to provide the best area for on-sensor integration.
Features of both the Vision 110 DSP and the Vision 130 DSP include the following improvements compared to previous-generation Vision P1 and Vision P6 DSPs:
- 2X floating-point performance
- Up to 5X improvements on certain AI and vision kernels
- Floating FFT performance improvements
- Lower code size
Vision 230 DSP
The Vision 230 DSP addresses the increasing computational requirements for embedded vision and AI applications. It is specifically optimized for simultaneous localization and mapping (SLAM), a technique commonly used in the robotics, drone, mobile, and automotive markets to automatically construct or update a map of an unknown environment. It is also designed for inside-out tracking in the AR/VR market. It offers more than 1TOPS AI performance for object detection, image classification, and image segmentation.
Features of the Vision 230 DSP include:
- 1024/512b Load/Store capabilities
- 512 8-bit MAC
- 8/16/32-bit fixed-point processing
- Double-precision (FP64), single-precision (FP32) and half-precision (FP16) floating-point processing
- 4-channel iDMA
Vision 240 DSP
The Vision 240 DSP is built with a 1024-bit SIMD, offering up to 3.84TOPS of performance. To address the requirements for embedded vision and AI applications, the Vision 240 DSP provides up to twice the AI and floating-point performance compared to the Vision 230 DSP.
Vision 240 DSP features include:
- 2048/1024b Load/Store capabilities
- 1024 8-bit MACs
- 8/16/32-bit fixed-point processing
- Double-precision (FP64), single-precision (FP32) and half-precision (FP16) floating-point processing
- 4-channel iDMA
- Acceleration for FFT and SLAM
Conclusion
AI algorithms are changing so rapidly that AI hardware cannot keep up. That’s why any AI hardware accelerators must be teamed with an efficient DSP. Cadence’s Tensilica family of DSPs has evolved over almost 20 years to be the most efficient partner for most AI hardware accelerators. Depending on processing requirements, one of the four members of the Vision DSP family should be a great match for just about any AI application.
Pulin Desai
Product Management Group Director, Tensilica IP, Cadence