This blog post was originally published at Arm’s website. It is reprinted here with the permission of Arm.
New Arm Compute Subsystems for Client deliver a step-change in performance, efficiency, and scalability, with production-ready physical implementations on the 3nm process.
AI is transforming consumer devices, and revolutionizing productivity, creativity and entertainment-based experiences. This is leading to greater automation, immersion and personalization that provide a wide range of opportunities for developers and end-users. AI continues to evolve and advance, with on-device generative AI driving the evolution of the mobile system-on-chip (SoC).
Building on the success of Arm’s Total Compute solutions, we are announcing brand-new compute subsystems for consumer devices, known as Arm Compute Subsystems (CSS) for Client. This is the compute foundation for AI-powered experiences, delivering a step-change in performance, efficiency and scalability across the broadest category of consumer devices.
CSS for Client includes the latest Armv9.2 Cortex CPU cluster and Arm Immortalis and Mali GPUs, CoreLink Interconnect system IP, and production-ready physical implementations for the CPUs and GPUs on the 3nm process on leading foundries. The platform provides the fastest path to production silicon for our partners. The physical implementations allow Arm’s partners to unlock all the benefits of the leading-edge 3nm process, while enabling highly flexible, customizable silicon designs.
Looking inside Arm CSS for Client
CSS for Client features the latest Armv9.2 CPU cluster, which integrates our highest performance Arm Cortex-X925 CPU, most efficient Arm Cortex-A725 CPU, and refreshed Arm Cortex-A520 CPU. This delivers unprecedented performance and efficiency for AI and other real-world compute workloads.
What’s included in CSS for Client
The system integration and expansion of CSS for Client is achieved through the latest CoreLink Interconnect. The integrated system-level cache (SLC) enables best system power efficiency by reducing DRAM bandwidth and accesses. The System Memory Management Unit (SMMU) provides enhanced security through stage-2 translation to support virtualized security frameworks, such as the Android Virtualization Framework (AVF).
CSS for Client achieves stunning graphics and console-level gaming performance through the Immortalis-G925 for flagship smartphone devices, which is built on the 5th Gen GPU architecture. With its enhanced performance and power efficiency, users can enjoy longer, more immersive gaming sessions on mobile.
CSS for Client will be part of the first generation of Android SoCs on 3nm process nodes, with this enabling best-in-class PPA (power, performance and area) in silicon. As part of CSS for Client, Arm’s physical implementations unlock the full potential of the 3nm technology, maximizing the PPA benefits for premium platforms and creating the fastest path to silicon for our partners.
Working with leading foundry partners, we are co-designing and delivering CPU and GPU physical implementations, which includes tape-out ready Cortex-X925 CPU and Immortalis-G925 physical implementations for 3nm. This helps our partners to access the full PPA benefits on the 3nm process, while shortening silicon development and deployment timelines through the production-ready silicon solutions. It also gives our partners the flexibility to build market-specific, differentiated CPU clusters and GPUs using CSS for Client.
Pushing the boundaries of compute and AI performance
CSS for Client is Arm’s fastest platform for Android to date, with significant improvements across key benchmarks and general compute use cases compared to the TCS23 platform. These include:
- 36 percent improvement in peak performance, measured by Geekbench 6 single core score, thanks to the new Cortex-X925;
- 33 percent faster application launch times on average across five of the top 10 applications to boost productivity and provide a fluid user experience on mobile devices;
- 60 percent faster web browsing, measured using the Speedometer 2.1 browser benchmark; and
- 30 percent peak graphics performance improvements on average across seven graphics benchmarks, including ray tracing and variable rate shading (VRS) benchmarks.
Some of the performance benefits from CSS for Client
CSS for Client is the platform for AI-powered consumer device experiences. Earlier this year, we showed how large language models (LLMs) can run locally on Arm CPUs on mobile devices. With CSS for Client, LLMs will run even better on Arm CPUs with faster response times. The platform delivers a 42 percent faster time-to-first token when running the Llama 3 LLM and a 46 percent faster time-to-first token when running the Phi-3 LLM.
Running LLMs on the Arm CPUs with CSS for Client
Moreover, CSS for Client also achieves a significant performance leap for AI inference across a broad range of general AI networks due to advances in the new Arm CPUs and GPUs. This includes 59 percent faster inference on Cortex-X925 and 36 percent faster AI inference on Immortalis-G925. Also, through leveraging an additional Cortex-X925 CPU in the CSS for Client CPU cluster configuration, we observe a staggering 2.7x performance uplift in AI inference across 17 popular networks for int8 and fp16 data types. These improvements in AI inference enable seamless user experiences across a range of AI use cases.
AI inference improvements with CSS for Client
One of these AI use cases where CSS for Client particularly shines is computational photography and AI camera. Being able to capture stunning photos and videos with realistic bokeh effects that blur the background and focus on a chosen subject is complex. The AI camera bokeh pipeline consists of multiple stages, such as depth estimation, segmentation, matting, and blending, to produce high-quality results. Compared to TCS23, CSS for Client achieves a 24 percent increase in bokeh performance through AI processing on the CPU for the bokeh workload. This means users can enjoy faster and smoother bokeh effects on their photos and videos without compromising battery life.
AI camera improvements through CSS for Client
Further performance and power optimizations are then possible on Client for CSS through a mix of software and tools. The introduction of Arm’s new Kleidi libraries, which features KleidiAI (a collection of highly optimized machine learning (ML) kernels), enables developers to unlock the full potential of Arm CPUs when running AI workloads via highly optimized generative AI frameworks. This means that developers can build their AI-based applications quickly, at the highest possible performance and across the broadest range of devices.
For more immersive and longer gaming sessions, CSS for Client delivers double-digit performance and power efficiency improvements. This includes a 37 percent average performance uplift at the same power and 30 percent GPU power reduction, playing at 120 frames per second (fps) on average across a range of popular mobile games.
Scalable performance across all consumer device markets
Arm is committed to enabling AI for everybody, with the relentless push for performance and efficiency through CSS for Client being scalable across a broad range of consumer devices and form factors.
CSS for Client scales up to target the most performant consumer devices entering the market, which includes the next-generation of AI PCs where Cortex-X925 delivers 50 percent more TOPS compared to the Arm Cortex-X4 CPU. CSS for Client provides a purpose-built scalable platform for the PC market. This features Cortex-X925 for best-in-class single-threaded performance and the best performance scalability through the newly updated DSU-120 that delivers up to 14 CPU cores within a single CPU cluster. Alongside SVE2, yet more Armv9 architecture features are coming to the PC market including the Pointer Authentication (PAC), Branch Target Identification (BTI) and Memory Tagging Extension (MTE) that are already proven security technologies in the mobile ecosystem.
Through CSS for Client, Arm delivers accessible AI across all performance and cost points in consumer device markets. Cortex-A725 is the primary processor for power efficient AI throughput, acting as the main workhorse and developer target for AI processing for more cost-sensitive mass-market consumer technology segments. For example, this virtual assistant demo shows the performance of running the Llama2-7B and Phi-3 3.8B LLMs on existing Android smartphones using 3x Cortex-A700 series CPU cores. Finally, the area optimized Cortex-A725 allows area efficient deployments of generative AI workloads across a wide range of consumer technology segments.
The consumer technology AI foundation
CSS for Client is the purpose-built platform for the next generation of AI experiences across a broad spectrum of consumer devices. On mobile, users will experience Android like never before, with CSS for Client being Arm’s fastest compute platform for Android. The PPA benefits of the platform are realized through physical implementations that deliver a faster time-to-market and frictionless deployment opportunities for our silicon partners. The scalable performance capabilities of CSS for Client deliver “AI for everybody”, helping to unleash AI performance for all cost points across a variety of different device and form factor types.
Essentially, CSS for Client allows our ecosystem to do more, whether it is unleashing more performance, more AI, more application experiences, or more advanced silicon, we cover all the bases. Through the platform, Arm is building the future of consumer computing for the AI-based experiences of today and tomorrow.
Kinjal Dave
Senior Director, Product Management, Client Line of Business, Arm