Just over a month after refreshing its Neoverse infrastructure line of CPU cores, Arm released a flurry of new products aimed at PC and mobile applications. In its largest IP product release ever, Arm released complete families of CPU, GPU, DSU, and interconnect IP as it continues to sync up releases across product lines to what it calls “Total Compute solutions.” In fact, the only major computing core family that did not receive an update with this release is the newest family of products, the Ethos NPUs for machine learning acceleration.
New Arm Cortex CPUs
The entire CPU lineup is being refreshed with new Armv9-based products with significant improvements in performance and efficiency. The most important feature that Armv9 adds is the second generation of Scalable Vector Extensions (SVE2) which is an upgrade to Arm’s original Neon SIMD extensions. Updates to Neon and the addition of SVE2 can deliver much faster performance on machine learning workloads and other parallel workloads. Arm also rolled up a series of improved math and security features described in later versions of Armv8 instruction set releases. All future Arm Cortex CPUs will implement SVE2.
For the highest-performance, Arm announced the Cortex-X2 with a new CPU cluster, up to 16MB of L3 cache, and up to 32MB of unified system level cache. Under simulated benchmark tests using the same process node as its predecessor, the Cortex-X1, the Cortex-X2 provides a 16% increase in instructions per clock (IPC) performance and 100% higher machine learning (ML) performance. According to Arm, a PC powered by the Cortex-X2 could provide performance higher than many current laptops. Two of Arm’s biggest licensees, Apple and Qualcomm are both vying to break open the Arm-based PC market with new high-performance SoCs and another licensee, MediaTek is in some recent Chromebooks. It will be interesting to see if or how the performance of the Cortex-X2 might help licensees achieve their goals faster.
The other two members of the CPU product family are the more traditional Cortex-A products aimed at smartphones and a wide range of other applications from DTVs to automotive. The first noticeable difference is that Arm moved from the two-number model designation (Cortex-A5x and Cortex-A7x) used previously, to a three-number product designation (Cortex-A5xx and Cortex-A7xx). The first big core in the product family is the Cortex-A710 and the first little core is the Cortex-A510.
In addition to the enhancements to branch prediction and data prefetch, there were microarchitectural changes to the Cortex-A710 over the previous generation Cortex-A78. These included changes like the reduction in the instruction core width to 5-wide and pipeline stages to a 10-cycle pipeline for a more power and space efficient core. The result was a 30% increase in efficiency combined with a 100% improvement in machine learning (ML) performance and an overall performance boost of about 10%.
It’s hard to imagine Arm making any significant changes to its Cortex-A55 that has been extremely popular in embedded and consumer electronics applications, but Arm managed to not only increase the efficiency by 20%, but it also increased ML performance by 200% and overall performance by 35%. The Cortex-A510 is designed with a dual-core complex with a shared L2 cache, Neon SIMD engine, and SVE2 vector processing unit. The shared complex saves die area and power with only a minor performance penalty. The Cortex-510 also features a 128-bit data prefetch per cycle, a three instruction per cycle decode, and a multi-stage branch prediction. With the dual-core complex four small cores can be achieved using two Cortex-510 complexes rather than four individual Cortex-A55s.
Arm Mali GPUs
As if the new complement of CPU cores wasn’t enough, Arm launched an entire family of new GPU cores that also share the new three-digit naming convention. The new family includes the Mali-G710, Mali-G610, Mali-G510, and Mali-G310. As with the CPU cores, the new GPU cores scale with performance and efficiency requirements. For the first time, however, the entire product family features Arm’s Valhalla GPU architecture. The Mali-G710 represents the 3rd generation premium mobile GPU with the Valhalla architecture while the Mali-G310 will be the first generation on the entry-level offering.
Being the premium product, the Mali-G710 reflects all the enhancements to the Valhalla architecture. The most notable feature of the Mali-G710 is the scalability of the GPU. The Mali-G78 featured a single processing unit while the Mali-G710 has four processing elements with dedicated resources. It also features a new Command Stream Frontend (CSF) that features both software and hardware enhancements and better aligns with Vulkan for higher performance and efficiency. The Mali-G710 can also be configured with seven to sixteen shader cores, one to four slices of 128b or 256b L2 cache slices, and 128b or 256b configurable ACE interfaces. In all, the Mali-G710 delivers approximately 20% improvements in efficiency with a 35% improvement in ML processing and a 20% improvement in gaming.
The slightly scaled down Mali-G610 shares the same features as the Mali-G710 but is scalable with one to six shader cores. For the mainstream and volume segments, the Mali-G510 and Mali-G310 provide some of the scalability with all the latest features plus adding support for Arm Fixed Rate Compression (FRC) and additional HDR support. However, the Mali-G510 provides a 22% improvement in efficiency with a 100% improvement in both machine learning and overall performance over its predecessor. The power efficient volume segment will receive the biggest boost from the transition to the Valhalla architecture with a 350% boost in efficiency combined with a 100% increase in ML performance and 500% improvement in texturing performance.
Supporting SoC IP
Arm’s DynamIQ Shared Unit, or DSU, received some of the most significant changes. Designed to be a bridge between compute cores, a shared L3, and a shared I/O interface, the DSU is critical to the design of SoCs. The DSU-110 has been enhanced to support a higher number of cores, more L3 cache, and higher performance and/or lower latency. In terms of CPU cores, the DSU-110 can support up to 8 Cortex-X2 cores, double that from the previous generation and up to 16MB of L3 cache. The most significant change comes in the microarchitecture with the L3 cache, snoop filter, and control logic separated into slices (up to 8) connected through a bi-directional dual-ring network interface. The architecture allows for simultaneous processing of multiple requests and data paths optimized to reduce latency and increase overall bandwidth.
Last but certainly not least is the new CoreLink CI-700 coherent interconnect and the CoreLink NI-700 network interconnect which received the latest and greatest advances from Arm. The CoreLink CI-700 features enterprise-grade AMBA CHI mesh technology for a fully coherent system level cache and snoop filter. The new mesh architecture allows for a combination of cross-point (XP) to cross-point (XP) and IP connections or up to six IP core connections using a single cross-point. As a result, the CI-700 can scale from a single cross-point to a 4×3 mesh. The result is an increase in bandwidth and reduction in power that scales with the configuration. The CoreLink NI-700 is a scalable network-on-a-chip that also adheres to the latest AMBA standards and has increased power management and security features.
As with previous generations, the various CPU, GPU, and even the Ethos NPU cores can be combined with DSU and CoreLink into a wide variety of configurations. As an example, while the premium smartphones have almost all gone to octa-core CPU configurations, the selections of Arm or Arm-compatible cores can vary widely. From one Cortex-X2 plus three Cortex-A710 and two Cortex-A510 pairs (each Cortex-A510 counts as two CPU cores) to two Cortex-A710 plus three Cortex-A510s. And many CE and embedded/IoT applications will opt for using one to four Cortex-A510s with other Arm M-class microcontrollers (MCUs) and/or R-class real-time CPUs.
While each of the Arm cores provides a value proposition, each generation is bringing the different IP solutions closer together. Not only do the different IP cores now share a similar number scheme, but they also share a key security enhancement to the Arm architecture – Memory Tagging Extensions (MTE). First introduced in the Arm architecture with the Armv8.5 instruction set release, MTE eliminates memory security vulnerabilities by tagging each memory allocation. Subsequent memory access must have the appropriate tags to access the memory. With memory being a key point of access for attacks, the integration of MTE throughout the Arm architecture ensures that different cores can access the same memory while locking out potential attacks from malicious software. This combined with the shared IP like the DAS and CoreLink, Arm is providing a complete solution for any application.
A common theme between the CPU and GPU announcements was the performance enhancements for ML. In addition to adding SVE2 as part of the Armv9 architecture, Arm adding support for matrix math instructions and the BFloat-16 floating-point format. Because ML can be performed on any compute architecture, Arm is working to enhance the performance of ML workloads on all its compute cores, not just the Ethos NPUs. Use of the different cores for ML tasks, however, is still up to the software.
In the details
It is important to note that Arm does not provide transistor counts or estimated die sizes for the new products. As with any semiconductor product, new features mean more die area. However, Arm compares their products from one generation to the next as if they are on the same semiconductor manufacturing process node. Given this, continuously achieving double-digit gains in performance and efficiency is a major accomplishment. But, given that most of the products using this new IP will be on enhanced process nodes, either 5nm sub-nodes or 3nm, the overall die area is likely to shrink, and additional performance or efficiency gains may be achieved.
Also, although the Arm Cortex-X2 CPU core is aimed at higher performance applications like PCs and Chromebooks, the Mali GPU cores are aimed squarely at mobile applications. They will provide a quality experience for casual PC gaming, but they do not support more advanced features like ray tracing in the latest game titles. At least not yet.
With this huge announcement of new IP, Arm set a new bar for not only itself but for other IP and semiconductor companies. Not only does the company continue to introduce new products on an annual cadence, but Arm also continues to demonstrate significant gains in performance and efficiency as a result of architectural enhancements. In addition, the company continues to demonstrate how using Arm IP throughout an SoC design is more valuable than just the sum of the individual cores.
Although there are still questions about the potential impact of the proposed acquisition of Arm by Nvidia, it’s clear the rate of product introductions and innovation continues to accelerate for the world’s most predominant compute architecture. If Nvidia licensed its industry leading GPU and AI architectures in conjunction with the Arm portfolio, almost anyone would be able to build class-leading SoCs and for all levels of compute.