SoC design - the never-ending challenge

In System-on-Chip (SoC) development, architects and designers juggle various design parameters as they push beyond the device performance achieved last time. And now, the increasing variety of use cases and software technologies are fundamentally influencing processing architectures.

They are striving to keep one foot in the present, in terms of usability, manufacturability and testability, carrying both customers and ecosystem with them. And they are firmly planting their other foot in an era of domain-specific architectures, encompassing new approaches to processing, memory and I/O.

This will in turn influence later, more standardised architectures.

What has not changed, is the demand for further performance. And improving overall system performance requires improvements to processors, I/O and memory.

Perspective

By some estimates, memory once accounted for 50% of silicon area. It is now beyond that, in some applications employing AI, beyond 80%. The demand to fit algorithms, models and data – all in order to meet performance and cost targets – is pushing silicon area towards potentially uneconomic levels.

And as SoC silicon area is increasingly dominated by memory, its power consumption – especially leakage at nodes below 90nm – is a matter of critical importance.

Static RAM (SRAM) scaling has stalled below 7nm. Use of embedded DRAM (eDRAM) largely fell by the wayside due to process complexity, yield and cost.

Chiplets have emerged as another option. But disaggregating caches and bulk memories into multiple chiplets is not yet a viable option for those companies working at mature nodes under terrific cost pressure, although over time chiplets will converge with many of these SoC sockets as application demands evolve, encouraging new designs.

Dynamic RAM (DRAM), the workhouse external main memory of the industry, is also hitting limits. And while DRAM access bandwidth and capacity have improved, latency has improved at a much, much slower rate.

DRAM is now becoming limited by the inability of capacitors to continue shrinking and maintain yields. It will increasingly face reliability problems, challenging refresh times, which will eat into bandwidth.

So, when it comes to memory, the SoC design playbook needs updating.

HPC & Cloud Computing

CPUs and DRAM dominate data centre power usage, at around ¾ of peak power usage of all hardware subsystems. Using KFBM for integrated L3 caches, and Stacked DFM as a very high-capacity supplement to external DRAM, allows a system architecture to be realised where memory densities increase and leakage power is kept in check – crucial at a time when generative AI training is causing an explosion in processing and memory requirements. Both KFBM and Stacked DFM can also be implemented on a chiplet, increasing the number of implementation options.

Automotive

Automotive applications such as digital cockpit, infotainment and Advanced Driver Assistance Systems (ADAS) are both memory hungry and safety critical. And with the emergence of AI-based Large Language Models (LLMs) to support various in-car use cases through voice commands, memory capacity must increase while latency must remain low.

KFBM provides a 5X increase in memory density over embedded SRAM for embedded bulk RAM and can facilitate the integration of large L3 caches on-chip, reducing the latency and energy associated with off-chip memory accesses. KFBM is also ideal for implementing large frame buffers to support the integrated GPUs used for the UI and navigation.

Using Stacked DFM also provides a high-density external memory, allowing the on-board memory capacity to greatly expand without an impact on BOM cost.

Consumer

Consumer devices, whether smartphones, tablets, digital TV, game consoles, AR/VR headsets, etc, require high bandwidth and low latency to live up to the promises made to customers about user experience.

With consumers relatively price sensitive and OEMs keen to encourage short replacement cycles, there is a tension between bill of material cost and adding significant amounts of memory to improve overall system performance.

KFBM, with its 5X greater density over embedded SRAM, can allow the integrated L3 caches in application processors to grow 5X larger without any silicon area impact (improving overall system performance), or alternatively save silicon area at the original L3 cache capacity, leaving the precious silicon area saved to be devoted to other functions. KFBM is also ideal for implementing large frame buffers to support integrated GPUs.

With the move to on-device generative AI to improve latency and enhance privacy, Stacked DFM can support models with billions of parameters on-device more cost-effectively than a purely DRAM- based external memory.

With many such devices being handheld, leakage power (and the associated heat) must be kept low. KFBM and Stacked DFM offer significantly lower leakage power than embedded SRAM/DRAM and external DRAM, respectively.

IoT

In markets such as industrial, healthcare, transportation, energy, smart cities and retail, microcontrollers have proliferated and connectivity has multiplied. Shipping in very high volumes and cost constrained, with low average selling prices, many IoT chips are fabricated on mature planar processes such as bulk silicon and SOI.

With TinyML now being deployed on microcontrollers across various IoT applications such as people counting and thermal imaging for energy management, and YOLO for object and person detection, microcontrollers are integrating neural network acceleration, more embedded memory and experimenting with in-memory compute schemes.

But being cost-constrained, IoT chip vendors cannot make their chip too big, as costs rise and yields worsen. And they cannot make it too small or various overheads like saw lines dominate. Their finely tuned cost models are based on planar silicon processes, with FinFET presenting an unacceptable cost/performance trade-off.

By embedding KFBM, with its 5X density advantage over SRAM, vendors can significantly increase the amount of embedded memory for models and data while staying within existing silicon area constraints – or they can reduce silicon area by using 5X denser KFBM to replace the existing amount of embedded SRAM. The saved silicon area could then be used for more complex logic.

KFBM also provides significant savings in leakage per MB – a significant metric in AI/ML-enabled microcontrollers.

Stacked DFM can be used as a dense external memory. It can be implemented as a specialty external memory allowing the SoC vendor to move away from commodity external DRAM devices, their inflexible interfaces and product lifecycle issues.

Peripherals and Communications

Mass storage devices, printers and routers use significant amounts of volatile RAM as buffers to enhance overall system performance, e.g., SSD DRAM. Stacked DFM, with its increased density over conventional DRAM, allows significantly larger buffers within the same BOM cost, whether as a DRAM replacement or as a DRAM supplement.

Artificial Intelligence and Machine Learning

AI and ML have proliferated across the entire spectrum of processing devices, from simple microcontrollers and complex application processors at the edge, to servers and HPC systems in the cloud.

Wherever these devices sit, latency, bandwidth and power remain critical.

In edge computing, much of the time is spent moving weights from memory to the processor. This requires the memory subsystem to support low latency, high-rate data transfer. Using KFBM – 5X denser than embedded SRAM – allows low latency, high bandwidth memory on-chip.

With on-device AI inferencing at the edge and ever-expanding generative AI models being trained in the cloud, Stacked DFM offers a much denser external memory, either as a replacement or supplement for external DRAM.

Both KFBM and Stacked DFM lend themselves to new approaches to energy efficient computing, such as compute in memory / in-memory compute, improving performance-per-watt for sustained AI inferencing and training.

Ultra-Low Power Memory

Thanks to the power wall, leakage current has increased as chips have scaled downwards, becoming a significant portion of overall power consumption, and elevating die temperature. The scaling wall means that there is also a limit to how far SRAM and DRAM can scale before there are stability and reliability issues. While the 1T0C structure of both KFBM and Stacked DFM are inherently ultra-low power, the Vertical SGT also allows the develop of ultra-low power SRAM memories. Vertical SGT-based SRAM has advantages in area, read/write stability, reduced leakage and also supports vertical stacking, to increase density further without any area impact.