Energy-Efficient, Near-Memory CMOS+X Architecture for Hardware Acceleration of DNNs with Application to NextG Wireless Systems

Team Members

  • Dr. Sarma Vrudhula (PI, Arizona State University)

  • Dr. Marwan Krunz (Co-PI, Univ. of Arizona)

  • Dr. Sanmukh Kuppannagari (Co-PI, Case Western Reserve University)

  • Dr. Gian Singh (post-doc, Arizona State University)

  • Ayushi Dube (Ph.D. Student, Arizona State University)

  • Abhinav Kumar (Master's Student, Arizona State University)

  • Sampad Chowdhury (Ph.D. Student, Arizona State University)

  • Arush Sharma (Ph.D. Student, Univ. of Arizona)

  • Rajan Shrestha (Ph.D. Student, Univ. of Arizona)

  • Changxin Li (Ph.D. Student, Case Reserve Western University)

Goals and Objectives

Deep neural networks (DNNs) have been successfully applied in many domains, including image classification, language models, speech analysis, autonomous vehicles, wireless communications, bioinformatics, and others. Their success stems from their ability to handle vast amounts of data and infer patterns without making assumptions on the underlying dynamics that produced the data. Cloud providers operate large data centers with high-speed computers that continuously perform DNN computations, with huge energy consumption that rivals that of some industries and nations. In addition to being used in solving large-scale problems, DNNs have recently been considered for recognition applications in battery-operated systems such as smartphones and embedded devices. However, there is a critical need to improve the energy efficiency of DNNs. The main goal of this project is twofold: (1) Design and evaluate a radically innovative energy-efficient hardware/software framework for on-chip implementation of DNNs, and (2) customize this framework for new DNNs that enable real-time signal classification in next-generation wireless systems. By integrating processing elements within memory chips, the energy consumption of a DNN can be significantly reduced, and more computations can be done faster. The hardware-accelerated DNN designs provided by this project will facilitate rapid identification of wireless transmissions (e.g. radar, 5G, LTE, Wi-Fi, microwave, satellite, and others) in a shared-spectrum scenario, enabling better use of the spectrum and facilitating accurate detection of adversarial and rogue signals. To achieve 10x-100x reduction in DNN energy consumption, a holistic approach is being pursued, which encompasses: (1) new circuit designs that leverage emerging ‘CMOS+X’ technologies; (2) a novel near-memory architecture in which processing elements are seamlessly integrated with traditional Dynamic RAM (DRAM); (3) novel 3D-matrix-based per-layer DNN computations and data-layout optimizations for kernel weights; and (4) algorithms and hardware/software co-design tailored for near-real-time DNN-based signal classification in next-generation wireless systems. In addition to its research goals, the project has a comprehensive educational and outreach agendas.

Research Thrusts

Thrust 1: Hardware Innovations for AI Pipelines

The Memory Wall Problem

DNNs are both compute and memory-bound. The energy consumption of large amounts of data transfers due to the processor-memory bottleneck between the memory and processors makes executing DNNs on traditional von Neumann computing systems (i.e., pure software implementations) environmentally unsustainable. Current hardware accelerators for DNNs that supplement a Von-Neumann CPU with ASICs, FPGAs, or GPUs have not overcome the processor-memory bottleneck. Moreover, each is customized to accelerate a specific type of DNN.

von_neumann

Processing in Memory

Processing-in-Memory (PIM) or near-memory processor architectures have long been known as a solution to eliminate the processor-memory bottleneck. PIM architectures are a promising approach for achieving the necessary levels of improvement in energy efficiency for sustainability. We propose introducing the processing near DRAM memory arrays. DRAM is the primary choice of the main memory in computing systems. It has several significant advantages over the other types of emerging memories for PIM systems because of decades of continuous optimization of their devices, circuits, and architecture. The advantages include large capacity, high internal parallelism, high speed, and energy efficiency.

proposed_sol

New Compute Elements (NPE and CMOS+X Neuron)

This thrust of the project focuses on the co-design of devices, circuits, and near-memory processor architecture. It involves designing and optimizing a novel circuit architecture for a threshold logic gate (TLG) or Binary Neuron using CMOS and non-volatile programmable conductance devices (X) as shown in Figure. The key challenges are centered around developing algorithms for optimizing the CMOS and X parameters to achieve a robust TLG design in the presence of multiple sources of variations.

We propose a new Neuron Processing Element (NPE), as shown in the Figure on the right, comprising a network of TLGs or configurable neurons (CNs), which will be designed to execute the essential neural network functions (inner products, activation functions, etc.). The NPE has several design parameters, including the number of clusters, the number and types of TLGs in each cluster, and the topology, which must be determined while optimizing for area, power, latency, throughput, and precision.

dram_ip

NPE-DRAM Integration

The NPEs are integrated with memory arrays (see the Figure on the Right) without interfering with the timing constraints or access protocols of the memory. This forms a single processor-in-memory system, which can be scaled to operate as a low-power edge device as well as a high-performance data center accelerator.

dramchip

System Integration

The overall near-memory processor design includes integrating DRAM, NPE, and the host CPU. The host CPU is required to compile the DNN application, schedule the operations, and control the data flow between the DRAM and the CPU. To enable such integration, an interface of the DRAM in the form of an instruction set architecture (ISA) and the controller is required. To readily integrate HBMs with a host CPU, we propose the extensions to the host CPU (X86, ARM, or RISC-V) ISA called PIM+ ISA, to control the execution of the NPEs and the DRAM. The PIM+ ISA and the modifications to the CPU are also shown in the Figure.

PIM ISA

Sample Results:

The table and bar chart below show the result of computing the Transformer models BERT (small and large), BART (small and Large), and GPT-J on the proposed PIM architecture named CIDAN-3D. They show the throughput improvements of CIDAN-3D against three baseline PIM architectures: TransPIM (2022), HAIMA (2023), and NVIDIA V100 GPU. On average, CIDAN-3D achieved 21.5X throughput improvement over an NVIDIA GPU V100 with 8GB of memory. This advantage arises from the CIDAN’s efficient use of small NPEs interfaced with the row buffer outputs of DRAM banks, maximizing DRAM parallelism while minimizing high-latency DRAM operations like activate (ACT), read (RD), and write-back (WR). By performing all computations within DRAM, the proposed architecture avoids CPU-DRAM data transactions, significantly reducing latency and energy use.

Image
Sample result thrust 1

Next Steps:

  • Demonstrate real-time inference on wireless signals in shared spectrum
  • Fabricate compute elements in CMOS+FeFET
  • New processor core for variable (INT, FP, POSIT) precision arithmetic
  • Logic Neural Networks for low-power, real-time inference

Thrust 2: Software Acceleration (Optimized Mapping Tools)

Image
Thrust2

The objective of this thrust is to build a compilation and runtime toolkit to enable optimal mapping and execution of Wireless CNN applications (Thrust 3) on PIM AI accelerators (Thrust 1).

Our toolkit takes a CNN model and carefully introduces structured sparsity (channel pruning) to reduce the computations while achieving minimal reduction in accuracy. This structured sparse CNN is further optimized for mapping by creating a connectivity matrix to represent the active channels and using matrix reordering techniques to obtain denser packing of the active channels, as dense packing leads to more efficient hardware utilization. Using a performance model, that estimates the performance of a convolution algorithm mapped to a hardware platform, the toolkit selects a convolution algorithm that is expected to perform the best for the model and uses an Integer Program based runtime scheduling algorithm that partitions the active channels into buckets and schedules the buckets on the hardware.

For an example wireless application of Jamming detection using CNNs, our toolkit achieved up to 1.2 to 1.6x reduction in inference latency with an accuracy loss of less than 2% on a GPU platform compared to a Pytorch implementation. Our next steps include performance modeling and optimization on the PIM AI accelerator (Thrust 1), development of 3D convolution algorithms for multi-stage protocol classification, and end to end Pytorch PIM compilation framework to seamlessly support wireless application deployment on PIM.

Sample Results:

Image
Sample result thrust 2

Next Steps:

  • Performance modeling and optimization on PIM architecture
  • End to end Pytorch to PIM compilation framework
  • 3D Conv algorithms for multi-stage protocol classifiers

Thrust 3: Real-time Classifiers for Wireless Communications

Applications in Multi-stage Wireless Signal Inference
Image
thrust3

Sample Results:

Image
Sample result thrust 3

Representation Publications

  • Ayushi Dube, Gian Singh, and Sarma Vrudhula, "A compact, low power transprecision ALU for smart edge devices," accepted to appear in the Proc. of the IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) Conference, University of Iceland, Iceland, August 2025.
  • Gian Singh and Sarma Vrudhula, "A scalable and energy-efficient Processing-in-Memory architecture for Gen-AI," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 15, no. 2, pp. 285-298, June 2025, doi: 10.1109/JETCAS.2025.3566929.
  • Gian Singh, Ayushi Dube, and Sarma Vrudhula, "A high throughput, energy-efficient architecture for variable precision computing in DRAM," Proc. of the IFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC), Tanger, Morocco, 2024, pp. 1-6, doi: 10.1109/VLSI-SoC62099.2024.10767834.
  • Md Rabiul Hossain and Marwan Krunz, "PCI classification in 5G NR: Deep learning unravels synchronization signal blocks," Proc. of the IEEE SECON Conference, Dec. 2024.
  • Wenhan Zhang, Meiyu Zhong, Ravi Tandon, and Marwan Krunz, "Filtered randomized smoothing: A new defense for robust modulation classification," Proc. of the IEEE MILCOM Conference, Oct. 2024.
  • Zhiwu Guo, Chicheng Zhang, Ming Li, and Marwan Krunz, "Fair probabilistic multi-armed bandit with applications to network optimization," IEEE Transactions on Machine Learning in Communications and Networking (TMLCN),vol. 2, pp. 994-1016, June 2024, doi: 10.1109/TMLCN.2024.3421170.
  • Arush S. Sharma and Marwan Krunz, "Enhanced RFI detection in imbalanced astronomical observations using weakly supervised GANs," Proc. of the IEEE ICC Conference - Workshop on Catalyzing Spectrum Sharing via Active-Passive Coexistence, Denver, June 2024.
  • Li, Changxin, and Sanmukh Kuppannagari, "Exploring algorithmic design choices for low latency CNN deployment," Proc. of the IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC), 2024.
  • Cronin, Timothy L., and Sanmukh Kuppannagari, "A framework to enable algorithmic design choice exploration in DNNs," Proc. of the IEEE High Performance Extreme Computing Conference (HPEC), 2024.