In this lab, we will implement a software driver to enable integration and operation of the previously designed hardware accelerator with the host system. Through mechanisms such as memory-mapped I/O or interrupts, we establish control and data communication between the CPU and the accelerator. The driver acts as a critical interface layer between the runtime system and the hardware accelerator, exposing low-level register or memory access as high-level software APIs. The runtime, which orchestrates the overall inference flow, leverages the driver to configure the accelerator, transfer data, and retrieve results. This separation of concerns allows the runtime to focus on model-level control logic, while the driver handles hardware-specific interactions. This architecture enables us to evaluate the performance gains provided by the accelerator during real model inference. Additionally, we investigate the feasibility and effectiveness of optimization techniques for accelerating inference in the absence of hardware acceleration. Through these experiments and analyses, we aim to gain a comprehensive understanding of how hardware acceleration and optimization affect model inference performance under different scenarios, thereby providing practical insights for future system design and deployment.
Lab 4.1 - Device Driver
Driver is a special type of software responsible for enabling the operating system to control hardware devices. It acts like a translator or bridge between the operating system and the hardware. The operating system itself does not directly manipulate hardware; instead, it uses drivers to instruct the hardware what to do.
Why Do We Need Driver ?
The OS Doesn’t Understand Hardware Languages
Each hardware device (such as printers, keyboards, network cards) has its own specific control mechanisms. The operating system cannot directly handle these differences. Drivers are responsible for translating OS commands into signals that the hardware can understand, and then returning responses from the hardware back to the OS.
Supporting Various Hardware with One OS
With drivers, no matter which manufacturer produces the hardware, as long as a compatible driver exists, the OS can control the device. This means the OS does not need to be rewritten for each different hardware.
Software Perspective
A driver is a piece of low-level system software, usually running within the kernel space of the OS.
It provides a standard interface (API) for applications or the OS to call (e.g., to read or send data).
It also handles hardware interrupts, such as notifying the OS when "data is ready" or "operation completed."
Hardware Perspective
The driver reads from and writes to control registers of the hardware (e.g., through memory-mapped I/O).
It controls hardware behavior based on configurations (e.g., starting data transfer, resetting the device).
It also facilitates data movement, such as transferring data from the hardware into main memory (RAM).
Communication between CPU and Peripheral Device
In modern computer systems, the CPU needs to exchange data with various peripheral devices, such as keyboards, network interfaces, graphics cards, and storage devices. These devices are not directly hardwired to the CPU for control. Instead, they are accessed through well-designed interface mechanisms, including:
Memory-Mapped I/O (MMIO)
Port-Mapped I/O (also known as I/O-mapped I/O)
Interrupt
MMIO (Memory-Mapped I/O)
MMIO is a design method where hardware device registers are mapped into the system’s memory address space. We use this method to control our DLA.
Device registers are accessed using normal memory instructions (load, store).
uint32_t value =*(volatileuint32_t*)0xFFFF0000;// read MMIO device
Port-Mapped I/O (Isolated I/O)
Devices are controlled via a separate I/O address space using specific CPU instructions (e.g., in, out on x86).
Registers are not part of memory space.
Smaller address space (e.g., 64KB).
Interrupt
Interrupts allow devices to asynchronously notify the CPU of events. The CPU pauses execution, runs an Interrupt Service Routine (ISR), then resumes. Benefits:
Avoids polling
Enables timely response to I/O events
Typical Use Cases:
Timer ticks
I/O completion (e.g., DMA, keyboard)
Network packet arrival
Software-Hardware Co-design Framework
In real embedded systems, software (runtime) typically interacts with hardware through Memory-Mapped I/O (MMIO) or device drivers, enabling indirect access to registers or peripherals. However, in this lab, we do not have access to actual hardware. Instead, we use Verilator to translate Verilog RTL into cycle-accurate C++ models, generating a Verilated model for simulation.
The issue arises because Verilated models expose only low-level signal interfaces (e.g., .clk, .rst, .data_in, .data_out), unlike real hardware that provides register-based MMIO access. As a result, in order to control the Verilated hardware, the runtime must directly manipulate the internal C++ objects generated by Verilator. This tight coupling requires the runtime to conform to the structure and naming conventions of the Verilator testbench, which is not modular and hinders portability.
To address this problem, we introduce the Hardware Abstraction Layer (HAL) in this lab.
Purpose and Benefits of HAL
Hardware Interface Abstraction HAL wraps the Verilated model and exposes an interface that resembles real hardware, such as read_reg(addr) and write_reg(addr, value). This allows the runtime to remain agnostic to whether it is interacting with an RTL simulation or actual FPGA hardware.
Unified Runtime Codebase With HAL in place, a single runtime implementation can be reused across both simulation and real hardware environments without any modification.
Modularity via Runtime Library On top of HAL, a Runtime Library can be built to provide high-level operations (e.g., launching computation, setting parameters, fetching results). This allows the main application to focus purely on control logic, improving software modularity, maintainability, and portability.
can we record the hardware runtime information like cycles、memory access time?
Yes, we provide info counter in the HAL.
HAL info counter
The runtime_info structure is designed to record execution-related metrics during hardware simulation or emulation. It provides useful insights into the system's behavior, performance, and memory access patterns. Below is a detailed explanation of each field:
include/hal/hal.hpp
/**
* @struct runtime_info
* @brief Stores runtime performance metrics for a hardware simulation or
* execution.
*
* This structure holds information about execution cycles, elapsed time, and
* memory operations during a hardware simulation or a specific computation.
*/structruntime_info{uint32_t elapsed_cycle;///< Number of cycles elapsed during execution.uint32_t elapsed_time;///< Total elapsed time (e.g., in nanoseconds).uint32_t memory_read;///< Total number of memory read operations.uint32_t memory_write;///< Total number of memory write operations.};
Note
The HAL counters do not simulate software-level counters; instead, they function more like debug counters that are typically embedded in hardware during design for debugging and analysis purposes. However, in our hardware implementation, we did not allocate dedicated registers for such counters. Instead, we supplement these counters within the HAL, allowing the driver to access them during simulation or emulation.
Address Mapping Problem
Q: Our simulated CPU uses 64-bit addresses, but our memory-mapped peripherals (MMIO) only occupy a small 32-bit region.
How do we bridge the address space between the CPU and simulated devices?
Why can't we directly access ARADDR_M in a 64-bit simulation environment?
HAL Simple MMU: Mapping 32-bit AXI Addresses in a 64-bit Simulation
In our simulation environment, the host system uses a 64-bit virtual memory space, while the AXI bus operates with 32-bit addresses. This mismatch can cause address mapping issues, as AXI requests may not directly correspond to valid host memory addresses.
To resolve this, we implement a simple MMU mechanism within the HAL. By capturing the high 32 bits of the HAL instance’s address (vm_addr_h), we can reconstruct valid 64-bit addresses by combining them with 32-bit AXI addresses:
If the mapping address crosses the 32-bit peripheral space boundary, invalid access may occur. This happens because the host's 64-bit address space cannot safely simulate memory outside the defined 32-bit region.
If need, the MMU eed to be optimized in the future, but in this lab, 4GB address space is enough for simulations.
Support MMIO write in HAL with AXI interface
Please read the descriptions, then copy and paste the code into hal.cpp.
boolHardwareAbstractionLayer::memory_set(uint32_t addr,uint32_t data){if(device ==NULL){fprintf(stderr,"[HAL] device is not init yet.\n");}
When a write request is issued via HAL, the first step is to verify whether the provided address falls within the valid MMIO (Memory-Mapped I/O) region. This region starts at baseaddr and spans a range defined by mmio_size, meaning a valid address must lie within the interval [baseaddr, baseaddr + mmio_size]. If the address fails this check (i.e., it is out of bounds), HAL will immediately return false, indicating that the operation is invalid and halting any further processing.
#ifdefDEBUGfprintf(stderr,"[HAL memory_set] (0x%08x) 0x%08x \n", addr, data);#endifif(addr < baseaddr || addr > baseaddr + mmio_size){#ifdefDEBUGfprintf(stderr,"[HAL ERROR] address 0x%08x is not in device MMIO range.\n",
addr);#endifreturnfalse;}
Following a successful address check, HAL proceeds with the AXI4 protocol, using three separate channels to complete a full write transaction. The first is the Address Write (AW) channel. HAL sets the target address into AWADDR_S and asserts AWVALID_S to signal that a valid address is being presented. The system then waits for the counterpart (typically an interconnect or slave device) to assert AWREADY_S, indicating readiness to accept the address. Once AWREADY_S is high, HAL advances one clock cycle and de-asserts AWVALID_S, completing the address phase.
After sending the address, HAL transitions to the data write phase using the Write Data (W) channel. The data is placed in WDATA_S, and WVALID_S is asserted to indicate the data is valid. Similar to the address phase, HAL waits for WREADY_S to be asserted by the receiver, signaling that it is ready to accept the data. At that point, HAL advances one clock cycle and de-asserts WVALID_S, marking the completion of the data transmission.
// send write data
device->WDATA_S = data;
device->WSTRB_S =0;// unused
device->WLAST_S =1;// single shot, always the last one
device->WVALID_S =1;// valid
device->eval();// wait for ready (data)while(!device->WREADY_S){clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);}clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);
device->WVALID_S =0;
Once the data is successfully sent, the final step is to receive the write response through the B (Write Response) channel. HAL first asserts BREADY_S to indicate that it is ready to receive a response, then waits for BVALID_S to be asserted by the receiver. Once BVALID_S goes high, HAL reads the value of BRESP_S to determine whether the write was successful. If the response is AXI_RESP_OKAY, the operation is considered successful and HAL returns true; otherwise, it returns false.
Please read the descriptions, then copy and paste the code into hal.cpp.
Same as write method.
boolHardwareAbstractionLayer::memory_get(uint32_t addr,uint32_t&data){if(device ==NULL){fprintf(stderr,"[HAL] device is not init yet.\n");}
When a read request is issued through the HAL, the first step is to verify whether the target address falls within the valid MMIO (Memory-Mapped I/O) region. If the address is out of bounds, the HAL immediately returns false, indicating that the operation is invalid and that no further steps will be taken.
#ifdefDEBUGfprintf(stderr,"[HAL memory_get] (0x%08x) \n", addr);#endifif(addr < baseaddr || addr > baseaddr + mmio_size){#ifdefDEBUGfprintf(stderr,"[HAL ERROR] address 0x%08x is not in device MMIO range.\n",
addr);#endifreturnfalse;}
If the address passes the check, the HAL proceeds to carry out the read transaction following the AXI4 protocol, which involves a three-phase operation. The first phase uses the AR (Address Read) channel. The HAL sets the target read address to ARADDR_S and asserts ARVALID_S, signaling that a valid read request is being issued. At this point, the system waits for the receiving end (such as an interconnect or slave device) to assert ARREADY_S, indicating readiness to accept the address. Once ARREADY_S is high, the HAL advances one clock cycle and de-asserts ARVALID_S, completing the address transmission.
After the address phase is completed, the HAL transitions to the data reception phase via the R (Read Data) channel. It first asserts RREADY_S to indicate that it is ready to receive data. When the slave asserts RVALID_S, signaling that valid data is available, the HAL immediately reads the value on RDATA_S and de-asserts RREADY_S, completing the data transfer.
Finally, the HAL examines the contents of RRESP_S to determine the status of the read operation. If the response is AXI_RESP_OKAY, the read is considered successful, and the retrieved data is stored in the specified variable. The function then returns true. Otherwise, if the response indicates an error, the HAL returns false, signaling that the read operation failed.
// get read data
data = device->RDATA_S;int resp = device->RRESP_S;clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);return resp == AXI_RESP_OKAY;}
We’ve now covered the HAL info counter and the simple MMU. With this knowledge, we’re ready to integrate them into the DMA handler.
Support DMA read in HAL with AXI interface
Please read the descriptions, then copy and paste the code into hal.cpp.
In the DMA read handling routine provided by our HAL, the process begins by retrieving the read address and burst length from the master interface. Specifically, the address is obtained from ARADDR_M and the burst length is retrieved from ARLEN_M, which determines the number of data beats to be transferred in this burst operation (for a total of len + 1 beats, as the AXI burst length is zero-based).
To initiate the transaction, the HAL asserts ARREADY_M to complete the address handshake on the AR channel. After simulating a clock cycle to reflect the timing behavior of the interface, ARREADY_M is de-asserted, signaling that the address phase is complete.
voidHardwareAbstractionLayer::handle_dma_read(){// get read addressuint32_t*addr;
addr =(uint32_t*)(vm_addr_h | device->ARADDR_M);uint32_t len = device->ARLEN_M;
device->ARREADY_M =1;clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);
device->ARREADY_M =0;clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);#ifdefDEBUGfprintf(stderr,"[HAL handle_dma_read] addr = %p, len = %d \n", addr,
len +1);#endif
Once the read address is acknowledged, the HAL proceeds to send data through the R channel in a burst manner. For each data beat, the corresponding memory content is fetched from the emulated memory (*(addr + i)) and sent through RDATA_M. The data response is always set to AXI_RESP_OKAY, and RID_M is set to 0 by default. Before sending each beat, the HAL simulates a memory access delay by incrementing info.elapsed_cycle and info.elapsed_time accordingly.
// send read data (increase mode, burst_size 32bits)
device->RID_M =0;// default
device->RRESP_M = AXI_RESP_OKAY;for(int i =0; i <= len; i++){
device->RDATA_M =*(addr + i);// send read data
info.elapsed_cycle += MEM_ACCESS_CYCLE;// simulate memory access delay
info.elapsed_time += MEM_ACCESS_CYCLE * CYCLE_TIME;#ifdefDEBUGfprintf(stdout,"[HAL handle_dma_read] addr = %p, data = %08x \n",
addr + i,*(addr + i));#endif
device->RLAST_M = i == len;// the last one
device->RVALID_M =1;
device->eval();
During the burst, RVALID_M is asserted to indicate that data is valid, and the system waits until the DMA master sets RREADY_M, signaling it is ready to accept the data. Only then is the clock advanced and the next beat prepared. On the final beat of the burst, RLAST_M is asserted to indicate the end of the transfer.
// wait DMA ready for next datawhile(!device->RREADY_M){clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);}clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);
device->RVALID_M =0;clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);}
device->eval();
At the end of the transaction, the HAL updates the internal counter info.memory_read to reflect the total number of bytes read. This is computed as (len + 1) * sizeof(uint32_t), and is used for performance monitoring and profiling purposes, such as bandwidth analysis or simulation statistics.
Please read the descriptions, then copy and paste the code into hal.cpp.
In the DMA write handling routine provided by our HAL, the process begins by retrieving the write address and burst length from the master interface. Specifically, the address is obtained from AWADDR_M and the burst length is retrieved from AWLEN_M, which determines the number of data beats to be transferred in this burst operation (for a total of len + 1 beats, as the AXI burst length is zero-based).
To initiate the transaction, the HAL asserts AWREADY_M to complete the address handshake on the AW channel. After simulating a clock cycle to reflect the timing behavior of the interface, AWREADY_M is de-asserted, signaling that the address phase is complete.
voidHardwareAbstractionLayer::handle_dma_write(){// get addressuint32_t*addr;
addr =(uint32_t*)(vm_addr_h | device->AWADDR_M);uint32_t len = device->AWLEN_M;
device->AWREADY_M =1;clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);
device->AWREADY_M =0;clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);#ifdefDEBUGfprintf(stderr,"[HAL handle_dma_write] addr = %p, len = %d \n", addr,
len +1);#endif
Once the write address is acknowledged, the HAL proceeds to receive data through the W channel in a burst manner. For each data beat, the corresponding data is obtained from WDATA_M and written into the emulated memory (*(addr + i)). Before processing each beat, the HAL simulates a memory access delay by incrementing info.elapsed_cycle and info.elapsed_time accordingly.
// recv write data (increase mode, burst_size 32bits)
device->RID_M =0;// defaultfor(int i =0; i <= len; i++){*(addr + i)=(uint32_t)device->WDATA_M;// recv write data
info.elapsed_cycle += MEM_ACCESS_CYCLE;// simulate memory access delay
info.elapsed_time += MEM_ACCESS_CYCLE * CYCLE_TIME;#ifdefDEBUGfprintf(stdout,"[HAL handle_dma_write] addr = %p, data = %08x \n",
addr + i,*(addr + i));#endif
device->WREADY_M =1;
device->eval();
During the burst, WREADY_M is asserted to indicate that the HAL is ready to accept data. The system then waits until the DMA master sets WVALID_M, signaling the data is valid and ready to be written. Once valid data is detected, the clock is advanced and WREADY_M is de-asserted to complete the handshake for that beat.
// wait DMA valid for next datawhile(!device->WVALID_M){clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);}clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);
device->WREADY_M =0;clock_step(device, ACLK, info.elapsed_cycle, info.elapsed_time);}
device->eval();
At the end of the data transfer, the HAL sends a write response by asserting BVALID_M and setting BRESP_M to AXI_RESP_OKAY, indicating a successful write operation. BID_M is set to 0 by default. The system then waits until the DMA master asserts BREADY_M to acknowledge the response, after which the response phase is completed and BVALID_M is de-asserted.
Finally, the HAL updates the internal counter info.memory_write to reflect the total number of bytes written. This is computed as (len + 1) * sizeof(uint32_t), and is used for performance monitoring and profiling purposes, such as bandwidth analysis or simulation statistics.
The above describes the basic design concept and implementation of the HAL. For more detailed design information and parameters, you can refer to and explore the files located at include/hal/hal.hpp and /src/hal/hal.cpp.
Lab 4.2 - Performance Profiling and Optimization
Why we need CPU performance profiling ?
In many embedded systems, edge devices, or environments lacking dedicated accelerators such as GPUs or NPUs, developers are often faced with the challenge of running AI models using only CPUs. Given the computationally intensive nature and frequent memory access patterns of AI models, performance can degrade significantly without proper analysis and optimization. Therefore, uncovering potential performance bottlenecks and applying compiler-level optimizations are crucial steps to ensure efficient AI inference on CPU-only platforms. This lecture introduces the use of performance analysis tools such as Valgrind and Cachegrind, along with common CPU-level optimization techniques, to help developers effectively deploy AI applications even in resource-constrained environments without hardware accelerators.
Valgrind is an open-source debugging and performance analysis tool for Linux, licensed under the GPL. Its suite of tools enables automated detection of memory management and threading issues, significantly reducing the time spent on debugging and enhancing program stability. Additionally, Valgrind provides in-depth profiling capabilities to optimize application performance.
Cachegrind
Cachegrind is a cache profiling tool that simulates the behavior of the I1 (instruction cache), D1 (data cache), and L2 caches in a CPU, providing precise identification of cache misses in code. It records the number of cache misses, memory accesses, and executed instructions at the source code level, offering detailed analysis at the function, module, and program-wide levels. Supporting programs written in any language, Cachegrind provides comprehensive profiling insights, though it operates at a runtime overhead of approximately 20 to 100 times slower than native execution.
Cachegrind Usage Guide
start profiling a program using Cachegrind, run:
valgrind --tool=cachegrind ./your_program
The output file cachegrind.out.<pid> contains detailed cache statistics. To generate a human-readable report, use:
No optimization. Suitable for debugging; preserves source code structure.
-O1
Enables basic optimizations. Safe and stable.
-O2
Moderate optimizations. Suitable for general use cases.
-O3
Aggressive optimizations including loop unrolling and vectorization.
Common Optimization Techniques
Technique
Description
Loop Unrolling
Expands loop bodies to reduce control overhead and improve instruction-level parallelism.
SIMD Vectorization
Uses SIMD instructions to process multiple data elements in parallel.
Common Subexpression Elimination
Eliminates redundant computations by reusing previously computed expressions.
Constant Folding / Propagation
Computes constant expressions at compile time and substitutes their values.
Loop-Invariant Code Motion
Moves calculations that do not change within a loop to outside the loop.
Memory Allocation & Alignment
Using malloc can improve performance by ensuring better cache locality and memory alignment, especially for large or frequently accessed data.
Examples
Loop Unrolling Before
for(int i =0; i <4; i++){
sum += a[i];}
This loop adds one element per iteration and includes overhead from loop control (increment, comparison). After
sum += a[0];
sum += a[1];
sum += a[2];
sum += a[3];
By unrolling the loop, we reduce control instructions and increase instruction-level parallelism (ILP), improving CPU pipeline efficiency. SIMD Vectorization Compiler optimize -O3, the compiler automatically convert this loop into SIMD instructions. Or
#include<immintrin.h>
__m128 vb =_mm_loadu_ps(b);// load 4 float number
__m128 vc =_mm_loadu_ps(c);
__m128 va =_mm_add_ps(vb, vc);// SIMD ADD_mm_storeu_ps(a, va);// store
SIMD processes multiple elements in parallel, reducing the number of arithmetic and memory access instructions. Common Subexpression Elimination Before
int x =3*4;// Constant Foldingconstint a =5;int y = a +2;// Constant Propagation
The expression (x + 2) is calculated twice — a redundant computation. After
int x =12;int y =7;
The optimized version computes the common expression only once, reducing ALU usage. Constant Folding / Propagation Before
int y =(x +2)*(x +2);
The compiler can compute these values at compile time and replace them with constants. After
int t = x +2;int y = t * t;
Reduces runtime computation and generates simpler machine code. Loop-Invariant Code Motion Before
for(int i =0; i < n; i++){
y[i]= x[i]* a * b;}
The expression a * b is invariant across iterations and gets recalculated unnecessarily. After
int t = a * b;for(int i =0; i < n; i++){
y[i]= x[i]* t;}
Moves constant computation outside the loop, reducing the number of multiplications. Memory Allocation & Alignment
int* data =malloc(1024*sizeof(int));
Default malloc alignment may be suboptimal for cache usage and SIMD.
Aligning to 64 bytes improves cache line utilization and enables efficient SIMD access — especially beneficial for large data structures like tensors and matrices.Most modern CPUs use a 64-byte cache line size, so aligning memory to 64-byte boundaries ensures that data fits neatly within a single cache line, avoiding splits across multiple lines and reducing cache misses.
Notes and Caveats
Optimizations may make debugging more difficult (e.g., variables may be inlined or eliminated).
Programs with undefined behavior may yield unexpected results under optimization.
DLA info Record for Performance Profiling
From the above introduction to the HAL, we can see that the info struct contains four counters. We have therefore exposed functionality to read and reset these counters at the DLA driver level.
Additionally, we provide an API that exports DLA computation information along with the counters recorded by the HAL into a CSV file, facilitating further analysis.
src/eyeriss/dla/runtime_dla.cpp
voidcreate_dla_info_to_csv(constchar*filename){fprintf(stdout,"Creating dla info file: %s\n", filename);
FILE *file =fopen(filename,"w");if(!file){fprintf(stderr,"Create DLA info file failed.\n");return;}fprintf(file,"Operation,Cycles,Time(ns),Memory read,Memory ""write,m,e,p,q,r,t,PAD,U,R,S,C,M,W,H\n");fclose(file);}voiddump_dla_info_to_csv(constchar*filename,constchar*operation_name,// mapping parameteruint32_t m,uint32_t e,uint32_t p,uint32_t q,uint32_t r,uint32_t t,// shape parameteruint32_t PAD,uint32_t U,uint32_t R,uint32_t S,uint32_t C,uint32_t M,uint32_t W,uint32_t H){
FILE *file =fopen(filename,"a");structruntime_info info =get_runtime_info();fprintf(file,"%s,", operation_name);// Operationfprintf(file,"%10d,", info.elapsed_cycle);// Cyclesfprintf(file,"%10d,", info.elapsed_time);// Time (ns)fprintf(file,"%10d,", info.memory_read);// Memory readfprintf(file,"%10d,", info.memory_write);// Memory writefprintf(file,"%d,%d,%d,%d,%d,%d,", m, e, p, q, r, t);fprintf(file,"%d,%d,%d,%d,%d,%d,%d,%d\n", PAD, U, R, S, C, M, W, H);fclose(file);}
Tip
If the simulation program is compiled with the DLA_INFO macro defined, this profiling feature will be enabled.
src/eyeriss/dla/runtime_dla.cpp
voiddla_init(){#ifdefDLA_INFOfprintf(stdout,"DLA runtime info logging enabled.\n");dla_reset_runtime_info();create_dla_info_to_csv(DLA_INFO_CSV);#endifhal_init();}
The compilation usage will be mentioned in the homework section.
Homework Requirements
In this lab, you learned about the HAL and basic concepts about device driver. Now it is your turn to implement the driver to support some operations of the DLA and the CPU.
In addition, you will implement several APIs to support operations commonly-used in CNNs such as convolution, maxpooling, relu, matmul, etc.
Prerequisites
Download the sample code and report template from Moodle and then decompress it.
unzip aoc2025-lab4.zip
Check Verilator version in this lab We are using Verilator 5.030. You can run the following command to verify it:
verilator --version
It will show that.
verilator 5.030
Directory Structure
After unzipped the file downloaded from Moodle, the directory structure will look like below:
In Lab 3, you have already implemented the complete PE-array architecture and the PPU. Now, in this lab, the TAs will provide you with the entire accelerator IP. The complete architecture is shown in the diagram below, which includes sub-modules such as the controller, global buffer, DMA, and MMIO AXI interface. We have already used Verilator to convert it into a C++ library and connected it to the HAL. Your task for this lab is to implement the DLA driver on top of the HAL.
MMIO register configuration
The following is the MMIO configuration of the accelerator when mapped into memory space. It includes the memory information it needs to operate on, as well as the computation parameters. It is important to note that the enable register should only be set after all parameters have been properly configured. This ensures the accelerator correctly reads the parameters before starting. Therefore, when implementing the driver, you must take the order of register writes into account.
We mount the DLA on 0x10040000 ~ 0x10041000 of the system address space (in src/eyeriss/dla/hardware_dla.h)
/* ========================= DLA Register Base Address & Size ========================= */#defineDLA_MMIO_BASE_ADDR0x10040000///< Base address of the DLA MMIO registers.#defineDLA_MMIO_SIZE0x1000///< Size of the DLA register memory map.
The following MMIO registers are all 32-bit (4 bytes) wide. Each address represents the starting location of the corresponding data.
Address Offset
Name
Description
0x0
enable
DLA enable with operation config
0x4
mapping_param
Mapping Parameter
0x8
shape_param1
Shape Parameter
0xc
shape_param2
Shape Parameter 2
0x10
ifmap_addr
Input feature map address (Starting address in DRAM)
0x14
filter_addr
Filter address (Starting address in DRAM)
0x18
bias_addr
Bias address (Starting address in DRAM)
0x1c
ofmap_addr
Output feature map address (Starting address in DRAM)
0x20
GLB_filter_addr
Global buffer filter address (Starting address in GLB)
0x24
GLB_opsum_addr
Global buffer output sum address
0x28
GLB_bias_addr
Global buffer bias address
0x2c
ifmap_len
Input feature map length
0x30
ofmap_len
Output feature map length
You can see the C MACRO define in src/eyeriss/dla/hardware_dla.h
/* ========================= DLA Register Offsets ========================= */#defineDLA_ENABLE_OFFSET0x0///< Offset for enabling/disabling the DLA.#defineDLA_MAPPING_PARAM_OFFSET0x4///< Offset for setting mapping parameters.#defineDLA_SHAPE_PARAM1_OFFSET0x8///< Offset for shape parameters (filter, channel).#defineDLA_SHAPE_PARAM2_OFFSET0xc///< Offset for shape parameters (input size, padding).#defineDLA_IFMAP_ADDR_OFFSET0x10///< Offset for input feature map address.#defineDLA_FILTER_ADDR_OFFSET0x14///< Offset for filter weights address.#defineDLA_BIAS_ADDR_OFFSET0x18///< Offset for bias values address.#defineDLA_OPSUM_ADDR_OFFSET0x1c///< Offset for output sum buffer address.#defineDLA_GLB_FILTER_ADDR_OFFSET0x20///< Offset for global filter weights address.#defineDLA_GLB_OFMAP_ADDR_OFFSET0x24///< Offset for global output feature map address.#defineDLA_GLB_BIAS_ADDR_OFFSET0x28///< Offset for global bias values address.#defineDLA_IFMAP_LEN_OFFSET0x2c///< Offset for input activation length.#defineDLA_OFMAP_LEN_OFFSET0x30///< Offset for output activation length.#defineDLA_UNDEFINED///< Placeholder for undefined registers.
The details of the bitwise configuration for the first four MMIO registers are as follows. Then, you have to implement them in src/eyeriss/dla/hardware_dla.c
1. DLA enable with operation config
Important
The enable register should be the last one when setting MMIO registers.
Note: Ensure to account for padding when calculating width (W) and height (H) before writing the value to the register. Add 2 * padding to both W and H, then apply the necessary bitwise operations.
Note
Memory Write Operation
Using the reg_write Function
In this assignment, you are required to use the function reg_write(uint32_t offset, uint32_t value); provided in hardware_dla.h to write values to a specific memory location.
Function Prototype
voidreg_write(uint32_t offset,uint32_t value);
Parameters
offset: The offset of the target memory location (relative to the DLA base address).
value: The value to be written.
Functionality
Writes value to the memory location corresponding to DLA_MMIO_BASE_ADDR + offset.
A function call-based runtime API for common DNN operations
After completing the low-level MMIO configuration driver, we need to implement appropriate computation APIs for the operations supported by the DLA.
Configuration for GLB memory allocation
"The size of the GLB is configured to 64 KB."
Students can implement the design based on the illustration provided in this diagram. The size of each block can be computed based on the shape parameters and mapping parameters, following the methodology used in the previous lab.
In the file src/eyeriss/dla/runtime/dla.cpp, you need to implement the qconv2d_relu_maxpool and qconv2d_relu functions to correctly activate the DLA. Before configuring the MMIO registers, you must determine the optimal m parameter in the runtime to maximize GLB utilization. As you learned in Lab 2, the value of m affects the size of the output feature map (ofmap) and bias stored in the GLB. Therefore, you need to calculate the largest power-of-two value for m that fits within the available GLB space.
The buffer allocation for the input feature map (ifmap) within the GLB does not need to consider padding, as its required space can be fully determined by the mapping and shape parameters.The DLA controller is aware of the padding size and includes corresponding logic to handle it, so this aspect does not need to be explicitly considered. While the previous analysis provides valuable design insights, there may be slight discrepancies during implementation due to various trade-offs.
src/eyeriss/dla/runtime_dla.cpp
intqconv2d_relu_maxpool(uint8_t*input_in_DRAM,int8_t*filter_in_DRAM,uint8_t*opsum_in_DRAM,int32_t*bias,uint32_t ofmap_len,uint32_t ifmap_len,uint32_t filter_len,// mapping parameteruint32_t m,uint32_t e,uint32_t p,uint32_t q,uint32_t r,uint32_t t,// shape parameteruint32_t PAD,uint32_t U,uint32_t R,uint32_t S,uint32_t C,uint32_t M,uint32_t W,uint32_t H,uint32_t scale){// int32_t scale_factor: merge ifmap and weight and ofmap// scale bit-shift#ifdefDLA_INFOdla_reset_runtime_info();#endif// Calculate m for GLB memory allocation/*! <<<========= Implement here =========>>>*/// call lower setting functions/*! <<<========= Implement here =========>>>*/wait_for_interrupt();dla_stop();#ifdefDLA_INFOdump_dla_info_to_csv(DLA_INFO_CSV,"qconv2d_relu_maxpool", m, e, p, q, r, t,
PAD, U, R, S, C, M, W, H);#endifreturn0;};
A different GLB configuration from the previous
One key difference in this assignment compared to the previous one is the number of bias values stored in the GLB. In this case, the number of bias values in GLB is m, and storing m biases in the GLB rather than p × t can reduce the number of DRAM accesses, saving handshake time. This change does not affect the number of opsums. Another is the space occupied by the ifmap does not need to account for padding, as it can be determined solely by the mapping parameters and shape parameters.
To better understand the source file, you may start by reading the corresponding header file.
DLA Testbench user guide
Note: The implementation of hal.cpp is also required. There are four testcase (dla0 ~ dla3)
Testbench
test API
Note
dla0
qconv2d_relu
dla1
qconv2d_relu_maxpool
dla2
qconv2d_relu_maxpool
dla3
qconv2d_relu_maxpool, qconv2d_relu_maxpool_cpu
you need to implement original cpu version first in HW4.2
Run the testbench separately.
cd test/dla/dla<testcase>
make test
For more Makefile usage
cd test/dla/dla<testcase>
make <usage>
Tip
We provide a counter in the HAL to record DLA information, which will be dumped into a .csv file when the user enables it by running
make all DLA_INFO=1
Run all testbench without any configurations
cd test/dla
make test
Note
Please be patient, as the simulation could take some time to complete. Make sure to take a screenshot of the simulation output if your test passes.
HW 4.2 - CPU Runtime API (60%)
Since there are still some operations that are not supported by the accelerator, it is necessary to implement such operations purely running on the CPU, namely CPU fallback support. Additionally, in certain embedded systems where accelerators are absent, AI model inference must be performed using only the CPU. This motivates us to develop such a library.
We’ve learned about memory hierarchy in computer organization, which helps us optimize for memory cache when performing CPU-only computations. Therefore, we need to implement basic algorithms as well as cache-optimized versions, and use tools like Valgrind to measure the effectiveness of these optimizations.
Implementation requirements
Complete the blank sections in the following files:
/src/eyeriss/cpu/improve/hardware_cpu.c
/src/eyeriss/cpu/original/hardware_cpu.c
Makefiles are also provided under the corresponding directories to compile the library, execute the program, and analysis the performance.
Original Version Implementation Requirements (30%) :
In /src/eyeriss/cpu/original/hardware_cpu.c, the conv and conv_maxpooling functions must be implemented using 6 nested for loops to fully cover the computation across output channel, output height, output width, input channel, filter height, and filter width.
The linear and linear_relu functions must be implemented using 2 nested for loops, corresponding to matrix-vector multiplication (output dimension × input dimension).
Each operation contributes 7.5 points to the overall score.
Improved Version Implementation Requirements (30%) : You can use loop tiling or any method you want to reduse the cache miss rate and cycles, improve program without complier optimize. We will use cachegrind to simulate the limited cache size.
Scoring Criteria for every improved operation:
Cycles reduced ratio
score
Note
<20%
2.5%
Basic implement score
20%
3.5%
30%
5%
40%
6.5%
50%
7.5%
Cycle Reduction Ratio, defined as: \[
\frac{\text{Cycle}_{\text{original}} - \text{Cycle}_{\text{improved}}}{\text{Cycle}_{\text{original}}}
\]
Important
Since the number of D refs and D1 misses does not vary with the server's state, these values will be used as a reference during grading. The cycle count, however, will be evaluated with a more lenient standard.
Homework user guide
Makefile Usage
To view the Makefile description, run make usage in the test/cpu directory.
The following is an example demonstration. when type in
make i_conv
The Makefile automatically executes the compilation process and, upon successfully generating the CONV_original executable, immediately runs the program, and records the result in CONV_improve.elf.log. Please record the cycle count from CONV_improve.elf.log into the report. The following is the information displayed in the terminal. E.g.
make test VER=improve OP=CONV
make[1]: Entering directory .......
mkdir -p build
CC runtime_cpu
CC hardware_cpu
CXX main
LD CONV_improve.elf
mkdir -p log
Running test with VER=improve, OP=CONV
make[1]: Leaving directory ........
....
If pass, it will show that.
Function: CONV
CONV: PASS
CYCLE: 376908943
And then, use valgrind to simulate the cache and get statistic result.
make v_i_conv
It will record the analysis result in CONV_improve.elf_cachegrind.log. Please record the D1 refs and D1 misses from CONV_improve.elf_cachegind.log into the report. The following is the information displayed in the terminal.
Important
Do not record the cycle count from here into the report.
One drawback of this hardware design is its limited user-friendliness or intuitiveness.
Submission Guidelines
Deadline
April 28, 2025 (Monday) at 23:59:59
Late submissions will not be accepted
Submission Format
Submissions must follow the specified structure:
Caution
Do not submit hardware/, include/ and test/, only src/and report.md is needed.
Before zipping your files into StudentID_lab4.zip, make sure to place everything inside a folder first. This helps keep things organized when the TA unzips it for grading.
StudentID_lab4 ----> create this folder, and then compress it as StudentID_lab4.zip
├── src
│ ├── eyeriss
│ │ ├── cpu
│ │ │ ├── improve
│ │ │ │ ├── hardware_cpu.c
│ │ │ │ ├── hardware_cpu.h
│ │ │ │ └── runtime_cpu.c
│ │ │ └── original
│ │ │ ├── hardware_cpu.c
│ │ │ ├── hardware_cpu.h
│ │ │ └── runtime_cpu.c
│ │ └── dla
│ │ ├── hardware_dla.cpp
│ │ ├── hardware_dla.h
│ │ └── runtime_dla.cpp
│ └── hal
│ └── hal.cpp
└── report.md
Evaluation Environment
The assignment will be graded in an Ubuntu 20.04 environment.
Please ensure your code runs correctly on Verilator v5.030.
This ensures that the evaluator can correctly set up and execute your code.
Important Notes
Submission without conforming to the guidelines may result in score deductions:
Incomplete submissions that prevent reproduction of results will be treated as incorrect answers, and no points will be awarded for those sections.
Incorrect filenames that do not impact grading will incur a 5% deduction.
Missing files will result in an additional 10% deduction.
Lab 4 - Runtime and Performance Profiling
Overview
In this lab, we will implement a software driver to enable integration and operation of the previously designed hardware accelerator with the host system. Through mechanisms such as memory-mapped I/O or interrupts, we establish control and data communication between the CPU and the accelerator. The driver acts as a critical interface layer between the runtime system and the hardware accelerator, exposing low-level register or memory access as high-level software APIs. The runtime, which orchestrates the overall inference flow, leverages the driver to configure the accelerator, transfer data, and retrieve results. This separation of concerns allows the runtime to focus on model-level control logic, while the driver handles hardware-specific interactions.
This architecture enables us to evaluate the performance gains provided by the accelerator during real model inference. Additionally, we investigate the feasibility and effectiveness of optimization techniques for accelerating inference in the absence of hardware acceleration. Through these experiments and analyses, we aim to gain a comprehensive understanding of how hardware acceleration and optimization affect model inference performance under different scenarios, thereby providing practical insights for future system design and deployment.
Lab 4.1 - Device Driver
Driver is a special type of software responsible for enabling the operating system to control hardware devices. It acts like a translator or bridge between the operating system and the hardware. The operating system itself does not directly manipulate hardware; instead, it uses drivers to instruct the hardware what to do.
Why Do We Need Driver ?
The OS Doesn’t Understand Hardware Languages
Each hardware device (such as printers, keyboards, network cards) has its own specific control mechanisms. The operating system cannot directly handle these differences. Drivers are responsible for translating OS commands into signals that the hardware can understand, and then returning responses from the hardware back to the OS.
Supporting Various Hardware with One OS
With drivers, no matter which manufacturer produces the hardware, as long as a compatible driver exists, the OS can control the device. This means the OS does not need to be rewritten for each different hardware.
Software Perspective
Hardware Perspective
Communication between CPU and Peripheral Device
In modern computer systems, the CPU needs to exchange data with various peripheral devices, such as keyboards, network interfaces, graphics cards, and storage devices. These devices are not directly hardwired to the CPU for control. Instead, they are accessed through well-designed interface mechanisms, including:
MMIO (Memory-Mapped I/O)
MMIO is a design method where hardware device registers are mapped into the system’s memory address space. We use this method to control our DLA.
load,store).Port-Mapped I/O (Isolated I/O)
Devices are controlled via a separate I/O address space using specific CPU instructions (e.g.,
in,outon x86).Interrupt
Interrupts allow devices to asynchronously notify the CPU of events. The CPU pauses execution, runs an Interrupt Service Routine (ISR), then resumes.
Benefits:
Typical Use Cases:
Software-Hardware Co-design Framework
In real embedded systems, software (runtime) typically interacts with hardware through Memory-Mapped I/O (MMIO) or device drivers, enabling indirect access to registers or peripherals. However, in this lab, we do not have access to actual hardware. Instead, we use Verilator to translate Verilog RTL into cycle-accurate C++ models, generating a Verilated model for simulation.
The issue arises because Verilated models expose only low-level signal interfaces (e.g.,
.clk,.rst,.data_in,.data_out), unlike real hardware that provides register-based MMIO access. As a result, in order to control the Verilated hardware, the runtime must directly manipulate the internal C++ objects generated by Verilator. This tight coupling requires the runtime to conform to the structure and naming conventions of the Verilator testbench, which is not modular and hinders portability.To address this problem, we introduce the Hardware Abstraction Layer (HAL) in this lab.
Purpose and Benefits of HAL
Hardware Interface Abstraction
HAL wraps the Verilated model and exposes an interface that resembles real hardware, such as
read_reg(addr)andwrite_reg(addr, value). This allows the runtime to remain agnostic to whether it is interacting with an RTL simulation or actual FPGA hardware.Unified Runtime Codebase
With HAL in place, a single runtime implementation can be reused across both simulation and real hardware environments without any modification.
Modularity via Runtime Library
On top of HAL, a Runtime Library can be built to provide high-level operations (e.g., launching computation, setting parameters, fetching results). This allows the main application to focus purely on control logic, improving software modularity, maintainability, and portability.
can we record the hardware runtime information like cycles、memory access time?
Yes, we provide info counter in the HAL.
HAL info counter
The runtime_info structure is designed to record execution-related metrics during hardware simulation or emulation. It provides useful insights into the system's behavior, performance, and memory access patterns. Below is a detailed explanation of each field:
include/hal/hal.hppNote
The HAL counters do not simulate software-level counters; instead, they function more like debug counters that are typically embedded in hardware during design for debugging and analysis purposes. However, in our hardware implementation, we did not allocate dedicated registers for such counters. Instead, we supplement these counters within the HAL, allowing the driver to access them during simulation or emulation.
Address Mapping Problem
Q: Our simulated CPU uses 64-bit addresses, but our memory-mapped peripherals (MMIO) only occupy a small 32-bit region.
ARADDR_Min a 64-bit simulation environment?HAL Simple MMU: Mapping 32-bit AXI Addresses in a 64-bit Simulation
In our simulation environment, the host system uses a 64-bit virtual memory space, while the AXI bus operates with 32-bit addresses. This mismatch can cause address mapping issues, as AXI requests may not directly correspond to valid host memory addresses.
To resolve this, we implement a simple MMU mechanism within the HAL. By capturing the high 32 bits of the HAL instance’s address (
vm_addr_h), we can reconstruct valid 64-bit addresses by combining them with 32-bit AXI addresses:src/hal/hal.cppSegmentation fault
If the mapping address crosses the 32-bit peripheral space boundary, invalid access may occur. This happens because the host's 64-bit address space cannot safely simulate memory outside the defined 32-bit region.
If need, the MMU eed to be optimized in the future, but in this lab, 4GB address space is enough for simulations.
Support MMIO write in HAL with AXI interface
Please read the descriptions, then copy and paste the code into
hal.cpp.When a write request is issued via HAL, the first step is to verify whether the provided address falls within the valid MMIO (Memory-Mapped I/O) region. This region starts at
baseaddrand spans a range defined bymmio_size, meaning a valid address must lie within the interval[baseaddr, baseaddr + mmio_size]. If the address fails this check (i.e., it is out of bounds), HAL will immediately return false, indicating that the operation is invalid and halting any further processing.Following a successful address check, HAL proceeds with the AXI4 protocol, using three separate channels to complete a full write transaction. The first is the Address Write (AW) channel. HAL sets the target address into
AWADDR_Sand assertsAWVALID_Sto signal that a valid address is being presented. The system then waits for the counterpart (typically an interconnect or slave device) to assertAWREADY_S, indicating readiness to accept the address. OnceAWREADY_Sis high, HAL advances one clock cycle and de-assertsAWVALID_S, completing the address phase.After sending the address, HAL transitions to the data write phase using the Write Data (W) channel. The data is placed in
WDATA_S, andWVALID_Sis asserted to indicate the data is valid. Similar to the address phase, HAL waits for WREADY_S to be asserted by the receiver, signaling that it is ready to accept the data. At that point, HAL advances one clock cycle and de-assertsWVALID_S, marking the completion of the data transmission.Once the data is successfully sent, the final step is to receive the write response through the B (Write Response) channel. HAL first asserts
BREADY_Sto indicate that it is ready to receive a response, then waits forBVALID_Sto be asserted by the receiver. OnceBVALID_Sgoes high, HAL reads the value ofBRESP_Sto determine whether the write was successful. If the response isAXI_RESP_OKAY, the operation is considered successful and HAL returns true; otherwise, it returns false.Support MMIO read in HAL with AXI interface
Please read the descriptions, then copy and paste the code into
hal.cpp.Same as write method.
When a read request is issued through the HAL, the first step is to verify whether the target address falls within the valid MMIO (Memory-Mapped I/O) region. If the address is out of bounds, the HAL immediately returns false, indicating that the operation is invalid and that no further steps will be taken.
If the address passes the check, the HAL proceeds to carry out the read transaction following the AXI4 protocol, which involves a three-phase operation. The first phase uses the AR (Address Read) channel. The HAL sets the target read address to
ARADDR_Sand assertsARVALID_S, signaling that a valid read request is being issued. At this point, the system waits for the receiving end (such as an interconnect or slave device) to assertARREADY_S, indicating readiness to accept the address. OnceARREADY_Sis high, the HAL advances one clock cycle and de-assertsARVALID_S, completing the address transmission.After the address phase is completed, the HAL transitions to the data reception phase via the R (Read Data) channel. It first asserts
RREADY_Sto indicate that it is ready to receive data. When the slave assertsRVALID_S, signaling that valid data is available, the HAL immediately reads the value onRDATA_Sand de-assertsRREADY_S, completing the data transfer.Finally, the HAL examines the contents of
RRESP_Sto determine the status of the read operation. If the response isAXI_RESP_OKAY, the read is considered successful, and the retrieved data is stored in the specified variable. The function then returns true. Otherwise, if the response indicates an error, the HAL returns false, signaling that the read operation failed.We’ve now covered the HAL info counter and the simple MMU. With this knowledge, we’re ready to integrate them into the DMA handler.
Support DMA read in HAL with AXI interface
Please read the descriptions, then copy and paste the code into
hal.cpp.In the DMA read handling routine provided by our HAL, the process begins by retrieving the read address and burst length from the master interface. Specifically, the address is obtained from
ARADDR_Mand the burst length is retrieved fromARLEN_M, which determines the number of data beats to be transferred in this burst operation (for a total oflen + 1beats, as the AXI burst length is zero-based).To initiate the transaction, the HAL asserts
ARREADY_Mto complete the address handshake on the AR channel. After simulating a clock cycle to reflect the timing behavior of the interface,ARREADY_Mis de-asserted, signaling that the address phase is complete.Once the read address is acknowledged, the HAL proceeds to send data through the R channel in a burst manner. For each data beat, the corresponding memory content is fetched from the emulated memory
(*(addr + i))and sent throughRDATA_M. The data response is always set toAXI_RESP_OKAY, andRID_Mis set to 0 by default. Before sending each beat, the HAL simulates a memory access delay by incrementinginfo.elapsed_cycleandinfo.elapsed_timeaccordingly.During the burst,
RVALID_Mis asserted to indicate that data is valid, and the system waits until the DMA master setsRREADY_M, signaling it is ready to accept the data. Only then is the clock advanced and the next beat prepared. On the final beat of the burst,RLAST_Mis asserted to indicate the end of the transfer.At the end of the transaction, the HAL updates the internal counter info.memory_read to reflect the total number of bytes read. This is computed as
(len + 1) * sizeof(uint32_t), and is used for performance monitoring and profiling purposes, such as bandwidth analysis or simulation statistics.Support DMA write in HAL with AXI interface
Please read the descriptions, then copy and paste the code into
hal.cpp.In the DMA write handling routine provided by our HAL, the process begins by retrieving the write address and burst length from the master interface. Specifically, the address is obtained from
AWADDR_Mand the burst length is retrieved fromAWLEN_M, which determines the number of data beats to be transferred in this burst operation (for a total oflen + 1beats, as the AXI burst length is zero-based).To initiate the transaction, the HAL asserts
AWREADY_Mto complete the address handshake on the AW channel. After simulating a clock cycle to reflect the timing behavior of the interface,AWREADY_Mis de-asserted, signaling that the address phase is complete.Once the write address is acknowledged, the HAL proceeds to receive data through the W channel in a burst manner. For each data beat, the corresponding data is obtained from
WDATA_Mand written into the emulated memory(*(addr + i)). Before processing each beat, the HAL simulates a memory access delay by incrementinginfo.elapsed_cycleandinfo.elapsed_timeaccordingly.During the burst,
WREADY_Mis asserted to indicate that the HAL is ready to accept data. The system then waits until the DMA master setsWVALID_M, signaling the data is valid and ready to be written. Once valid data is detected, the clock is advanced andWREADY_Mis de-asserted to complete the handshake for that beat.At the end of the data transfer, the HAL sends a write response by asserting
BVALID_Mand settingBRESP_MtoAXI_RESP_OKAY, indicating a successful write operation.BID_Mis set to 0 by default. The system then waits until the DMA master assertsBREADY_Mto acknowledge the response, after which the response phase is completed andBVALID_Mis de-asserted.Finally, the HAL updates the internal counter
info.memory_writeto reflect the total number of bytes written. This is computed as(len + 1) * sizeof(uint32_t), and is used for performance monitoring and profiling purposes, such as bandwidth analysis or simulation statistics.The above describes the basic design concept and implementation of the HAL. For more detailed design information and parameters, you can refer to and explore the files located at
include/hal/hal.hppand/src/hal/hal.cpp.Lab 4.2 - Performance Profiling and Optimization
Why we need CPU performance profiling ?
In many embedded systems, edge devices, or environments lacking dedicated accelerators such as GPUs or NPUs, developers are often faced with the challenge of running AI models using only CPUs. Given the computationally intensive nature and frequent memory access patterns of AI models, performance can degrade significantly without proper analysis and optimization. Therefore, uncovering potential performance bottlenecks and applying compiler-level optimizations are crucial steps to ensure efficient AI inference on CPU-only platforms. This lecture introduces the use of performance analysis tools such as Valgrind and Cachegrind, along with common CPU-level optimization techniques, to help developers effectively deploy AI applications even in resource-constrained environments without hardware accelerators.
Performance Measurement Tool
Valgrind
https://valgrind.org/info/tools.html#cachegrind
Valgrind is an open-source debugging and performance analysis tool for Linux, licensed under the GPL. Its suite of tools enables automated detection of memory management and threading issues, significantly reducing the time spent on debugging and enhancing program stability. Additionally, Valgrind provides in-depth profiling capabilities to optimize application performance.
Cachegrind
Cachegrind is a cache profiling tool that simulates the behavior of the I1 (instruction cache), D1 (data cache), and L2 caches in a CPU, providing precise identification of cache misses in code. It records the number of cache misses, memory accesses, and executed instructions at the source code level, offering detailed analysis at the function, module, and program-wide levels. Supporting programs written in any language, Cachegrind provides comprehensive profiling insights, though it operates at a runtime overhead of approximately 20 to 100 times slower than native execution.
Cachegrind Usage Guide
Advanced Configuration - Customizing Cache Parameters
To simulate different CPU cache configurations, use
--I1,--D1, and--L2options:CPU Optimization
Compiler Optimization Levels (GCC / Clang Examples)
-O0-O1-O2-O3Common Optimization Techniques
malloccan improve performance by ensuring better cache locality and memory alignment, especially for large or frequently accessed data.Examples
Loop Unrolling
Before
This loop adds one element per iteration and includes overhead from loop control (increment, comparison).
After
By unrolling the loop, we reduce control instructions and increase instruction-level parallelism (ILP), improving CPU pipeline efficiency.
SIMD Vectorization
Compiler optimize
-O3, the compiler automatically convert this loop into SIMD instructions.Or
SIMD processes multiple elements in parallel, reducing the number of arithmetic and memory access instructions.
Common Subexpression Elimination
Before
The expression
(x + 2)is calculated twice — a redundant computation.After
The optimized version computes the common expression only once, reducing ALU usage.
Constant Folding / Propagation
Before
The compiler can compute these values at compile time and replace them with constants.
After
Reduces runtime computation and generates simpler machine code.
Loop-Invariant Code Motion
Before
The expression
a * bis invariant across iterations and gets recalculated unnecessarily.After
Moves constant computation outside the loop, reducing the number of multiplications.
Memory Allocation & Alignment
Default
mallocalignment may be suboptimal for cache usage and SIMD.Aligning to 64 bytes improves cache line utilization and enables efficient SIMD access — especially beneficial for large data structures like tensors and matrices.Most modern CPUs use a 64-byte cache line size, so aligning memory to 64-byte boundaries ensures that data fits neatly within a single cache line, avoiding splits across multiple lines and reducing cache misses.
Notes and Caveats
DLA info Record for Performance Profiling
From the above introduction to the HAL, we can see that the info struct contains four counters. We have therefore exposed functionality to read and reset these counters at the DLA driver level.
src/eyeriss/dla/hardware_dla.cppsrc/eyeriss/dla/runtime_dla.cppAdditionally, we provide an API that exports DLA computation information along with the counters recorded by the HAL into a CSV file, facilitating further analysis.
src/eyeriss/dla/runtime_dla.cppTip
If the simulation program is compiled with the DLA_INFO macro defined, this profiling feature will be enabled.
src/eyeriss/dla/runtime_dla.cppThe compilation usage will be mentioned in the homework section.
Homework Requirements
In this lab, you learned about the HAL and basic concepts about device driver. Now it is your turn to implement the driver to support some operations of the DLA and the CPU.
In addition, you will implement several APIs to support operations commonly-used in CNNs such as convolution, maxpooling, relu, matmul, etc.
Prerequisites
We are using Verilator 5.030. You can run the following command to verify it:
Directory Structure
After unzipped the file downloaded from Moodle, the directory structure will look like below:
HW 4.1 - DLA driver (30%)
In Lab 3, you have already implemented the complete PE-array architecture and the PPU. Now, in this lab, the TAs will provide you with the entire accelerator IP. The complete architecture is shown in the diagram below, which includes sub-modules such as the controller, global buffer, DMA, and MMIO AXI interface. We have already used Verilator to convert it into a C++ library and connected it to the HAL. Your task for this lab is to implement the DLA driver on top of the HAL.
MMIO register configuration
The following is the MMIO configuration of the accelerator when mapped into memory space. It includes the memory information it needs to operate on, as well as the computation parameters. It is important to note that the enable register should only be set after all parameters have been properly configured. This ensures the accelerator correctly reads the parameters before starting. Therefore, when implementing the driver, you must take the order of register writes into account.
We mount the DLA on 0x10040000 ~ 0x10041000 of the system address space (in
src/eyeriss/dla/hardware_dla.h)The following MMIO registers are all 32-bit (4 bytes) wide. Each address represents the starting location of the corresponding data.
0x0enable0x4mapping_param0x8shape_param10xcshape_param20x10ifmap_addr0x14filter_addr0x18bias_addr0x1cofmap_addr0x20GLB_filter_addr0x24GLB_opsum_addr0x28GLB_bias_addr0x2cifmap_len0x30ofmap_lenYou can see the C MACRO define in
src/eyeriss/dla/hardware_dla.hThe details of the bitwise configuration for the first four MMIO registers are as follows. Then, you have to implement them in
src/eyeriss/dla/hardware_dla.c1. DLA enable with operation config
Important
The
enableregister should be the last one when setting MMIO registers.0for CONV.1(Reserved, this operation did not implemented in the ASIC, you can extend the operation like FCN in the future).0for disable.1for enable.2. Mapping Parameter (mapping_param) - Please refer to the paper for the definition.
3. Shape Parameter (shape_param1) - Please refer to the paper for the definition.
PAD = 1: Only padding of size 1 is supported. Other padding sizes will not be implemented in this lab.4. Shape Parameter 2 (shape_param2) - Please refer to the paper for the definition.
Note: Ensure to account for padding when calculating width (W) and height (H) before writing the value to the register. Add
2 * paddingto bothWandH, then apply the necessary bitwise operations.Note
Memory Write Operation
Using the
reg_writeFunctionIn this assignment, you are required to use the function
reg_write(uint32_t offset, uint32_t value);provided inhardware_dla.hto write values to a specific memory location.Function Prototype
Parameters
Functionality
Writes
valueto the memory location corresponding toDLA_MMIO_BASE_ADDR + offset.A function call-based runtime API for common DNN operations
After completing the low-level MMIO configuration driver, we need to implement appropriate computation APIs for the operations supported by the DLA.
Configuration for GLB memory allocation
"The size of the GLB is configured to 64 KB."

Students can implement the design based on the illustration provided in this diagram. The size of each block can be computed based on the shape parameters and mapping parameters, following the methodology used in the previous lab.
src/eyeriss/dla/runtime/dla.cpp, you need to implement theqconv2d_relu_maxpoolandqconv2d_relufunctions to correctly activate the DLA. Before configuring the MMIO registers, you must determine the optimalmparameter in the runtime to maximize GLB utilization. As you learned in Lab 2, the value ofmaffects the size of the output feature map (ofmap) and bias stored in the GLB. Therefore, you need to calculate the largest power-of-two value formthat fits within the available GLB space.src/eyeriss/dla/runtime_dla.cppA different GLB configuration from the previous
One key difference in this assignment compared to the previous one is
the number of bias valuesstored in theGLB. In this case, the number of bias values in GLB ism, and storingmbiases in the GLB rather thanp × tcan reduce the number of DRAM accesses, saving handshake time. This change does not affect the number of opsums.Another is the space occupied by the ifmap does not need to account for
padding, as it can be determined solely by the mapping parameters and shape parameters.To better understand the source file, you may start by reading the corresponding header file.
DLA Testbench user guide
Note: The implementation of
hal.cppis also required.There are four testcase (dla0 ~ dla3)
Tip
We provide a counter in the HAL to record DLA information, which will be dumped into a
.csvfile when the user enables it by runningNote
Please be patient, as the simulation could take some time to complete.
Make sure to take a screenshot of the simulation output if your test passes.
HW 4.2 - CPU Runtime API (60%)
Since there are still some operations that are not supported by the accelerator, it is necessary to implement such operations purely running on the CPU, namely CPU fallback support. Additionally, in certain embedded systems where accelerators are absent, AI model inference must be performed using only the CPU. This motivates us to develop such a library.
We’ve learned about memory hierarchy in computer organization, which helps us optimize for memory cache when performing CPU-only computations. Therefore, we need to implement basic algorithms as well as cache-optimized versions, and use tools like Valgrind to measure the effectiveness of these optimizations.
Implementation requirements
Complete the blank sections in the following files:
/src/eyeriss/cpu/improve/hardware_cpu.c/src/eyeriss/cpu/original/hardware_cpu.cMakefiles are also provided under the corresponding directories to compile the library, execute the program, and analysis the performance.
Original Version Implementation Requirements (30%) :
/src/eyeriss/cpu/original/hardware_cpu.c, theconvandconv_maxpoolingfunctions must be implemented using 6 nested for loops to fully cover the computation across output channel, output height, output width, input channel, filter height, and filter width.linearandlinear_relufunctions must be implemented using 2 nested for loops, corresponding to matrix-vector multiplication (output dimension × input dimension).Improved Version Implementation Requirements (30%) :
You can use loop tiling or any method you want to reduse the cache miss rate and cycles, improve program without complier optimize. We will use cachegrind to simulate the limited cache size.
Scoring Criteria for every improved operation:
Cycle Reduction Ratio, defined as:
\[ \frac{\text{Cycle}_{\text{original}} - \text{Cycle}_{\text{improved}}}{\text{Cycle}_{\text{original}}} \]
Important
Since the number of D refs and D1 misses does not vary with the server's state, these values will be used as a reference during grading. The cycle count, however, will be evaluated with a more lenient standard.
Homework user guide
Makefile Usage
To view the Makefile description, run
make usagein thetest/cpudirectory.The following is an example demonstration.
when type in
The Makefile automatically executes the compilation process and, upon successfully generating the CONV_original executable, immediately runs the program, and records the result in
CONV_improve.elf.log. Please record the cycle count fromCONV_improve.elf.loginto the report.The following is the information displayed in the terminal.
E.g.
If pass, it will show that.
And then, use valgrind to simulate the cache and get statistic result.
It will record the analysis result in
CONV_improve.elf_cachegrind.log. Please record theD1 refsandD1 missesfromCONV_improve.elf_cachegind.loginto the report.The following is the information displayed in the terminal.
Important
Do not record the cycle count from here into the report.
Supported Commands in Makefile
By analogy
make o_convCONV_original.elfmake i_convCONV_improve.elfmake o_conv_maxCONV_MAX_original.elfmake i_conv_maxCONV_MAX_improve.elfmake o_linearLINEAR_original.elfmake i_linearLINEAR_improve.elfmake o_linear_reluLINEAR_RELU_original.elfmake i_linear_reluLINEAR_RELU_improve.elfRemove all files except for source code files.
Cachegrind check
E.g. 1
It should display information similar to the following.
E.g. 2
It should display information similar to the following.
⚠️ If the cache size does not match the one shown in the image, please enter the following command to correct it.
And then check again
The reason for zero-padding the filter for DLA
One drawback of this hardware design is its limited user-friendliness or intuitiveness.
Submission Guidelines
Deadline
Submission Format
Submissions must follow the specified structure:
Caution
hardware/,include/andtest/, onlysrc/andreport.mdis needed.StudentID_lab4.zip, make sure to place everything inside a folder first. This helps keep things organized when the TA unzips it for grading.Evaluation Environment
Important Notes
Submission without conforming to the guidelines may result in score deductions: