In this lab, we will implement an AI compiler flow that bridges high-level machine learning models with low-level C code deployment, targeting CPU execution. By integrating TVM, a machine learning compiler framework, with a code generator and pattern fusion engine, we create an end-to-end system capable of compiling models like PyTorch into optimized C code. The compiler serves as a crucial translation and optimization layer, converting models from intermediate representations (Relay) into fused, low-level operator sequences suitable for embedded or resource-constrained environments.
The code generation runtime, which controls this compilation pipeline, leverages modular passes such as operator fusion, pattern recognition, and C function call generation to produce executable code that mimics real hardware accelerator deployment. This abstraction allows us to study inference performance and correctness in a software-only context, simulating hardware-constrained environments.
Moreover, the lab guides us through implementing key components such as Relay operator fusion, model traversal, and shape-aware codegen logic. Through these implementations, we investigate how compiler-level optimizations—like operator fusion and quantization-aware generation—impact inference performance and portability. Ultimately, this lab equips us with practical insights into how AI compilers serve as a critical infrastructure in deploying deep learning models across heterogeneous platforms.
Lab 5.0 - Enviroment Setup
In the upcoming assignment, TVM will be required. Please follow the instructions below to set up the environment.
Set up the basic environment and download the required configurations.
If you encounter an issue where Conda is not found, it means Conda has not been installed. The installation guide for Miniconda can be found in Lab 0.
We need the CPU version of torchvision, and the TVM version is already installed on the server. Please use the following command to install it to your local conda environment.
After completing the above steps, extract the files from Moodle and place them on the server.
Lab 5.1 - Introduction to AI Compiler
AI compilers enable the deployment of models from high-level frameworks like TensorFlow and PyTorch onto various hardware platforms such as CPUs, GPUs, and AI accelerators by transforming high-level code into low-level executable code.
TVM
One such compiler is TVM, an open-source machine learning compiler framework designed to optimize and deploy deep learning models efficiently across diverse hardware targets. TVM automates the process of translating models into optimized code tailored to specific hardware architectures.
Bring Your Own Codegen (BYOC)
The compilation process begins by converting models from TensorFlow or PyTorch into an Intermediate Representation (IR). In the high-level IR, computations are structured as a computation graph, where each node represents an operation (e.g., matrix multiplication, convolution). This graph is then progressively optimized through multiple stages. Finally, TVM’s code generation (codegen) module translates the optimized IR into low-level C code or other backend-specific code for execution on the target hardware.
It will precisely record each execution step along with detailed information, such as input shape, data type, and more. Once the Relay representation is obtained, optimizations can begin.
In this lab, our objective is to take the input Relay, apply fusion techniques to combine specific operators, and generate a fused Relay. Subsequently, we will perform code generation on the fused Relay to produce the output files marked in green, located along the designated path on the right.
In typical scenarios, TVM's C code generator is implemented as a C++ class that must be registered within TVM's function registry. After registration, TVM needs to be rebuilt in order to trigger the customized code generator through the Python API using relay.build(). However, TVM also offers an alternative design that allows implementing the code generator directly in Python. In this case, the function can be registered using a decorator. It is important to note that such functions must take a Relay model as input and return a string containing the generated code in either C++ or C.
According to the BYOC (Bring Your Own Codegen) framework, in order to produce an executable as part of the standard TVM compilation flow, the custom code generator must conform to the DLPack specification, and data transmission must utilize DLTensor. However, since our approach focuses on an end-to-end code generation flow, we bypass TVM’s generated output files. Instead, we directly invoke our code generator to produce both the model’s C source code and the corresponding binary weight data.
In Lab 4, we implemented the runtime API for the CPU and the driver for the DLA. It's important to note that the operations of these APIs are not purely single operations. Instead, they function more like fused operators within a single function, such as conv2d_relu, conv2d_relu_maxpool, and so on. To handle this, we use TVM to automatically detect patterns from the Relay model graph and fuse these patterns into a single representative node, called a Composite. Next, we annotate these nodes for the specific target (or compiler). Finally, we merge these compiler regions to obtain the Fused Relay model, which is then used by our customized code generator.
Fusing multiple operators helps reduce memory accesses, thereby minimizing data movement and improving performance.
Homework Note
The first task in this assignment is to implement the fusion of different operators in Relay.
Based on the merge_composite_pass function in fuse.py, we need to create a pattern_table to identify subgraphs that match specific patterns. Therefore, our goal here is to properly construct the pattern_table.
To complete the pattern_table, we need to implement several fusion functions defined within it.
Here, let's use fuse_conv2d_bias_add_relu as an example to explain.
aoc2025-lab5/StudentID_lab5/Python/utils/fuse.py
deffuse_conv2d_bias_add_relu():
# Define the pattern for the operations to be fused
i = dfp.wildcard() # Input
w = dfp.wildcard() # Weight
b = dfp.wildcard() # Bias
dequantized_i = dfp.is_op("qnn.dequantize")(i,dfp.wildcard(),dfp.wildcard())
dequantized_w = dfp.is_op("qnn.dequantize")(w,dfp.wildcard(),dfp.wildcard())
dequantized_b = dfp.is_op("qnn.dequantize")(b,dfp.wildcard(),dfp.wildcard())
conv2d_op = dfp.is_op("nn.conv2d")(dequantized_i,dequantized_w)
bias_add_op = dfp.is_op("nn.bias_add")(conv2d_op, dequantized_b)
relu_op = dfp.is_op("nn.relu")(bias_add_op)
quantize_op = dfp.is_op("qnn.quantize")(relu_op,dfp.wildcard(),dfp.wildcard())
cast_op = dfp.is_op("cast")(quantize_op) # Assuming requantize is a cast operationreturn cast_op
Using fuse_conv2d_bias_add_relu() as an example, note that wildcards in TVM represent patterns that can match any Relay expression.
First, we use wildcards to represent the input, weight, and bias tensors.
Next, we sequence the operations in the following order: conv2d, followed by bias_add, and then relu.
After applying these operations, we quantize the result back to the original data type. Finally, we cast the data to meet the hardware requirements before returning the final output.
The following diagram illustrates the structure of the pattern:
Following the fusion and annotation of the model subgraph, the subsequent step involves generating customized C code aligned with the target ASIC driver and its corresponding API.
Homework Note
TAs has already marked the sections where your work is needed. Please complete the parts indicated by TA.
Lab 5.3 - Integration
To aid understanding, the diagram below depicts the full function call path for the code generation and data generation process as implemented in this lab.
Overview of C codegen in Python version
In this lab, we implement a lightweight C code generator using Python, allowing for faster prototyping without the need to recompile TVM. This Python-based flow simplifies the integration of customized code generation logic while maintaining compatibility with the Relay model structure.
The codegen process is organized into three core components:
Python script path
aoc2025-lab5/StudentID_lab5/Python/utils/*.py
codegen.py: Responsible for generating the full C source code required for model inference. This includes emitting function declarations and layer-wise computations, as well as embedding model weights.
datagen.py: Handles the transformation of sample input datasets into a runtime-friendly binary format. A lightweight header is added to assist with input parsing during execution.
note.py: Serves as a configuration and pattern-matching module. It defines wildcard variable names for pattern recognition and maps fused composite functions to their corresponding C code templates.
This modular design not only increases code readability and reusability but also separates concerns clearly, making it easier to maintain and extend the system for different target hardware or model structures.
digraph G {
node [shape=box, style=filled, fillcolor=lightgray];
subgraph cluster_outer1{
label = "build_model.py";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor=lightyellow;
node [style=filled, fillcolor=white];
subgraph cluster_inner1{
label = "build_model()";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor=white;
node [style=filled, fillcolor=white];
relay_build [label="relay.build()"];
}
dataset_gen [label="dataset_gen"];
}
subgraph cluster_outer2{
label = "datagen.py";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor=lightyellow;
node [style=filled, fillcolor=white];
subgraph cluster_inner2{
label = "Dataset_Generator";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor="#E6E6FA";
node [style=filled, fillcolor=white];
gen_bin [label="gen_input_h()"];
gen_float_array [label="gen_float_array()"];
fetch_data [label="fetch_data()"];
}
}
Compiler_space [label="Compiler Space"];
subgraph cluster_outer3{
label = "codegen.py";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor=lightyellow;
node [style=filled, fillcolor=white];
DLA_compiler [label="DLA_compiler()", fillcolor=lightblue];
subgraph cluster_inner1{
label = "CsourceCodegen";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor="#E6E6FA";
node [style=filled, fillcolor=white];
create_c_source_module [label = "create_c_source_module()"];
get_c_func [label = "get_c_func()"];
}
subgraph cluster_inner2{
label = "Ccodegen";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor="#E6E6FA";
node [style=filled, fillcolor=white];
jit [label = "jit()"];
visit_expr [label = "visit_expr()"];
}
}
subgraph cluster_outer4{
label = "note.py";
style=filled;
color=black; // Border color
penwidth=1; // Makes the border more visible
fillcolor=lightyellow;
node [style=filled, fillcolor=white];
tvm_auto_args_NOTES [label="tvm_auto_args_NOTES {}", fillcolor=lightblue];
tvm_c_func_call_gen [label="tvm_c_func_call_gen {}", fillcolor=lightblue];
}
get_c_func -> jit;
get_c_func -> visit_expr;
visit_expr -> visit_expr;
DLA_compiler -> get_c_func;
DLA_compiler -> create_c_source_module;
dataset_gen -> gen_bin;
DLA_compiler -> Compiler_space [label=" (1) Register", style=dashed];
relay_build -> Compiler_space [label=" (2) Find", style=dashed];
Compiler_space -> relay_build [label=" (3)",style=dashed];
relay_build -> DLA_compiler [label=" (4) Call", style=dashed];
tvm_auto_args_NOTES -> visit_expr [label=" Get tensor info", style=dashed];
tvm_c_func_call_gen -> visit_expr [label=" Get codegen function", style=dashed];
}
TVM Relay External Codegen: C Backend Walkthrough - codegen.py
This part provides an explanation of the implementation for an external code generator in TVM targeting C. It demonstrates how to lower a Relay composite function into C code and generate a runtime-loadable module.
Module Overview and Imports
import tvm
import os
import numpy as np
from .fuse import COMPILER_NAME
from .fuse import pattern_table
from .note import *
Here, standard TVM libraries are imported, along with local modules for pattern matching and annotations. The COMPILER_NAME defines the custom compiler target name used by TVM.
Data Structures: Output and Data
Output
classOutput(dict):
...
Output is used to store metadata for generated buffers or variables, such as name, data type, copy requirements, and size.
Data
classData(dict):
...
Data stores information about constants, including their data content and structural metadata.
Abstract Codegen Base Class
classCodegenCBase(ABC):
...
This base class provides shared logic for all C-based code generators:
Code string construction with indentation
Scope management for nested code
Wrapper function generation (generate_backend_c_func)
Runtime entry point generation (jit_impl)
The output consists of both an internal kernel function and a wrapper conforming to TVM's external runtime interface.
CodegenC: Core Relay to C Lowering
classCodegenC(CodegenCBase):
...
This class handles the traversal of the Relay IR and emits C code. It supports common Relay expressions including Call, Var, Tuple, Constant, and TupleGetItem.
First, obtain which Composite function this Call belongs to. Next, replace "." with "_" in composite_name to convert it into a C-compatible function name. Finally, retrieve the input shape of this Call.And check whether it exists in our pattern_table to prevent encountering unsupported functions.
Next, iterate through all parameters of the Call, checking whether each parameter is a constant (Constant) or a variable (Variable). Store these parameters for later use in generating C code.
Please complete the implementation of the trace parts, ensuring that operator information is correctly extracted and the corresponding C code is generated.
# ------------------------------------------------------------# TODO 1: Trace parameters# ------------------------------------------------------------# For each argument in call.args, determine:# - its mapped name in tvm_auto_args_NOTES[func_name]# - whether it's a Constant or not# - if not a constant, use `self.visit_expr(arg)` to visit it# Then fill the `parameters` dict like:# parameters["input"] = (value, is_const)## Hint:# - Use zip(call.args, tvm_auto_args_NOTES[func_name])# - Use isinstance(arg, Constant)
parameters = dict()
raise NotImplementedError("Please implement argument tracing into the 'parameters' dictionary.")
# fetch function generator
func_gen = tvm_c_func_call_gen.get(func_name,None)
ifnot func_gen:
return parameters["input"] # if function generator not exist, just bypass the input.
Next, set up the output buffer and store the parameters and values into the config for easy access later. Note that buf_idx should be incremented by 1 each time to ensure there are no duplicate buffers. Once the buffers are set, use get_conv_info(call) to retrieve the configuration and store it in the config.
# output buffer# ------------------------------------------------------------# TODO 2: Create output buffer# ------------------------------------------------------------# You need to:# - Generate a new buffer name using `BUF_PREFIX` and self.buf_idx# - Get the output buffer size: self.get_size(call)# - Get the output buffer dtype: self.get_dtype_string(call.checked_type)## You should generate a line like:# float* out_0 = (float*)malloc(size * 4);## Output:# - out -> output buffer name# - out_size -> total number of elements# - dtype -> C-style data typeraise NotImplementedError("Please implement output buffer allocation logic.")
### gether the parameter that we need.# conv2d Op infoif"conv2d"in func_name:
config = self.get_conv_info(call)
config["C"] = in_shape[1]
config["H"] = in_shape[2]
config["W"] = in_shape[3]
else:
config = dict()
# wildcard infofor k in ["input","weight","bias"]:
# default
config[k] = None
config[f"{k}_len"] = None
config[f"{k}_dtype"] = None# get parameter
param = parameters.get(k, None)
if param == None:
continue# unpack
p,is_const = param
if p == None:
continue# if it is constant, now can visit itif is_const:
p = self.visit_constant(p)[0]
config[k] = p.name
config[f"{k}_len"] = p.size
config[f"{k}_dtype"] = p.dtype
config["output"] = out
config["output_len"] = out_size
# convert quntize scalefor k,(v,is_const) in parameters.items():
if"scale"in k and is_const:
n = v.data.numpy()
config[k] = n[0] if n.ndim == 1else n
# malloc output buffer
buf_create = f"{dtype}* {out} = ({dtype}*)malloc({out_size * 4});"
self.ext_func_body.append(buf_create)
# generate c function
self.ext_func_body.append("".join(func_gen(config)))
# free input buffer
p,_ = parameters["input"]
if BUF_PREFIX in p.name:
buf_free = f"free({p.name});"
self.ext_func_body.append(buf_free)
output = Output(name=out, dtype=dtype, need_copy=True, size=out_size)
return [output]
Homework Note
TAs has already marked the sections where your work is needed. Please complete the parts indicated by TA.
Extracting Convolution Information
To extract convolution-related parameters (e.g., padding, kernel size, and number of output channels), the get_conv_info function performs a Breadth-First Search (BFS) over the body of the composite function. It starts by initializing a conv2d_info dictionary with default values and adds the top-level operation (call.op.body) to the op_list. As it iterates through this list, it checks if each node is a Call and whether its operator name is "nn.conv2d". If a convolution is found, it extracts and stores the relevant attributes—such as padding, channels, and kernel_size—into the conv2d_info dictionary. This approach ensures that deeply nested operators inside fused composite patterns are correctly identified and their parameters recorded for downstream code generation.
defget_conv_info(self,call):
op_list = [call.op.body,]
conv2d_info = {
"m":"DEFAULT_m",
"e":"DEFAULT_e",
"p":"DEFAULT_p",
"q":"DEFAULT_q",
"r":"DEFAULT_r",
"t":"DEFAULT_t",
"U":1
}
# BFSwhilelen(op_list) > 0:
op = op_list.pop(0)
# ------------------------------------------------------------# TODO: Extract conv2d attributes from the op node# ------------------------------------------------------------# If the op is nn.conv2d:# - Extract and store:# - padding -> conv2d_info["PAD"]# - channels -> conv2d_info["M"]# - kernel size (0 and 1) -> conv2d_info["R"], conv2d_info["S"]# - Also set conv2d_info["m"] = conv2d_info["M"]## Hint:# - Use op.op.name to check for "nn.conv2d"# - Use op.attrs["padding"], op.attrs["channels"], etc.# - You can assume padding and kernel_size are lists/tuples.## Example:# conv2d_info["PAD"] = op.attrs["padding"][0]# When done, remove the following lineraise NotImplementedError("You need to extract attributes for nn.conv2d node.")
for node in op.args:
ifisinstance(node, Call):
op_list.append(node)
return conv2d_info
This is essential for generating hardware-aware kernel configurations.
Homework Note
TAs has already marked the sections where your work is needed. Please complete the parts indicated by TA.
This function is registered under the name relay.ext.<compiler> and invoked automatically when TVM detects a composite pattern assigned to the external compiler.
Output Files
Upon execution, the following files are written to ./output/:
code.c: Core runtime logic
weight.c: Constant data arrays
weight.h: Header declarations for constants
weight.bin: Serialized constant tensors for loading at runtime
Dataset Preprocessing for C Deployment - datagen.py
In this section, we walk through the implementation of datagen.py, which prepares dataset samples for use in embedded C environments. Its primary function is to retrieve images from a test dataset and convert them into .h (C header) or .bin (binary) files, making them portable for inference testing on microcontrollers or other low-level platforms.
Note
Please copy and paste the provided code into your own datagen.py file. You do not need to write it from scratch.
Data Processing Class Overview
We start by defining a class called Dataset_Generator, which handles the core data preprocessing logic. Its responsibilities include:
Loading the test dataset
Extracting a specified number of test samples per category
The fetch_data() method extracts a fixed number of samples (num_data_per_class) from each category in the test dataset and organizes them into a dictionary for later processing.
deffetch_data(self, num_data_per_class):
self.classes = self.test_dataset.classes
data_dict = dict()
for idx, c inenumerate(self.classes):
data_dict[c] = []
for img, y in self.testloader:
if idx == y:
data_dict[c].append(img)
iflen(data_dict[c]) >= num_data_per_class:
breakreturn data_dict
Generating .bin Files
The gen_bin() method saves the extracted image data to a .bin file in a format suitable for embedded inference. The binary structure includes:
Total number of classes
Each class name and its length (encoded in UTF-8)
Number of samples per class
Flattened image size
The actual image data in float32 format
defgen_bin(self, output_path, num_data_per_class=10):
data_dict = self.fetch_data(num_data_per_class=num_data_per_class)
withopen(output_path, "wb") as f:
num_classes = len(self.classes)
f.write(struct.pack("I", num_classes)) # Total number of classesfor class_name in self.classes:
encoded_name = class_name.encode('utf-8')
name_length = len(encoded_name)
f.write(struct.pack("I", name_length))
f.write(encoded_name)
first_data_shape = list(data_dict.values())[0][0].shape
flattened_size = np.prod(first_data_shape)
f.write(struct.pack("I", num_data_per_class))
f.write(struct.pack("I", flattened_size))
for values in data_dict.values():
for value in values:
np_array = value.numpy().astype('float32')
f.write(np_array.tobytes())
TVM Function Call Generation for Quantized Inference - note.py
This module automates the generation of C-style function call code for the TVM (Tensor Virtual Machine) compiler. It targets quantized deep learning models and supports operations such as:
Quantized Convolution (QConv2D)
Quantized Linear Layer (QLinear)
Quantization / Dequantization
Variants like ReLU and MaxPooling fused with other ops
tvm_auto_args_NOTES: Parameter Schemas for TVM Operations
This dictionary defines the expected argument names for each supported TVM function. These names will be used for:
Identifying input/output tensors
Associating quantization parameters (scales, zero points)
Mapping configurations for code generation
For example, the qconv2d_relu_maxpool function requires:
Input, Weights, Biases (quantized)
Quantization/Dequantization scales and zero points
Dequantize/Quantize metadata for rescaling output
convert_log: Helper for Quantization Scaling
defconvert_log(x):
return -int(math.log2(x))
This helper function calculates -log2(x) as an integer. It’s commonly used to transform floating-point scaling factors into integer log scale shifts, which are more efficient for fixed-point inference on hardware accelerators or microcontrollers.
tvm_c_func_call_gen: C Code Generators per Operator
This dictionary maps each TVM operation to a corresponding C-language function call template, dynamically filled using the config dictionary.
Each lambda returns a formatted multi-line C function call string, ready to be inserted into generated code. These templates can target:
CPU-only versions (*_cpu)
Normal hardware-accelerated (DLA) versions
Example Output:
qconv2d_relu_maxpool(
input, weight, output,
bias, output_len, input_len, weight_len,
// mapping parameter
m, e, p, q, r, t,
// shape parameter
PAD, U, R, S,
C, M, W, H,
// quantize scaleconvert_log(input_scale * weight_scale / dequantize_scale)
);
Once all python script are completed, the build_model logic will automatically invoke the appropriate code generation logic from tvm_c_func_call_gen, completing the full compilation and code-emission process for quantized models.
Lab 5.4 Performance Analysis
In this lab, we will explore tools that help analyze and visualize memory usage during program execution. We will use Valgrind’s Massif toolset to profile a simple C program that recursively prints Fibonacci numbers.
massif: Heap Memory Profiler
Massif is a heap profiler provided by Valgrind. It helps track memory allocations, identify memory peaks, and analyze memory usage over time.
To use Massif, run the following command in the lab directory:
make massif
This will execute the program massif_test, which prints a list of Fibonacci numbers using a recursive function.
Massif will trace memory usage at runtime. Internally, it uses the following command:
massif-visualizer is a graphical interface for visualizing Massif output, making it easier to analyze memory usage trends interactively.
To use massif-visualizer:
Ensure X11 forwarding is enabled if you're using SSH.
Launch the tool:
massif-visualizer
Open the output file:
Click Open…
Choose massif.out.massif_test
View the memory usage graph:
This visualization helps you easily pinpoint memory-intensive regions and track memory growth over time. There are also gui application called massif-visualizer, it is…
Clean Up
After completing the lab, clean the generated output files by running:
make clean
This removes all intermediate files and prepares the environment for a fresh start.
Homework Requirements
In this lab, you learned how to build a complete AI compiler flow that translates high-level machine learning models into low-level C code suitable for embedded and CPU-only environments. You explored key compiler components such as Relay operator fusion, external code generation, and performance analysis using memory profiling tools like valgrind, massif, and ms_print.
In addition, you will implement and complete several essential modules of the compiler pipeline, including:
Writing pattern-matching functions for operator fusion in fuse.py.
Completing code generation logic in note.py for different quantized operators.
Generating appropriate function calls based on Relay transformations.
Performing memory profiling on recursive functions using Massif, and interpreting the results through both CLI (ms_print) and GUI (massif-visualizer).
These tasks are designed to help you understand the end-to-end compilation and deployment process of deep learning models, and how software optimization maps to hardware-aware execution models.
By the end of this assignment, you will have hands-on experience with:
Translating Relay models into fused representations
Implementing target-specific code generators
Profiling and analyzing memory usage at runtime
Make sure to follow the implementation notes and fill in the TODOs as marked by the TAs in each respective file.
Prerequisites
Before starting the lab, make sure the following setup steps are completed:
Download and extract the lab materials Get the sample code and report template from Moodle, then decompress the archive:
unzip aoc2025-lab5.zip
Activate the lab environment Ensure that you are working in the correct Conda environment for this lab:
conda activate tvm-lab
Directory Structure
After unzipped the file downloaded from Moodle, the directory structure will look like below:
datagen.py Please copy the contents of datagen.py in section Lab 5.3 into your own datagen.py file.
note.py: tvm_c_func_call_gen dictionary
tvm_c_func_call_gen = {
...
}
Once you have completed the marked sections above, you can proceed to execute the following steps in order.
Build the model
make build_model
This command will:
load model from onnx format, convert to relay model.
parse and fused the relay model, and dump relay model expresion to text file.(Used in visuTVM)
Build the model in python script, including traverse composited model, C code generation, weight generation.
Parse the CIFAR10 dataset, and convert to custom input binary file.
extract the tar generate from TVM, and categorize them into different folders. The Process will traverse the hale model in DFS order, notice that the trace flow is from output layer back to input layer.
visuTVM: Relay Graph Visualizer
visuTVM is a tool used to visualize the structure of a TVM Relay model graph, helping you better understand model transformations during compilation.
To generate visualizations of the Relay graph:
make visuTVM
This command produces two SVG images representing the Relay graph:
./output/visu_VGG8_relay_ir.svg: The original Relay IR (before the MergeComposite pass)
./output/visu_VGG8_relay_ir_pass.svg: The Relay IR after pattern fusion and annotation passes
Submission
Include both .svg images in your report.md to illustrate the changes before and after fusion.
HW 5.2 Simulation and Performance Analysis
In this task, you will analyze the memory usage and runtime performance of the CPU-only version of your model using Massif, a heap memory profiler from Valgrind. Additionally, you will utilize DLA info counters—provided in the Lab 4 runtime library—to evaluate the behavior and efficiency of the simulated accelerator.
This dual analysis allows you to compare software-based and hardware-like execution, providing deeper insights into memory bottlenecks and inference performance.
Inference model with CPU-only
For quickly demo and test of cpu version:
make test_cpu
you will got a single shot of inference of full model in cpu-only runtime api.
CC weight.o
CC input.o
CC utils.o
CC runtime_cpu.o
CC hardware_cpu.o
CXX main.o
CXX model.o
LD main
make[1]: Leaving directory '/home/aoc2025/n26130605/work/lab5/testbench/cpu'
/home/aoc2025/n26130605/work/lab5
Running program...
make[1]: Entering directory '/home/aoc2025/n26130605/work/lab5/testbench/cpu'
mkdir -p log
Run test
===============[ single test ]===============
Input file: ../../output/bin/input.bin
Weight file: ../../output/bin/weight.bin
Class index: 4
Image index: 9
=============================================
Image Test: 9/10 image class deer
=============================================
[ airplane] 5.203%
[ automobile] 0.058%
[ bird] 0.621%
[ cat] 0.333%
[ deer] 20.578%
[ dog] 0.484%
[ frog] 0.058%
[ horse] 71.826%
[ ship] 0.090%
[ truck] 0.750%
=============================================
make[1]: Leaving directory '/home/aoc2025/n26130605/work/lab5/testbench/cpu'
For more config in compiling cpu-only version runtime, move into testbench/cpu, then use make usage for more details about configurations.
cd testbench/cpu
make usage
Usage: make [target]
Available targets:
all - Build the project (default target)
test [CLASS][INDEX] - Run the compiled executable with test input
valgrind [CLASS][INDEX] - Run Valgrind Massif to analyze memory usage
test_full - Run with 100 test input
valgrind_full - Run Valgrind Massifwith 100 test input
clean - Remove all generated files
Environment Variables:
CLASS=<num> - Set class index for testing (default: 4)
INDEX=<num> - Set test index (default: 9)
Notice that it is needed to make clean before any new configuration applied.
This model is pre-trained by TA. After quantization, it achieves 88.xx% accuracy on the CIFAR-10 dataset. This means that the results seen in this simulation make sense!
Memory Usage Analysis with Massif
make valgrind and make valgrind_full execute the same tests as make test, but with enhanced memory tracking during runtime.
These commands utilize the Valgrind Massif tool to monitor and trace memory usage, saving the output in the massif_out/ directory.
To visualize and analyze memory usage over time, use massif-visualizer to open the generated output files.
Submission
Screenshot the test_full result matrix, including Accuracy.
Screenshot the massif-visualizer memory graph result, and record the pick memory usage in the report.
Inference model with DLA
In DLA version test and demo, use make test_dla at top directory will perform a single shot simulation on eyeriss ASIC.
This model was pre-trained by the TA, so occasional misclassifications can occur. For example, in the case above, the model mistakenly identified a deer as a horse.
Also, for more information and configuraion about DLA version compile and inference is in testbench/dla, use make usage to get details about them.
cd testbench/dla
make usage
Usage: make [target]
Available targets:
all [DEBUG=?][DLA_INFO=?][USE_VCD=?] - Build the project (default target)
test [CLASS=<num>][INDEX=<num>] - Run the compiled executable with test input
clean - Remove all generated files
nWave - Launch nWave with logs
Environment Variables:
DEBUG=1 - Enable debug mode
DLA_INFO=1 - Enable DLA info logs
USE_VCD=1 - Enable VCD dumping
CLASS=<num> - Set class index for testing (default: 4)
INDEX=<num> - Set test index (default: 9)
Important
Notice that the DLA version did not support test_full, because the 100 images simulation will takes more than one hours, even run on better computer.
DLA runtime analysis
To enable this feature, set the environment variable DLA_INFO=1 before running make test. After the test completes, a CSV file will be generated containing statistics and parameters for each task assigned to the DLA.
Submission
Run inference on a single image using DLA simulation, and export the statistics as a CSV file. Use the data to fill in the provided sheet and generate a bar chart of per-layer statistics in report.md. The statistics should include cycle count, time and memory read/write count for each layer.
Compare the results with the predicted values from Lab 2. Do the statistics align with the predictions? If there are discrepancies, where might estimation and actual experimental results diverge? (This is an open-ended question—any reasonable explanation is acceptable.)
VS Code Extension - Markdown Preview Mermaid
This extension allows you to preview Mermaid diagrams directly in Markdown files.
Clean Up
To remove output logs and executables in a specific module, run make clean inside testbench/cpu or testbench/dla.
To clean up everything, including generated code and testbench executables, run make clean in the root directory of the lab project.
Submission Guidelines
Deadline
May 19, 2025 at 23:59:59
Late submissions will not be accepted
Submission Format
Submissions must follow the specified structure:
Caution
Do not submit other files, only Python/and report.md is needed.
Before zipping your files into StudentID_lab5.zip, make sure to place everything inside a folder first. This helps keep things organized when the TA unzips it for grading.
StudentID_lab5 ----> create this folder, and then compress it as StudentID_lab5.zip
├── Python
│ └── utils
│ ├── __init__.py
│ ├── codegen.py
│ ├── datagen.py
│ ├── fuse.py
│ ├── note.py
│ └── utils.py
├── images ----> create to store images in report.md as you need
└── report.md
Important Notes
Submission without conforming to the guidelines may result in score deductions:
Incomplete submissions that prevent reproduction of results will be treated as incorrect answers, and no points will be awarded for those sections.
Incorrect filenames that do not impact grading will incur a 5% deduction.
Missing files will result in an additional 10% deduction.
Lab 5 - AI Compiler
Overview
In this lab, we will implement an AI compiler flow that bridges high-level machine learning models with low-level C code deployment, targeting CPU execution. By integrating TVM, a machine learning compiler framework, with a code generator and pattern fusion engine, we create an end-to-end system capable of compiling models like PyTorch into optimized C code. The compiler serves as a crucial translation and optimization layer, converting models from intermediate representations (Relay) into fused, low-level operator sequences suitable for embedded or resource-constrained environments.
The code generation runtime, which controls this compilation pipeline, leverages modular passes such as operator fusion, pattern recognition, and C function call generation to produce executable code that mimics real hardware accelerator deployment. This abstraction allows us to study inference performance and correctness in a software-only context, simulating hardware-constrained environments.
Moreover, the lab guides us through implementing key components such as Relay operator fusion, model traversal, and shape-aware codegen logic. Through these implementations, we investigate how compiler-level optimizations—like operator fusion and quantization-aware generation—impact inference performance and portability. Ultimately, this lab equips us with practical insights into how AI compilers serve as a critical infrastructure in deploying deep learning models across heterogeneous platforms.
Lab 5.0 - Enviroment Setup
In the upcoming assignment, TVM will be required. Please follow the instructions below to set up the environment.
Set up the basic environment and download the required configurations.
Warning
If you encounter an issue where Conda is not found, it means Conda has not been installed. The installation guide for Miniconda can be found in Lab 0.
We need the CPU version of
torchvision, and the TVM version is already installed on the server. Please use the following command to install it to your local conda environment.After completing the above steps, extract the files from Moodle and place them on the server.
Lab 5.1 - Introduction to AI Compiler
AI compilers enable the deployment of models from high-level frameworks like TensorFlow and PyTorch onto various hardware platforms such as CPUs, GPUs, and AI accelerators by transforming high-level code into low-level executable code.
TVM
One such compiler is TVM, an open-source machine learning compiler framework designed to optimize and deploy deep learning models efficiently across diverse hardware targets. TVM automates the process of translating models into optimized code tailored to specific hardware architectures.
Bring Your Own Codegen (BYOC)
The compilation process begins by converting models from TensorFlow or PyTorch into an Intermediate Representation (IR). In the high-level IR, computations are structured as a computation graph, where each node represents an operation (e.g., matrix multiplication, convolution). This graph is then progressively optimized through multiple stages. Finally, TVM’s code generation (codegen) module translates the optimized IR into low-level C code or other backend-specific code for execution on the target hardware.
For more information about BYOC, see How to Bring Your Own Codegen to TVM
Question: What will Relay look like?
After being converted by TVM, the high-level Relay representation will look like this.
It will precisely record each execution step along with detailed information, such as input shape, data type, and more. Once the Relay representation is obtained, optimizations can begin.
In this lab, our objective is to take the input Relay, apply fusion techniques to combine specific operators, and generate a fused Relay. Subsequently, we will perform code generation on the fused Relay to produce the output files marked in green, located along the designated path on the right.
In typical scenarios, TVM's C code generator is implemented as a C++ class that must be registered within TVM's function registry. After registration, TVM needs to be rebuilt in order to trigger the customized code generator through the Python API using
relay.build(). However, TVM also offers an alternative design that allows implementing the code generator directly in Python. In this case, the function can be registered using a decorator. It is important to note that such functions must take a Relay model as input and return a string containing the generated code in either C++ or C.According to the BYOC (Bring Your Own Codegen) framework, in order to produce an executable as part of the standard TVM compilation flow, the custom code generator must conform to the DLPack specification, and data transmission must utilize DLTensor. However, since our approach focuses on an end-to-end code generation flow, we bypass TVM’s generated output files. Instead, we directly invoke our code generator to produce both the model’s C source code and the corresponding binary weight data.
Lab 5.2 - Optimization
Operator Fusion
In Lab 4, we implemented the runtime API for the CPU and the driver for the DLA. It's important to note that the operations of these APIs are not purely single operations. Instead, they function more like fused operators within a single function, such as
conv2d_relu,conv2d_relu_maxpool, and so on. To handle this, we use TVM to automatically detect patterns from the Relay model graph and fuse these patterns into a single representative node, called a Composite. Next, we annotate these nodes for the specific target (or compiler). Finally, we merge these compiler regions to obtain the Fused Relay model, which is then used by our customized code generator.Fusing multiple operators helps reduce memory accesses, thereby minimizing data movement and improving performance.
Homework Note
The first task in this assignment is to implement the fusion of different operators in Relay.
Based on the
merge_composite_passfunction infuse.py, we need to create a pattern_table to identify subgraphs that match specific patterns. Therefore, our goal here is to properly construct the pattern_table.aoc2025-lab5/StudentID_lab5/Python/utils/fuse.pyTo complete the pattern_table, we need to implement several fusion functions defined within it.
Here, let's use
fuse_conv2d_bias_add_reluas an example to explain.aoc2025-lab5/StudentID_lab5/Python/utils/fuse.pydef fuse_conv2d_bias_add_relu(): # Define the pattern for the operations to be fused i = dfp.wildcard() # Input w = dfp.wildcard() # Weight b = dfp.wildcard() # Bias dequantized_i = dfp.is_op("qnn.dequantize")(i,dfp.wildcard(),dfp.wildcard()) dequantized_w = dfp.is_op("qnn.dequantize")(w,dfp.wildcard(),dfp.wildcard()) dequantized_b = dfp.is_op("qnn.dequantize")(b,dfp.wildcard(),dfp.wildcard()) conv2d_op = dfp.is_op("nn.conv2d")(dequantized_i,dequantized_w) bias_add_op = dfp.is_op("nn.bias_add")(conv2d_op, dequantized_b) relu_op = dfp.is_op("nn.relu")(bias_add_op) quantize_op = dfp.is_op("qnn.quantize")(relu_op,dfp.wildcard(),dfp.wildcard()) cast_op = dfp.is_op("cast")(quantize_op) # Assuming requantize is a cast operation return cast_opUsing
fuse_conv2d_bias_add_relu()as an example, note that wildcards in TVM represent patterns that can match any Relay expression.First, we use wildcards to represent the input, weight, and bias tensors.
Next, we sequence the operations in the following order:
conv2d, followed bybias_add, and thenrelu.After applying these operations, we quantize the result back to the original data type. Finally, we cast the data to meet the hardware requirements before returning the final output.
Following the fusion and annotation of the model subgraph, the subsequent step involves generating customized C code aligned with the target ASIC driver and its corresponding API.
Homework Note
TAs has already marked the sections where your work is needed. Please complete the parts indicated by TA.
Lab 5.3 - Integration
To aid understanding, the diagram below depicts the full function call path for the code generation and data generation process as implemented in this lab.
Overview of C codegen in Python version
In this lab, we implement a lightweight C code generator using Python, allowing for faster prototyping without the need to recompile TVM. This Python-based flow simplifies the integration of customized code generation logic while maintaining compatibility with the Relay model structure.
The codegen process is organized into three core components:
Python script path
aoc2025-lab5/StudentID_lab5/Python/utils/*.pycodegen.py: Responsible for generating the full C source code required for model inference. This includes emitting function declarations and layer-wise computations, as well as embedding model weights.datagen.py: Handles the transformation of sample input datasets into a runtime-friendly binary format. A lightweight header is added to assist with input parsing during execution.note.py: Serves as a configuration and pattern-matching module. It defines wildcard variable names for pattern recognition and maps fused composite functions to their corresponding C code templates.This modular design not only increases code readability and reusability but also separates concerns clearly, making it easier to maintain and extend the system for different target hardware or model structures.
TVM Relay External Codegen: C Backend Walkthrough -
codegen.pyThis part provides an explanation of the implementation for an external code generator in TVM targeting C. It demonstrates how to lower a Relay composite function into C code and generate a runtime-loadable module.
Module Overview and Imports
import tvm import os import numpy as np from .fuse import COMPILER_NAME from .fuse import pattern_table from .note import *Here, standard TVM libraries are imported, along with local modules for pattern matching and annotations. The
COMPILER_NAMEdefines the custom compiler target name used by TVM.Data Structures: Output and Data
Output
class Output(dict): ...Outputis used to store metadata for generated buffers or variables, such as name, data type, copy requirements, and size.Data
class Data(dict): ...Datastores information about constants, including their data content and structural metadata.Abstract Codegen Base Class
class CodegenCBase(ABC): ...This base class provides shared logic for all C-based code generators:
generate_backend_c_func)jit_impl)The output consists of both an internal kernel function and a wrapper conforming to TVM's external runtime interface.
CodegenC: Core Relay to C Lowering
class CodegenC(CodegenCBase): ...This class handles the traversal of the Relay IR and emits C code. It supports common Relay expressions including
Call,Var,Tuple,Constant, andTupleGetItem.Visiting Expressions
def visit_expr(self, node): if isinstance(node, Var): return self.visit_var(node) elif isinstance(node, Tuple): return self.visit_tuple(node) elif isinstance(node, TupleGetItem): return self.visit_tuple_get_item(node) elif isinstance(node, Constant): return self.visit_constant(node) elif isinstance(node, Call): return self.visit_call(node) else: return self.visit_expr_default(node)Becuase python did not support Overloadding, Relay expressions are dispatched to appropriate handlers depending on their type. For instance:
visit_var: Registers variable inputs.visit_constant: Processes embedded tensor data.visit_call: Handles composite patterns such as convolutions.Visiting Call Nodes (Composite Operators)
def visit_call(self, call): ...This is the main entry point for lowering composite functions into backend-specific C calls. Key steps include:
def visit_call(self, call): composite_name = call.op.attrs["Composite"] func_name = composite_name.replace(".","_") in_shape = self.get_shape(call.args[0].checked_type) if composite_name in PATTERN_TABLE: print("[composite trace]", composite_name, in_shape) else: raise RuntimeError("Unrecognized composite")First, obtain which Composite function this Call belongs to.
Next, replace "." with "_" in composite_name to convert it into a C-compatible function name.
Finally, retrieve the input shape of this Call.And check whether it exists in our pattern_table to prevent encountering unsupported functions.
Next, iterate through all parameters of the Call, checking whether each parameter is a constant (Constant) or a variable (Variable). Store these parameters for later use in generating C code.
Please complete the implementation of the trace parts, ensuring that operator information is correctly extracted and the corresponding C code is generated.
Next, set up the output buffer and store the parameters and values into the config for easy access later. Note that buf_idx should be incremented by 1 each time to ensure there are no duplicate buffers. Once the buffers are set, use get_conv_info(call) to retrieve the configuration and store it in the config.
Homework Note
TAs has already marked the sections where your work is needed. Please complete the parts indicated by TA.
Extracting Convolution Information
To extract convolution-related parameters (e.g., padding, kernel size, and number of output channels), the
get_conv_infofunction performs a Breadth-First Search (BFS) over the body of the composite function. It starts by initializing aconv2d_infodictionary with default values and adds the top-level operation (call.op.body) to theop_list. As it iterates through this list, it checks if each node is a Call and whether its operator name is "nn.conv2d". If a convolution is found, it extracts and stores the relevant attributes—such aspadding,channels, andkernel_size—into the conv2d_info dictionary. This approach ensures that deeply nested operators inside fused composite patterns are correctly identified and their parameters recorded for downstream code generation.This is essential for generating hardware-aware kernel configurations.
Homework Note
TAs has already marked the sections where your work is needed. Please complete the parts indicated by TA.
Code Emission and Constant Handling
def create_data_reference(self, symbol, const_id, cn): ...Each constant is given a unique name and stored for emission into
weight.candweight.h.The
visit_constantmethod also returns anOutputobject representing the constant as a pointer for runtime use.CSourceCodegen: Final Module Construction
class CSourceCodegen(CSourceModuleCodegenBase): ...This class wraps everything and produces:
code.c: Generated logic for the Relay function.weight.c/weight.h: Definitions and initializations for constant data.weight.bin: Serialized tensor weights.Generating the Function
def gen_c_func(self, func): ...Calls into
CodegenCto perform expression traversal and lower the function. The resulting C source code is appended to thecode_stream.Generating Weight Files
def gen_weight(self, const_vars): ...Iterates through all constants, exports their data to:
weight.c)weight.h)weight.bin)Special handling is included for DLA platforms that require channel padding to multiples of 4.
TVM Registration
@registry.register_func(f"relay.ext.{COMPILER_NAME}") def DLA_compiler(ref): ...This function is registered under the name
relay.ext.<compiler>and invoked automatically when TVM detects a composite pattern assigned to the external compiler.Output Files
Upon execution, the following files are written to
./output/:code.c: Core runtime logicweight.c: Constant data arraysweight.h: Header declarations for constantsweight.bin: Serialized constant tensors for loading at runtimeself.dump_code(self.weight_c_stream.getvalue(), "weight", "c")Dataset Preprocessing for C Deployment -
datagen.pyIn this section, we walk through the implementation of
datagen.py, which prepares dataset samples for use in embedded C environments. Its primary function is to retrieve images from a test dataset and convert them into.h(C header) or.bin(binary) files, making them portable for inference testing on microcontrollers or other low-level platforms.Note
Please copy and paste the provided code into your own
datagen.pyfile. You do not need to write it from scratch.Data Processing Class Overview
We start by defining a class called
Dataset_Generator, which handles the core data preprocessing logic. Its responsibilities include:.hfiles.binfilesclass Dataset_Generator(object): def __init__(self, source, root="data", eval_transform=None): self.root = root self.eval_transform = eval_transform self.test_dataset = source( root=self.root, train=False, download=True, transform=self.eval_transform ) self.testloader = DataLoader(self.test_dataset, batch_size=1, num_workers=1, shuffle=False) self.classes = []Fetching Data by Class
The
fetch_data()method extracts a fixed number of samples (num_data_per_class) from each category in the test dataset and organizes them into a dictionary for later processing.def fetch_data(self, num_data_per_class): self.classes = self.test_dataset.classes data_dict = dict() for idx, c in enumerate(self.classes): data_dict[c] = [] for img, y in self.testloader: if idx == y: data_dict[c].append(img) if len(data_dict[c]) >= num_data_per_class: break return data_dictGenerating .bin Files
The
gen_bin()method saves the extracted image data to a.binfile in a format suitable for embedded inference. The binary structure includes:float32formatdef gen_bin(self, output_path, num_data_per_class=10): data_dict = self.fetch_data(num_data_per_class=num_data_per_class) with open(output_path, "wb") as f: num_classes = len(self.classes) f.write(struct.pack("I", num_classes)) # Total number of classes for class_name in self.classes: encoded_name = class_name.encode('utf-8') name_length = len(encoded_name) f.write(struct.pack("I", name_length)) f.write(encoded_name) first_data_shape = list(data_dict.values())[0][0].shape flattened_size = np.prod(first_data_shape) f.write(struct.pack("I", num_data_per_class)) f.write(struct.pack("I", flattened_size)) for values in data_dict.values(): for value in values: np_array = value.numpy().astype('float32') f.write(np_array.tobytes())TVM Function Call Generation for Quantized Inference -
note.pyThis module automates the generation of C-style function call code for the TVM (Tensor Virtual Machine) compiler. It targets quantized deep learning models and supports operations such as:
QConv2D)QLinear)tvm_auto_args_NOTES: Parameter Schemas for TVM Operationstvm_auto_args_NOTES = { f"{COMPILER_NAME}_qconv2d_relu_maxpool": [ "input", "input_scale", "input_zero_point", "weight", "weight_scale", "weight_zero_point", "bias", "bias_scale", "bias_zero_point", "dequantize_scale", "dequantize_zero_point", "quantize_zero_point", ], # Additional entries go here }This dictionary defines the expected argument names for each supported TVM function. These names will be used for:
For example, the
qconv2d_relu_maxpoolfunction requires:convert_log: Helper for Quantization Scalingdef convert_log(x): return -int(math.log2(x))This helper function calculates
-log2(x)as an integer.It’s commonly used to transform floating-point scaling factors into integer log scale shifts, which are more efficient for fixed-point inference on hardware accelerators or microcontrollers.
tvm_c_func_call_gen: C Code Generators per Operatortvm_c_func_call_gen = { f"{COMPILER_NAME}_qconv2d_relu_maxpool": lambda config: f""" #ifndef CPU_ONLY qconv2d_relu_maxpool( #else qconv2d_relu_maxpool_cpu( #endif {config["input"]}, {config["weight"]}, {config["output"]}, {config["bias"]}, {config["output_len"]}, {config["input_len"]}, {config["weight_len"]}, #ifndef CPU_ONLY // Mapping parameters {config["m"]}, {config["e"]}, {config["p"]}, {config["q"]}, {config["r"]}, {config["t"]}, #endif // Shape parameters {config["PAD"]}, {config["U"]}, {config["R"]}, {config["S"]}, {config["C"]}, {config["M"]}, {config["W"]}, {config["H"]}, // Quantization scale {convert_log(config["input_scale"] * config["weight_scale"] / config["dequantize_scale"])} ); """, # Additional entries go hereThis dictionary maps each TVM operation to a corresponding C-language function call template, dynamically filled using the
configdictionary.Each lambda returns a formatted multi-line C function call string, ready to be inserted into generated code. These templates can target:
*_cpu)Example Output:
qconv2d_relu_maxpool( input, weight, output, bias, output_len, input_len, weight_len, // mapping parameter m, e, p, q, r, t, // shape parameter PAD, U, R, S, C, M, W, H, // quantize scale convert_log(input_scale * weight_scale / dequantize_scale) );Once all python script are completed, the
build_modellogic will automatically invoke the appropriate code generation logic fromtvm_c_func_call_gen, completing the full compilation and code-emission process for quantized models.Lab 5.4 Performance Analysis
In this lab, we will explore tools that help analyze and visualize memory usage during program execution. We will use Valgrind’s Massif toolset to profile a simple C program that recursively prints Fibonacci numbers.
massif: Heap Memory ProfilerMassif is a heap profiler provided by Valgrind. It helps track memory allocations, identify memory peaks, and analyze memory usage over time.
To use Massif, run the following command in the
labdirectory:This will execute the program
massif_test, which prints a list of Fibonacci numbers using a recursive function.Massif will trace memory usage at runtime. Internally, it uses the following command:
valgrind --tool=massif \ --heap=yes \ --stacks=yes \ --time-unit=i \ --detailed-freq=1 \ --max-snapshots=1000 \ --massif-out-file=massif.out.massif_test ./massif_testExample output:
Note
The prefix
==841727==refers to the process ID (PID) of the running program.ms_print: Text-Based Memory Usage Graphms_print is a command-line tool that reads the Massif output and displays a detailed memory usage graph.
To generate a visual report:
This will use
ms_printto parse the output filemassif.out.massif_testand dump the results intomassif_output.txt.Here's a sample snippet of the memory usage chart:
It includes:
massif-visualizer: GUI for Massif Outputmassif-visualizer is a graphical interface for visualizing Massif output, making it easier to analyze memory usage trends interactively.
To use
massif-visualizer:Ensure X11 forwarding is enabled if you're using SSH.
Launch the tool:
Open the output file:
massif.out.massif_testView the memory usage graph:
This visualization helps you easily pinpoint memory-intensive regions and track memory growth over time.
There are also gui application called
massif-visualizer, it is…Clean Up
After completing the lab, clean the generated output files by running:
This removes all intermediate files and prepares the environment for a fresh start.
Homework Requirements
In this lab, you learned how to build a complete AI compiler flow that translates high-level machine learning models into low-level C code suitable for embedded and CPU-only environments. You explored key compiler components such as Relay operator fusion, external code generation, and performance analysis using memory profiling tools like
valgrind,massif, andms_print.In addition, you will implement and complete several essential modules of the compiler pipeline, including:
fuse.py.note.pyfor different quantized operators.ms_print) and GUI (massif-visualizer).These tasks are designed to help you understand the end-to-end compilation and deployment process of deep learning models, and how software optimization maps to hardware-aware execution models.
By the end of this assignment, you will have hands-on experience with:
Make sure to follow the implementation notes and fill in the TODOs as marked by the TAs in each respective file.
Prerequisites
Before starting the lab, make sure the following setup steps are completed:
Download and extract the lab materials
Get the sample code and report template from Moodle, then decompress the archive:
Activate the lab environment
Ensure that you are working in the correct Conda environment for this lab:
Directory Structure
After unzipped the file downloaded from Moodle, the directory structure will look like below:
HW 5.1 Codegen with TVM compiler
Implement the section in the following python files
fuse.pydef fuse_conv2d_bias_add_relu():... def fuse_dense_add_relu():... def fuse_dense_add():...codegen.pydef visit_call(self, call):... def get_conv_info(self,call):...datagen.pyPlease copy the contents of
datagen.pyin section Lab 5.3 into your owndatagen.pyfile.note.py:tvm_c_func_call_gendictionaryOnce you have completed the marked sections above, you can proceed to execute the following steps in order.
Build the model
The Process will traverse the hale model in DFS order, notice that the trace flow is from output layer back to input layer.
visuTVM: Relay Graph Visualizer
visuTVMis a tool used to visualize the structure of a TVM Relay model graph, helping you better understand model transformations during compilation.To generate visualizations of the Relay graph:
This command produces two SVG images representing the Relay graph:
./output/visu_VGG8_relay_ir.svg: The original Relay IR (before the MergeComposite pass)./output/visu_VGG8_relay_ir_pass.svg: The Relay IR after pattern fusion and annotation passesSubmission
Include both
.svgimages in yourreport.mdto illustrate the changes before and after fusion.HW 5.2 Simulation and Performance Analysis
In this task, you will analyze the memory usage and runtime performance of the CPU-only version of your model using Massif, a heap memory profiler from Valgrind. Additionally, you will utilize DLA info counters—provided in the Lab 4 runtime library—to evaluate the behavior and efficiency of the simulated accelerator.
This dual analysis allows you to compare software-based and hardware-like execution, providing deeper insights into memory bottlenecks and inference performance.
Inference model with CPU-only
For quickly demo and test of cpu version:
you will got a single shot of inference of full model in cpu-only runtime api.
For more config in compiling cpu-only version runtime, move into
testbench/cpu, then usemake usagefor more details about configurations.Notice that it is needed to
make cleanbefore any new configuration applied.make testis the single shot of indecated image.make test_fullwill implement 100 images.Model Accuracy
This model is pre-trained by TA. After quantization, it achieves 88.xx% accuracy on the CIFAR-10 dataset.
This means that the results seen in this simulation make sense!
Memory Usage Analysis with Massif
make valgrindandmake valgrind_fullexecute the same tests asmake test, but with enhanced memory tracking during runtime.massif_out/directory.Submission
test_fullresult matrix, including Accuracy.Inference model with DLA
In DLA version test and demo, use
make test_dlaat top directory will perform a single shot simulation on eyeriss ASIC.Model Misclassification
This model was pre-trained by the TA, so occasional misclassifications can occur. For example, in the case above, the model mistakenly identified a
deeras ahorse.Also, for more information and configuraion about DLA version compile and inference is in
testbench/dla, usemake usageto get details about them.Important
Notice that the DLA version did not support
test_full, because the 100 images simulation will takes more than one hours, even run on better computer.DLA runtime analysis
To enable this feature, set the environment variable
DLA_INFO=1before running make test. After the test completes, a CSV file will be generated containing statistics and parameters for each task assigned to the DLA.Submission
report.md. The statistics should include cycle count, time and memory read/write count for each layer.VS Code Extension - Markdown Preview Mermaid
This extension allows you to preview Mermaid diagrams directly in Markdown files.

Clean Up
make cleaninsidetestbench/cpuortestbench/dla.make cleanin the root directory of the lab project.Submission Guidelines
Deadline
Submission Format
Submissions must follow the specified structure:
Caution
Python/andreport.mdis needed.StudentID_lab5.zip, make sure to place everything inside a folder first. This helps keep things organized when the TA unzips it for grading.Important Notes
Submission without conforming to the guidelines may result in score deductions: