Notebook preview
Setup instructions¶
Website @ HPCA'26: https://tutorial.xiangshan.cc/hpca26/hands_on/setup/
- Open https://t.xiangshan.cc in your browser
- Enter password: TO BE DISCLOSED ON SITE
- By default you should see a terminal
- If not, click Menu (triple dashes) on the top-left corner, then "Terminal > New Terminal"
- Run
./start.shin the terminal - This will create a unique random id for you, and create your workspace at
/data/${random_id} - When your workspace is created, it should be opened automatically
- If not, click "Menu > File > Open Folder", enter
/data/${random_id}, then click "OK"
- If not, click "Menu > File > Open Folder", enter
- Feel free to ask the instructors for help any time if you have any problem
Welcome to XiangShan Tutorial¶
In this section, we will introduce the basic usage of Jupyter Notebook.
A Jupyter Notebook consists of multiple cells, each of which can contain python/bash code or text. We use text cells to write these instructions you are reading, and code cells to place the code you can run.
- Cells that start with %%bash are Bash scripts; the rest are Python code.
- Lines that start with # are comments.
You can click ▶ in the top-left corner of a code cell to run that single cell; the output will be displayed below the cell.
When you open a Notebook for the first time, you may encounter the problem that the environment is not detected correctly. If a pop-up window appears at the top of the page prompting "Select Kernel" when you click run, please select "Python Environments" > "Python 3.x.y /bin/python3 (Global Env)".
> 
%%bash
echo "[bash] Welcome to the XiangShan Bootcamp!"
print("[python] Welcome to the XiangShan Bootcamp!")
Each cell has its own working directory and environment variables, so some commands may need to be rerun. If you execute these commands directly in the shell, you can skip the repetitive parts, e.g. source env.sh.
It should also be mentioned that...
this is different from executing commands directly in the shell
In the shell, you may need to pay more attention to the current working directory without repeating source env.sh to set environment variables.
%%bash
# For example, in this cell, we change the working directory,
pwd
cd ..
pwd
# ... and set some environment variables,
export HELLO_WORLD="XiangShan Bootcamp!"
echo Hello $HELLO_WORLD
%%bash
# ... we can see that the changes made in the previous cell do not take effect in this cell.
pwd
echo Hello $HELLO_WORLD
About the get_asset funtion¶
You may see the use of the get_asset function in the following chapters.
# For example, using the pre-compiled simulation program emu
$(get_asset emu-precompile/emu)
# This will return the absolute path of assets/emu-precompile/emu
# Or, prefer to use locally compiled files (if they exist)
$(get_asset emu-precompile/emu ${NOOP_HOME}/build/emu)
# If you have done local compilation, ${NOOP_HOME}/build/emu will exist and be returned preferentially. Otherwise, it falls back to assets/emu-precompile/emu
Compiling and running XiangShan requires massive computing resources.
In order to enable more people to experience XiangShan with weaker devices, bootcamp provides many precompiled assets. We will use them through the get_asset function.
In this tutorial, our demo server do has limited resource, so, for most compilation and some execution commands, we'll just show them without actually running to save computing resources and time.
Configure and Build¶
In this section, we present the workflow for configuring and building XiangShan.
Configure¶
Benefiting from the flexibility provided by Chisel, an OOP HDL, XiangShan offers a wide range of configurable parameters for researchers and industry customization.
// xiangshan/frontend/FrontendParameters.scala
case class FrontendParameters(
FetchBlockSize: Int = 64, // bytes
bpuParameters: BpuParameters = BpuParameters(),
icacheParameters: ICacheParameters = ICacheParameters(),
// ...
) { ... }
// xiangshan/frontend/icache/Parameters.scala
case class ICacheParameters(
NumSets: Int = 256,
NumWays: Int = 4,
Replacer: String = "setplru", // "random", "setlru", "setplru"
EnableEcc: Boolean = true, // whether to enable ECC or parity check
// ...
) { ... }
Taking the frontend design of Kunminghu-V3 as an example, we employ a hierarchical parameter system where designers first define parameter classes and default parameters for each module, as shown in this page.
Subsequently, for researchers and industry users, these parameter classes can be instantiated in top/Configs.scala to quickly generate different configurations:
// Default configuration
class DefaultConfig extends Config(
// ...
frontendParameters = FrontendParameters()
// ...
)
// Minimal configuration for quick functional verification, with smaller resources to speed up compilation and simulation
class MinimalConfig extends Config(
// ...
frontendParameters = FrontendParameters(
icacheParameters = ICacheParameters(
NumSets = 64,
NumWays = 2,
Replacer = "random",
EnableEcc = false
)
)
// ...
)
Finally, different configurations can be compiled at build time using make CONFIG=XXXConfig.
Wait a minute!¶
Due to some historical reasons, the build workflow of XiangShan depends on certain environment variables, such as NOOP_HOME. We provide an env.sh script to help you set up these environment variables.
You can run the following cell to view these environment variables:
%%bash
source ../env.sh
env | grep _HOME | outputBuffer
Build¶
After the environment setup, go to $NOOP_HOME to build XiangShan.
%%bash
source ../env.sh
cd ${NOOP_HOME}
#make emu -j16 CONFIG=MinimalConfig
# Additional make options
# CONFIG=MinimalConfig XiangShan configuration
# EMU_THREADS=4 Simulation thread count
# EMU_TRACE=1 Enable waveforms
# WITH_DRAMSIM=1 Simulate DRAM with DRAMSim3
# WITH_CHISELDB=1 Enable ChiselDB
# WITH_CONSTANTIN=1 Enable Constantin
The commands above will generate outputs like build/emu and build/rtl.
- build/rtl/*.sv is Verilog files generated by Chisel.
- build/emu is a simulation executable further compiled with Verilator.
Build the workload using Nexus-AM¶
In the previous section, we built the XiangShan emulator emu. However, we cannot directly run an arbitrary ELF file on it. This is because XiangShan is a bare-metal device that does not provide support such as operating systems or runtime libraries.
Linux? Too heavyweight.
Nexus-AM: a lightweight framework for bare-metal machines.
The am/ directory contains the Nexus-AM framework sources; apps/ and tests/ hold common workload sources, and you can create your own apps and tests.
%%bash
source ../env.sh
cd ${AM_HOME}
tree -d -L 1 | outputBuffer
echo "- Apps: $(ls -m ./apps)"
echo "- Tests: $(ls -m ./tests)"
We start with a simple "Hello, XiangShan" example, whose source code is located in the ${AM_HOME}/apps/hello directory. We can compile it using make:
%%bash
source ../env.sh
cd $AM_HOME/apps/hello
# compiling
# ARCH=riscv64-xs for XiangShan
# LINUX_GNU_TOOLCHAIN=1 to use the GNU toolchain from Ubuntu apt repo, instead of riscv64-unknown-elf
make ARCH=riscv64-xs LINUX_GNU_TOOLCHAIN=1 | outputBuffer
# check output
ls build
This will generate the following three files:
- hello-riscv64-xs.bin:Program binary image (The ELF header and other metadata was removed) for emu.
- hello-riscv64-xs.elf:The program's ELF file.
- hello-riscv64-xs.txt:The program’s disassembly for debugging
Nexus-AM supports multiple ISAs and configurations. You can compile for different targets by passing different ARCH values. For example, riscv64-xs used above is for XiangShan's RISC-V 64-bit architecture.
If the ARCH you pass isn’t supported, make will print all supported ARCH values.
%%bash
source ../env.sh
cd ${AM_HOME}
# || true to prevent notebook erroring out on non-zero exit code
make ARCH= || true
Run RTL simulation¶
Now, we can finally run XiangShan's RTL simulation using the emu and hello programs we just compiled!
XiangShan's emu supports many options, run emu --help to see usage.
%%bash
source ../env.sh
cd ${NOOP_HOME}
$(get_asset emu-precompile/emu ${NOOP_HOME}/build/emu) --help | outputBuffer
Run Hello XiangShan:
%%bash
source ../env.sh
cd ${NOOP_HOME}
$(get_asset emu-precompile/emu ${NOOP_HOME}/build/emu) \
-i $(get_asset workload/hello-riscv64-xs.bin ${AM_HOME}/apps/hello/build/hello-riscv64-xs.bin) \
--no-diff \
2>/dev/null | outputBuffer
Note: XiangShan prints performance counters to stderr at the end of a run, and the output is very large. We recommend always redirecting stderr to a file. In the example above, since we don’t need the counters, we redirect it to /dev/null, adjust as needed.

We use MinJie Platform to construct the workflow, which is shown as functional verification toolchain in the picture.
We summarize functional verification as a loop consisting of the following four parts, with tool support for each part:
- Test generation: nexus-am for generating bare-metal tests;
- Bug detection: nemu for providing golden result, difftest for result comparison;
- Preserving bug context: lightsss for automatic context capturing;
- Troubleshoot & bug fixation: waveform and chiselDB for bug locating and fixing.
Why we need Minjie¶
We manually injected an bug into the ALU module and compiled emu for you to test:
%%bash
source ../env.sh
cd ${NOOP_HOME}
$(get_asset emu-precompile/emu-alu-err) \
-i $(get_asset workload/hello-riscv64-xs.bin ${AM_HOME}/apps/hello/build/hello-riscv64-xs.bin) \
--no-diff \
-C 20000 \
2>/dev/null || true
We can see that it does not correctly print "Hello XiangShan" as the previous section did. And, when it reaches the 20,000-cycle limit, pc is 0xE. That's clearly not the behavior of a normal program.
However, the simulation program cannot determine whether it is running correctly or incorrectly. As a result, it may continue running indefinitely after the faulty behaviour, merely wasting compute resources.
"How does the simulation program know it has already encountered an error"
"How to locate where the error occurred" are important questions in processor functional verification.
Minjie toolchain is designed to solve these questions.
NEMU: ISA Reference¶
"How does the simulation program know it has already encountered an error?"
-> A reference model
NEMU: a Spike-like ISA simulator
- QEMU-like performance;
- Exposes APIs to compare and verify XiangShan's architectural state.
Recall the quesions we have raised before, we need to define what is "correct", in other words, we need a golden reference. Like QEMU and Spike, but they are a little bit heaviweight and not easy to integrate into our workflow.
-> NEMU
NEMU provides two default configurations:
- xxx_defconfig:xxx Default settings for standalone run mode
- xxx-ref_defconfig:xxx As the default configuration for DiffTest co-simulation mode
difftest and reference mode will be introduced later, let's have a look on the build steps and usage of standalone mode first.
%%bash
source ../env.sh
cd ${NEMU_HOME}
make clean
# compile default config as standalone mode
make riscv64-xs_defconfig | outputBuffer
make -j | outputBuffer
make clean-softfloat | outputBuffer
# compile default config as reference mode
make riscv64-xs-ref_defconfig | outputBuffer
make -j | outputBuffer
Next, we run Hello XiangShan on NEMU in standalone mode.
%%bash
source ../env.sh
cd ${NEMU_HOME}
# Use the -b option to start NEMU in batch mode and avoid manually entering commands to run the workload.
./build/riscv64-nemu-interpreter \
-b \
$(get_asset workload/hello-riscv64-xs.bin) | outputBuffer
Difftest: ISA Co-simulation framework¶
- How does the simulation program know it has already encountered an error
- How to locate where the error occurred
-> Co-sim RTL (DUT) and reference model (REF)

Now we have NEMU as the reference model, it partially solve the first problem, we now know what is "correct", to further answer these questions, we still need a mechanism to notify the emu process or developers when an error happens.
We can run the RTL simulation and the reference model simultaneously, comparing their architectural states in real time to identify errors in the simulation program.
When running emu, enable the DiffTest feature by specifying the path to the reference model's dynamic link library with the --diff <path/to/ref.so> parameter.
We rerun the faulty emu with diff enabled:
%%bash
source ../env.sh
cd ${NOOP_HOME}
$(get_asset emu-precompile/emu-alu-err) \
-i $(get_asset workload/hello-riscv64-xs.bin) \
--diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
2>/dev/null | tee emu_err.log > /dev/null || true # tutorial:add "|| true" to avoid notebook errors; It's not needed in real usage.
tail -n 7 emu_err.log
At PC 0x0080000078, the REF and DUT are not matched: a0 is 0x2000 in the REF, but 0 in the DUT.
Difftest also prints useful information when an error is encountered, such as the values of architectural registers (integer/floating-point/vector/CSR).
%%bash
source ../env.sh
cd ${NOOP_HOME}
tail -n 95 emu_err.log | head -n 19 | outputBuffer
And the order of instruction commits and other information.
%%bash
source ../env.sh
cd ${NOOP_HOME}
tail -n 124 emu_err.log | head -n 11 | outputBuffer
After Difftest detects an error, we can rerun the simulation and enable waveform output around the failing cycle reported by Difftest.
%%bash
source ../env.sh
cd ${NOOP_HOME}
mkdir -p build
rm -f ./build/*.vcd
$(get_asset emu-precompile/emu-alu-err) \
-i $(get_asset workload/hello-riscv64-xs.bin) \
--diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
-b 8000 \
-e 10000 \
--dump-wave \
2>/dev/null >/dev/null || true
echo -n "Dump wave: "
realpath ./build/*.vcd
LightSSS: Light-weight Simulation Snapshot System¶
- How to locate where the error occurred -> Difftest
- How to efficiently obtain waveform?
- Enable waveform at the first place -> Disk 💥
- Manually re-run -> Not efficient
Thanks to DiffTest, we have already addressed the issue of "How to locate where the error occurred".
However, after finding the failure point, we still need to manually enable the waveform output and rerun the simulation to obtain the waveform files needed for debugging. This process is very time-consuming, especially for long-running workloads like SPEC CPU.
To further improve efficiency, we need a method that can automatically restore to the state before the error occurred and generate waveform files from that state when an error is detected. In other words, we need a snapshot mechanism.

LightSSS is designed to solve this problem. It utilizes Linux's fork() system call and the copy-on-write mechanism to achieve low-cost simulation snapshots.
To use LightSSS is straightforward. Simply add the --enable-fork parameter when running emu to enable the feature:
%%bash
source ../env.sh
cd ${NOOP_HOME}
mkdir -p build
rm -f ./build/*.vcd
$(get_asset emu-precompile/emu-alu-err) \
-i $(get_asset workload/hello-riscv64-xs.bin) \
--diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
--enable-fork \
2> /dev/null | outputBuffer || true
echo -n "Dump wave: "
realpath ./build/*.vcd
If you see "the oldest checkpoint start to dump wave and dump nemu log...", LightSSS is active. The simulation will then restart from the latest snapshot and record waveforms.
ChiselDB:Debug-friendly structured database¶
Waveforms are not the best choice for certain signals (e.g., bus transactions):
- Storage waste
- Not structured
-> ChiselDB
LigtSSS is powerful but the waveform are still large in size and hard to apply further analysis And we want to analyze structured data like memory transaction trace So we present ChiselDB, a debug-friendly structured database. It will insert probes between module interfaces in hardware, and use DPI-C in Chisel code directly to transfer bundle info and data As for bug analysis, SQL queries are supported so it's much more easy to use than waveform.
We provide a prebuilt simulator emu-cdb-err with an injected bug that forces all data released from L2 Cache to L3 Cache to a constant value.
Enable ChiselDB with --dump-db and turn on DiffTest; after running, DiffTest reports an error and a .db file is generated under ./build.
%%bash
source ../env.sh
cd ${NOOP_HOME}
rm -f ./build/*.db # clean old files
mkdir -p build
$(get_asset emu-precompile/emu-cdb-err) \
-i $(get_asset workload/stream_100000.bin) \
--diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
--dump-db \
2>linux.err || true
echo -n "Dump DB: "
realpath ./build/*.db
Then use SQLite to read the .db for analysis: query all TileLink transactions at address 0x80048f00, and format the output with ./scripts/cache/convert_tllog.sh.
%%bash
source ../env.sh
DB=$(ls -t ${NOOP_HOME}/build/*db | head -n 1)
sqlite3 ${DB} "select * from TLLog where ADDRESS=0x80048f00" | sh ${NOOP_HOME}/scripts/cache/convert_tllog.sh | outputBuffer
Columns: timestamp to_from channel opcode permission address data
Here we can see that:
- L1D writes data back to L2
16171 L2_L1D_0 C ProbeAckData Shrink TtoN 0 5 80048f00 0000000080048f50 0000000080014328 0000000000000000 0000000000000000 user: 0 echo: 0
16172 L2_L1D_0 C ProbeAckData Shrink TtoN 0 5 80048f00 0000000000000000 0000000000000000 000000008001e000 0000000080042060 user: 0 echo: 0
- And to L3
16179 L3_L2_0 C ProbeAckData Shrink TtoN 0 2 80048f00 0000000080048f50 0000000080014328 0000000000000000 0000000000000000 user: 0 echo: 1
16180 L3_L2_0 C ProbeAckData Shrink TtoN 0 2 80048f00 0000000000000000 0000000000000000 000000008001e000 0000000080042060 user: 0 echo: 1
- The next time L1D requests the data, L3->L2 GrantData does not match the original data written by L1D, indicating the bug.
16457 L2_L1D_0 A AcquireBlock Grow NtoT 0 0 80048f00 0000000000000000 0000000000000000 0000000000000000 0000000000000000 user: 80048f07 echo: 0
16463 L3_L2_0 A AcquireBlock Grow NtoT 0 0 80048f00 0000000000000000 0000000000000000 0000000000000000 0000000000000000 user: 0 echo: 1
16486 L3_L2_0 D GrantData Cap toT 1 0 80048f00 0000000000abcdef 0000000000000000 0000000000000000 0000000000000000 user: 0 echo: 1

Co-verification of the Cache system with upstream modules is complex and prevents rapid iteration.
To address this issue, we developed TL-Test: a unit-level cache-system verification framework that supports the TileLink protocol, cache-coherence checking, and randomized test-case generation.
Here is another example to detect cache coherence violation by TL-Test. We inject a bug that wrongly shift the grant data.
TL-Test generates randomized tests and pinpoints a transfer problem at a specific address in our cache design. It logs all bus transactions; we then use grep to extract log for analysis.
%%bash
source ../env.sh
cat $(get_asset tltest-precompile/tlt_err.patch) | outputBuffer
%%bash
source ../env.sh
# cd $TLT_HOME && make coupledL2-test-l2l3-v3 run THREADS_BUILD=16 CXX_COMPILER=clang++-17
# cd $TLT_HOME/run && ./tltest_v3lt 2>&1 | tee tltest_v3lt.log
get_asset tltest-precompile/tltest_err
mkdir -p ${WORK_DIR}/02-functional/05-tltest
cd ${WORK_DIR}/02-functional/05-tltest
cp -r $(get_asset tltest-precompile) ./ && cd ./tltest-precompile
./tltest_err 2>&1 | tee tltest_v3lt.log > /dev/null
tail -n 50 tltest_v3lt.log | head -n 15
Error Addr: 0x80
%%bash
source ../env.sh
# grep "addr: 0x80" $TLT_HOME/run/tltest_v3lt.log
cd ${WORK_DIR}/02-functional/05-tltest/tltest-precompile && grep "addr: 0x80," tltest_v3lt.log | head -n 10
Columns: [time] [INFO-level] #nodeIdx core [channel] [opcode] source, address, alias, data
- L1D acquires Eaddr
[236] [tl-test-new-INFO] #0 L2[0].C[0] [fire A] [AcquirePerm NtoT] source: 0x3, addr: 0x80, alias: 0
- L1D release Eaddr, and data successfully transferred from L1D to L2
[806] [tl-test-new-INFO] #0 L2[0].C[0] [fire C] [ReleaseData TtoN] source: 0x3, addr: 0x80, alias: 0, data: [ c7 a5 ... ]
[808] [tl-test-new-INFO] #0 L2[0].C[0] [fire C] [ReleaseData TtoN] source: 0x3, addr: 0x80, alias: 0, data: [ fe 14 ... ]
- Next time L1D acquires Eaddr, L2 returns wrong data
[2036] [tl-test-new-INFO] #0 L2[0].C[0] [fire D] [GrantData toT] source: 0xf, addr: 0x80, alias: 0x1, data: [ 00 c7 ... ]
[2038] [tl-test-new-INFO] #0 L2[0].C[0] [fire D] [GrantData toT] source: 0xf, addr: 0x80, alias: 0x1, data: [ 00 fe ... ]

Apart from the functional verification introduced before. Performance verification and optimization are also crucial parts of processor development.
So in this chapter, we will introduce the MinJie (or agile) performance verification approaches used by our team and provide some demonstration.
Similar to functional verification, we also summarize the performance verification process as an iterative cycle, as shown in this picture:
We do RTL implementation and we run tests and do performance evaluation and performance analysis.
In these steps, RTL implementation and running tests are naive, so we won't detail them in this tutorial.
Next, I will introduce powerful tools we use in performance evaluation and performance analysis, and how can we speed up the optimization process. We have SimPoint, XSPerf, top-down and constantin.

Let's start with checkpoint. Here's the story:
To evaluate performance, we usually run benchmark suites via simulation (i.e. software simulation using verilator, hardware-accelerated simulation using FPGA/emulator).
However, existing approaches each have their own challenges:
- Software simulation is too slow. For a complex design like XiangShan, it can only run at a few KHz, so it takes too long to run a benchmark;
- FPGA has limited on-chip resources, making it difficult to use for complex designs like XiangShan;
- Emulators are too expensive for us, and, probably for most academia.
We have seen some works trying to accelerate software simulation or improve FPGA usability.
These are great jobs, but, we think there's a much simpler way: Checkpointing.

Checkpointing simply means selecting some segments of a program's execution, saving the architectural state (i.e. registers and memory) at the beginning of these segments. Later when we wants to do performance evaluation, we can simply load the saved state and start simulation there.
This brings 2 main benefits:
- This reduces the number of instructions that need to be simulated. We're not running the entire program from the start, but only some segments of it.
- Different segments from the same program can be simulated in parallel, thus increasing simulation parallelism.
By taking a weighted average of the performance data collected from each segment, we can estimate the overall performance.
This slide shows 2 common methods for selecting segments:
- Uniform sampling, i.e., selecting a segment every fixed number of instructions;
- SimPoint sampling, i.e., selecting segments that can represent the overall behavior of the program by profiling.
Next, we will demonstrate how SimPoint profiles a program, generates checkpoints, and runs simulations using checkpoints.
This section will use some paths and constants different from ../env.sh. For convenience, we have created a 01-env.sh. In this section, we will use this script to set environment variables. You can run the following cell to view these environment variables.
%%bash
source ../03-performance/01-env.sh
env | grep WORKLOAD= # workload to be simulated / profiled / checkpointed
env | grep CHECKPOINT_INTERVAL=
env | grep NEMU=
env | grep _HOME | tail
env | grep _PATH | tail
The first step to perform checkpointing is to compile SimPoint tool and NEMU (in checkpoint mode), and generate a checkpoint restorer.
%%bash
source ../03-performance/01-env.sh
cd ${NEMU_HOME}
git submodule update --init
# Compile simpoint generator
cd ${NEMU_HOME}/resource/simpoint/simpoint_repo
make clean
make
# Compile NEMU in checkpoint mode
cd ${NEMU_HOME}
make clean
make riscv64-xs-cpt_defconfig
make -j8
# Generate checkpoint restorer for ${WORKLOAD}
cd ${NEMU_HOME}/resource/gcpt_restore
rm -rf ${GCPT_PATH}
make -C ${NEMU_HOME}/resource/gcpt_restore/ \
O=${GCPT_PATH} \
GCPT_PAYLOAD_PATH=$(get_asset workload/${WORKLOAD}.bin) \
CROSS_COMPILE=riscv64-linux-gnu-
(run)
SimPoint is used to select representative segments.
NEMU is used to profile program and generate checkpoints.
The restorer acts like a bootloader, which loads the saved memory from simulated flash to main memory, and recovers registers.
Next, we need to run the program to be checkpointed using NEMU to collect program behavior for profiling.
%%bash
source ../03-performance/01-env.sh
rm -rf ${RESULT_PATH}
_LOG_PATH=${LOG_PATH}/profiling
mkdir -p ${_LOG_PATH}
${NEMU} ${GCPT} \
-w ${WORKLOAD} \
-D ${RESULT_PATH} \
-C profiling \
-b \
--simpoint-profile \
--cpt-interval ${CHECKPOINT_INTERVAL} \
> >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt)
Then, use SimPoint to perform clustering analysis on the collected program behavior, selecting segments.
%%bash
source ../03-performance/01-env.sh
CLUSTER=${RESULT_PATH}/cluster/${WORKLOAD}
mkdir -p ${CLUSTER}
random1=`head -20 /dev/urandom | cksum | cut -c 1-6`
random2=`head -20 /dev/urandom | cksum | cut -c 1-6`
_LOG_PATH=${LOG_PATH}/cluster
mkdir -p ${_LOG_PATH}
${SIMPOINT} \
-loadFVFile ${PROFILING_RESULT_PATH}/${WORKLOAD}/simpoint_bbv.gz \
-saveSimpoints ${CLUSTER}/simpoints0 \
-saveSimpointWeights ${CLUSTER}/weights0 \
-inputVectorsGzipped \
-maxK 3 \
-numInitSeeds 2 \
-iters 1000 \
-seedkm ${random1} \
-seedproj ${random2} \
> >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt)
Finally, use NEMU to rerun the program that needs to be checkpointed to generate checkpoint files.
%%bash
source ../03-performance/01-env.sh
CLUSTER=${RESULT_PATH}/cluster
_LOG_PATH=${LOG_PATH}/checkpoint
mkdir -p ${_LOG_PATH}
${NEMU} ${GCPT} \
-w ${WORKLOAD} \
-D ${RESULT_PATH} \
-C checkpoint \
-b \
-S ${CLUSTER} \
--cpt-interval ${CHECKPOINT_INTERVAL} \
> >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt)
Go to the directory ${RESULT_PATH}/checkpoint/${WORKLOAD}, you can see the generated checkpoint files, a total of cluster number of .gz files, with the weight of the checkpoint indicated in the file name.
%%bash
source ../03-performance/01-env.sh
find "${RESULT_PATH}/checkpoint/${WORKLOAD}" -type f -name "*_.gz" | tail
We can use emu to run one of the generated checkpoints and see the effect.
%%bash
source ../03-performance/01-env.sh
CHECKPOINT=$(find ${RESULT_PATH}/checkpoint/${WORKLOAD} -type f -name "*_.gz" | tail -1)
$(get_asset emu-precompile/emu) \
-i ${CHECKPOINT} \
--diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
--max-cycles=50000 \
2>/dev/null
When emu detects that the file is a gzip-compressed checkpoint, it will automatically decompress it and restore the memory state and architectural state from the checkpoint.
Performance counter in XiangShan¶
Purpose: Collect performance events for analysis and tuning
XSPerf:
- Accumulate:
if (valid) counter += diff;
- Histogram:
if (valid) distribution[value / step] += 1;
- Rolling:
if (valid) counters[segment] += diff; if (cycles++ == segment_size) { cycles = 0; segment++; }
While running benchmarks, we need to collect and record hardware behavior (performance events) for analysis and tuning.
In XiangShan RTL, we have implements three types of performance counters, the pesudo code are shown here:
- Accumulate: Basic counter that accumulates whenever a performance event occurs;
- Histogram: Records the distribution of values when performance events occur;
- Rolling: Works like a segmented Accumulate-type counter, it tracks the changes in the number of performance events in each segment throughout the entire run.
Accumulate & Histogram¶
These two types of performance counters are printed to stderr when the simulation ends.
Example (Accumulate): the total number of instructions committed
def ifCommitReg(counter: UInt): UInt = Mux(isCommitReg, counter, 0.U)
XSPerfAccumulate("commitInstr", ifCommitReg(trueCommitCnt), XSPerfLevel.CRITICAL)
Example (Histogram): the distribution of L2 Cache acquire latency
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 0, 30, 1, true, true)
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 30, 100, 5, true, true)
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 100, 200, 10, true, true)
%%bash
source ../env.sh
cd ${NOOP_HOME}
$(get_asset emu-precompile/emu) \
-i $(get_asset workload/hello-riscv64-xs.bin) \
--no-diff 2>stderr.log | tail
echo "=== Last 10 lines:"
tail -n 10 stderr.log
echo "=== Example of XSPerfAccumulate: rob commitInstr"
grep -n "rob: commitInstr," stderr.log | tail
echo "=== Example of XSHistogram: l2cache acquire period"
grep -n "l2cache.slices_0.mshrCtl: acquire_period" stderr.log | tail
You can run this cell to see examples of XSPerf.
Rolling¶

This type of performance counter utilizes the ChiselDB framework introduced in 02-functional/04-chiseldb to store the collected data into a SQLite3 database file.
To enable RollingDB, you need to specify WITH_ROLLINGDB=1 during compilation and use the --dump-db parameter at runtime.
The previous two types of counters cannot reflect the characteristic differences of different segments during the execution of a program, so we may miss the impact of a certain microarchitecture modification on a specific segment (critical region). Therefore, we need rolling analysis.
⚠️Note: If you are reading this notebook on the tutorial demo server, please do not recompile XiangShan, as it will take a long time and consume a lot of computing resources.
%%bash
source ../env.sh
cd ${NOOP_HOME}
mkdir -p ${WORK_DIR}/03-performance/02-xsperf
# for tutorial: copy a pre-generated rolling db to tutorial dir
cp $(get_asset emu-perf-result/xs-perf-rolling.db) \
${WORK_DIR}/03-performance/02-xsperf/xs-perf-rolling.db
Here we just copy the pre-generated rolling db file to our workspace. In code-server you're using now, you can see the commands used to generate it.
After obtaining the database file, we use a python script to analyze it.
In the following example, we use the rollingplot.py script to plot ipc data.
Gathering ipc data in XiangShan's RTL code is as follows:
// every 1000 cycles
XSPerfRolling("ipc", ifCommitReg(trueCommitCnt), 1000, clock, reset)
%%bash
source ../env.sh
cd ${WORK_DIR}/03-performance/02-xsperf
# Use python scripts to analyze the rolling db, for example, plot ipc
python3 ${NOOP_HOME}/scripts/rolling/rollingplot.py \
./xs-perf-rolling.db \
ipc
ls -lh ${WORK_DIR}/03-performance/02-xsperf/results/perf.png
The script outputs the following image, showing the IPC changes of XiangShan over time while running this program:

(at ../work/03-performance/02-xsperf/results/perf.png)
If the image does not load correctly, you can try closing the notebook and reopening it.
Top-Down¶
Purpose: Organize perf events in hierarchical form
[1]: Yasin A. A top-down method for performance analysis and counters architecture[C]//2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2014: 35-44.

Top-Down is a common performance analysis method that organizes fragmented performance events into a hierarchical form to more accurately analyze the impact of individual performance events on overall processor performance.
Based on the XSPerfAccumulate introduced in the 02-xsperf section, we have implemented a set of Top-Down counters optimized for the XiangShan microarchitecture and RISC-V instruction set in RTL to help us better model the XiangShan microarchitecture and align it with XS-GEM5.
In the ${NOOP_HOME}/scripts/top-down directory, we have also implemented some analysis scripts that you can run to extract Top-Down results, plot graphs, etc..
⚠️Note: If you are reading this notebook on the tutorial demo server, please do not run the analysis scripts as they will perform a large amount of disk access.
%%bash
source ../env.sh
mkdir -p ${WORK_DIR}/03-performance/03-topdown
cd ${WORK_DIR}/03-performance/03-topdown
# for tutorial: copy analysis results
cp -r $(get_asset emu-spec-topdown-result/results) ./
echo === results ===
ls ./results
echo === first 10 lines of results.csv ===
head -n 10 ./results/results.csv
echo === first 10 lines of results-weighted.csv ===
head -n 10 ./results/results-weighted.csv
Again, we just use the pre-generated results here.
The script outputs the following image:

(at ../work/03-performance/03-topdown/results/result.png)
If the image does not load correctly, you can try closing the notebook and reopening it.
Sometimes we want to test performance under different parameters.
We may use a cycle-accurate simulator, but as it may not 100% accurate, we sometimes want to test directly on RTL.
However, it is very time-consuming to compile every time we adjust only 1 or 2 parameters, even though the most part of RTL is unchanged. Is there any way to change parameters without compiling?
We present Constantin, which is based on the DPI-C interface and uses C++ functions and Chisel's BlackBox mechanism to configure parameters during runtime initialization.
To replace a scala parameter with Constantin, it looks roughly like this:
/* *** w/o Constantin *** */
val enableSomeModule = WireInit(false.B) // change to true.B, re-compile and re-run
/* *** w/ Constantin *** */
// in RTL
val enableSomeModule = WireInit(Constantin.createRecord("enableSomeModule", initValue = false))
// in constantin.txt
enableSomeModule 0 // change to 1, re-run
To enable Constantin, you need to use the WITH_CONSTANTIN=1 option when compiling emu.
Currently, Constantin does not present as an emu argument, its configuration file must be located at ${NOOP_HOME}/build/constantin.txt.
The following example uses Constantin to control the switch of the branch predictor. You can run this cell to compare the differences between on and off.
%%bash
source ../env.sh
mkdir -p ${NOOP_HOME}/build
# run with default parameter
rm -f ${NOOP_HOME}/build/constantin.txt || true
$(get_asset emu-precompile/emu-constantin) \
-i $(get_asset workload/coremark-2-iteration.bin) \
-C 10000 \
--no-diff \
2>/dev/null
# run with Bpu turned off (falls back to static not-taken prediction)
echo "enableUbtb 0" > ${NOOP_HOME}/build/constantin.txt
$(get_asset emu-precompile/emu-constantin) \
-i $(get_asset workload/coremark-2-iteration.bin) \
-C 10000 \
--no-diff \
2>/dev/null
Autosolving¶
Purpose: automatically find the best parameter configuration under the current microarchitecture.
Steps:
- Enable in Constantin configuration (refer to
./04-autosolving.patch); - Compile emu with
WITH_CONSTANTIN=1; - Provide config file.
- Parameter name, bit width, initial value.
- Performance counter name, optimization strategy.
- Workload, etc..
We also implemented Autosolving for Constantin, which can automatically find the best parameter configuration under the current microarchitecture.
To use Autosolving, you need to enable it in the Constantin configuration (refer to ./04-autosolving.patch), and compile emu with WITH_CONSTANTIN=1.
After enabling Autosolving, emu will read the Constantin configuration from stdin instead of a txt file, allowing us to use our python script to automatically run emu with specific configurations and try to find the optimal configuration.
You need to provide a configuration file for the script (refer to 04-autosolving-config.json), including:
- Descriptions of configurable parameters
- Parameter name
- Bit width
- Initial value
- Optimization goals
- Performance counter name
- Strategy (minimize, maximize)
- Baseline
- Genetic algorithm parameters
- emu running parameters
- workload
- Maximum number of instructions
- Number of threads
Our script uses genetic algorithm for parameter exploration, and it is also easy to implement other algorithms (such as ant colony/particle swarm optimization).
%%bash
source ../env.sh
echo === Patch ===
cat ./04-autosolving.patch
echo === Config ===
cat ./04-autosolving-config.json
echo === Run ===
mkdir -p ${NOOP_HOME}/build
cp $(get_asset emu-precompile/emu-autosolving) ${NOOP_HOME}/build/emu
python3 ${NOOP_HOME}/scripts/constantHelper.py ./04-autosolving-config.json
You can run this cell to see autosolving works.
