mla-rt User Guideο
mla-rt is the MLA (Machine Learning Accelerator) runtime CLI. It loads a compiled model
(*.elf or *.lm), runs it on the MLA hardware, and reports performance and correctness
statistics.
Overview and deploymentο
mla-rt is an aarch64 binary that runs on an MLA-equipped board. Every supported board
ships mla-rt pre-installed at /usr/bin/mla-rt, so after connecting via SSH
(ssh root@<board>), you can run it directly with no setup. This is the right choice for
routine model bring-up, performance measurements, and CI-style runs.
A typical session uses an SSH alias or hostname for the board, for example
ssh root@modalix.
Models live at /mnt/mla_test_runner_models/... β for example
/mnt/mla_test_runner_models/ma/2025-05-05-regression-tests/image_classification/resnet_50/elffiles/resnet_50_stage1_mla.elf.
A model directory typically contains:
File |
Purpose |
|---|---|
|
Compiled MLA binary (the model) |
|
Reference IFM (input feature map) |
|
OFM checkfile (golden output for validation) |
|
The model in MLC text format (not frame data) |
|
Reference stats |
Note
The recipes on this page were captured on a Modalix (MA) board, so DaVinci-only flags
(-X F:0xN:0xN) and Zebu/VDK modes do not apply there.
Quick startο
modalix:~$ ./mla-rt -vss /mnt/.../resnet_50_stage1_mla.elf
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt: 0.009s dramB=25342080 mbps=2884.039
Total Elapsed: time 82.31ms
Model stats: /mnt/.../resnet_50_stage1_mla.elf
output: dt:usec precision:0
tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|
----------------------------------------------------------------------------------------------------
676| 709| 698| 10| 693| 8| 7240| 2884|
----------------------------------------------------------------------------------------------------
-vss = -v (verbose) with sub-arg ss (stats level 2). Without -vss, the table is
suppressed, and only the setup_model line is printed.
-vs enables the table; -vss adds the runmsg.lat, init.lat, and dram.MB/s
columns.
The performance tableο
Column meanings with their respective units:
Column |
Unit |
Meaning |
|---|---|---|
|
nsec |
Aggregate tile execution time reported by the MLA hardware |
|
Β΅sec |
End-to-end runtime of the user-facing API call |
|
Β΅sec |
Time spent in |
|
Β΅sec |
Multi-run only β gap between successive model invocations |
|
Β΅sec |
Relocation processing |
|
Β΅sec |
Pure model execution (M4-side) |
|
Β΅sec |
Time to send the run message to M4. Shown when stats >= 2 |
|
Β΅sec |
One-time MLA initialisation. Shown when stats >= 2 |
|
MB/sec |
DRAM bandwidth measured during model load |
For multi-run (-m N > 1), each column gets four rows: avg, min, max, std
(see the R5 recipe below).
The default time scale is microseconds with precision 0 (output: dt:usec precision:0).
Override with the :nsec|:usec|:msec|:sec and :N (digits) suffixes on -v, for example
-vs:msec:3 shows stats in milliseconds with 3 decimals.
Connection modes (-a / βconnect)ο
Default: scans /etc/buildinfo on the host to pick the build target. On aarch64 boards, this
normally selects a65 (the on-board ARM CPU).
Comma-separated tokens; usage menu:
-a [a65|dv|davinci|ma|michelangelo|mod|test|zebu1|zebu2|sw_mbox|dram|sdram|block|enable|disable|mip|vdk]
Architecture optionsο
Option |
Meaning |
Where it works |
|---|---|---|
|
DaVinci silicon |
DV boards (e.g., older mla-hl* generation) |
|
Modalix silicon |
MA boards |
|
Loopback / no hardware |
Anywhere |
Connection / transport optionsο
Option |
Meaning |
|---|---|
|
On-board A65 ARM (default for aarch64 hosts) |
|
Software mailbox transport |
|
Zebu emulator (network access required) |
|
Virtual development kit |
|
MIP transport mode |
Memory allocator optionsο
Or use -B / --mem-lib.
Option |
Meaning |
|---|---|
|
DRAM allocator |
|
SDRAM allocator (default) |
|
Block allocator |
Cache optionsο
Option |
Meaning |
|---|---|
|
Enable cache |
|
Disable cache |
Examples:
-a dv,a65,sw_mbox # Arch=DaVinci, Connect A65, use software mailbox
-a mod,zebu1 # Arch=Modalix, Connect to <zebu-emulator-server>
-a mod,block,disable # Arch=Modalix, use block allocator, disable cache
Complete flag referenceο
Sorted by short flag (uppercase before lowercase).
Arg column: β = no argument, optional = optional argument, req = required
argument.
Short |
Long |
Arg |
Default |
Description |
|---|---|---|---|---|
|
|
req |
(none) |
Compute output labels via softmax against |
|
|
req |
(auto) |
Architecture / transport / allocator (see Connection modes) |
|
|
req |
|
Choose memory allocator (alternative to |
|
|
req |
|
Ctrl-C handling. See -hc appendix |
|
|
req |
|
Set N bytes of m4.sram to zero (cycle-count instrumentation) |
|
|
β |
off |
Validate OFM rows against the checkfile. See Checkfile validation |
|
|
β |
off |
Enter |
|
|
req |
|
M4 driver binary (required for Q0 / Q0123 quad configs) |
|
|
req |
(none) |
Add performance counter |
|
|
β |
off |
Section checksum validation (slow β ~10Γ load time, see R25) |
|
|
β |
DTB check ON |
Skip |
|
|
req |
|
DMA timeout threshold and factor |
|
|
req |
(none) |
|
|
|
optional |
β |
Help. |
|
|
req |
(off) |
Software interleaving. |
|
|
β |
off |
Continue past errors |
|
|
req |
|
Per-run model timeout in milliseconds (Zebu uses this) |
|
|
req |
(none) |
OCM placement ( |
|
|
req |
(none) |
Add a relocation: create DramSection |
|
|
req |
(none) |
Use |
|
|
β |
off |
Connect, dump M4 log to stdout, exit (no model load) |
|
|
β |
off |
Enable M4 logging during the run. Note: writes |
|
|
req |
(none) |
Append YAML stats to |
|
|
req |
|
Run model |
|
|
β |
M4 reset ON |
Skip M4 reset before run |
|
|
β |
sharing ON |
Disable sharing of duplicate instructions across models (used for LLMs) |
|
|
req |
(none) |
Construct an IFM segment named |
|
|
req |
(none) |
Construct an OFM segment. |
|
|
req |
(none) |
Process MLAPolicies args. |
|
|
req |
(none) |
Per-layer latency YAML (input). Output written to |
|
|
β |
off |
Use DDR performance counters (R22 β increased model.lat β700β914 Β΅sec from counter overhead) |
|
(none) |
β |
off |
Quiet output (alias of |
|
|
req |
(none) |
Read DRAM/L1/L2/SRAM/OCM. See Memory access |
|
|
req |
|
Run only segment |
|
|
β |
off |
Dump stats. Use |
|
|
req |
(none) |
Write address/value pairs to YAML file |
|
|
β |
off |
Skip default IFM/OFM allocation (used with GST APIs). Auto-disabled if |
|
|
β |
β |
Print git revision and exit |
|
|
optional |
(off) |
Logger options. See Verbosity |
|
|
β |
rproc ON |
Disable remoteproc subsystem for M4 control |
|
|
req |
(none) |
Write to DRAM/L1/L2/SRAM/OCM. See Memory access |
|
|
req |
(none) |
Diagnostic arg ( |
|
|
req |
|
|
|
|
β |
off |
Parse and allocate, do not run |
|
|
req |
(none) |
M4 debug config (logLevel, fifoLevel, gpio/uart/profile enables). See M4 debug config |
|
|
req |
|
DMA burst size (Modalix-only) |
|
|
β |
off |
Use ev74 (target generic) memory allocator. Note: the parser tags this as a required argument but uses no value; pass any token, e.g., |
Long-only flagο
Long |
Description |
|---|---|
|
DEPRECATED β replaced by |
Positional argumentsο
After the option list, any remaining tokens are interpreted by extension:
*.elfor*.lmβ executable model files (multiple allowed)everything else β frame data files
Verbosity and logging (-v, -l, -L, -q)ο
-v takes an optional argument. The argument is parsed by the logger. Common forms:
Form |
Effect |
|---|---|
|
Bare verbose; turns on default channels |
|
Stats level 1 (basic table) |
|
Stats level 2 (extra columns: |
|
Check-file logging |
|
Quiet (same as |
|
Stats + relocation processing |
|
Time scale = ms, precision = 3 digits |
|
Stats on, warnings off, m4 driver mask = |
|
Policy log output |
|
Set ALL flags |
The full grammar is -vXYZ where each letter is a one-char channel. See the
-hl appendix for the channel list.
-l enables the M4 driver log (collected during run). -L connects, dumps the log to
stdout, and exits (no load). -q is shorthand for -vq.
Memory access (-r / -w)ο
General forms (see -hm appendix):
-r EXPR[:N][:FILE][:FMT] # read
-w EXPR:N[:DATA|FILE] # write
EXPR may be:
Form |
Meaning |
|---|---|
|
DRAM address |
|
A model symbol β |
|
M4 SRAM bank N |
|
L1 at tile (x,y) |
|
L2 (e.g., |
|
M4 driver symbol |
|
Print all M4 driver symbols |
|
All symbols (with |
Suffixes for N and addresses: k = 1024, m = 1048576, r = 16 (one row).
FMT is %<FILL><TYPE><TSIZE><MODS>:
TYPE:
a(dma),b(binary),c(char),d(decimal),f(float),o(octal),s(string),t(tile-instr),u(unsigned),x(hex)TSIZE:
1,2,4,8byte widthMODS:
n(no address prefix),r(reverse 16-byte row),m(m4_sym),@N(N bytes per row)Special:
FILE=sha1produces a SHA-1 hash
Examples:
-r ifm.b0:32:%d1@16 # 32 B of IFM as decimal, 16 per row
-r ofm.b0::%b:/tmp/ofm.bin # all OFM as binary to file
-r 0x30:8:0xdeadbeef # (write form) 8 B of 0xdeadbeef at 0x30
-w sram(0)::FILE # write FILE into sram(0)
-w 0x1d0000:32k:-1 # fill 32 KB at 0x1d0000 with -1
-w l2(n0,0x100):10r::0,1,2,3,4,5 # write 10 rows of literal data
Performance counters (-e)ο
Up to 4 counters per run (exits if exceeded). Counter names:
ID |
Name |
|---|---|
0 |
|
1 |
|
2 |
|
3 |
|
4 |
|
5 |
|
6 |
|
7 |
|
All perf counters report in nanoseconds. Each counter adds a column perf.id N with the
counterβs name. The values in the table are scaled to the active time-unit (default Β΅sec).
Common combination: -e2 -e5 -e6 -e7ο
The most useful set for understanding where MLA cycles are spent: tile activity
(tile_active) plus the three freeze counters (sync_frz, iq_sync_frz,
d_sync_frz). With those, tile_active + freeze β tile.lat.
modalix:~$ ./mla-rt -vss -m 3 -e2 -e5 -e6 -e7 $MODEL
Perf counter 2 source "tile_active" added.
Perf counter 5 source "sync_frz" added.
Perf counter 6 source "iq_sync_frz" added.
Perf counter 7 source "d_sync_frz" added.
...
perf.id 2| perf.id 5| perf.id 6| perf.id 7|
tile.lat| api.lat| model.lat| gap.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|tile_active| sync_frz|iq_sync_frz| d_sync_frz|
----------------------------------------------------------------------------------------------------------------------------------------------------------------
avg 676| 701| 693| 6| 7| 689| 7| 6094| 2890| 494| 182| 3| 182|
min 676| 686| 683| 4| 1| 682| 4| 4040| 2890| 494| 182| 3| 182|
max 677| 708| 696| 8| 9| 691| 8| 7121| 2890| 494| 183| 3| 183|
std nan| 9| 6| 3| 3| 4| 2| 1266| 0| 0| nan| 0| nan|
Reading the avg row above: tile_active=494Β΅s + sync_frz=182Β΅s β tile.lat=676Β΅s.
The tile spent ~73 % executing and ~27 % stalled on sync.
Counter id 0 / 1 exampleο
modalix:~$ ./mla-rt -vss -m 3 -e0 -e1 $MODEL
tile.lat| api.lat| model.lat| gap.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| tiled_lat| tile_lat|
avg 676| 703| 694| 6| 7| 690| 7| 6614| 2916| 676| 676|
For DDR-side counters use -Q.
Checkfile validation (-c / -vc)ο
-c (or --chk) asks mla-rt to compare the modelβs OFM against a golden checkfile and
report PASSED/FAILED. For the comparison to be meaningful, you must:
Pass the matching IFM with
--ifm <SYM>:<FILE>so the model runs on the input the checkfile was produced for.Tell
mla-rtwhere the OFM lives (--ofm <SYM>::<CHK>) β the emptyFILEfield means βno init dataβ.Use the symbol names declared in the model. For most regression-test models thatβs
ifm.b0andofm.b0.
If you specify --checkfile FILE instead of --ofm <SYM>::<CHK>, -c is auto-enabled and
the checkfile is bound to the default OFM symbol β but you still need --ifm to feed the right
input, otherwise the model runs against a zero-filled IFM and reports FAILED (this is what R6
in Recipes shows).
-vc (or -vssc) turns on check-file logging, which prints the address range checked and
an βN rows compared correctly, M rows were badβ summary in addition to the PASSED/FAILED line.
Recipe: PASS with -c -vssο
modalix:~$ DIR=/mnt/.../resnet_50/elffiles
modalix:~$ ./mla-rt -c -vss --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
--ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \
$DIR/resnet_50_stage1_mla.elf
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt: 0.009s dramB=25342080 mbps=2872.757
Test:: ofm.b0::PASSED
Total Elapsed: time 205.43ms
tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|
675| 1250| 710| 22| 692| 9| 7360| 2873|
api.lat jumps from ~700 Β΅sec (without -c) to ~1250 Β΅sec (with -c) β the comparison
takes time and is included in the API latency.
Recipe: PASS with -c -vssc (verbose check)ο
modalix:~$ ./mla-rt -c -vssc --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
--ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \
$DIR/resnet_50_stage1_mla.elf
...
checkDramSection: DramSection: ofm.b0:address:0x1182040000..0x11820403f0:size: 0x3f0:zone:0x5: mlaZoneOFM:model:/mnt/.../resnet_50_stage1_mla.elf:refcnt:0x1
63 rows compared correctly, 0 rows were bad.
Test:: ofm.b0::PASSED
The two extra lines (one per checked DramSection, plus the row count) make -vssc the right
verbosity when debugging a FAIL.
Per-layer stats (-p)ο
-p FILE runs the model layer by layer and emits per-layer latency and perf-counter data.
The input file is a YAML list of layers with name, start_cycle, end_cycle. Most
regression-test models ship one alongside the elf β for example
resnet_50_stage1_mla_stats.yaml.
Output is written to <input-base>_output.yaml (the input is not overwritten). For
example, passing /tmp/layers.yaml produces /tmp/layers_output.yaml. Pre-existing
_output.yaml is overwritten.
-p is mutually exclusive with -C / --cycle-count; combining them is fatal at
parse time.
Input file formatο
0:
name: MLA_0/placeholder_0_0
start_cycle: 0
end_cycle: 0
1:
name: MLA_0/conv2d_add_relu_0
start_cycle: 0
end_cycle: 28435
2:
name: MLA_0/max_pool2d_1
start_cycle: 28436
end_cycle: 31265
...
Index 0 is a placeholder and is skipped. Layers are run in order; cycle ranges drive how long
each layer is allowed to execute.
Recipeο
Note
Always pass a copy of the stats yaml β even though mla-rt writes to a separate
_output.yaml, copying first protects against accidents (e.g., invoking with
-p $MODEL_DIR/foo_stats.yaml from a writable mount).
modalix:~$ cp $DIR/resnet_50_stage1_mla_stats.yaml /tmp/layers_in.yaml
modalix:~$ ./mla-rt -vss -p /tmp/layers_in.yaml \
--ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
$DIR/resnet_50_stage1_mla.elf
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt: 0.009s dramB=25342080 mbps=2887.514
Per layer stats output file name:
/tmp/layers_in_output.yaml
Sum perf ctr run_time is: 728.98us
Sum m4 systick latency is: 729.63us
Total Elapsed: time 353.83ms
Output file format (/tmp/layers_in_output.yaml)ο
Each layer gets the cycle range it ran with plus five timing fields:
# Date created: 2026-04-29 22:08:16
1:
name: MLA_0/conv2d_add_relu_0
start_cycle: 0
end_cycle: 28435
layer_latency: 44.84us
run_time: 43.60us
active_time: 28.43us
l2_freeze: 14.88us
iq_freeze: 3.50us
2:
name: MLA_0/max_pool2d_1
...
Field |
Source |
Meaning |
|---|---|---|
|
M4 systick measurement |
Wall-clock time spent in the layer |
|
Perf counter |
Layer execution time as reported by the perf counter |
|
Perf counter |
Time tiles were actively computing |
|
Perf counter |
Time stalled waiting for L2 |
|
Perf counter |
Time stalled on instruction-queue sync |
The summary lines printed to stdout (Sum perf ctr run_time / Sum m4 systick latency) are
the totals across all layers in the file.
M4 debug config (-y)ο
-y FILE (--yaml-config) loads a YAML file that configures the M4 debug context β log
level, FIFO mode, GPIO/UART/profile enables, deadlock checking, and the Zebu flag. Without
-y, mla-rt uses hard-coded defaults.
A sample template is shown below. Use it as a template:
debugConfig: # this node is mandatory
# log level: disabled | info | verbose | debug | fifo-debug
logLevel : disabled
# fifo level: no-fifo | write | read-write
fifoLevel : write
# boolean flags
gpioEn : false
deadlockEn : true
uartEn : false
profileEn : false
zebu : false
Key |
Type |
Allowed values |
Effect |
|---|---|---|---|
|
enum |
|
M4 driver log verbosity |
|
enum |
|
Which fifo channels are enabled |
|
bool |
|
Enable GPIO debug |
|
bool |
|
Enable deadlock detection (default |
|
bool |
|
Enable UART output |
|
bool |
|
Enable profiling hooks |
|
bool |
|
Set when running against Zebu emulator |
|
uint32 |
(numeric) |
Optional β M4 semaphore operation override |
Recipeο
modalix:~$ ./mla-rt -vss -y /path/to/mlart-default-cfg.yaml \
$DIR/resnet_50_stage1_mla.elf
Reading m4 debug configuration from: /path/to/mlart-default-cfg.yaml
m4_context_config.yaml content:
logLevel : disabled
fifoLevel : write
gpioEn : false
deadlockEn : true
uartEn : false
profileEn : false
zebu : false
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt: 0.009s dramB=25342080 mbps=2898.003
Total Elapsed: time 83.72ms
tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|
676| 707| 697| 10| 692| 8| 7121| 2898|
mla-rt echoes the parsed config back so you can confirm what was applied. If a key is missing
from your file, the corresponding default is kept.
Diagnostic arguments (-X)ο
Format: -X <opt>:.... Currently F (frequency) and H (HWIL register).
-X F:β¦ β Set MLA PLL frequencyο
DaVinci syntax: -X F:P:M (PLL P and M registers).
-X F:0x3:0x3b # 500 MHz (DV)
-X F:0x2:0x35 # 600 MHz (DV)
-X F:0x2:0x3e # 700 MHz (DV)
-X F:0x1:0x2f # 800 MHz (DV)
-X F:0x1:0x35 # 900 MHz (DV)
-X F:0x1:0x3b # 1000 MHz (DV)
Modalix syntax: -X F:FREQ_MHZ (range 500β1500):
-X F:933 # 933 MHz (MA)
Warning
Setting the frequency mutates board state until the next reset/reboot. After benchmarking at a non-default frequency, run again with the original frequency to restore.
-X H:0xVALUE β Set NoC HWIL registerο
-X H:0x70 # set MA hwil (noc_cfg) to 0x70
(Modalix-only; ignored on DV.)
Multi-run and stats outputο
-m N (default 1) runs the model N times without reloading. The reported table
changes:
N = 1 β single row of values
N > 1 β 4 rows:
avg,min,max,std, plus an extragap.latcolumn (inter-run gap)
-M FILE (or --stats FILE) appends a YAML summary. With trailing + (-M FILE+) it
appends to an existing file. R7 output:
# Date created: 2026-04-29 21:35:08
# precision:dt:usec precision:0
/mnt/.../resnet_50_stage1_mla.elf:
filesize: 27777336
tile.lat:
avg: 675
api.lat:
avg: 707
model.lat:
avg: 697
relo.lat:
avg: 10
run.lat:
avg: 691
runmsg.lat:
avg: 8
init.lat:
avg: 6921
dram.MB/s:
avg: 2885
For -m N > 1, each metric also gets min, max, std keys.
Argument constraints and interactionsο
Constraint |
Behavior |
|---|---|
|
Fatal β mutually exclusive |
|
Auto-enables |
|
Auto-enables |
|
|
|
Fatal β max 4 perf counters |
|
Warning: |
|
Notice: |
Unrecognised flags (e.g., --csv, --check-lm, --legacy, --sram,
--dram-channel, -F) cause the full usage menu to be printed and exit.
Recipes (with verified board output)ο
All recipes were run on a Modalix board against the following model and companions:
modalix:~$ MODEL=/mnt/.../resnet_50/elffiles/resnet_50_stage1_mla.elf
modalix:~$ IFM=$(dirname $MODEL)/resnet_50_stage1_mla.ifm.mlc
modalix:~$ CHK=$(dirname $MODEL)/resnet_50_stage1_mla.ofm_chk.mlc
R1 β Default (no flags) β minimum outputο
modalix:~$ ./mla-rt $MODEL
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
The default verbosity prints only the setup line. Stats are gated behind -v.
R2 β Single run with full stats (-vss)ο
modalix:~$ ./mla-rt -vss $MODEL
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt: 0.009s dramB=25342080 mbps=2884.039
Total Elapsed: time 82.31ms
Model stats: /mnt/.../resnet_50_stage1_mla.elf
output: dt:usec precision:0
tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|
----------------------------------------------------------------------------------------------------
676| 709| 698| 10| 693| 8| 7240| 2884|
----------------------------------------------------------------------------------------------------
R3 β Quiet (-q)ο
modalix:~$ ./mla-rt -q $MODEL
...shush...
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
R4 β Dry run (-Y) β parse + allocate, donβt runο
modalix:~$ ./mla-rt -Y $MODEL
[ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf
R5 β Multi-run (-m 4) β see jitterο
modalix:~$ ./mla-rt -vss -m 4 $MODEL
tile.lat| api.lat| model.lat| gap.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|
----------------------------------------------------------------------------------------------------------------
avg 676| 696| 690| 6| 5| 688| 6| 5315| 2888|
min 676| 686| 683| 3| 1| 682| 4| 4040| 2887|
max 677| 710| 699| 7| 10| 694| 8| 7361| 2887|
std 0| 13| 9| 2| 5| 6| 2| 1889| 0|
R6 β βcheckfile without βifm β expected FAILο
modalix:~$ ./mla-rt -vss --checkfile $CHK $MODEL
...
Test:: ofm.b0::FAILED
Total Elapsed: time 93.21ms
api.lat| model.lat| relo.lat| run.lat| runmsg.lat| dram.MB/s|
----------------------------------------------------------------------------
11600| 699| 10| 693| 9| 2901|
--checkfile auto-enables -c, but the model runs against a zero-filled IFM and the output
doesnβt match. For the PASS path, see Checkfile validation.
R7 β YAML stats (-M)ο
See Multi-run and stats output above for the YAML format.
R8b β Custom IFM and OFMο
modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM --ofm ofm.b0::$CHK $MODEL
tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|
----------------------------------------------------------------------------------------------------
675| 720| 709| 22| 691| 8| 7241| 2882|
The empty middle field in ofm.b0::$CHK means βno init dataβ, and $CHK is the validation
checkfile.
R9 β Performance countersο
See Performance counters for the recommended
-e2 -e5 -e6 -e7 combo and the -e0 -e1 variant.
R20 β Read 32 B of IFM as decimals, 16 per rowο
modalix:~$ ./mla-rt -vss -r "ifm.b0:32:%d1@16" $MODEL
mlaZoneIFM:ifm.b0[0x1182000000]@32
0x1182000000: 86 -62 -101 -52 -23 80 -13 -105 -43 25 80 58 126 -80 97 -114
0x1182000010: 14 -58 -77 40 -50 15 -62 -90 63 -18 45 43 113 -20 -72 34
R21 β Dump OFM to a binary fileο
modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM -r "ofm.b0::%b:/tmp/ofm.bin" $MODEL
mlaZoneOFM:ofm.b0[0x1182040000]@1008 => /tmp/ofm.bin
modalix:~$ ls -la /tmp/ofm.bin
-rw-r--r-- 1 root root 1008 Apr 29 21:38 /tmp/ofm.bin
R22 β DDR performance counters (-Q)ο
modalix:~$ ./mla-rt -vss -Q $MODEL
tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|
677| 925| 914| 10| 909| 8| 6681| 2885|
(Note model.lat rises β700β914 Β΅sec β DDR counter overhead.)
R25 β Section checksums (-f) β slowο
modalix:~$ ./mla-rt -vss -f $MODEL
loadDramSegments:dt: 0.818s dramB=25342080 mbps= 30.976
...
Total Elapsed: time 889.66ms
(Load is ~10Γ slower; dram.MB/s collapses from ~2900 to 31 because the checksum computation
is mixed in.)
R26 β Memory allocator (βmem-lib sdram)ο
modalix:~$ ./mla-rt -vss --mem-lib sdram $MODEL
arch.memlib: SDramAllocator
[ 0/ 1]setup_model ...
R27 β Versionο
modalix:~$ ./mla-rt -V
mla_rt git revision: <revision>
Stale βhelp entriesο
The usage() menu lists several flags that the current parser does not accept:
Listed in βhelp |
Actual status |
|---|---|
|
Not parsed β passing it dumps the usage menu and exits |
|
Not parsed |
|
Not parsed |
|
Not parsed (named entry exists in menu but no long option) |
|
Not parsed |
|
Not a top-level flag; |
|
Documentation note, not a flag |
|
The actual flag is |
|
DEPRECATED β use |
Troubleshootingο
Error opening file dump.log : Permission denied(with-l)Logging tries to write
dump.login cwd. If youcdinto a read-only mount, the open fails. Workarounds:cd /tmpbefore running, or use-L(dumps the log to stdout, no file).parse_file:ERROR:cannot open <FILE>from--ofm-o NAME:FILE:CHKrequiresFILEto exist as init data. To skip init, leave the field empty:-o NAME::CHK.Test:: ofm.b0::FAILEDwith default (no--ifm)The model runs against zero-filled (or default) IFM, so the OFM doesnβt match the golden checkfile. Pass the matching
--ifm.- Setting
-X F:...on the wrong arch DV uses
F:P:M; MA usesF:FREQ_MHZ(500β1500). Mismatched syntax fails with a parsing error.-zdoesnβt behave like a no-arg flagThe parser declares
-zas a required argument but the case body uses no value. Pass any token (e.g.,-z 1) until this is fixed.
Appendix: verbatim sub-help outputο
Captured from the board.
-hc β Ctrl-C handlingο
MLASignal allows control of SIGHUP(1) SIGINT(2) SIGQUIT(3) SIGTERM(15) :
options set as follows: [--ctrl-c|-b] 01vsycre :
c : immediately do gst.close
e : exit(1)
no|off : turn off sighandler
r : schedule gst to halt further processing
s : after interrupt sleep 1 sec then return
y : ask Y/N to continue interrupt
examples: :
--ctrl-c ys : ask y/n ; sleep 1s
--ctrl-c r : request to halt further processing
-hl β Loggerο
Logger options set as follows:
-vXYZ where XYZ are:
a : dma process logging
A : allocation process
b : turn on progress bar
c : check file logging: -c must be on
d : generic debug diagnostics
F : frames processing
g : gstapi and remote proc diagnostics
h : help menu i.e. usage()
l : loader logging
m : m4 driver log (1:wm 2:m 4:dm)
n : show notes: -vn- turn off
o=FILE : output to FILE
o=syslog : output to syslog facility
p : payload transactions
P : enable policies
q : quiet = (all masks zeroed)
r : relocation processing
s : show all statistics
S : show .lm/.elf dma and tile sizes
u : verify/fix lm program overrun
v : set verbosity: v=1(-vcxwt) v=2(v=1 + lstu) ...
x : log run progress
y : dump symbol table(s)
z : set all flags
:N : set FP precision 42.nnnn
:nsec|usec|msec|sec : set time unit
Generic diagnostics :
e : ERROR diagnostics: -ve- turn off
f : FATAL diagnostics: -vf- turn off
t : TODO diagnostics
w : WARN diagnostics
examples : -vv=2 or -v -v :set verbose = 2 ; verbose is a compound qualifier
: -vxr=2 :enable run=1 and relocations=2
: -vsw=0,m=0x7 :enable stats, disable warnings, set m4_driver=wm|m|dm
: -vs:msec:3 :enable stats at dt=millisec fp.precision=3
-hm β Memory dump (full reference)ο
For reads to DRAM/L1/L2/SRAM/OCM the general form is : -r EXPR:SIZE:FILE:FMT
For writes we have : -w EXPR:SIZE:FILE
: EXPR can be a constant, symbol or l1(.) l2(.) addresses described below
: SIZE defaults to 4B and can be inferred
: For FILE=sha1 we generate a SHA1 key
: For FMT: %<FILL>abcdfostux<TSIZE><MODIFERS>
: where FILL: field size in bytes: ex %04 fill with 4 zeros
: a:dma b:binary c:character d:decimal f:float o:octal s:string t:tile-instr u:unsigned x:hex
TSIZE : x1: one hex byte d2: two byte decimal uint16 u8: 8byte unsigned
MODIFIERS : n: NO ADDRESS r: reverse 16B row m: m4_sym @rowsize: bytes/row
-r addr[:n][:FILE][:FMT] : read dram[addr] for len n
-r sym[:n][:FILE][:FMT] : read sym (ofm|ifm|lora|...)
-r sym::sha1 : SHA1 of sym (len inferred)
-r l1(x,y,addr)[:n][:FILE][:FMT] : read L1 at tile (x,y)
-r l2(unit,addr)[:n][:FILE|FMT] : read L2 at unit=n0,e2 row addr
-r m4.list : print M4 driver symbols
-r m4(sym)[:n][:FILE][:FMT] : read M4 driver symbol
-r * : report all symbols in symbol table
-w addr:n[:data|FILE] : write data or FILE to dram[addr]@n
format examples:
-r 0:2r:%d1 β 0x000000: 10 0 0 20 31 32 3 -43 ...
-r 0:8:%x1 β 0x000000: a 0 0 14 1f 20 3 d5
-r 0:16:%02x1 β 0x000000: 0a 00 00 14 1f 20 03 d5 00 00 00 00 00 00 00 00
-r 0:16:%02x1r β 0x000000: 00 00 00 00 00 00 00 00 d5 03 20 1f 14 00 00 0a
-r 0:16:%u4n β 335544330 3573751839 0 0
-hP β Policiesο
Policies respond to correct driver or mla error or timeouts :
options set as follows: -policy or -P x,y:n,z=1.0 :
-P off : disable policy options
a : enable all policy options
r : enable retry on failure
d : enable driver restart on failure
l : enable driver reload
Policies have a self test mode m|m4|mla described below :
mla,m4,f=0.7 : enable random mla/m4 failure at rate 0.7
mla:N : generate mla failure N = { 1..5 }
m4:N : generate m4 error 6..18 (not all implemented)
examples : -vP or -vP+ for verbose output
-P r+ : retry 2x
-P rd+l : retry, driver restart 2x, then reload
-P a,m4,f=1 : all policies + random m4 failure rate 1.0
-P a,m4:7:20:2 : generate m4:7=ERR_CODE_FATAL_DMA:unt:err
-P a,mla : random mla errors at 0.5
-P a,m4:18:128 : m4 interrupt sig=128
mla error codes:
mla:1 : dmaDramToM4: write payload failed
mla:2 : MLA file not found
mla:3 : M4 done timeout
mla:4 : generic sendCommand failure
mla:5 : checkfile failed to match
m4 error codes:
m4:7:unit:err : DMA Fatal interrupt
m4:8:err : MIM Fatal interrupt
m4:9:row:col : TLC Fatal interrupt
m4:10:err : uC Fatal interrupt
: m4:6,11-17 not implemented (will be generated by -P a,m4)