mla-rt User Guide

mla-rt is the MLA (Machine Learning Accelerator) runtime CLI. It loads a compiled model (*.elf or *.lm), runs it on the MLA hardware, and reports performance and correctness statistics.

Overview and deployment 

mla-rt is an aarch64 binary that runs on an MLA-equipped board. Every supported board ships mla-rt pre-installed at /usr/bin/mla-rt, so after connecting via SSH (ssh root@<board>), you can run it directly with no setup. This is the right choice for routine model bring-up, performance measurements, and CI-style runs.

A typical session uses an SSH alias or hostname for the board, for example ssh root@modalix.

Models live at /mnt/mla_test_runner_models/... — for example /mnt/mla_test_runner_models/ma/2025-05-05-regression-tests/image_classification/resnet_50/elffiles/resnet_50_stage1_mla.elf.

A model directory typically contains:

File	Purpose
`*_stage1_mla.elf`	Compiled MLA binary (the model)
`*_stage1_mla.ifm.mlc`	Reference IFM (input feature map)
`*_stage1_mla.ofm_chk.mlc`	OFM checkfile (golden output for validation)
`*_stage1_mla.mlc`	The model in MLC text format (not frame data)
`*_stage1_mla_stats.yaml`	Reference stats

Note

The recipes on this page were captured on a Modalix (MA) board, so DaVinci-only flags (-X F:0xN:0xN) and Zebu/VDK modes do not apply there.

Quick start 

modalix:~$ ./mla-rt -vss /mnt/.../resnet_50_stage1_mla.elf
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2884.039
Total Elapsed: time   82.31ms
Model stats: /mnt/.../resnet_50_stage1_mla.elf
output: dt:usec precision:0
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------
            676|        709|        698|         10|        693|          8|       7240|       2884|
----------------------------------------------------------------------------------------------------

-vss = -v (verbose) with sub-arg ss (stats level 2). Without -vss, the table is suppressed, and only the setup_model line is printed.

-vs enables the table; -vss adds the runmsg.lat, init.lat, and dram.MB/s columns.

The performance table 

Column meanings with their respective units:

Column	Unit	Meaning
`tile.lat`	nsec	Aggregate tile execution time reported by the MLA hardware
`api.lat`	µsec	End-to-end runtime of the user-facing API call
`model.lat`	µsec	Time spent in `mla_run_model` (host side)
`gap.lat`	µsec	Multi-run only — gap between successive model invocations
`relo.lat`	µsec	Relocation processing
`run.lat`	µsec	Pure model execution (M4-side)
`runmsg.lat`	µsec	Time to send the run message to M4. Shown when stats >= 2
`init.lat`	µsec	One-time MLA initialisation. Shown when stats >= 2
`dram.MB/s`	MB/sec	DRAM bandwidth measured during model load

For multi-run (-m N > 1), each column gets four rows: avg, min, max, std (see the R5 recipe below).

The default time scale is microseconds with precision 0 (output: dt:usec precision:0). Override with the :nsec|:usec|:msec|:sec and :N (digits) suffixes on -v, for example -vs:msec:3 shows stats in milliseconds with 3 decimals.

Connection modes (-a / –connect)

Default: scans /etc/buildinfo on the host to pick the build target. On aarch64 boards, this normally selects a65 (the on-board ARM CPU).

Comma-separated tokens; usage menu:

-a [a65|dv|davinci|ma|michelangelo|mod|test|zebu1|zebu2|sw_mbox|dram|sdram|block|enable|disable|mip|vdk]

Architecture options 

Option	Meaning	Where it works
`dv` / `davinci`	DaVinci silicon	DV boards (e.g., older mla-hl* generation)
`ma` / `michelangelo` / `mod`	Modalix silicon	MA boards
`test`	Loopback / no hardware	Anywhere

Connection / transport options 

Option	Meaning
`a65`	On-board A65 ARM (default for aarch64 hosts)
`sw_mbox`	Software mailbox transport
`zebu1` / `zebu2`	Zebu emulator (network access required)
`vdk`	Virtual development kit
`mip`	MIP transport mode

Memory allocator options 

Or use -B / --mem-lib.

Option	Meaning
`dram`	DRAM allocator
`sdram`	SDRAM allocator (default)
`block`	Block allocator

Cache options 

Option	Meaning
`enable`	Enable cache
`disable`	Disable cache

Examples:

-a dv,a65,sw_mbox     # Arch=DaVinci, Connect A65, use software mailbox
-a mod,zebu1          # Arch=Modalix, Connect to <zebu-emulator-server>
-a mod,block,disable  # Arch=Modalix, use block allocator, disable cache

Complete flag reference 

Sorted by short flag (uppercase before lowercase).

Arg column: — = no argument, optional = optional argument, req = required argument.

Short	Long	Arg	Default	Description
`-A`	`--labels`	req `FILE`	(none)	Compute output labels via softmax against `FILE`
`-a`	`--connect`	req	(auto)	Architecture / transport / allocator (see Connection modes)
`-B`	`--mem-lib`	req `sdram\|dram\|block`	`sdram`	Choose memory allocator (alternative to `-a sdram\|dram\|block`)
`-b`	`--ctrl-c`	req	`r`	Ctrl-C handling. See -hc appendix
`-C`	`--cycle-count`	req `N`	`0`	Set N bytes of m4.sram to zero (cycle-count instrumentation)
`-c`	`--chk`	—	off	Validate OFM rows against the checkfile. See Checkfile validation
`-D`	`--debug`	—	off	Enter `MLAMemDebug` mode
`-d`	`--driver`	req `*.bin`	`mla_driver.elf`	M4 driver binary (required for Q0 / Q0123 quad configs)
`-e`	`--perf-ctr`	req `N`	(none)	Add performance counter `N` (max 4). See Performance counters
`-f`	`--do-check`	—	off	Section checksum validation (slow — ~10× load time, see R25)
`-G`	`--no-check-dtb`	—	DTB check ON	Skip `/proc/device-tree/reserved-memory/` check for DRAM allocators
`-g`	`--tmo-thr-fact`	req `THR:FACT`	`0:0`	DMA timeout threshold and factor
`-H`	`--m4-dasm`	req `*.dasm`	(none)	`mla_driver.dasm` for M4 symbol resolution
`-h`	`--help`	optional `aclmP`	—	Help. `-h`=main; `-ha`=all sub-helps; `-hc`=ctrl-c; `-hl`=logger; `-hm`=mem dump; `-hP`=policies
`-I`	`--interleaving`	req	(off)	Software interleaving. `2`/`3`/`4` = AXI ctrls; `d` = optimize DMA; `m0123` = MIM placement; combinable e.g. `-I4dm`
`-i`	`--ignore`	—	off	Continue past errors
`-J`	`--model-timeout`	req `MS`	`3000.0`	Per-run model timeout in milliseconds (Zebu uses this)
`-j`	`--ocm`	req `STRAT:LIMIT`	(none)	OCM placement (`all\|max\|min`); historically Zebu-only
`-K`	`--reloc`	req `SYM:INIT`	(none)	Add a relocation: create DramSection `SYM` initialised from `INIT` file. Repeatable
`-k`	`--checkfile`	req `FILE`	(none)	Use `FILE` as the OFM checkfile. Auto-enables `-c`
`-L`	`--last-log`	—	off	Connect, dump M4 log to stdout, exit (no model load)
`-l`	`--log`	—	off	Enable M4 logging during the run. Note: writes `dump.log` in cwd; from a read-only NFS mount this errors with `Error opening file dump.log : Permission denied` (see Troubleshooting)
`-M`	`--stats`	req `FILE[+]`	(none)	Append YAML stats to `FILE`. See R7 for format
`-m`	`--max-run-count`	req `N`	`1`	Run model `N` times without reloading. See Multi-run
`-N`	`--no-reset`	—	M4 reset ON	Skip M4 reset before run
`-n`	`--no-shr-sect`	—	sharing ON	Disable sharing of duplicate instructions across models (used for LLMs)
`-O`	`--ifm`	req `NAME:FILE`	(none)	Construct an IFM segment named `NAME` initialised from `FILE`. Repeatable
`-o`	`--ofm`	req `NAME:FILE:CHK`	(none)	Construct an OFM segment. `FILE` may be empty (`NAME::CHK`) for no init data; `CHK` is optional checkfile. Repeatable
`-P`	`--policy`	req	(none)	Process MLAPolicies args. `-Poff` to disable. See -hP appendix
`-p`	`--per-layer-stats`	req `FILE`	(none)	Per-layer latency YAML (input). Output written to `<base>_output.yaml`. Mutually exclusive with `-C`. See Per-layer stats
`-Q`	`--ddr-perf`	—	off	Use DDR performance counters (R22 — increased model.lat ≈700→914 µsec from counter overhead)
`-q`	(none)	—	off	Quiet output (alias of `-vq`). Prints `...shush...` and only the setup line
`-r`	`--read`	req	(none)	Read DRAM/L1/L2/SRAM/OCM. See Memory access
`-S`	`--seg-num`	req `N`	`-1`	Run only segment `N`
`-s`	`--report`	—	off	Dump stats. Use `--stats FILE` for YAML output
`-T`	`--write-addr`	req `*.yaml`	(none)	Write address/value pairs to YAML file
`-t`	`--gst-api`	—	off	Skip default IFM/OFM allocation (used with GST APIs). Auto-disabled if `--ifm`/`--ofm` is also passed
`-V`	`--version`	—	—	Print git revision and exit
`-v`	`--verbose`	optional	(off)	Logger options. See Verbosity
`-W`	`--no-rproc`	—	rproc ON	Disable remoteproc subsystem for M4 control
`-w`	`--write`	req	(none)	Write to DRAM/L1/L2/SRAM/OCM. See Memory access
`-X`	`--diag-arg`	req	(none)	Diagnostic arg (`F:` = freq, `H:` = HWIL). See Diagnostic arguments
`-x`	`--max-threads`	req `N`	`8`	`MAX_MULT_LOAD_THREADS` — model load parallelism
`-Y`	`--dryrun`	—	off	Parse and allocate, do not run
`-y`	`--yaml-config`	req `FILE`	(none)	M4 debug config (logLevel, fifoLevel, gpio/uart/profile enables). See M4 debug config
`-Z`	`--dma-burst-sz`	req `SIZE`	`0`	DMA burst size (Modalix-only)
`-z`	`--en-alloc-gen`	—	off	Use ev74 (target generic) memory allocator. Note: the parser tags this as a required argument but uses no value; pass any token, e.g., `-z 1`

Long-only flag 

Long	Description
`--arch`	DEPRECATED — replaced by `-a`/`--connect`. Do not use.

Positional arguments 

After the option list, any remaining tokens are interpreted by extension:

*.elf or *.lm → executable model files (multiple allowed)
everything else → frame data files

Verbosity and logging (-v, -l, -L, -q)

-v takes an optional argument. The argument is parsed by the logger. Common forms:

Form	Effect
`-v`	Bare verbose; turns on default channels
`-vs`	Stats level 1 (basic table)
`-vss`	Stats level 2 (extra columns: `runmsg.lat`, `init.lat`, `dram.MB/s`)
`-vc`	Check-file logging
`-vq`	Quiet (same as `-q`)
`-vsr`	Stats + relocation processing
`-v:msec:3`	Time scale = ms, precision = 3 digits
`-vsw=0,m=0x7`	Stats on, warnings off, m4 driver mask = `wm\|m\|dm`
`-vP` / `-vP+`	Policy log output
`-vz`	Set ALL flags

The full grammar is -vXYZ where each letter is a one-char channel. See the -hl appendix for the channel list.

-l enables the M4 driver log (collected during run). -L connects, dumps the log to stdout, and exits (no load). -q is shorthand for -vq.

Memory access (-r / -w)

General forms (see -hm appendix):

-r EXPR[:N][:FILE][:FMT]   # read
-w EXPR:N[:DATA|FILE]      # write

EXPR may be:

Form	Meaning
`0xADDR`	DRAM address
`sym`	A model symbol — `ifm`, `ifm.b0`, `ofm`, `ofm.b0`, `lora`, …
`sram(N)`	M4 SRAM bank N
`l1(x,y,addr)`	L1 at tile (x,y)
`l2(unit,addr)`	L2 (e.g., `l2(n0,0x100)`)
`m4(sym)`	M4 driver symbol
`m4.list`	Print all M4 driver symbols
`*`	All symbols (with `-r` only)

Suffixes for N and addresses: k = 1024, m = 1048576, r = 16 (one row).

FMT is %<FILL><TYPE><TSIZE><MODS>:

TYPE: a (dma), b (binary), c (char), d (decimal), f (float), o (octal), s (string), t (tile-instr), u (unsigned), x (hex)
TSIZE: 1, 2, 4, 8 byte width
MODS: n (no address prefix), r (reverse 16-byte row), m (m4_sym), @N (N bytes per row)
Special: FILE=sha1 produces a SHA-1 hash

Examples:

-r ifm.b0:32:%d1@16              # 32 B of IFM as decimal, 16 per row
-r ofm.b0::%b:/tmp/ofm.bin       # all OFM as binary to file
-r 0x30:8:0xdeadbeef             # (write form) 8 B of 0xdeadbeef at 0x30
-w sram(0)::FILE                 # write FILE into sram(0)
-w 0x1d0000:32k:-1               # fill 32 KB at 0x1d0000 with -1
-w l2(n0,0x100):10r::0,1,2,3,4,5 # write 10 rows of literal data

Performance counters (-e)

Up to 4 counters per run (exits if exceeded). Counter names:

ID	Name
0	`tiled_lat`
1	`tile_lat`
2	`tile_active`
3	`sync_frz/pwr`
4	`tile_pwr-c`
5	`sync_frz`
6	`iq_sync_frz`
7	`d_sync_frz`

All perf counters report in nanoseconds. Each counter adds a column perf.id N with the counter’s name. The values in the table are scaled to the active time-unit (default µsec).

Common combination: -e2 -e5 -e6 -e7 

The most useful set for understanding where MLA cycles are spent: tile activity (tile_active) plus the three freeze counters (sync_frz, iq_sync_frz, d_sync_frz). With those, tile_active + freeze ≈ tile.lat.

modalix:~$ ./mla-rt -vss -m 3 -e2 -e5 -e6 -e7 $MODEL
Perf counter 2 source "tile_active" added.
Perf counter 5 source "sync_frz" added.
Perf counter 6 source "iq_sync_frz" added.
Perf counter 7 source "d_sync_frz" added.
...
                                                                                                                  perf.id 2|  perf.id 5|  perf.id 6|  perf.id 7|
       tile.lat|    api.lat|  model.lat|    gap.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|tile_active|   sync_frz|iq_sync_frz| d_sync_frz|
----------------------------------------------------------------------------------------------------------------------------------------------------------------
avg         676|        701|        693|          6|          7|        689|          7|       6094|       2890|        494|        182|          3|        182|
min         676|        686|        683|          4|          1|        682|          4|       4040|       2890|        494|        182|          3|        182|
max         677|        708|        696|          8|          9|        691|          8|       7121|       2890|        494|        183|          3|        183|
std         nan|          9|          6|          3|          3|          4|          2|       1266|          0|          0|        nan|          0|        nan|

Reading the avg row above: tile_active=494µs + sync_frz=182µs ≈ tile.lat=676µs. The tile spent ~73 % executing and ~27 % stalled on sync.

Counter id 0 / 1 example 

modalix:~$ ./mla-rt -vss -m 3 -e0 -e1 $MODEL
       tile.lat|    api.lat|  model.lat|    gap.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|  tiled_lat|   tile_lat|
avg         676|        703|        694|          6|          7|        690|          7|       6614|       2916|        676|        676|

For DDR-side counters use -Q.

Checkfile validation (-c / -vc)

-c (or --chk) asks mla-rt to compare the model’s OFM against a golden checkfile and report PASSED/FAILED. For the comparison to be meaningful, you must:

Pass the matching IFM with --ifm <SYM>:<FILE> so the model runs on the input the checkfile was produced for.
Tell mla-rt where the OFM lives (--ofm <SYM>::<CHK>) — the empty FILE field means “no init data”.
Use the symbol names declared in the model. For most regression-test models that’s ifm.b0 and ofm.b0.

If you specify --checkfile FILE instead of --ofm <SYM>::<CHK>, -c is auto-enabled and the checkfile is bound to the default OFM symbol — but you still need --ifm to feed the right input, otherwise the model runs against a zero-filled IFM and reports FAILED (this is what R6 in Recipes shows).

-vc (or -vssc) turns on check-file logging, which prints the address range checked and an “N rows compared correctly, M rows were bad” summary in addition to the PASSED/FAILED line.

Recipe: PASS with -c -vss 

modalix:~$ DIR=/mnt/.../resnet_50/elffiles
modalix:~$ ./mla-rt -c -vss --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
                    --ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \
                    $DIR/resnet_50_stage1_mla.elf
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2872.757
Test::              ofm.b0::PASSED
Total Elapsed: time  205.43ms
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
            675|       1250|        710|         22|        692|          9|       7360|       2873|

api.lat jumps from ~700 µsec (without -c) to ~1250 µsec (with -c) — the comparison takes time and is included in the API latency.

Recipe: PASS with -c -vssc (verbose check)

modalix:~$ ./mla-rt -c -vssc --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
                    --ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \
                    $DIR/resnet_50_stage1_mla.elf
...
checkDramSection: DramSection:      ofm.b0:address:0x1182040000..0x11820403f0:size:   0x3f0:zone:0x5:  mlaZoneOFM:model:/mnt/.../resnet_50_stage1_mla.elf:refcnt:0x1
63 rows compared correctly, 0 rows were bad.
Test::              ofm.b0::PASSED

The two extra lines (one per checked DramSection, plus the row count) make -vssc the right verbosity when debugging a FAIL.

Per-layer stats (-p)

-p FILE runs the model layer by layer and emits per-layer latency and perf-counter data. The input file is a YAML list of layers with name, start_cycle, end_cycle. Most regression-test models ship one alongside the elf — for example resnet_50_stage1_mla_stats.yaml.

Output is written to <input-base>_output.yaml (the input is not overwritten). For example, passing /tmp/layers.yaml produces /tmp/layers_output.yaml. Pre-existing _output.yaml is overwritten.

-p is mutually exclusive with -C / --cycle-count; combining them is fatal at parse time.

Input file format 

0:
  name: MLA_0/placeholder_0_0
  start_cycle: 0
  end_cycle: 0
1:
  name: MLA_0/conv2d_add_relu_0
  start_cycle: 0
  end_cycle: 28435
2:
  name: MLA_0/max_pool2d_1
  start_cycle: 28436
  end_cycle: 31265
...

Index 0 is a placeholder and is skipped. Layers are run in order; cycle ranges drive how long each layer is allowed to execute.

Recipe 

Note

Always pass a copy of the stats yaml — even though mla-rt writes to a separate _output.yaml, copying first protects against accidents (e.g., invoking with -p $MODEL_DIR/foo_stats.yaml from a writable mount).

modalix:~$ cp $DIR/resnet_50_stage1_mla_stats.yaml /tmp/layers_in.yaml
modalix:~$ ./mla-rt -vss -p /tmp/layers_in.yaml \
                    --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
                    $DIR/resnet_50_stage1_mla.elf
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2887.514
Per layer stats output file name:
/tmp/layers_in_output.yaml

Sum perf ctr run_time is: 728.98us
Sum m4 systick latency is: 729.63us

Total Elapsed: time  353.83ms

Output file format (/tmp/layers_in_output.yaml)

Each layer gets the cycle range it ran with plus five timing fields:

# Date created: 2026-04-29 22:08:16

1:
  name: MLA_0/conv2d_add_relu_0
  start_cycle: 0
  end_cycle: 28435
  layer_latency: 44.84us
  run_time: 43.60us
  active_time: 28.43us
  l2_freeze: 14.88us
  iq_freeze: 3.50us
2:
  name: MLA_0/max_pool2d_1
  ...

Field	Source	Meaning
`layer_latency`	M4 systick measurement	Wall-clock time spent in the layer
`run_time`	Perf counter `run_time`	Layer execution time as reported by the perf counter
`active_time`	Perf counter `tile_active`	Time tiles were actively computing
`l2_freeze`	Perf counter	Time stalled waiting for L2
`iq_freeze`	Perf counter	Time stalled on instruction-queue sync

The summary lines printed to stdout (Sum perf ctr run_time / Sum m4 systick latency) are the totals across all layers in the file.

M4 debug config (-y)

-y FILE (--yaml-config) loads a YAML file that configures the M4 debug context — log level, FIFO mode, GPIO/UART/profile enables, deadlock checking, and the Zebu flag. Without -y, mla-rt uses hard-coded defaults.

A sample template is shown below. Use it as a template:

debugConfig: # this node is mandatory
    # log level: disabled | info | verbose | debug | fifo-debug
    logLevel : disabled
    # fifo level: no-fifo | write | read-write
    fifoLevel : write
    # boolean flags
    gpioEn : false
    deadlockEn : true
    uartEn : false
    profileEn : false
    zebu : false

Key	Type	Allowed values	Effect
`logLevel`	enum	`disabled`, `info`, `verbose`, `debug`, `fifo-debug`	M4 driver log verbosity
`fifoLevel`	enum	`no-fifo`, `write`, `read-write`	Which fifo channels are enabled
`gpioEn`	bool	`true`/`false`	Enable GPIO debug
`deadlockEn`	bool	`true`/`false`	Enable deadlock detection (default `true`)
`uartEn`	bool	`true`/`false`	Enable UART output
`profileEn`	bool	`true`/`false`	Enable profiling hooks
`zebu`	bool	`true`/`false`	Set when running against Zebu emulator
`m4SemaphoreOp`	uint32	(numeric)	Optional — M4 semaphore operation override

Recipe 

modalix:~$ ./mla-rt -vss -y /path/to/mlart-default-cfg.yaml \
                    $DIR/resnet_50_stage1_mla.elf
Reading m4 debug configuration from: /path/to/mlart-default-cfg.yaml

m4_context_config.yaml content:
  logLevel   : disabled
  fifoLevel  : write
  gpioEn     : false
  deadlockEn : true
  uartEn     : false
  profileEn  : false
  zebu       : false
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2898.003
Total Elapsed: time   83.72ms
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
            676|        707|        697|         10|        692|          8|       7121|       2898|

mla-rt echoes the parsed config back so you can confirm what was applied. If a key is missing from your file, the corresponding default is kept.

Diagnostic arguments (-X)

Format: -X <opt>:.... Currently F (frequency) and H (HWIL register).

-X F:… — Set MLA PLL frequency 

DaVinci syntax: -X F:P:M (PLL P and M registers).

-X F:0x3:0x3b   # 500 MHz (DV)
-X F:0x2:0x35   # 600 MHz (DV)
-X F:0x2:0x3e   # 700 MHz (DV)
-X F:0x1:0x2f   # 800 MHz (DV)
-X F:0x1:0x35   # 900 MHz (DV)
-X F:0x1:0x3b   # 1000 MHz (DV)

Modalix syntax: -X F:FREQ_MHZ (range 500–1500):

-X F:933        # 933 MHz (MA)

Warning

Setting the frequency mutates board state until the next reset/reboot. After benchmarking at a non-default frequency, run again with the original frequency to restore.

-X H:0xVALUE — Set NoC HWIL register 

-X H:0x70   # set MA hwil (noc_cfg) to 0x70

(Modalix-only; ignored on DV.)

Multi-run and stats output 

-m N (default 1) runs the model N times without reloading. The reported table changes:

N = 1 → single row of values
N > 1 → 4 rows: avg, min, max, std, plus an extra gap.lat column (inter-run gap)

-M FILE (or --stats FILE) appends a YAML summary. With trailing + (-M FILE+) it appends to an existing file. R7 output:

# Date created: 2026-04-29 21:35:08
# precision:dt:usec precision:0
/mnt/.../resnet_50_stage1_mla.elf:
    filesize: 27777336
    tile.lat:
        avg: 675
    api.lat:
        avg: 707
    model.lat:
        avg: 697
    relo.lat:
        avg: 10
    run.lat:
        avg: 691
    runmsg.lat:
        avg: 8
    init.lat:
        avg: 6921
    dram.MB/s:
        avg: 2885

For -m N > 1, each metric also gets min, max, std keys.

Argument constraints and interactions 

Constraint	Behavior
`-p`/`--per-layer-stats` + `-C`/`--cycle-count`	Fatal — mutually exclusive
`--checkfile FILE` without `-c`	Auto-enables `-c` (notice: `--checkfile specified ... adding -c`)
`-vs` set but no `-s`	Auto-enables `-s` (`-vs stats specified ... adding -s`)
`-t` + `--ifm`/`--ofm`	`-t` is auto-disabled (`--ifm --ofm turns off`)
`-e` more than 4 times	Fatal — max 4 perf counters
`--stats` without `-vs`	Warning: `--stats specified without log options -vs`
`-vP` without `-P`	Notice: `-vP policy specified but no -P policy set`

Unrecognised flags (e.g., --csv, --check-lm, --legacy, --sram, --dram-channel, -F) cause the full usage menu to be printed and exit.

Recipes (with verified board output)

All recipes were run on a Modalix board against the following model and companions:

modalix:~$ MODEL=/mnt/.../resnet_50/elffiles/resnet_50_stage1_mla.elf
modalix:~$ IFM=$(dirname $MODEL)/resnet_50_stage1_mla.ifm.mlc
modalix:~$ CHK=$(dirname $MODEL)/resnet_50_stage1_mla.ofm_chk.mlc

R1 — Default (no flags) — minimum output 

modalix:~$ ./mla-rt $MODEL
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf

The default verbosity prints only the setup line. Stats are gated behind -v.

R2 — Single run with full stats (-vss)

modalix:~$ ./mla-rt -vss $MODEL
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2884.039
Total Elapsed: time   82.31ms
Model stats: /mnt/.../resnet_50_stage1_mla.elf
output: dt:usec precision:0
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------
            676|        709|        698|         10|        693|          8|       7240|       2884|
----------------------------------------------------------------------------------------------------

R3 — Quiet (-q)

modalix:~$ ./mla-rt -q $MODEL
...shush...
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf

R4 — Dry run (-Y) — parse + allocate, don’t run 

modalix:~$ ./mla-rt -Y $MODEL
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf

R5 — Multi-run (-m 4) — see jitter 

modalix:~$ ./mla-rt -vss -m 4 $MODEL
       tile.lat|    api.lat|  model.lat|    gap.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------------------
avg         676|        696|        690|          6|          5|        688|          6|       5315|       2888|
min         676|        686|        683|          3|          1|        682|          4|       4040|       2887|
max         677|        710|        699|          7|         10|        694|          8|       7361|       2887|
std           0|         13|          9|          2|          5|          6|          2|       1889|          0|

R6 — –checkfile without –ifm → expected FAIL 

modalix:~$ ./mla-rt -vss --checkfile $CHK $MODEL
...
Test::              ofm.b0::FAILED
Total Elapsed: time   93.21ms
        api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|  dram.MB/s|
----------------------------------------------------------------------------
          11600|        699|         10|        693|          9|       2901|

--checkfile auto-enables -c, but the model runs against a zero-filled IFM and the output doesn’t match. For the PASS path, see Checkfile validation.

R7 — YAML stats (-M)

See Multi-run and stats output above for the YAML format.

R8b — Custom IFM and OFM 

modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM --ofm ofm.b0::$CHK $MODEL
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------
            675|        720|        709|         22|        691|          8|       7241|       2882|

The empty middle field in ofm.b0::$CHK means “no init data”, and $CHK is the validation checkfile.

R9 — Performance counters 

See Performance counters for the recommended -e2 -e5 -e6 -e7 combo and the -e0 -e1 variant.

R20 — Read 32 B of IFM as decimals, 16 per row 

modalix:~$ ./mla-rt -vss -r "ifm.b0:32:%d1@16" $MODEL
mlaZoneIFM:ifm.b0[0x1182000000]@32
0x1182000000:  86 -62 -101 -52 -23  80 -13 -105 -43  25  80  58 126 -80  97 -114
0x1182000010:  14 -58 -77  40 -50  15 -62 -90  63 -18  45  43 113 -20 -72  34

R21 — Dump OFM to a binary file 

modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM -r "ofm.b0::%b:/tmp/ofm.bin" $MODEL
mlaZoneOFM:ofm.b0[0x1182040000]@1008 => /tmp/ofm.bin
modalix:~$ ls -la /tmp/ofm.bin
-rw-r--r-- 1 root root 1008 Apr 29 21:38 /tmp/ofm.bin

R22 — DDR performance counters (-Q)

modalix:~$ ./mla-rt -vss -Q $MODEL
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
            677|        925|        914|         10|        909|          8|       6681|       2885|

(Note model.lat rises ≈700→914 µsec — DDR counter overhead.)

R25 — Section checksums (-f) — slow 

modalix:~$ ./mla-rt -vss -f $MODEL
loadDramSegments:dt:   0.818s  dramB=25342080  mbps=  30.976
...
Total Elapsed: time  889.66ms

(Load is ~10× slower; dram.MB/s collapses from ~2900 to 31 because the checksum computation is mixed in.)

R26 — Memory allocator (–mem-lib sdram)

modalix:~$ ./mla-rt -vss --mem-lib sdram $MODEL
arch.memlib: SDramAllocator
[   0/   1]setup_model ...

R27 — Version 

modalix:~$ ./mla-rt -V
mla_rt git revision: <revision>

Stale –help entries 

The usage() menu lists several flags that the current parser does not accept:

Listed in –help	Actual status
`--csv`	Not parsed — passing it dumps the usage menu and exits
`--check-lm`	Not parsed
`--legacy`	Not parsed
`--sram N`	Not parsed (named entry exists in menu but no long option)
`--dram-channel N`	Not parsed
`-F err_t,...`	Not a top-level flag; `F` only exists as a `-X` sub-token (frequency)
`mla-poke`	Documentation note, not a flag
`mem-lib` (without `--`)	The actual flag is `-B` / `--mem-lib`
`--arch`	DEPRECATED — use `-a`/`--connect`

Troubleshooting 

Error opening file dump.log : Permission denied (with -l): Logging tries to write dump.log in cwd. If you cd into a read-only mount, the open fails. Workarounds: cd /tmp before running, or use -L (dumps the log to stdout, no file).
parse_file:ERROR:cannot open <FILE> from --ofm: -o NAME:FILE:CHK requires FILE to exist as init data. To skip init, leave the field empty: -o NAME::CHK.
Test:: ofm.b0::FAILED with default (no --ifm): The model runs against zero-filled (or default) IFM, so the OFM doesn’t match the golden checkfile. Pass the matching --ifm.
Setting -X F:... on the wrong arch: DV uses F:P:M; MA uses F:FREQ_MHZ (500–1500). Mismatched syntax fails with a parsing error.
-z doesn’t behave like a no-arg flag: The parser declares -z as a required argument but the case body uses no value. Pass any token (e.g., -z 1) until this is fixed.

Appendix: verbatim sub-help output 

Captured from the board.

-hc — Ctrl-C handling 

MLASignal allows control of SIGHUP(1) SIGINT(2) SIGQUIT(3) SIGTERM(15) :
options set as follows:  [--ctrl-c|-b]  01vsycre :
c                           : immediately do gst.close
e                           : exit(1)
no|off                      : turn off sighandler
r                           : schedule gst to halt further processing
s                           : after interrupt sleep 1 sec then return
y                           : ask Y/N to continue interrupt
examples:                   :
--ctrl-c ys                 : ask y/n ; sleep 1s
--ctrl-c r                  : request to halt further processing

-hl — Logger 

Logger options set as follows:
-vXYZ where XYZ are:
  a                           : dma process logging
  A                           : allocation process
  b                           : turn on progress bar
  c                           : check file logging: -c must be on
  d                           : generic debug diagnostics
  F                           : frames processing
  g                           : gstapi and remote proc diagnostics
  h                           : help menu i.e. usage()
  l                           : loader logging
  m                           : m4 driver log (1:wm 2:m 4:dm)
  n                           : show notes: -vn-   turn off
  o=FILE                      : output to FILE
  o=syslog                    : output to syslog facility
  p                           : payload transactions
  P                           : enable policies
  q                           : quiet = (all masks zeroed)
  r                           : relocation processing
  s                           : show all statistics
  S                           : show .lm/.elf dma and tile sizes
  u                           : verify/fix lm program overrun
  v                           : set verbosity: v=1(-vcxwt) v=2(v=1 + lstu) ...
  x                           : log run progress
  y                           : dump symbol table(s)
  z                           : set all flags
  :N                          : set FP precision 42.nnnn
  :nsec|usec|msec|sec         : set time unit
  Generic diagnostics         :
  e                           : ERROR diagnostics: -ve- turn off
  f                           : FATAL diagnostics: -vf- turn off
  t                           : TODO diagnostics
  w                           : WARN diagnostics
  examples                    : -vv=2 or -v -v :set verbose = 2 ; verbose is a compound qualifier
                              : -vxr=2         :enable run=1 and relocations=2
                              : -vsw=0,m=0x7   :enable stats, disable warnings, set m4_driver=wm|m|dm
                              : -vs:msec:3     :enable stats at dt=millisec fp.precision=3

-hm — Memory dump (full reference)

For reads to DRAM/L1/L2/SRAM/OCM the general form is : -r EXPR:SIZE:FILE:FMT
For writes we have                  : -w EXPR:SIZE:FILE
                                    : EXPR can be a constant, symbol or l1(.) l2(.) addresses described below
                                    : SIZE defaults to 4B and can be inferred
                                    : For FILE=sha1 we generate a SHA1 key
                                    : For FMT: %<FILL>abcdfostux<TSIZE><MODIFERS>
                                    : where FILL: field size in bytes: ex %04 fill with 4 zeros
                                    : a:dma  b:binary  c:character  d:decimal  f:float  o:octal  s:string  t:tile-instr  u:unsigned  x:hex
TSIZE                               : x1: one hex byte  d2: two byte decimal uint16  u8: 8byte unsigned
MODIFIERS                           : n: NO ADDRESS  r: reverse 16B row  m: m4_sym  @rowsize: bytes/row

  -r addr[:n][:FILE][:FMT]            : read dram[addr] for len n
  -r sym[:n][:FILE][:FMT]             : read sym (ofm|ifm|lora|...)
  -r sym::sha1                        : SHA1 of sym (len inferred)
  -r l1(x,y,addr)[:n][:FILE][:FMT]    : read L1 at tile (x,y)
  -r l2(unit,addr)[:n][:FILE|FMT]     : read L2 at unit=n0,e2 row addr
  -r m4.list                          : print M4 driver symbols
  -r m4(sym)[:n][:FILE][:FMT]         : read M4 driver symbol
  -r * : report all symbols in symbol table
  -w addr:n[:data|FILE]               : write data or FILE to dram[addr]@n

format examples:
  -r 0:2r:%d1   → 0x000000:  10   0   0  20  31  32   3 -43 ...
  -r 0:8:%x1    → 0x000000:  a  0  0 14 1f 20  3 d5
  -r 0:16:%02x1 → 0x000000: 0a 00 00 14 1f 20 03 d5 00 00 00 00 00 00 00 00
  -r 0:16:%02x1r → 0x000000: 00 00 00 00 00 00 00 00 d5 03 20 1f 14 00 00 0a
  -r 0:16:%u4n  → 335544330 3573751839        0        0

-hP — Policies 

Policies respond to correct driver or mla error or timeouts :
options set as follows:  -policy  or -P x,y:n,z=1.0 :
  -P off                      : disable policy options
  a                           : enable all policy options
  r                           : enable retry on failure
  d                           : enable driver restart on failure
  l                           : enable driver reload

  Policies have a self test mode m|m4|mla described below :
  mla,m4,f=0.7                : enable random mla/m4 failure at rate 0.7
  mla:N                       : generate mla failure N = { 1..5 }
  m4:N                        : generate m4 error 6..18 (not all implemented)

  examples                    : -vP or -vP+ for verbose output
  -P r+                       : retry 2x
  -P rd+l                     : retry, driver restart 2x, then reload
  -P a,m4,f=1                 : all policies + random m4 failure rate 1.0
  -P a,m4:7:20:2              : generate m4:7=ERR_CODE_FATAL_DMA:unt:err
  -P a,mla                    : random mla errors at 0.5
  -P a,m4:18:128              : m4 interrupt sig=128

  mla error codes:
  mla:1                       : dmaDramToM4: write payload failed
  mla:2                       : MLA file not found
  mla:3                       : M4 done timeout
  mla:4                       : generic sendCommand failure
  mla:5                       : checkfile failed to match
  m4 error codes:
  m4:7:unit:err               : DMA Fatal interrupt
  m4:8:err                    : MIM Fatal interrupt
  m4:9:row:col                : TLC Fatal interrupt
  m4:10:err                   : uC Fatal interrupt
                              : m4:6,11-17 not implemented (will be generated by -P a,m4)