mla-rt User Guide

mla-rt is the MLA (Machine Learning Accelerator) runtime CLI. It loads a compiled model (*.elf or *.lm), runs it on the MLA hardware, and reports performance and correctness statistics.

Overview and deployment

mla-rt is an aarch64 binary that runs on an MLA-equipped board. Every supported board ships mla-rt pre-installed at /usr/bin/mla-rt, so after connecting via SSH (ssh root@<board>), you can run it directly with no setup. This is the right choice for routine model bring-up, performance measurements, and CI-style runs.

A typical session uses an SSH alias or hostname for the board, for example ssh root@modalix.

Models live at /mnt/mla_test_runner_models/... β€” for example /mnt/mla_test_runner_models/ma/2025-05-05-regression-tests/image_classification/resnet_50/elffiles/resnet_50_stage1_mla.elf.

A model directory typically contains:

File

Purpose

*_stage1_mla.elf

Compiled MLA binary (the model)

*_stage1_mla.ifm.mlc

Reference IFM (input feature map)

*_stage1_mla.ofm_chk.mlc

OFM checkfile (golden output for validation)

*_stage1_mla.mlc

The model in MLC text format (not frame data)

*_stage1_mla_stats.yaml

Reference stats

Note

The recipes on this page were captured on a Modalix (MA) board, so DaVinci-only flags (-X F:0xN:0xN) and Zebu/VDK modes do not apply there.

Quick start

modalix:~$ ./mla-rt -vss /mnt/.../resnet_50_stage1_mla.elf
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2884.039
Total Elapsed: time   82.31ms
Model stats: /mnt/.../resnet_50_stage1_mla.elf
output: dt:usec precision:0
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------
            676|        709|        698|         10|        693|          8|       7240|       2884|
----------------------------------------------------------------------------------------------------

-vss = -v (verbose) with sub-arg ss (stats level 2). Without -vss, the table is suppressed, and only the setup_model line is printed.

-vs enables the table; -vss adds the runmsg.lat, init.lat, and dram.MB/s columns.

The performance table

Column meanings with their respective units:

Column

Unit

Meaning

tile.lat

nsec

Aggregate tile execution time reported by the MLA hardware

api.lat

Β΅sec

End-to-end runtime of the user-facing API call

model.lat

Β΅sec

Time spent in mla_run_model (host side)

gap.lat

Β΅sec

Multi-run only β€” gap between successive model invocations

relo.lat

Β΅sec

Relocation processing

run.lat

Β΅sec

Pure model execution (M4-side)

runmsg.lat

Β΅sec

Time to send the run message to M4. Shown when stats >= 2

init.lat

Β΅sec

One-time MLA initialisation. Shown when stats >= 2

dram.MB/s

MB/sec

DRAM bandwidth measured during model load

For multi-run (-m N > 1), each column gets four rows: avg, min, max, std (see the R5 recipe below).

The default time scale is microseconds with precision 0 (output: dt:usec precision:0). Override with the :nsec|:usec|:msec|:sec and :N (digits) suffixes on -v, for example -vs:msec:3 shows stats in milliseconds with 3 decimals.

Connection modes (-a / –connect)

Default: scans /etc/buildinfo on the host to pick the build target. On aarch64 boards, this normally selects a65 (the on-board ARM CPU).

Comma-separated tokens; usage menu:

-a [a65|dv|davinci|ma|michelangelo|mod|test|zebu1|zebu2|sw_mbox|dram|sdram|block|enable|disable|mip|vdk]

Architecture options

Option

Meaning

Where it works

dv / davinci

DaVinci silicon

DV boards (e.g., older mla-hl* generation)

ma / michelangelo / mod

Modalix silicon

MA boards

test

Loopback / no hardware

Anywhere

Connection / transport options

Option

Meaning

a65

On-board A65 ARM (default for aarch64 hosts)

sw_mbox

Software mailbox transport

zebu1 / zebu2

Zebu emulator (network access required)

vdk

Virtual development kit

mip

MIP transport mode

Memory allocator options

Or use -B / --mem-lib.

Option

Meaning

dram

DRAM allocator

sdram

SDRAM allocator (default)

block

Block allocator

Cache options

Option

Meaning

enable

Enable cache

disable

Disable cache

Examples:

-a dv,a65,sw_mbox     # Arch=DaVinci, Connect A65, use software mailbox
-a mod,zebu1          # Arch=Modalix, Connect to <zebu-emulator-server>
-a mod,block,disable  # Arch=Modalix, use block allocator, disable cache

Complete flag reference

Sorted by short flag (uppercase before lowercase).

Arg column: β€” = no argument, optional = optional argument, req = required argument.

Short

Long

Arg

Default

Description

-A

--labels

req FILE

(none)

Compute output labels via softmax against FILE

-a

--connect

req

(auto)

Architecture / transport / allocator (see Connection modes)

-B

--mem-lib

req sdram|dram|block

sdram

Choose memory allocator (alternative to -a sdram|dram|block)

-b

--ctrl-c

req

r

Ctrl-C handling. See -hc appendix

-C

--cycle-count

req N

0

Set N bytes of m4.sram to zero (cycle-count instrumentation)

-c

--chk

β€”

off

Validate OFM rows against the checkfile. See Checkfile validation

-D

--debug

β€”

off

Enter MLAMemDebug mode

-d

--driver

req *.bin

mla_driver.elf

M4 driver binary (required for Q0 / Q0123 quad configs)

-e

--perf-ctr

req N

(none)

Add performance counter N (max 4). See Performance counters

-f

--do-check

β€”

off

Section checksum validation (slow β€” ~10Γ— load time, see R25)

-G

--no-check-dtb

β€”

DTB check ON

Skip /proc/device-tree/reserved-memory/ check for DRAM allocators

-g

--tmo-thr-fact

req THR:FACT

0:0

DMA timeout threshold and factor

-H

--m4-dasm

req *.dasm

(none)

mla_driver.dasm for M4 symbol resolution

-h

--help

optional aclmP

β€”

Help. -h=main; -ha=all sub-helps; -hc=ctrl-c; -hl=logger; -hm=mem dump; -hP=policies

-I

--interleaving

req

(off)

Software interleaving. 2/3/4 = AXI ctrls; d = optimize DMA; m0123 = MIM placement; combinable e.g. -I4dm

-i

--ignore

β€”

off

Continue past errors

-J

--model-timeout

req MS

3000.0

Per-run model timeout in milliseconds (Zebu uses this)

-j

--ocm

req STRAT:LIMIT

(none)

OCM placement (all|max|min); historically Zebu-only

-K

--reloc

req SYM:INIT

(none)

Add a relocation: create DramSection SYM initialised from INIT file. Repeatable

-k

--checkfile

req FILE

(none)

Use FILE as the OFM checkfile. Auto-enables -c

-L

--last-log

β€”

off

Connect, dump M4 log to stdout, exit (no model load)

-l

--log

β€”

off

Enable M4 logging during the run. Note: writes dump.log in cwd; from a read-only NFS mount this errors with Error opening file dump.log : Permission denied (see Troubleshooting)

-M

--stats

req FILE[+]

(none)

Append YAML stats to FILE. See R7 for format

-m

--max-run-count

req N

1

Run model N times without reloading. See Multi-run

-N

--no-reset

β€”

M4 reset ON

Skip M4 reset before run

-n

--no-shr-sect

β€”

sharing ON

Disable sharing of duplicate instructions across models (used for LLMs)

-O

--ifm

req NAME:FILE

(none)

Construct an IFM segment named NAME initialised from FILE. Repeatable

-o

--ofm

req NAME:FILE:CHK

(none)

Construct an OFM segment. FILE may be empty (NAME::CHK) for no init data; CHK is optional checkfile. Repeatable

-P

--policy

req

(none)

Process MLAPolicies args. -Poff to disable. See -hP appendix

-p

--per-layer-stats

req FILE

(none)

Per-layer latency YAML (input). Output written to <base>_output.yaml. Mutually exclusive with -C. See Per-layer stats

-Q

--ddr-perf

β€”

off

Use DDR performance counters (R22 β€” increased model.lat β‰ˆ700β†’914 Β΅sec from counter overhead)

-q

(none)

β€”

off

Quiet output (alias of -vq). Prints ...shush... and only the setup line

-r

--read

req

(none)

Read DRAM/L1/L2/SRAM/OCM. See Memory access

-S

--seg-num

req N

-1

Run only segment N

-s

--report

β€”

off

Dump stats. Use --stats FILE for YAML output

-T

--write-addr

req *.yaml

(none)

Write address/value pairs to YAML file

-t

--gst-api

β€”

off

Skip default IFM/OFM allocation (used with GST APIs). Auto-disabled if --ifm/--ofm is also passed

-V

--version

β€”

β€”

Print git revision and exit

-v

--verbose

optional

(off)

Logger options. See Verbosity

-W

--no-rproc

β€”

rproc ON

Disable remoteproc subsystem for M4 control

-w

--write

req

(none)

Write to DRAM/L1/L2/SRAM/OCM. See Memory access

-X

--diag-arg

req

(none)

Diagnostic arg (F: = freq, H: = HWIL). See Diagnostic arguments

-x

--max-threads

req N

8

MAX_MULT_LOAD_THREADS β€” model load parallelism

-Y

--dryrun

β€”

off

Parse and allocate, do not run

-y

--yaml-config

req FILE

(none)

M4 debug config (logLevel, fifoLevel, gpio/uart/profile enables). See M4 debug config

-Z

--dma-burst-sz

req SIZE

0

DMA burst size (Modalix-only)

-z

--en-alloc-gen

β€”

off

Use ev74 (target generic) memory allocator. Note: the parser tags this as a required argument but uses no value; pass any token, e.g., -z 1

Long-only flag

Long

Description

--arch

DEPRECATED β€” replaced by -a/--connect. Do not use.

Positional arguments

After the option list, any remaining tokens are interpreted by extension:

  • *.elf or *.lm β†’ executable model files (multiple allowed)

  • everything else β†’ frame data files

Verbosity and logging (-v, -l, -L, -q)

-v takes an optional argument. The argument is parsed by the logger. Common forms:

Form

Effect

-v

Bare verbose; turns on default channels

-vs

Stats level 1 (basic table)

-vss

Stats level 2 (extra columns: runmsg.lat, init.lat, dram.MB/s)

-vc

Check-file logging

-vq

Quiet (same as -q)

-vsr

Stats + relocation processing

-v:msec:3

Time scale = ms, precision = 3 digits

-vsw=0,m=0x7

Stats on, warnings off, m4 driver mask = wm|m|dm

-vP / -vP+

Policy log output

-vz

Set ALL flags

The full grammar is -vXYZ where each letter is a one-char channel. See the -hl appendix for the channel list.

-l enables the M4 driver log (collected during run). -L connects, dumps the log to stdout, and exits (no load). -q is shorthand for -vq.

Memory access (-r / -w)

General forms (see -hm appendix):

-r EXPR[:N][:FILE][:FMT]   # read
-w EXPR:N[:DATA|FILE]      # write

EXPR may be:

Form

Meaning

0xADDR

DRAM address

sym

A model symbol β€” ifm, ifm.b0, ofm, ofm.b0, lora, …

sram(N)

M4 SRAM bank N

l1(x,y,addr)

L1 at tile (x,y)

l2(unit,addr)

L2 (e.g., l2(n0,0x100))

m4(sym)

M4 driver symbol

m4.list

Print all M4 driver symbols

*

All symbols (with -r only)

Suffixes for N and addresses: k = 1024, m = 1048576, r = 16 (one row).

FMT is %<FILL><TYPE><TSIZE><MODS>:

  • TYPE: a (dma), b (binary), c (char), d (decimal), f (float), o (octal), s (string), t (tile-instr), u (unsigned), x (hex)

  • TSIZE: 1, 2, 4, 8 byte width

  • MODS: n (no address prefix), r (reverse 16-byte row), m (m4_sym), @N (N bytes per row)

  • Special: FILE=sha1 produces a SHA-1 hash

Examples:

-r ifm.b0:32:%d1@16              # 32 B of IFM as decimal, 16 per row
-r ofm.b0::%b:/tmp/ofm.bin       # all OFM as binary to file
-r 0x30:8:0xdeadbeef             # (write form) 8 B of 0xdeadbeef at 0x30
-w sram(0)::FILE                 # write FILE into sram(0)
-w 0x1d0000:32k:-1               # fill 32 KB at 0x1d0000 with -1
-w l2(n0,0x100):10r::0,1,2,3,4,5 # write 10 rows of literal data

Performance counters (-e)

Up to 4 counters per run (exits if exceeded). Counter names:

ID

Name

0

tiled_lat

1

tile_lat

2

tile_active

3

sync_frz/pwr

4

tile_pwr-c

5

sync_frz

6

iq_sync_frz

7

d_sync_frz

All perf counters report in nanoseconds. Each counter adds a column perf.id N with the counter’s name. The values in the table are scaled to the active time-unit (default Β΅sec).

Common combination: -e2 -e5 -e6 -e7

The most useful set for understanding where MLA cycles are spent: tile activity (tile_active) plus the three freeze counters (sync_frz, iq_sync_frz, d_sync_frz). With those, tile_active + freeze β‰ˆ tile.lat.

modalix:~$ ./mla-rt -vss -m 3 -e2 -e5 -e6 -e7 $MODEL
Perf counter 2 source "tile_active" added.
Perf counter 5 source "sync_frz" added.
Perf counter 6 source "iq_sync_frz" added.
Perf counter 7 source "d_sync_frz" added.
...
                                                                                                                  perf.id 2|  perf.id 5|  perf.id 6|  perf.id 7|
       tile.lat|    api.lat|  model.lat|    gap.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|tile_active|   sync_frz|iq_sync_frz| d_sync_frz|
----------------------------------------------------------------------------------------------------------------------------------------------------------------
avg         676|        701|        693|          6|          7|        689|          7|       6094|       2890|        494|        182|          3|        182|
min         676|        686|        683|          4|          1|        682|          4|       4040|       2890|        494|        182|          3|        182|
max         677|        708|        696|          8|          9|        691|          8|       7121|       2890|        494|        183|          3|        183|
std         nan|          9|          6|          3|          3|          4|          2|       1266|          0|          0|        nan|          0|        nan|

Reading the avg row above: tile_active=494Β΅s + sync_frz=182Β΅s β‰ˆ tile.lat=676Β΅s. The tile spent ~73 % executing and ~27 % stalled on sync.

Counter id 0 / 1 example

modalix:~$ ./mla-rt -vss -m 3 -e0 -e1 $MODEL
       tile.lat|    api.lat|  model.lat|    gap.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|  tiled_lat|   tile_lat|
avg         676|        703|        694|          6|          7|        690|          7|       6614|       2916|        676|        676|

For DDR-side counters use -Q.

Checkfile validation (-c / -vc)

-c (or --chk) asks mla-rt to compare the model’s OFM against a golden checkfile and report PASSED/FAILED. For the comparison to be meaningful, you must:

  1. Pass the matching IFM with --ifm <SYM>:<FILE> so the model runs on the input the checkfile was produced for.

  2. Tell mla-rt where the OFM lives (--ofm <SYM>::<CHK>) β€” the empty FILE field means β€œno init data”.

  3. Use the symbol names declared in the model. For most regression-test models that’s ifm.b0 and ofm.b0.

If you specify --checkfile FILE instead of --ofm <SYM>::<CHK>, -c is auto-enabled and the checkfile is bound to the default OFM symbol β€” but you still need --ifm to feed the right input, otherwise the model runs against a zero-filled IFM and reports FAILED (this is what R6 in Recipes shows).

-vc (or -vssc) turns on check-file logging, which prints the address range checked and an β€œN rows compared correctly, M rows were bad” summary in addition to the PASSED/FAILED line.

Recipe: PASS with -c -vss

modalix:~$ DIR=/mnt/.../resnet_50/elffiles
modalix:~$ ./mla-rt -c -vss --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
                    --ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \
                    $DIR/resnet_50_stage1_mla.elf
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2872.757
Test::              ofm.b0::PASSED
Total Elapsed: time  205.43ms
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
            675|       1250|        710|         22|        692|          9|       7360|       2873|

api.lat jumps from ~700 Β΅sec (without -c) to ~1250 Β΅sec (with -c) β€” the comparison takes time and is included in the API latency.

Recipe: PASS with -c -vssc (verbose check)

modalix:~$ ./mla-rt -c -vssc --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
                    --ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \
                    $DIR/resnet_50_stage1_mla.elf
...
checkDramSection: DramSection:      ofm.b0:address:0x1182040000..0x11820403f0:size:   0x3f0:zone:0x5:  mlaZoneOFM:model:/mnt/.../resnet_50_stage1_mla.elf:refcnt:0x1
63 rows compared correctly, 0 rows were bad.
Test::              ofm.b0::PASSED

The two extra lines (one per checked DramSection, plus the row count) make -vssc the right verbosity when debugging a FAIL.

Per-layer stats (-p)

-p FILE runs the model layer by layer and emits per-layer latency and perf-counter data. The input file is a YAML list of layers with name, start_cycle, end_cycle. Most regression-test models ship one alongside the elf β€” for example resnet_50_stage1_mla_stats.yaml.

Output is written to <input-base>_output.yaml (the input is not overwritten). For example, passing /tmp/layers.yaml produces /tmp/layers_output.yaml. Pre-existing _output.yaml is overwritten.

-p is mutually exclusive with -C / --cycle-count; combining them is fatal at parse time.

Input file format

0:
  name: MLA_0/placeholder_0_0
  start_cycle: 0
  end_cycle: 0
1:
  name: MLA_0/conv2d_add_relu_0
  start_cycle: 0
  end_cycle: 28435
2:
  name: MLA_0/max_pool2d_1
  start_cycle: 28436
  end_cycle: 31265
...

Index 0 is a placeholder and is skipped. Layers are run in order; cycle ranges drive how long each layer is allowed to execute.

Recipe

Note

Always pass a copy of the stats yaml β€” even though mla-rt writes to a separate _output.yaml, copying first protects against accidents (e.g., invoking with -p $MODEL_DIR/foo_stats.yaml from a writable mount).

modalix:~$ cp $DIR/resnet_50_stage1_mla_stats.yaml /tmp/layers_in.yaml
modalix:~$ ./mla-rt -vss -p /tmp/layers_in.yaml \
                    --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \
                    $DIR/resnet_50_stage1_mla.elf
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2887.514
Per layer stats output file name:
/tmp/layers_in_output.yaml

Sum perf ctr run_time is: 728.98us
Sum m4 systick latency is: 729.63us

Total Elapsed: time  353.83ms

Output file format (/tmp/layers_in_output.yaml)

Each layer gets the cycle range it ran with plus five timing fields:

# Date created: 2026-04-29 22:08:16

1:
  name: MLA_0/conv2d_add_relu_0
  start_cycle: 0
  end_cycle: 28435
  layer_latency: 44.84us
  run_time: 43.60us
  active_time: 28.43us
  l2_freeze: 14.88us
  iq_freeze: 3.50us
2:
  name: MLA_0/max_pool2d_1
  ...

Field

Source

Meaning

layer_latency

M4 systick measurement

Wall-clock time spent in the layer

run_time

Perf counter run_time

Layer execution time as reported by the perf counter

active_time

Perf counter tile_active

Time tiles were actively computing

l2_freeze

Perf counter

Time stalled waiting for L2

iq_freeze

Perf counter

Time stalled on instruction-queue sync

The summary lines printed to stdout (Sum perf ctr run_time / Sum m4 systick latency) are the totals across all layers in the file.

M4 debug config (-y)

-y FILE (--yaml-config) loads a YAML file that configures the M4 debug context β€” log level, FIFO mode, GPIO/UART/profile enables, deadlock checking, and the Zebu flag. Without -y, mla-rt uses hard-coded defaults.

A sample template is shown below. Use it as a template:

debugConfig: # this node is mandatory
    # log level: disabled | info | verbose | debug | fifo-debug
    logLevel : disabled
    # fifo level: no-fifo | write | read-write
    fifoLevel : write
    # boolean flags
    gpioEn : false
    deadlockEn : true
    uartEn : false
    profileEn : false
    zebu : false

Key

Type

Allowed values

Effect

logLevel

enum

disabled, info, verbose, debug, fifo-debug

M4 driver log verbosity

fifoLevel

enum

no-fifo, write, read-write

Which fifo channels are enabled

gpioEn

bool

true/false

Enable GPIO debug

deadlockEn

bool

true/false

Enable deadlock detection (default true)

uartEn

bool

true/false

Enable UART output

profileEn

bool

true/false

Enable profiling hooks

zebu

bool

true/false

Set when running against Zebu emulator

m4SemaphoreOp

uint32

(numeric)

Optional β€” M4 semaphore operation override

Recipe

modalix:~$ ./mla-rt -vss -y /path/to/mlart-default-cfg.yaml \
                    $DIR/resnet_50_stage1_mla.elf
Reading m4 debug configuration from: /path/to/mlart-default-cfg.yaml

m4_context_config.yaml content:
  logLevel   : disabled
  fifoLevel  : write
  gpioEn     : false
  deadlockEn : true
  uartEn     : false
  profileEn  : false
  zebu       : false
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2898.003
Total Elapsed: time   83.72ms
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
            676|        707|        697|         10|        692|          8|       7121|       2898|

mla-rt echoes the parsed config back so you can confirm what was applied. If a key is missing from your file, the corresponding default is kept.

Diagnostic arguments (-X)

Format: -X <opt>:.... Currently F (frequency) and H (HWIL register).

-X F:… β€” Set MLA PLL frequency

DaVinci syntax: -X F:P:M (PLL P and M registers).

-X F:0x3:0x3b   # 500 MHz (DV)
-X F:0x2:0x35   # 600 MHz (DV)
-X F:0x2:0x3e   # 700 MHz (DV)
-X F:0x1:0x2f   # 800 MHz (DV)
-X F:0x1:0x35   # 900 MHz (DV)
-X F:0x1:0x3b   # 1000 MHz (DV)

Modalix syntax: -X F:FREQ_MHZ (range 500–1500):

-X F:933        # 933 MHz (MA)

Warning

Setting the frequency mutates board state until the next reset/reboot. After benchmarking at a non-default frequency, run again with the original frequency to restore.

-X H:0xVALUE β€” Set NoC HWIL register

-X H:0x70   # set MA hwil (noc_cfg) to 0x70

(Modalix-only; ignored on DV.)

Multi-run and stats output

-m N (default 1) runs the model N times without reloading. The reported table changes:

  • N = 1 β†’ single row of values

  • N > 1 β†’ 4 rows: avg, min, max, std, plus an extra gap.lat column (inter-run gap)

-M FILE (or --stats FILE) appends a YAML summary. With trailing + (-M FILE+) it appends to an existing file. R7 output:

# Date created: 2026-04-29 21:35:08
# precision:dt:usec precision:0
/mnt/.../resnet_50_stage1_mla.elf:
    filesize: 27777336
    tile.lat:
        avg: 675
    api.lat:
        avg: 707
    model.lat:
        avg: 697
    relo.lat:
        avg: 10
    run.lat:
        avg: 691
    runmsg.lat:
        avg: 8
    init.lat:
        avg: 6921
    dram.MB/s:
        avg: 2885

For -m N > 1, each metric also gets min, max, std keys.

Argument constraints and interactions

Constraint

Behavior

-p/--per-layer-stats + -C/--cycle-count

Fatal β€” mutually exclusive

--checkfile FILE without -c

Auto-enables -c (notice: --checkfile specified ... adding -c)

-vs set but no -s

Auto-enables -s (-vs stats specified ... adding -s)

-t + --ifm/--ofm

-t is auto-disabled (--ifm --ofm turns off)

-e more than 4 times

Fatal β€” max 4 perf counters

--stats without -vs

Warning: --stats specified without log options -vs

-vP without -P

Notice: -vP policy specified but no -P policy set

Unrecognised flags (e.g., --csv, --check-lm, --legacy, --sram, --dram-channel, -F) cause the full usage menu to be printed and exit.

Recipes (with verified board output)

All recipes were run on a Modalix board against the following model and companions:

modalix:~$ MODEL=/mnt/.../resnet_50/elffiles/resnet_50_stage1_mla.elf
modalix:~$ IFM=$(dirname $MODEL)/resnet_50_stage1_mla.ifm.mlc
modalix:~$ CHK=$(dirname $MODEL)/resnet_50_stage1_mla.ofm_chk.mlc

R1 β€” Default (no flags) β€” minimum output

modalix:~$ ./mla-rt $MODEL
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf

The default verbosity prints only the setup line. Stats are gated behind -v.

R2 β€” Single run with full stats (-vss)

modalix:~$ ./mla-rt -vss $MODEL
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf
loadDramSegments:dt:   0.009s  dramB=25342080  mbps=2884.039
Total Elapsed: time   82.31ms
Model stats: /mnt/.../resnet_50_stage1_mla.elf
output: dt:usec precision:0
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------
            676|        709|        698|         10|        693|          8|       7240|       2884|
----------------------------------------------------------------------------------------------------

R3 β€” Quiet (-q)

modalix:~$ ./mla-rt -q $MODEL
...shush...
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf

R4 β€” Dry run (-Y) β€” parse + allocate, don’t run

modalix:~$ ./mla-rt -Y $MODEL
[   0/   1]setup_model /mnt/.../resnet_50_stage1_mla.elf

R5 β€” Multi-run (-m 4) β€” see jitter

modalix:~$ ./mla-rt -vss -m 4 $MODEL
       tile.lat|    api.lat|  model.lat|    gap.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------------------
avg         676|        696|        690|          6|          5|        688|          6|       5315|       2888|
min         676|        686|        683|          3|          1|        682|          4|       4040|       2887|
max         677|        710|        699|          7|         10|        694|          8|       7361|       2887|
std           0|         13|          9|          2|          5|          6|          2|       1889|          0|

R6 β€” –checkfile without –ifm β†’ expected FAIL

modalix:~$ ./mla-rt -vss --checkfile $CHK $MODEL
...
Test::              ofm.b0::FAILED
Total Elapsed: time   93.21ms
        api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|  dram.MB/s|
----------------------------------------------------------------------------
          11600|        699|         10|        693|          9|       2901|

--checkfile auto-enables -c, but the model runs against a zero-filled IFM and the output doesn’t match. For the PASS path, see Checkfile validation.

R7 β€” YAML stats (-M)

See Multi-run and stats output above for the YAML format.

R8b β€” Custom IFM and OFM

modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM --ofm ofm.b0::$CHK $MODEL
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
----------------------------------------------------------------------------------------------------
            675|        720|        709|         22|        691|          8|       7241|       2882|

The empty middle field in ofm.b0::$CHK means β€œno init data”, and $CHK is the validation checkfile.

R9 β€” Performance counters

See Performance counters for the recommended -e2 -e5 -e6 -e7 combo and the -e0 -e1 variant.

R20 β€” Read 32 B of IFM as decimals, 16 per row

modalix:~$ ./mla-rt -vss -r "ifm.b0:32:%d1@16" $MODEL
mlaZoneIFM:ifm.b0[0x1182000000]@32
0x1182000000:  86 -62 -101 -52 -23  80 -13 -105 -43  25  80  58 126 -80  97 -114
0x1182000010:  14 -58 -77  40 -50  15 -62 -90  63 -18  45  43 113 -20 -72  34

R21 β€” Dump OFM to a binary file

modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM -r "ofm.b0::%b:/tmp/ofm.bin" $MODEL
mlaZoneOFM:ofm.b0[0x1182040000]@1008 => /tmp/ofm.bin
modalix:~$ ls -la /tmp/ofm.bin
-rw-r--r-- 1 root root 1008 Apr 29 21:38 /tmp/ofm.bin

R22 β€” DDR performance counters (-Q)

modalix:~$ ./mla-rt -vss -Q $MODEL
       tile.lat|    api.lat|  model.lat|   relo.lat|    run.lat| runmsg.lat|   init.lat|  dram.MB/s|
            677|        925|        914|         10|        909|          8|       6681|       2885|

(Note model.lat rises β‰ˆ700β†’914 Β΅sec β€” DDR counter overhead.)

R25 β€” Section checksums (-f) β€” slow

modalix:~$ ./mla-rt -vss -f $MODEL
loadDramSegments:dt:   0.818s  dramB=25342080  mbps=  30.976
...
Total Elapsed: time  889.66ms

(Load is ~10Γ— slower; dram.MB/s collapses from ~2900 to 31 because the checksum computation is mixed in.)

R26 β€” Memory allocator (–mem-lib sdram)

modalix:~$ ./mla-rt -vss --mem-lib sdram $MODEL
arch.memlib: SDramAllocator
[   0/   1]setup_model ...

R27 β€” Version

modalix:~$ ./mla-rt -V
mla_rt git revision: <revision>

Stale –help entries

The usage() menu lists several flags that the current parser does not accept:

Listed in –help

Actual status

--csv

Not parsed β€” passing it dumps the usage menu and exits

--check-lm

Not parsed

--legacy

Not parsed

--sram N

Not parsed (named entry exists in menu but no long option)

--dram-channel N

Not parsed

-F err_t,...

Not a top-level flag; F only exists as a -X sub-token (frequency)

mla-poke

Documentation note, not a flag

mem-lib (without --)

The actual flag is -B / --mem-lib

--arch

DEPRECATED β€” use -a/--connect

Troubleshooting

Error opening file dump.log : Permission denied (with -l)

Logging tries to write dump.log in cwd. If you cd into a read-only mount, the open fails. Workarounds: cd /tmp before running, or use -L (dumps the log to stdout, no file).

parse_file:ERROR:cannot open <FILE> from --ofm

-o NAME:FILE:CHK requires FILE to exist as init data. To skip init, leave the field empty: -o NAME::CHK.

Test:: ofm.b0::FAILED with default (no --ifm)

The model runs against zero-filled (or default) IFM, so the OFM doesn’t match the golden checkfile. Pass the matching --ifm.

Setting -X F:... on the wrong arch

DV uses F:P:M; MA uses F:FREQ_MHZ (500–1500). Mismatched syntax fails with a parsing error.

-z doesn’t behave like a no-arg flag

The parser declares -z as a required argument but the case body uses no value. Pass any token (e.g., -z 1) until this is fixed.

Appendix: verbatim sub-help output

Captured from the board.

-hc β€” Ctrl-C handling

MLASignal allows control of SIGHUP(1) SIGINT(2) SIGQUIT(3) SIGTERM(15) :
options set as follows:  [--ctrl-c|-b]  01vsycre :
c                           : immediately do gst.close
e                           : exit(1)
no|off                      : turn off sighandler
r                           : schedule gst to halt further processing
s                           : after interrupt sleep 1 sec then return
y                           : ask Y/N to continue interrupt
examples:                   :
--ctrl-c ys                 : ask y/n ; sleep 1s
--ctrl-c r                  : request to halt further processing

-hl β€” Logger

Logger options set as follows:
-vXYZ where XYZ are:
  a                           : dma process logging
  A                           : allocation process
  b                           : turn on progress bar
  c                           : check file logging: -c must be on
  d                           : generic debug diagnostics
  F                           : frames processing
  g                           : gstapi and remote proc diagnostics
  h                           : help menu i.e. usage()
  l                           : loader logging
  m                           : m4 driver log (1:wm 2:m 4:dm)
  n                           : show notes: -vn-   turn off
  o=FILE                      : output to FILE
  o=syslog                    : output to syslog facility
  p                           : payload transactions
  P                           : enable policies
  q                           : quiet = (all masks zeroed)
  r                           : relocation processing
  s                           : show all statistics
  S                           : show .lm/.elf dma and tile sizes
  u                           : verify/fix lm program overrun
  v                           : set verbosity: v=1(-vcxwt) v=2(v=1 + lstu) ...
  x                           : log run progress
  y                           : dump symbol table(s)
  z                           : set all flags
  :N                          : set FP precision 42.nnnn
  :nsec|usec|msec|sec         : set time unit
  Generic diagnostics         :
  e                           : ERROR diagnostics: -ve- turn off
  f                           : FATAL diagnostics: -vf- turn off
  t                           : TODO diagnostics
  w                           : WARN diagnostics
  examples                    : -vv=2 or -v -v :set verbose = 2 ; verbose is a compound qualifier
                              : -vxr=2         :enable run=1 and relocations=2
                              : -vsw=0,m=0x7   :enable stats, disable warnings, set m4_driver=wm|m|dm
                              : -vs:msec:3     :enable stats at dt=millisec fp.precision=3

-hm β€” Memory dump (full reference)

For reads to DRAM/L1/L2/SRAM/OCM the general form is : -r EXPR:SIZE:FILE:FMT
For writes we have                  : -w EXPR:SIZE:FILE
                                    : EXPR can be a constant, symbol or l1(.) l2(.) addresses described below
                                    : SIZE defaults to 4B and can be inferred
                                    : For FILE=sha1 we generate a SHA1 key
                                    : For FMT: %<FILL>abcdfostux<TSIZE><MODIFERS>
                                    : where FILL: field size in bytes: ex %04 fill with 4 zeros
                                    : a:dma  b:binary  c:character  d:decimal  f:float  o:octal  s:string  t:tile-instr  u:unsigned  x:hex
TSIZE                               : x1: one hex byte  d2: two byte decimal uint16  u8: 8byte unsigned
MODIFIERS                           : n: NO ADDRESS  r: reverse 16B row  m: m4_sym  @rowsize: bytes/row

  -r addr[:n][:FILE][:FMT]            : read dram[addr] for len n
  -r sym[:n][:FILE][:FMT]             : read sym (ofm|ifm|lora|...)
  -r sym::sha1                        : SHA1 of sym (len inferred)
  -r l1(x,y,addr)[:n][:FILE][:FMT]    : read L1 at tile (x,y)
  -r l2(unit,addr)[:n][:FILE|FMT]     : read L2 at unit=n0,e2 row addr
  -r m4.list                          : print M4 driver symbols
  -r m4(sym)[:n][:FILE][:FMT]         : read M4 driver symbol
  -r * : report all symbols in symbol table
  -w addr:n[:data|FILE]               : write data or FILE to dram[addr]@n

format examples:
  -r 0:2r:%d1   β†’ 0x000000:  10   0   0  20  31  32   3 -43 ...
  -r 0:8:%x1    β†’ 0x000000:  a  0  0 14 1f 20  3 d5
  -r 0:16:%02x1 β†’ 0x000000: 0a 00 00 14 1f 20 03 d5 00 00 00 00 00 00 00 00
  -r 0:16:%02x1r β†’ 0x000000: 00 00 00 00 00 00 00 00 d5 03 20 1f 14 00 00 0a
  -r 0:16:%u4n  β†’ 335544330 3573751839        0        0

-hP β€” Policies

Policies respond to correct driver or mla error or timeouts :
options set as follows:  -policy  or -P x,y:n,z=1.0 :
  -P off                      : disable policy options
  a                           : enable all policy options
  r                           : enable retry on failure
  d                           : enable driver restart on failure
  l                           : enable driver reload

  Policies have a self test mode m|m4|mla described below :
  mla,m4,f=0.7                : enable random mla/m4 failure at rate 0.7
  mla:N                       : generate mla failure N = { 1..5 }
  m4:N                        : generate m4 error 6..18 (not all implemented)

  examples                    : -vP or -vP+ for verbose output
  -P r+                       : retry 2x
  -P rd+l                     : retry, driver restart 2x, then reload
  -P a,m4,f=1                 : all policies + random m4 failure rate 1.0
  -P a,m4:7:20:2              : generate m4:7=ERR_CODE_FATAL_DMA:unt:err
  -P a,mla                    : random mla errors at 0.5
  -P a,m4:18:128              : m4 interrupt sig=128

  mla error codes:
  mla:1                       : dmaDramToM4: write payload failed
  mla:2                       : MLA file not found
  mla:3                       : M4 done timeout
  mla:4                       : generic sendCommand failure
  mla:5                       : checkfile failed to match
  m4 error codes:
  m4:7:unit:err               : DMA Fatal interrupt
  m4:8:err                    : MIM Fatal interrupt
  m4:9:row:col                : TLC Fatal interrupt
  m4:10:err                   : uC Fatal interrupt
                              : m4:6,11-17 not implemented (will be generated by -P a,m4)