.. _mla_rt: mla-rt User Guide ################# ``mla-rt`` is the MLA (Machine Learning Accelerator) runtime CLI. It loads a compiled model (``*.elf`` or ``*.lm``), runs it on the MLA hardware, and reports performance and correctness statistics. .. contents:: On this page :local: :depth: 2 .. _mla_rt_overview: Overview and deployment ----------------------- ``mla-rt`` is an ``aarch64`` binary that runs on an MLA-equipped board. Every supported board ships ``mla-rt`` pre-installed at ``/usr/bin/mla-rt``, so after connecting via SSH (``ssh root@``), you can run it directly with no setup. This is the right choice for routine model bring-up, performance measurements, and CI-style runs. A typical session uses an SSH alias or hostname for the board, for example ``ssh root@modalix``. Models live at ``/mnt/mla_test_runner_models/...`` — for example ``/mnt/mla_test_runner_models/ma/2025-05-05-regression-tests/image_classification/resnet_50/elffiles/resnet_50_stage1_mla.elf``. A model directory typically contains: .. list-table:: :widths: 35 65 :header-rows: 1 * - **File** - **Purpose** * - ``*_stage1_mla.elf`` - Compiled MLA binary (the model) * - ``*_stage1_mla.ifm.mlc`` - Reference IFM (input feature map) * - ``*_stage1_mla.ofm_chk.mlc`` - OFM checkfile (golden output for validation) * - ``*_stage1_mla.mlc`` - The model in MLC text format (not frame data) * - ``*_stage1_mla_stats.yaml`` - Reference stats .. note:: The recipes on this page were captured on a **Modalix (MA)** board, so DaVinci-only flags (``-X F:0xN:0xN``) and Zebu/VDK modes do not apply there. .. _mla_rt_quick_start: Quick start ----------- .. code-block:: console modalix:~$ ./mla-rt -vss /mnt/.../resnet_50_stage1_mla.elf [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf loadDramSegments:dt: 0.009s dramB=25342080 mbps=2884.039 Total Elapsed: time 82.31ms Model stats: /mnt/.../resnet_50_stage1_mla.elf output: dt:usec precision:0 tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| ---------------------------------------------------------------------------------------------------- 676| 709| 698| 10| 693| 8| 7240| 2884| ---------------------------------------------------------------------------------------------------- ``-vss`` = ``-v`` (verbose) with sub-arg ``ss`` (stats level 2). Without ``-vss``, the table is suppressed, and only the ``setup_model`` line is printed. ``-vs`` enables the table; ``-vss`` adds the ``runmsg.lat``, ``init.lat``, and ``dram.MB/s`` columns. .. _mla_rt_performance_table: The performance table --------------------- Column meanings with their respective units: .. list-table:: :widths: 15 12 73 :header-rows: 1 * - **Column** - **Unit** - **Meaning** * - ``tile.lat`` - nsec - Aggregate tile execution time reported by the MLA hardware * - ``api.lat`` - µsec - End-to-end runtime of the user-facing API call * - ``model.lat`` - µsec - Time spent in ``mla_run_model`` (host side) * - ``gap.lat`` - µsec - Multi-run only — gap between successive model invocations * - ``relo.lat`` - µsec - Relocation processing * - ``run.lat`` - µsec - Pure model execution (M4-side) * - ``runmsg.lat`` - µsec - Time to send the run message to M4. Shown when stats >= 2 * - ``init.lat`` - µsec - One-time MLA initialisation. Shown when stats >= 2 * - ``dram.MB/s`` - MB/sec - DRAM bandwidth measured during model load For multi-run (``-m N > 1``), each column gets four rows: ``avg``, ``min``, ``max``, ``std`` (see the R5 recipe below). The default time scale is **microseconds** with precision 0 (``output: dt:usec precision:0``). Override with the ``:nsec|:usec|:msec|:sec`` and ``:N`` (digits) suffixes on ``-v``, for example ``-vs:msec:3`` shows stats in milliseconds with 3 decimals. .. _mla_rt_connection_modes: Connection modes (-a / --connect) --------------------------------- Default: scans ``/etc/buildinfo`` on the host to pick the build target. On aarch64 boards, this normally selects ``a65`` (the on-board ARM CPU). Comma-separated tokens; usage menu: .. code-block:: text -a [a65|dv|davinci|ma|michelangelo|mod|test|zebu1|zebu2|sw_mbox|dram|sdram|block|enable|disable|mip|vdk] Architecture options ~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :widths: 22 38 40 :header-rows: 1 * - **Option** - **Meaning** - **Where it works** * - ``dv`` / ``davinci`` - DaVinci silicon - DV boards (e.g., older mla-hl* generation) * - ``ma`` / ``michelangelo`` / ``mod`` - Modalix silicon - MA boards * - ``test`` - Loopback / no hardware - Anywhere Connection / transport options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :widths: 25 75 :header-rows: 1 * - **Option** - **Meaning** * - ``a65`` - On-board A65 ARM (default for aarch64 hosts) * - ``sw_mbox`` - Software mailbox transport * - ``zebu1`` / ``zebu2`` - Zebu emulator (network access required) * - ``vdk`` - Virtual development kit * - ``mip`` - MIP transport mode Memory allocator options ~~~~~~~~~~~~~~~~~~~~~~~~~~ Or use ``-B`` / ``--mem-lib``. .. list-table:: :widths: 25 75 :header-rows: 1 * - **Option** - **Meaning** * - ``dram`` - DRAM allocator * - ``sdram`` - SDRAM allocator (default) * - ``block`` - Block allocator Cache options ~~~~~~~~~~~~~~ .. list-table:: :widths: 25 75 :header-rows: 1 * - **Option** - **Meaning** * - ``enable`` - Enable cache * - ``disable`` - Disable cache Examples: .. code-block:: text -a dv,a65,sw_mbox # Arch=DaVinci, Connect A65, use software mailbox -a mod,zebu1 # Arch=Modalix, Connect to -a mod,block,disable # Arch=Modalix, use block allocator, disable cache .. _mla_rt_flag_reference: Complete flag reference ----------------------- Sorted by short flag (uppercase before lowercase). ``Arg`` column: ``—`` = no argument, ``optional`` = optional argument, ``req`` = required argument. .. list-table:: :widths: 6 16 14 12 52 :header-rows: 1 * - **Short** - **Long** - **Arg** - **Default** - **Description** * - ``-A`` - ``--labels`` - req ``FILE`` - (none) - Compute output labels via softmax against ``FILE`` * - ``-a`` - ``--connect`` - req - (auto) - Architecture / transport / allocator (see :ref:`Connection modes `) * - ``-B`` - ``--mem-lib`` - req ``sdram|dram|block`` - ``sdram`` - Choose memory allocator (alternative to ``-a sdram|dram|block``) * - ``-b`` - ``--ctrl-c`` - req - ``r`` - Ctrl-C handling. See :ref:`-hc appendix ` * - ``-C`` - ``--cycle-count`` - req ``N`` - ``0`` - Set N bytes of m4.sram to zero (cycle-count instrumentation) * - ``-c`` - ``--chk`` - — - off - Validate OFM rows against the checkfile. See :ref:`Checkfile validation ` * - ``-D`` - ``--debug`` - — - off - Enter ``MLAMemDebug`` mode * - ``-d`` - ``--driver`` - req ``*.bin`` - ``mla_driver.elf`` - M4 driver binary (required for Q0 / Q0123 quad configs) * - ``-e`` - ``--perf-ctr`` - req ``N`` - (none) - Add performance counter ``N`` (max 4). See :ref:`Performance counters ` * - ``-f`` - ``--do-check`` - — - off - Section checksum validation (slow — ~10× load time, see R25) * - ``-G`` - ``--no-check-dtb`` - — - DTB check ON - Skip ``/proc/device-tree/reserved-memory/`` check for DRAM allocators * - ``-g`` - ``--tmo-thr-fact`` - req ``THR:FACT`` - ``0:0`` - DMA timeout threshold and factor * - ``-H`` - ``--m4-dasm`` - req ``*.dasm`` - (none) - ``mla_driver.dasm`` for M4 symbol resolution * - ``-h`` - ``--help`` - optional ``aclmP`` - — - Help. ``-h``\ =main; ``-ha``\ =all sub-helps; ``-hc``\ =ctrl-c; ``-hl``\ =logger; ``-hm``\ =mem dump; ``-hP``\ =policies * - ``-I`` - ``--interleaving`` - req - (off) - Software interleaving. ``2``/``3``/``4`` = AXI ctrls; ``d`` = optimize DMA; ``m0123`` = MIM placement; combinable e.g. ``-I4dm`` * - ``-i`` - ``--ignore`` - — - off - Continue past errors * - ``-J`` - ``--model-timeout`` - req ``MS`` - ``3000.0`` - Per-run model timeout in milliseconds (Zebu uses this) * - ``-j`` - ``--ocm`` - req ``STRAT:LIMIT`` - (none) - OCM placement (``all|max|min``); historically Zebu-only * - ``-K`` - ``--reloc`` - req ``SYM:INIT`` - (none) - Add a relocation: create DramSection ``SYM`` initialised from ``INIT`` file. Repeatable * - ``-k`` - ``--checkfile`` - req ``FILE`` - (none) - Use ``FILE`` as the OFM checkfile. Auto-enables ``-c`` * - ``-L`` - ``--last-log`` - — - off - Connect, dump M4 log to stdout, exit (no model load) * - ``-l`` - ``--log`` - — - off - Enable M4 logging during the run. **Note:** writes ``dump.log`` in cwd; from a read-only NFS mount this errors with ``Error opening file dump.log : Permission denied`` (see :ref:`Troubleshooting `) * - ``-M`` - ``--stats`` - req ``FILE[+]`` - (none) - Append YAML stats to ``FILE``. See R7 for format * - ``-m`` - ``--max-run-count`` - req ``N`` - ``1`` - Run model ``N`` times without reloading. See :ref:`Multi-run ` * - ``-N`` - ``--no-reset`` - — - M4 reset ON - Skip M4 reset before run * - ``-n`` - ``--no-shr-sect`` - — - sharing ON - Disable sharing of duplicate instructions across models (used for LLMs) * - ``-O`` - ``--ifm`` - req ``NAME:FILE`` - (none) - Construct an IFM segment named ``NAME`` initialised from ``FILE``. Repeatable * - ``-o`` - ``--ofm`` - req ``NAME:FILE:CHK`` - (none) - Construct an OFM segment. ``FILE`` may be empty (``NAME::CHK``) for no init data; ``CHK`` is optional checkfile. Repeatable * - ``-P`` - ``--policy`` - req - (none) - Process MLAPolicies args. ``-Poff`` to disable. See :ref:`-hP appendix ` * - ``-p`` - ``--per-layer-stats`` - req ``FILE`` - (none) - Per-layer latency YAML (input). Output written to ``_output.yaml``. **Mutually exclusive with** ``-C``. See :ref:`Per-layer stats ` * - ``-Q`` - ``--ddr-perf`` - — - off - Use DDR performance counters (R22 — increased model.lat ≈700→914 µsec from counter overhead) * - ``-q`` - (none) - — - off - Quiet output (alias of ``-vq``). Prints ``...shush...`` and only the setup line * - ``-r`` - ``--read`` - req - (none) - Read DRAM/L1/L2/SRAM/OCM. See :ref:`Memory access ` * - ``-S`` - ``--seg-num`` - req ``N`` - ``-1`` - Run only segment ``N`` * - ``-s`` - ``--report`` - — - off - Dump stats. Use ``--stats FILE`` for YAML output * - ``-T`` - ``--write-addr`` - req ``*.yaml`` - (none) - Write address/value pairs to YAML file * - ``-t`` - ``--gst-api`` - — - off - Skip default IFM/OFM allocation (used with GST APIs). Auto-disabled if ``--ifm``/``--ofm`` is also passed * - ``-V`` - ``--version`` - — - — - Print git revision and exit * - ``-v`` - ``--verbose`` - optional - (off) - Logger options. See :ref:`Verbosity ` * - ``-W`` - ``--no-rproc`` - — - rproc ON - Disable remoteproc subsystem for M4 control * - ``-w`` - ``--write`` - req - (none) - Write to DRAM/L1/L2/SRAM/OCM. See :ref:`Memory access ` * - ``-X`` - ``--diag-arg`` - req - (none) - Diagnostic arg (``F:`` = freq, ``H:`` = HWIL). See :ref:`Diagnostic arguments ` * - ``-x`` - ``--max-threads`` - req ``N`` - ``8`` - ``MAX_MULT_LOAD_THREADS`` — model load parallelism * - ``-Y`` - ``--dryrun`` - — - off - Parse and allocate, do not run * - ``-y`` - ``--yaml-config`` - req ``FILE`` - (none) - M4 debug config (logLevel, fifoLevel, gpio/uart/profile enables). See :ref:`M4 debug config ` * - ``-Z`` - ``--dma-burst-sz`` - req ``SIZE`` - ``0`` - DMA burst size (Modalix-only) * - ``-z`` - ``--en-alloc-gen`` - — - off - Use ev74 (target generic) memory allocator. **Note:** the parser tags this as a required argument but uses no value; pass any token, e.g., ``-z 1`` Long-only flag ~~~~~~~~~~~~~~~ .. list-table:: :widths: 20 80 :header-rows: 1 * - **Long** - **Description** * - ``--arch`` - **DEPRECATED** — replaced by ``-a``/``--connect``. Do not use. Positional arguments ~~~~~~~~~~~~~~~~~~~~~~ After the option list, any remaining tokens are interpreted by extension: * ``*.elf`` or ``*.lm`` → executable model files (multiple allowed) * everything else → frame data files .. _mla_rt_verbosity_logging: Verbosity and logging (-v, -l, -L, -q) -------------------------------------- ``-v`` takes an **optional** argument. The argument is parsed by the logger. Common forms: .. list-table:: :widths: 25 75 :header-rows: 1 * - **Form** - **Effect** * - ``-v`` - Bare verbose; turns on default channels * - ``-vs`` - Stats level 1 (basic table) * - ``-vss`` - Stats level 2 (extra columns: ``runmsg.lat``, ``init.lat``, ``dram.MB/s``) * - ``-vc`` - Check-file logging * - ``-vq`` - Quiet (same as ``-q``) * - ``-vsr`` - Stats + relocation processing * - ``-v:msec:3`` - Time scale = ms, precision = 3 digits * - ``-vsw=0,m=0x7`` - Stats on, warnings off, m4 driver mask = ``wm|m|dm`` * - ``-vP`` / ``-vP+`` - Policy log output * - ``-vz`` - Set ALL flags The full grammar is ``-vXYZ`` where each letter is a one-char channel. See the :ref:`-hl appendix ` for the channel list. ``-l`` enables the M4 driver log (collected during run). ``-L`` connects, dumps the log to stdout, and exits (no load). ``-q`` is shorthand for ``-vq``. .. _mla_rt_memory_access: Memory access (-r / -w) ----------------------- General forms (see :ref:`-hm appendix `): .. code-block:: text -r EXPR[:N][:FILE][:FMT] # read -w EXPR:N[:DATA|FILE] # write ``EXPR`` may be: .. list-table:: :widths: 25 75 :header-rows: 1 * - **Form** - **Meaning** * - ``0xADDR`` - DRAM address * - ``sym`` - A model symbol — ``ifm``, ``ifm.b0``, ``ofm``, ``ofm.b0``, ``lora``, ... * - ``sram(N)`` - M4 SRAM bank N * - ``l1(x,y,addr)`` - L1 at tile (x,y) * - ``l2(unit,addr)`` - L2 (e.g., ``l2(n0,0x100)``) * - ``m4(sym)`` - M4 driver symbol * - ``m4.list`` - Print all M4 driver symbols * - ``*`` - All symbols (with ``-r`` only) Suffixes for ``N`` and addresses: ``k`` = 1024, ``m`` = 1048576, ``r`` = 16 (one row). ``FMT`` is ``%``: * **TYPE:** ``a`` (dma), ``b`` (binary), ``c`` (char), ``d`` (decimal), ``f`` (float), ``o`` (octal), ``s`` (string), ``t`` (tile-instr), ``u`` (unsigned), ``x`` (hex) * **TSIZE:** ``1``, ``2``, ``4``, ``8`` byte width * **MODS:** ``n`` (no address prefix), ``r`` (reverse 16-byte row), ``m`` (m4_sym), ``@N`` (N bytes per row) * **Special:** ``FILE=sha1`` produces a SHA-1 hash Examples: .. code-block:: text -r ifm.b0:32:%d1@16 # 32 B of IFM as decimal, 16 per row -r ofm.b0::%b:/tmp/ofm.bin # all OFM as binary to file -r 0x30:8:0xdeadbeef # (write form) 8 B of 0xdeadbeef at 0x30 -w sram(0)::FILE # write FILE into sram(0) -w 0x1d0000:32k:-1 # fill 32 KB at 0x1d0000 with -1 -w l2(n0,0x100):10r::0,1,2,3,4,5 # write 10 rows of literal data .. _mla_rt_performance_counters: Performance counters (-e) ------------------------- Up to 4 counters per run (exits if exceeded). Counter names: .. list-table:: :widths: 15 85 :header-rows: 1 * - **ID** - **Name** * - 0 - ``tiled_lat`` * - 1 - ``tile_lat`` * - 2 - ``tile_active`` * - 3 - ``sync_frz/pwr`` * - 4 - ``tile_pwr-c`` * - 5 - ``sync_frz`` * - 6 - ``iq_sync_frz`` * - 7 - ``d_sync_frz`` All perf counters report in nanoseconds. Each counter adds a column ``perf.id N`` with the counter's name. The values in the table are scaled to the active time-unit (default µsec). Common combination: -e2 -e5 -e6 -e7 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The most useful set for understanding where MLA cycles are spent: tile activity (``tile_active``) plus the three freeze counters (``sync_frz``, ``iq_sync_frz``, ``d_sync_frz``). With those, ``tile_active + freeze ≈ tile.lat``. .. code-block:: console modalix:~$ ./mla-rt -vss -m 3 -e2 -e5 -e6 -e7 $MODEL Perf counter 2 source "tile_active" added. Perf counter 5 source "sync_frz" added. Perf counter 6 source "iq_sync_frz" added. Perf counter 7 source "d_sync_frz" added. ... perf.id 2| perf.id 5| perf.id 6| perf.id 7| tile.lat| api.lat| model.lat| gap.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s|tile_active| sync_frz|iq_sync_frz| d_sync_frz| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- avg 676| 701| 693| 6| 7| 689| 7| 6094| 2890| 494| 182| 3| 182| min 676| 686| 683| 4| 1| 682| 4| 4040| 2890| 494| 182| 3| 182| max 677| 708| 696| 8| 9| 691| 8| 7121| 2890| 494| 183| 3| 183| std nan| 9| 6| 3| 3| 4| 2| 1266| 0| 0| nan| 0| nan| Reading the avg row above: ``tile_active=494µs`` + ``sync_frz=182µs`` ≈ ``tile.lat=676µs``. The tile spent ~73 % executing and ~27 % stalled on sync. Counter id 0 / 1 example ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss -m 3 -e0 -e1 $MODEL tile.lat| api.lat| model.lat| gap.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| tiled_lat| tile_lat| avg 676| 703| 694| 6| 7| 690| 7| 6614| 2916| 676| 676| For DDR-side counters use ``-Q``. .. _mla_rt_checkfile_validation: Checkfile validation (-c / -vc) ------------------------------- ``-c`` (or ``--chk``) asks ``mla-rt`` to compare the model's OFM against a golden checkfile and report PASSED/FAILED. For the comparison to be meaningful, you must: #. Pass the **matching IFM** with ``--ifm :`` so the model runs on the input the checkfile was produced for. #. Tell ``mla-rt`` where the OFM lives (``--ofm ::``) — the empty ``FILE`` field means "no init data". #. Use the symbol names declared in the model. For most regression-test models that's ``ifm.b0`` and ``ofm.b0``. If you specify ``--checkfile FILE`` instead of ``--ofm ::``, ``-c`` is auto-enabled and the checkfile is bound to the default OFM symbol — but you still need ``--ifm`` to feed the right input, otherwise the model runs against a zero-filled IFM and reports ``FAILED`` (this is what R6 in :ref:`Recipes ` shows). ``-vc`` (or ``-vssc``) turns on **check-file logging**, which prints the address range checked and an "N rows compared correctly, M rows were bad" summary in addition to the PASSED/FAILED line. Recipe: PASS with -c -vss ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ DIR=/mnt/.../resnet_50/elffiles modalix:~$ ./mla-rt -c -vss --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \ --ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \ $DIR/resnet_50_stage1_mla.elf [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf loadDramSegments:dt: 0.009s dramB=25342080 mbps=2872.757 Test:: ofm.b0::PASSED Total Elapsed: time 205.43ms tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| 675| 1250| 710| 22| 692| 9| 7360| 2873| ``api.lat`` jumps from ~700 µsec (without ``-c``) to ~1250 µsec (with ``-c``) — the comparison takes time and is included in the API latency. Recipe: PASS with -c -vssc (verbose check) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -c -vssc --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \ --ofm ofm.b0::$DIR/resnet_50_stage1_mla.ofm_chk.mlc \ $DIR/resnet_50_stage1_mla.elf ... checkDramSection: DramSection: ofm.b0:address:0x1182040000..0x11820403f0:size: 0x3f0:zone:0x5: mlaZoneOFM:model:/mnt/.../resnet_50_stage1_mla.elf:refcnt:0x1 63 rows compared correctly, 0 rows were bad. Test:: ofm.b0::PASSED The two extra lines (one per checked DramSection, plus the row count) make ``-vssc`` the right verbosity when debugging a FAIL. .. _mla_rt_per_layer_stats: Per-layer stats (-p) -------------------- ``-p FILE`` runs the model **layer by layer** and emits per-layer latency and perf-counter data. The input file is a YAML list of layers with ``name``, ``start_cycle``, ``end_cycle``. Most regression-test models ship one alongside the elf — for example ``resnet_50_stage1_mla_stats.yaml``. Output is written to ``_output.yaml`` (the input is **not** overwritten). For example, passing ``/tmp/layers.yaml`` produces ``/tmp/layers_output.yaml``. Pre-existing ``_output.yaml`` is overwritten. ``-p`` is **mutually exclusive** with ``-C`` / ``--cycle-count``; combining them is fatal at parse time. Input file format ~~~~~~~~~~~~~~~~~~ .. code-block:: yaml 0: name: MLA_0/placeholder_0_0 start_cycle: 0 end_cycle: 0 1: name: MLA_0/conv2d_add_relu_0 start_cycle: 0 end_cycle: 28435 2: name: MLA_0/max_pool2d_1 start_cycle: 28436 end_cycle: 31265 ... Index ``0`` is a placeholder and is skipped. Layers are run in order; cycle ranges drive how long each layer is allowed to execute. Recipe ~~~~~~ .. note:: Always pass a copy of the stats yaml — even though ``mla-rt`` writes to a separate ``_output.yaml``, copying first protects against accidents (e.g., invoking with ``-p $MODEL_DIR/foo_stats.yaml`` from a writable mount). .. code-block:: console modalix:~$ cp $DIR/resnet_50_stage1_mla_stats.yaml /tmp/layers_in.yaml modalix:~$ ./mla-rt -vss -p /tmp/layers_in.yaml \ --ifm ifm.b0:$DIR/resnet_50_stage1_mla.ifm.mlc \ $DIR/resnet_50_stage1_mla.elf [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf loadDramSegments:dt: 0.009s dramB=25342080 mbps=2887.514 Per layer stats output file name: /tmp/layers_in_output.yaml Sum perf ctr run_time is: 728.98us Sum m4 systick latency is: 729.63us Total Elapsed: time 353.83ms Output file format (/tmp/layers_in_output.yaml) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each layer gets the cycle range it ran with plus five timing fields: .. code-block:: yaml # Date created: 2026-04-29 22:08:16 1: name: MLA_0/conv2d_add_relu_0 start_cycle: 0 end_cycle: 28435 layer_latency: 44.84us run_time: 43.60us active_time: 28.43us l2_freeze: 14.88us iq_freeze: 3.50us 2: name: MLA_0/max_pool2d_1 ... .. list-table:: :widths: 18 27 55 :header-rows: 1 * - **Field** - **Source** - **Meaning** * - ``layer_latency`` - M4 systick measurement - Wall-clock time spent in the layer * - ``run_time`` - Perf counter ``run_time`` - Layer execution time as reported by the perf counter * - ``active_time`` - Perf counter ``tile_active`` - Time tiles were actively computing * - ``l2_freeze`` - Perf counter - Time stalled waiting for L2 * - ``iq_freeze`` - Perf counter - Time stalled on instruction-queue sync The summary lines printed to stdout (``Sum perf ctr run_time`` / ``Sum m4 systick latency``) are the totals across all layers in the file. .. _mla_rt_m4_debug_config: M4 debug config (-y) -------------------- ``-y FILE`` (``--yaml-config``) loads a YAML file that configures the M4 debug context — log level, FIFO mode, GPIO/UART/profile enables, deadlock checking, and the Zebu flag. Without ``-y``, ``mla-rt`` uses hard-coded defaults. A sample template is shown below. Use it as a template: .. code-block:: yaml debugConfig: # this node is mandatory # log level: disabled | info | verbose | debug | fifo-debug logLevel : disabled # fifo level: no-fifo | write | read-write fifoLevel : write # boolean flags gpioEn : false deadlockEn : true uartEn : false profileEn : false zebu : false .. list-table:: :widths: 18 12 35 35 :header-rows: 1 * - **Key** - **Type** - **Allowed values** - **Effect** * - ``logLevel`` - enum - ``disabled``, ``info``, ``verbose``, ``debug``, ``fifo-debug`` - M4 driver log verbosity * - ``fifoLevel`` - enum - ``no-fifo``, ``write``, ``read-write`` - Which fifo channels are enabled * - ``gpioEn`` - bool - ``true``/``false`` - Enable GPIO debug * - ``deadlockEn`` - bool - ``true``/``false`` - Enable deadlock detection (default ``true``) * - ``uartEn`` - bool - ``true``/``false`` - Enable UART output * - ``profileEn`` - bool - ``true``/``false`` - Enable profiling hooks * - ``zebu`` - bool - ``true``/``false`` - Set when running against Zebu emulator * - ``m4SemaphoreOp`` - uint32 - (numeric) - Optional — M4 semaphore operation override Recipe ~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss -y /path/to/mlart-default-cfg.yaml \ $DIR/resnet_50_stage1_mla.elf Reading m4 debug configuration from: /path/to/mlart-default-cfg.yaml m4_context_config.yaml content: logLevel : disabled fifoLevel : write gpioEn : false deadlockEn : true uartEn : false profileEn : false zebu : false [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf loadDramSegments:dt: 0.009s dramB=25342080 mbps=2898.003 Total Elapsed: time 83.72ms tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| 676| 707| 697| 10| 692| 8| 7121| 2898| ``mla-rt`` echoes the parsed config back so you can confirm what was applied. If a key is missing from your file, the corresponding default is kept. .. _mla_rt_diagnostic_arguments: Diagnostic arguments (-X) ------------------------- Format: ``-X :...``. Currently ``F`` (frequency) and ``H`` (HWIL register). -X F:... — Set MLA PLL frequency ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DaVinci syntax: ``-X F:P:M`` (PLL ``P`` and ``M`` registers). .. code-block:: text -X F:0x3:0x3b # 500 MHz (DV) -X F:0x2:0x35 # 600 MHz (DV) -X F:0x2:0x3e # 700 MHz (DV) -X F:0x1:0x2f # 800 MHz (DV) -X F:0x1:0x35 # 900 MHz (DV) -X F:0x1:0x3b # 1000 MHz (DV) Modalix syntax: ``-X F:FREQ_MHZ`` (range 500–1500): .. code-block:: text -X F:933 # 933 MHz (MA) .. warning:: Setting the frequency mutates board state until the next reset/reboot. After benchmarking at a non-default frequency, run again with the original frequency to restore. -X H:0xVALUE — Set NoC HWIL register ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: text -X H:0x70 # set MA hwil (noc_cfg) to 0x70 (Modalix-only; ignored on DV.) .. _mla_rt_multi_run: Multi-run and stats output -------------------------- ``-m N`` (default ``1``) runs the model ``N`` times **without reloading**. The reported table changes: * N = 1 → single row of values * N > 1 → 4 rows: ``avg``, ``min``, ``max``, ``std``, plus an extra ``gap.lat`` column (inter-run gap) ``-M FILE`` (or ``--stats FILE``) appends a YAML summary. With trailing ``+`` (``-M FILE+``) it appends to an existing file. R7 output: .. code-block:: yaml # Date created: 2026-04-29 21:35:08 # precision:dt:usec precision:0 /mnt/.../resnet_50_stage1_mla.elf: filesize: 27777336 tile.lat: avg: 675 api.lat: avg: 707 model.lat: avg: 697 relo.lat: avg: 10 run.lat: avg: 691 runmsg.lat: avg: 8 init.lat: avg: 6921 dram.MB/s: avg: 2885 For ``-m N > 1``, each metric also gets ``min``, ``max``, ``std`` keys. .. _mla_rt_argument_constraints: Argument constraints and interactions ------------------------------------- .. list-table:: :widths: 45 55 :header-rows: 1 * - **Constraint** - **Behavior** * - ``-p``/``--per-layer-stats`` + ``-C``/``--cycle-count`` - **Fatal** — mutually exclusive * - ``--checkfile FILE`` without ``-c`` - Auto-enables ``-c`` (notice: ``--checkfile specified ... adding -c``) * - ``-vs`` set but no ``-s`` - Auto-enables ``-s`` (``-vs stats specified ... adding -s``) * - ``-t`` + ``--ifm``/``--ofm`` - ``-t`` is auto-disabled (``--ifm --ofm turns off``) * - ``-e`` more than 4 times - **Fatal** — max 4 perf counters * - ``--stats`` without ``-vs`` - Warning: ``--stats specified without log options -vs`` * - ``-vP`` without ``-P`` - Notice: ``-vP policy specified but no -P policy set`` Unrecognised flags (e.g., ``--csv``, ``--check-lm``, ``--legacy``, ``--sram``, ``--dram-channel``, ``-F``) cause the full usage menu to be printed and exit. .. _mla_rt_recipes: Recipes (with verified board output) ------------------------------------ All recipes were run on a Modalix board against the following model and companions: .. code-block:: console modalix:~$ MODEL=/mnt/.../resnet_50/elffiles/resnet_50_stage1_mla.elf modalix:~$ IFM=$(dirname $MODEL)/resnet_50_stage1_mla.ifm.mlc modalix:~$ CHK=$(dirname $MODEL)/resnet_50_stage1_mla.ofm_chk.mlc R1 — Default (no flags) — minimum output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt $MODEL [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf The default verbosity prints only the setup line. Stats are gated behind ``-v``. R2 — Single run with full stats (-vss) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss $MODEL [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf loadDramSegments:dt: 0.009s dramB=25342080 mbps=2884.039 Total Elapsed: time 82.31ms Model stats: /mnt/.../resnet_50_stage1_mla.elf output: dt:usec precision:0 tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| ---------------------------------------------------------------------------------------------------- 676| 709| 698| 10| 693| 8| 7240| 2884| ---------------------------------------------------------------------------------------------------- R3 — Quiet (-q) ~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -q $MODEL ...shush... [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf R4 — Dry run (-Y) — parse + allocate, don't run ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -Y $MODEL [ 0/ 1]setup_model /mnt/.../resnet_50_stage1_mla.elf R5 — Multi-run (-m 4) — see jitter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss -m 4 $MODEL tile.lat| api.lat| model.lat| gap.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| ---------------------------------------------------------------------------------------------------------------- avg 676| 696| 690| 6| 5| 688| 6| 5315| 2888| min 676| 686| 683| 3| 1| 682| 4| 4040| 2887| max 677| 710| 699| 7| 10| 694| 8| 7361| 2887| std 0| 13| 9| 2| 5| 6| 2| 1889| 0| R6 — --checkfile without --ifm → expected FAIL ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss --checkfile $CHK $MODEL ... Test:: ofm.b0::FAILED Total Elapsed: time 93.21ms api.lat| model.lat| relo.lat| run.lat| runmsg.lat| dram.MB/s| ---------------------------------------------------------------------------- 11600| 699| 10| 693| 9| 2901| ``--checkfile`` auto-enables ``-c``, but the model runs against a zero-filled IFM and the output doesn't match. For the PASS path, see :ref:`Checkfile validation `. R7 — YAML stats (-M) ~~~~~~~~~~~~~~~~~~~~~ See :ref:`Multi-run and stats output ` above for the YAML format. R8b — Custom IFM and OFM ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM --ofm ofm.b0::$CHK $MODEL tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| ---------------------------------------------------------------------------------------------------- 675| 720| 709| 22| 691| 8| 7241| 2882| The empty middle field in ``ofm.b0::$CHK`` means "no init data", and ``$CHK`` is the validation checkfile. R9 — Performance counters ~~~~~~~~~~~~~~~~~~~~~~~~~~~ See :ref:`Performance counters ` for the recommended ``-e2 -e5 -e6 -e7`` combo and the ``-e0 -e1`` variant. R20 — Read 32 B of IFM as decimals, 16 per row ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss -r "ifm.b0:32:%d1@16" $MODEL mlaZoneIFM:ifm.b0[0x1182000000]@32 0x1182000000: 86 -62 -101 -52 -23 80 -13 -105 -43 25 80 58 126 -80 97 -114 0x1182000010: 14 -58 -77 40 -50 15 -62 -90 63 -18 45 43 113 -20 -72 34 R21 — Dump OFM to a binary file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss --ifm ifm.b0:$IFM -r "ofm.b0::%b:/tmp/ofm.bin" $MODEL mlaZoneOFM:ofm.b0[0x1182040000]@1008 => /tmp/ofm.bin modalix:~$ ls -la /tmp/ofm.bin -rw-r--r-- 1 root root 1008 Apr 29 21:38 /tmp/ofm.bin R22 — DDR performance counters (-Q) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss -Q $MODEL tile.lat| api.lat| model.lat| relo.lat| run.lat| runmsg.lat| init.lat| dram.MB/s| 677| 925| 914| 10| 909| 8| 6681| 2885| (Note ``model.lat`` rises ≈700→914 µsec — DDR counter overhead.) R25 — Section checksums (-f) — slow ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss -f $MODEL loadDramSegments:dt: 0.818s dramB=25342080 mbps= 30.976 ... Total Elapsed: time 889.66ms (Load is ~10× slower; ``dram.MB/s`` collapses from ~2900 to 31 because the checksum computation is mixed in.) R26 — Memory allocator (--mem-lib sdram) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -vss --mem-lib sdram $MODEL arch.memlib: SDramAllocator [ 0/ 1]setup_model ... R27 — Version ~~~~~~~~~~~~~ .. code-block:: console modalix:~$ ./mla-rt -V mla_rt git revision: .. _mla_rt_stale_help: Stale --help entries -------------------- The ``usage()`` menu lists several flags that the current parser **does not accept**: .. list-table:: :widths: 30 70 :header-rows: 1 * - **Listed in --help** - **Actual status** * - ``--csv`` - Not parsed — passing it dumps the usage menu and exits * - ``--check-lm`` - Not parsed * - ``--legacy`` - Not parsed * - ``--sram N`` - Not parsed (named entry exists in menu but no long option) * - ``--dram-channel N`` - Not parsed * - ``-F err_t,...`` - Not a top-level flag; ``F`` only exists as a ``-X`` sub-token (frequency) * - ``mla-poke`` - Documentation note, not a flag * - ``mem-lib`` (without ``--``) - The actual flag is ``-B`` / ``--mem-lib`` * - ``--arch`` - DEPRECATED — use ``-a``/``--connect`` .. _mla_rt_troubleshooting: Troubleshooting --------------- ``Error opening file dump.log : Permission denied`` (with ``-l``) Logging tries to write ``dump.log`` in cwd. If you ``cd`` into a read-only mount, the open fails. Workarounds: ``cd /tmp`` before running, or use ``-L`` (dumps the log to stdout, no file). ``parse_file:ERROR:cannot open `` from ``--ofm`` ``-o NAME:FILE:CHK`` requires ``FILE`` to exist as init data. To skip init, leave the field empty: ``-o NAME::CHK``. ``Test:: ofm.b0::FAILED`` with default (no ``--ifm``) The model runs against zero-filled (or default) IFM, so the OFM doesn't match the golden checkfile. Pass the matching ``--ifm``. Setting ``-X F:...`` on the wrong arch DV uses ``F:P:M``; MA uses ``F:FREQ_MHZ`` (500–1500). Mismatched syntax fails with a parsing error. ``-z`` doesn't behave like a no-arg flag The parser declares ``-z`` as a required argument but the case body uses no value. Pass any token (e.g., ``-z 1``) until this is fixed. .. _mla_rt_appendix: Appendix: verbatim sub-help output ---------------------------------- Captured from the board. .. _mla_rt_appendix_hc: -hc — Ctrl-C handling ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: text MLASignal allows control of SIGHUP(1) SIGINT(2) SIGQUIT(3) SIGTERM(15) : options set as follows: [--ctrl-c|-b] 01vsycre : c : immediately do gst.close e : exit(1) no|off : turn off sighandler r : schedule gst to halt further processing s : after interrupt sleep 1 sec then return y : ask Y/N to continue interrupt examples: : --ctrl-c ys : ask y/n ; sleep 1s --ctrl-c r : request to halt further processing .. _mla_rt_appendix_hl: -hl — Logger ~~~~~~~~~~~~ .. code-block:: text Logger options set as follows: -vXYZ where XYZ are: a : dma process logging A : allocation process b : turn on progress bar c : check file logging: -c must be on d : generic debug diagnostics F : frames processing g : gstapi and remote proc diagnostics h : help menu i.e. usage() l : loader logging m : m4 driver log (1:wm 2:m 4:dm) n : show notes: -vn- turn off o=FILE : output to FILE o=syslog : output to syslog facility p : payload transactions P : enable policies q : quiet = (all masks zeroed) r : relocation processing s : show all statistics S : show .lm/.elf dma and tile sizes u : verify/fix lm program overrun v : set verbosity: v=1(-vcxwt) v=2(v=1 + lstu) ... x : log run progress y : dump symbol table(s) z : set all flags :N : set FP precision 42.nnnn :nsec|usec|msec|sec : set time unit Generic diagnostics : e : ERROR diagnostics: -ve- turn off f : FATAL diagnostics: -vf- turn off t : TODO diagnostics w : WARN diagnostics examples : -vv=2 or -v -v :set verbose = 2 ; verbose is a compound qualifier : -vxr=2 :enable run=1 and relocations=2 : -vsw=0,m=0x7 :enable stats, disable warnings, set m4_driver=wm|m|dm : -vs:msec:3 :enable stats at dt=millisec fp.precision=3 .. _mla_rt_appendix_hm: -hm — Memory dump (full reference) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: text For reads to DRAM/L1/L2/SRAM/OCM the general form is : -r EXPR:SIZE:FILE:FMT For writes we have : -w EXPR:SIZE:FILE : EXPR can be a constant, symbol or l1(.) l2(.) addresses described below : SIZE defaults to 4B and can be inferred : For FILE=sha1 we generate a SHA1 key : For FMT: %abcdfostux : where FILL: field size in bytes: ex %04 fill with 4 zeros : a:dma b:binary c:character d:decimal f:float o:octal s:string t:tile-instr u:unsigned x:hex TSIZE : x1: one hex byte d2: two byte decimal uint16 u8: 8byte unsigned MODIFIERS : n: NO ADDRESS r: reverse 16B row m: m4_sym @rowsize: bytes/row -r addr[:n][:FILE][:FMT] : read dram[addr] for len n -r sym[:n][:FILE][:FMT] : read sym (ofm|ifm|lora|...) -r sym::sha1 : SHA1 of sym (len inferred) -r l1(x,y,addr)[:n][:FILE][:FMT] : read L1 at tile (x,y) -r l2(unit,addr)[:n][:FILE|FMT] : read L2 at unit=n0,e2 row addr -r m4.list : print M4 driver symbols -r m4(sym)[:n][:FILE][:FMT] : read M4 driver symbol -r * : report all symbols in symbol table -w addr:n[:data|FILE] : write data or FILE to dram[addr]@n format examples: -r 0:2r:%d1 → 0x000000: 10 0 0 20 31 32 3 -43 ... -r 0:8:%x1 → 0x000000: a 0 0 14 1f 20 3 d5 -r 0:16:%02x1 → 0x000000: 0a 00 00 14 1f 20 03 d5 00 00 00 00 00 00 00 00 -r 0:16:%02x1r → 0x000000: 00 00 00 00 00 00 00 00 d5 03 20 1f 14 00 00 0a -r 0:16:%u4n → 335544330 3573751839 0 0 .. _mla_rt_appendix_hp: -hP — Policies ~~~~~~~~~~~~~~ .. code-block:: text Policies respond to correct driver or mla error or timeouts : options set as follows: -policy or -P x,y:n,z=1.0 : -P off : disable policy options a : enable all policy options r : enable retry on failure d : enable driver restart on failure l : enable driver reload Policies have a self test mode m|m4|mla described below : mla,m4,f=0.7 : enable random mla/m4 failure at rate 0.7 mla:N : generate mla failure N = { 1..5 } m4:N : generate m4 error 6..18 (not all implemented) examples : -vP or -vP+ for verbose output -P r+ : retry 2x -P rd+l : retry, driver restart 2x, then reload -P a,m4,f=1 : all policies + random m4 failure rate 1.0 -P a,m4:7:20:2 : generate m4:7=ERR_CODE_FATAL_DMA:unt:err -P a,mla : random mla errors at 0.5 -P a,m4:18:128 : m4 interrupt sig=128 mla error codes: mla:1 : dmaDramToM4: write payload failed mla:2 : MLA file not found mla:3 : M4 done timeout mla:4 : generic sendCommand failure mla:5 : checkfile failed to match m4 error codes: m4:7:unit:err : DMA Fatal interrupt m4:8:err : MIM Fatal interrupt m4:9:row:col : TLC Fatal interrupt m4:10:err : uC Fatal interrupt : m4:6,11-17 not implemented (will be generated by -P a,m4)