Minimize usage of CPUs through alternatives like State Machines or other technology #37

peers8862 · 2026-02-11T02:31:37Z

peers8862
Feb 11, 2026
Maintainer

List options.

peers8862 · 2026-02-11T02:32:04Z

peers8862
Feb 11, 2026
Maintainer Author

Tpico c3 State Machines 4 and 4 PIOs

3 replies

peers8862 Feb 11, 2026
Maintainer Author

RP2040 PIO State Machine Registers - Detailed Reference

SMx_CLKDIV (Clock Divider Register)

This register controls the execution speed of your state machine by dividing the system clock. Critical for timing-sensitive protocols and power management.

Structure: 32-bit register split into two fields:

Bits 31:16 - Integer divider (16 bits)
Bits 15:8 - Fractional divider (8 bits)
Bits 7:0 - Unused

Clock calculation: SM_clock = sys_clock / (INT + FRAC/256)

The fractional divider gives you fine-grained control - essential when you need precise timing for protocols like WS2812 LED control or custom UART rates. For maximum speed, set to 1.0 (0x00010000). For power savings on slower protocols, you can drastically reduce clock speed, which directly reduces power consumption since the state machine only consumes power proportional to its clock rate.

Power consideration: Running a state machine at 1/256th speed (divider = 256.0) when full speed isn't needed can save significant power on battery-operated devices.

SMx_EXECCTRL (Execution Control Register)

This configures how your state machine executes instructions and handles control flow.

Bit fields:

EXEC_STALLED (bit 31, read-only): Indicates if the state machine is stalled waiting for data or other conditions. Useful for debugging and monitoring.

SIDE_EN (bit 30): Enables side-set functionality, allowing you to control additional pins alongside normal instructions without consuming extra cycles.

SIDE_PINDIR (bit 29): When set, side-set operations affect pin direction (input/output) rather than pin levels. Powerful for protocols that need to dynamically change pin direction.

JMP_PIN (bits 28:24): Selects which GPIO pin the conditional JMP instruction examines. This is your external condition input for branching logic.

OUT_EN_SEL (bits 23:19): Selects which data bit controls output enable for pin groups. Advanced feature for tri-state buses.

INLINE_OUT_EN (bit 18): Enables inline output enable control.

OUT_STICKY (bit 17): Makes OUT instructions sticky - the value persists even when the OSR is empty. Useful for maintaining state on pins.

WRAP_TOP (bits 16:12): Upper boundary for instruction wrapping. Your program counter wraps from this address back to WRAP_BOTTOM, creating an automatic execution loop.

WRAP_BOTTOM (bits 11:7): Lower boundary for wrapping. Together with WRAP_TOP, this defines your main program loop without needing explicit JMP instructions, saving cycles.

STATUS_SEL (bits 6:5): Selects what the status flags monitor:

0x0 = TX FIFO level
0x1 = RX FIFO level
Determines what triggers the WAIT instruction when waiting on status conditions.

STATUS_N (bits 4:0): Threshold value for status comparison. Works with STATUS_SEL to create conditions like "wait until TX FIFO has at least N empty slots."

The wrap mechanism is particularly elegant for power efficiency - your program loops automatically without branch instructions, and you can keep tight, cache-friendly code loops.

SMx_SHIFTCTRL (Shift Control Register)

Controls the Input and Output Shift Registers - the heart of your data manipulation.

FJOIN_RX (bit 31): When set, the RX FIFO increases from 4 to 8 entries deep by stealing the TX FIFO. Useful when you're only receiving data.

FJOIN_TX (bit 30): Opposite - increases TX FIFO to 8 entries by stealing RX FIFO. Good for transmit-only operations.

PULL_THRESH (bits 29:25): Autopull threshold (1-32). When the OSR has shifted out this many bits, it automatically pulls new data from TX FIFO. Set to 32 for full word operations, or less for packed data.

PUSH_THRESH (bits 24:20): Autopush threshold (1-32). When the ISR has shifted in this many bits, it automatically pushes to RX FIFO.

OUT_SHIFTDIR (bit 19): Shift direction for OUT operations:

0 = shift right (LSB first)
1 = shift left (MSB first)

IN_SHIFTDIR (bit 18): Shift direction for IN operations, same encoding.

AUTOPULL (bit 17): Enables automatic pulling from TX FIFO when OSR reaches threshold. Critical for continuous data streaming without CPU intervention.

AUTOPUSH (bit 16): Enables automatic pushing to RX FIFO when ISR reaches threshold.

The autopush/autopull mechanism is transformative for power efficiency. Your state machine can stream data continuously while the ARM cores sleep. For your industrial applications, this means a state machine could monitor a sensor protocol, buffer data, and only wake the CPU when significant events occur or buffers fill.

SMx_ADDR (Address Register)

Bits 4:0: Current program counter value (0-31, since PIO instruction memory is 32 words deep)

This is read-only during normal operation. When you examine it, you're seeing exactly which instruction the state machine will execute next. Useful for debugging synchronization issues.

The state machine increments this automatically, or it changes on JMP instructions or wrap conditions. You can force execution from a specific address by writing to SMx_INSTR with a JMP instruction.

SMx_INSTR (Instruction Register)

Bits 15:0: Direct instruction execution

Writing a 16-bit PIO instruction to this register forces immediate execution, overriding the current program flow for one cycle. The state machine then resumes normal execution from SMx_ADDR.

Use cases:

Emergency GPIO manipulation without stopping the state machine
Dynamically inserting single instructions (like SET to change pin states)
Debugging by forcing PULL or PUSH operations
Implementing "software interrupts" for state machines

This is incredibly powerful for hybrid control - your state machine runs autonomously, but your CPU can inject commands when needed without stopping operation.

Power advantage: You can build simple state machine programs and use SMx_INSTR for complex occasional operations, keeping the PIO program small and the state machine mostly autonomous.

SMx_PINCTRL (Pin Control Register)

Maps the abstract PIO pin operations (OUT, SET, IN, side-set) to physical GPIO pins.

SIDESET_COUNT (bits 31:29): Number of pins used for side-set operations (0-5). This determines how many bits of your instruction word are reserved for side-set data.

SET_COUNT (bits 28:26): Number of pins affected by SET instructions (0-5).

OUT_COUNT (bits 25:20): Number of pins affected by OUT instructions (0-32).

IN_BASE (bits 19:15): Base GPIO pin number for IN operations. IN reads starting from this pin.

SIDESET_BASE (bits 14:10): Base GPIO pin number for side-set operations.

SET_BASE (bits 9:5): Base GPIO pin number for SET operations.

OUT_BASE (bits 4:0): Base GPIO pin number for OUT operations.

Critical understanding: The PIO doesn't directly specify GPIO pin numbers in instructions. Instead, instructions reference pin index 0, 1, 2... and PINCTRL maps these to actual GPIO pins. This makes PIO programs relocatable - the same program can control different GPIO pins just by changing PINCTRL.

Example: If OUT_BASE = 10 and OUT_COUNT = 4, then OUT PINS, 4 will write to GPIO10-GPIO13.

This indirection is powerful for reusable protocols. Your UART program doesn't care which pins are TX/RX - that's configured in PINCTRL at runtime.

FIFO Registers (TXFx and RXFx)

TXFx (Transmit FIFO)

Each state machine has a 32-bit wide, 4-entry deep TX FIFO (8-deep if you use FJOIN_TX).

Write operation: CPU writes 32-bit words here. The state machine pulls from this FIFO with PULL instructions (manual) or autopull (automatic when OSR empties).

Status flags:

FSTAT register tells you if FIFO is full, empty, or has space
Can generate DMA requests when not full
Can generate IRQ when below threshold

Power optimization: Fill the TX FIFO with several words, let the state machine stream them out while CPU sleeps. DMA can refill autonomously, creating zero-CPU data streaming for sensor logging or communication protocols.

RXFx (Receive FIFO)

32-bit wide, 4-entry deep (8-deep with FJOIN_RX).

Read operation: CPU reads 32-bit words that the state machine has pushed via PUSH instructions or autopush.

Status flags:

Same FSTAT monitoring as TX
DMA request generation when not empty
IRQ generation when threshold exceeded

Critical for industrial monitoring: State machine can sample digital inputs at precise intervals, pack multiple samples into 32-bit words, push to FIFO. CPU wakes only when FIFO is full or buffer threshold reached, dramatically reducing power consumption compared to polling.

Internal State Machine Registers

These exist inside the state machine itself and are manipulated by PIO instructions, not directly by the CPU.

X and Y Scratch Registers

Two 32-bit general-purpose registers accessible only to PIO instructions.

Common uses:

Loop counters (MOV X, !X for countdown loops)
Bit pattern storage
Temporary values during protocol processing
Delay timers

Instructions that use X/Y:

MOV - Transfer between registers, FIFOs, pins
JMP X-- / JMP Y-- - Decrement and conditionally jump (perfect for counted loops)
OUT X, count / IN Y, count - Use as data staging

Power-efficient pattern: Use X/Y for local state instead of constantly pulling from FIFO. For example, in a pulse-counting application, increment X locally and only PUSH the count when a threshold is reached.

ISR (Input Shift Register)

32-bit register that accumulates input data before pushing to RX FIFO.

Operation flow:

IN instruction shifts data from source (pins, X, Y, NULL, etc.) into ISR
Data accumulates until threshold reached
Autopush (if enabled) automatically moves ISR contents to RX FIFO
ISR clears and ready for next accumulation

Shift direction (controlled by IN_SHIFTDIR):

Right shift: New data enters at MSB, shifts toward LSB. Good for big-endian protocols.
Left shift: New data enters at LSB, shifts toward MSB. Good for little-endian.

Example: Reading 8-bit serial data - configure PUSH_THRESH=8, each IN PINS, 1 shifts one bit, after 8 bits the byte autopushes to FIFO.

Power benefit: Accumulate many small samples (individual bits or small fields) into packed 32-bit words before waking CPU. Reduces interrupt overhead massively.

OSR (Output Shift Register)

32-bit register that holds output data pulled from TX FIFO before shifting out.

Operation flow:

PULL instruction (or autopull) loads 32-bit word from TX FIFO into OSR
OUT instructions shift data from OSR to destination (pins, X, Y, PC, etc.)
When empty (or threshold reached), autopull loads next word

Shift direction (controlled by OUT_SHIFTDIR):

Right shift: Data exits from LSB, remaining bits shift right
Left shift: Data exits from MSB, remaining bits shift left

Example: Transmitting 8-bit characters - load 32-bit word containing 4 characters, set PULL_THRESH=32, each OUT PINS, 8 sends one character.

OUT_STICKY feature: When enabled, the last OUT value persists even when OSR is empty. Useful for holding a pin state (like chip select) without constantly reloading.

Practical Integration Example

For your industrial hardware, here's how these registers work together in a typical UART receiver:

; Configuration via registers:
; CLKDIV: Set to achieve 8x oversampling of baud rate
; EXECCTRL: Set JMP_PIN to RX pin for start bit detection
; SHIFTCTRL: Set AUTOPUSH=1, PUSH_THRESH=8, IN_SHIFTDIR=right
; PINCTRL: Set IN_BASE to your RX GPIO pin

.program uart_rx
start:
    wait 0 pin 0        ; Wait for start bit (low)
    set x, 7 [10]       ; Load bit counter, delay to center of start bit
bitloop:
    in pins, 1          ; Sample one bit into ISR
    jmp x-- bitloop [6] ; Loop for 8 bits with inter-bit delay
    ; Autopush fires here, ISR->FIFO automatically
    jmp start           ; Ready for next byte

Power efficiency achieved:

State machine runs at minimal clock (CLKDIV) needed for baud rate
CPU sleeps until RX FIFO threshold triggers interrupt (could receive many bytes)
No bit-banging overhead on ARM cores
DMA could empty FIFO to RAM buffer, further reducing CPU wake events

This architecture lets your portable industrial device monitor multiple serial sensors continuously while the main CPU remains in deep sleep most of the time, waking only when significant data has accumulated.

The register design philosophy is minimal overhead and maximum autonomy - exactly what you need for power-constrained handheld tools.

peers8862 Feb 11, 2026
Maintainer Author

Deep Dive: SMx_PINCTRL and Side-Set Architecture

Fundamental Concept: Pin Mapping Abstraction

The PIO state machine doesn't work with absolute GPIO pin numbers. Instead, it uses relative pin indexing that gets mapped to physical GPIOs through PINCTRL. This is fundamentally different from traditional microcontroller peripherals.

Why this matters for your industrial hardware:

Same PIO program works on any GPIO pins
Reconfigure pin assignments at runtime without recompiling
Multiple state machines can run identical code on different pin sets
Protocol libraries become truly portable

SMx_PINCTRL Register Breakdown (32 bits)

Bits 31:29 [3 bits]  - SIDESET_COUNT
Bits 28:26 [3 bits]  - SET_COUNT  
Bits 25:20 [6 bits]  - OUT_COUNT
Bits 19:15 [5 bits]  - IN_BASE
Bits 14:10 [5 bits]  - SIDESET_BASE
Bits 9:5   [5 bits]  - SET_BASE
Bits 4:0   [5 bits]  - OUT_BASE

The Four Pin Operation Types

Each PIO instruction can manipulate pins in four distinct ways. PINCTRL defines which physical GPIOs each operation touches.

1. OUT Pins (Data Output)

OUT_BASE (bits 4:0): Starting GPIO number (0-31)
OUT_COUNT (bits 25:20): How many consecutive pins (0-32)

When your PIO program executes OUT PINS, n, it writes to GPIOs [OUT_BASE ... OUT_BASE+OUT_COUNT-1], taking the rightmost n bits.

Example:

OUT_BASE = 12
OUT_COUNT = 8
Instruction: OUT PINS, 8

Writes to GPIO12-GPIO19. If you OUT PINS, 4, only GPIO12-GPIO15 are affected.

Power-efficient pattern for parallel interfaces:

; 8-bit parallel bus on GPIO12-19
OUT_BASE = 12, OUT_COUNT = 8
.program parallel_write
    pull            ; Get 32-bit word from FIFO
    out pins, 8     ; Write byte 0 to GPIO12-19
    out pins, 8     ; Write byte 1 to GPIO12-19  
    out pins, 8     ; Write byte 2 to GPIO12-19
    out pins, 8     ; Write byte 3 to GPIO12-19

Four bytes transmitted with minimal CPU involvement. State machine clocks at exact timing needed - CPU sleeps.

2. IN Pins (Data Input)

IN_BASE (bits 19:15): Starting GPIO number (0-31)

IN PINS, n reads from GPIOs starting at IN_BASE, reading n consecutive pins.

Critical detail: There's no IN_COUNT. The instruction itself specifies how many bits to read (1-32).

Example:

IN_BASE = 20
Instruction: IN PINS, 8

Reads GPIO20-GPIO27 into the ISR.

Industrial sensor application:

; Read 4 sensors on GPIO20-23 continuously
IN_BASE = 20
.program sensor_sample
    set x, 249      ; Sample 250 times
loop:
    in pins, 4 [31] ; Read 4 sensor pins, delay 32 cycles
    jmp x-- loop
    push            ; Push packed samples to FIFO (250*4 = 1000 bits)

This packs 250 samples into ~32 32-bit words. CPU wakes once to read entire batch. Massive power savings vs. polling.

3. SET Pins (Direct Pin Control)

SET_BASE (bits 9:5): Starting GPIO number (0-31)
SET_COUNT (bits 28:26): How many consecutive pins (0-5, max 5!)

SET PINS, value directly writes a literal value (0-31, 5 bits) to the pins.

Key limitation: SET can only control up to 5 pins. This is because the literal value is encoded in the instruction itself (5 bits).

Example:

SET_BASE = 10
SET_COUNT = 3
Instruction: SET PINS, 0b101

Sets GPIO10=1, GPIO11=0, GPIO12=1

Common use - chip select and control signals:

; SPI-like protocol with CS on GPIO8
SET_BASE = 8, SET_COUNT = 1
.program spi_transfer
    set pins, 0     ; CS low (activate)
    out pins, 8     ; Send data on OUT pins
    set pins, 1     ; CS high (deactivate)

SET is for compile-time known values - control signals, initialization, clock manipulation.

4. Side-Set Pins (The Power Feature)

SIDESET_BASE (bits 14:10): Starting GPIO number (0-31)
SIDESET_COUNT (bits 31:29): How many pins (0-5)

This is where PIO becomes truly powerful for timing-critical protocols.

Side-Set: Deep Conceptual Understanding

The fundamental problem side-set solves:

In traditional microcontrollers, generating a clock signal while transmitting data requires alternating instructions:

set_clock_high()
write_data_bit()
set_clock_low()
write_data_bit()

Each operation consumes a cycle. For protocols like SPI, I2S, or WS2812 where clock and data must change on precise timing, this doubles your instruction count and makes timing complex.

Side-set solution:

Side-set pins are modified by every instruction using bits embedded in the instruction word itself. You can toggle clock/control pins "for free" alongside your main operation.

Side-Set Instruction Encoding

Every PIO instruction is 16 bits:

Bits 15:13 - Major opcode
Bits 12:8  - Varies by instruction
Bits 7:5   - Delay/side-set field
Bits 4:0   - Varies by instruction

Bits 7:5 (3 bits total) are shared between delay cycles and side-set data. The split is determined by SIDESET_COUNT:

If SIDESET_COUNT = 0:

All 3 bits are delay (0-7 cycles)
No side-set capability

If SIDESET_COUNT = 1:

Bit 7: Delay (0-1 cycles)
Bits 6:5: Side-set data (1 bit)
You can side-set 1 pin with 0-1 cycle delay

If SIDESET_COUNT = 2:

Bit 7: Delay (0-1 cycles)
Bits 6:5: Side-set data (2 bits)
You can side-set 2 pins with 0-1 cycle delay

If SIDESET_COUNT = 3:

Bits 7:5: All side-set data (3 bits)
No delay capability!

If SIDESET_COUNT = 4:

Bits 7:4: Side-set data (4 bits)
Bit 3 stolen from instruction encoding
Delay = 0, instruction complexity reduced

If SIDESET_COUNT = 5:

Bits 7:3: Side-set data (5 bits)
Maximum side-set, minimum everything else

Critical tradeoff: More side-set bits = fewer delay bits. Choose based on protocol needs.

Side-Set in Practice: SPI Master

SPI requires coordinated clock (SCK) and data (MOSI) changes. Perfect for side-set.

; GPIO16 = MOSI (OUT pin)
; GPIO17 = SCK (side-set pin)

OUT_BASE = 16, OUT_COUNT = 1
SIDESET_BASE = 17, SIDESET_COUNT = 1

.program spi_master
.side_set 1        ; Declare we're using 1 side-set bit

    pull side 0    ; Load data from FIFO, SCK=0
    set x, 7 side 0 ; Bit counter, SCK=0
bitloop:
    out pins, 1 side 0 ; Write bit to MOSI, SCK=0 (setup time)
    nop side 1 [1]     ; SCK=1 (clock high for 2 cycles)
    jmp x-- bitloop side 0 ; SCK=0, next bit

Analyze the power and timing efficiency:

out pins, 1 side 0 - Single instruction does two things:
- Shifts one bit to MOSI
- Ensures SCK is low (setup time for data)
nop side 1 [1] - Creates clock pulse:
- SCK goes high
- Holds for 2 cycles (the nop cycle + delay)
jmp x-- bitloop side 0 - SCK returns low during loop jump

Without side-set, this would require:

set pins, 0        ; SCK low
out pins, 1        ; Data bit
set pins, 1        ; SCK high
nop
set pins, 0        ; SCK low

That's 5 instructions instead of 3. For 8 bits, you've added 16 extra cycles.

For battery-powered industrial tools, fewer cycles = less energy per transaction. Over millions of SPI transfers (sensor readings, display updates), this compounds significantly.

Side-Set Pin Direction Control

SIDE_PINDIR bit (bit 29 of EXECCTRL) changes what side-set controls:

SIDE_PINDIR = 0 (default): Side-set controls pin output values
SIDE_PINDIR = 1: Side-set controls pin direction (input vs output)

This is exotic but incredibly powerful for bidirectional protocols.

Bidirectional Protocol Example: 1-Wire / DHT Sensors

; Single data pin on GPIO18 (must be input sometimes, output others)
; Use side-set to control direction

SIDESET_BASE = 18, SIDESET_COUNT = 1
SIDE_PINDIR = 1  ; Side-set controls direction, not value

.program onewire_read
.side_set 1

    set pindirs, 1 side 1  ; Output mode
    set pins, 0            ; Pull low (start pulse)
    set pindirs, 0 side 0  ; Input mode (release line)
    wait 1 pin 0           ; Wait for device response
    in pins, 1             ; Read bit

What's happening:

side 1 makes GPIO18 an output
side 0 makes GPIO18 an input
Single instruction changes bus direction
No race conditions or glitches

Without this feature, you'd need:

set pindirs, 1     ; Set direction
set pins, 0        ; Set value
set pindirs, 0     ; Clear direction

The side-set version is atomic and faster.

Power advantage for industrial sensors: Many low-power sensors (DHT22, DS18B20, 1-Wire devices) use bidirectional single-wire protocols. Side-set with PINDIR mode makes these protocols efficient and reliable on PIO.

Advanced Side-Set: Multi-Pin Control

You can side-set up to 5 pins simultaneously. Each bit in the side-set field controls one consecutive pin starting from SIDESET_BASE.

Example: RGB LED with separate clock

; GPIO20 = R, GPIO21 = G, GPIO22 = B (side-set)
; GPIO23 = Clock (also side-set)

SIDESET_BASE = 20, SIDESET_COUNT = 4

.program rgb_clock
.side_set 4

    pull side 0b0000           ; All low, prepare
    out x, 24 side 0b0000      ; Extract RGB data
loop:
    out null, 1 side 0b0001    ; Shift, R=value, Clock=1
    jmp !osre loop side 0b0000 ; Clock=0, continue if data remains

Each side-set value encodes: [Clock][B][G][R]

Side-set value 0b0001 = R follows OSR LSB, G=0, B=0, Clock=1

This is complex but shows the power: four GPIO pins changing state in a single instruction based on literal values.

Side-Set Optional Mode

In your PIO assembler, you can declare .side_set N opt:

.side_set 1 opt

This makes side-set optional - not every instruction needs to specify it. Costs one extra bit from delay field.

Encoding with optional side-set:

Extra bit indicates "is side-set used?"
If used, remaining bits are side-set data
If not used, those bits become delay

When to use optional:
Some instructions need side-set (data transfer), others don't (waiting, counting). Optional mode gives flexibility.

Example:

.program flexible_spi
.side_set 1 opt

    pull              ; No side-set needed
    set x, 7          ; No side-set needed
bitloop:
    out pins, 1 side 0 ; Side-set required: clock low
    nop side 1         ; Side-set required: clock high
    jmp x-- bitloop    ; No side-set needed

Tradeoff: Optional mode reduces maximum delay per instruction. For timing-critical code, you might prefer mandatory side-set and use nop instructions for delays.

Combining All Pin Operations

You can use all four pin types in one program:

; SPI-like interface with CS, CLK, MOSI, MISO
; GPIO10 = MISO (IN pin)
; GPIO11 = MOSI (OUT pin)  
; GPIO12 = CS (SET pin)
; GPIO13 = CLK (side-set pin)

IN_BASE = 10
OUT_BASE = 11, OUT_COUNT = 1
SET_BASE = 12, SET_COUNT = 1
SIDESET_BASE = 13, SIDESET_COUNT = 1

.program spi_full_duplex
.side_set 1

    set pins, 0 side 0    ; CS low (select), CLK low
    pull side 0           ; Load TX data
    set x, 7 side 0       ; Bit counter
bitloop:
    out pins, 1 side 0    ; Write MOSI bit, CLK low (setup)
    in pins, 1 side 1     ; Read MISO bit, CLK high (sample)
    jmp x-- bitloop side 0 ; CLK low, next bit
    push side 0           ; Push RX data to FIFO
    set pins, 1 side 0    ; CS high (deselect)

Single state machine, full-duplex SPI, four pin operations:

SET controls chip select (slow control signal)
OUT writes transmit data
IN reads receive data
Side-set generates clock

Power efficiency: This runs continuously at exact SPI clock rate. CPU only pulls/pushes FIFO data. Could interface with DMA for zero-CPU SPI transactions while processor sleeps.

Real-World Industrial Application: WS2812B LEDs

WS2812B (NeoPixel) protocol is notoriously timing-sensitive. Perfect demonstration of side-set power.

Protocol requirements:

Bit 1: 800ns high, 450ns low
Bit 0: 400ns high, 850ns low
Timing tolerance: ±150ns

Traditional implementation: Bit-banging with careful cycle counting, blocks CPU, fragile timing.

PIO with side-set:

; GPIO14 = Data line (side-set)
SIDESET_BASE = 14, SIDESET_COUNT = 1
; Clock divider set for 800kHz (1.25µs per cycle)

.program ws2812
.side_set 1

.wrap_target
    out x, 1 side 0 [2]    ; Get bit, start low, 3 cycles
    jmp !x bit0 side 1 [1] ; If bit=0 jump, high for 2 cycles
bit1:
    nop side 1 [4]         ; Bit 1: stay high (5 more cycles = 7 total)
    jmp end side 0 [3]     ; Go low for 4 cycles
bit0:
    nop side 0 [3]         ; Bit 0: go low immediately (4 cycles low)
end:
.wrap

Timing analysis at 800kHz (1.25µs cycle):

Bit 1 timing:

High: 2 (jmp) + 5 (nop) = 7 cycles = 875ns ✓ (target 800ns)
Low: 4 (jmp end) + 3 (out) = 7 cycles = 875ns ✓ (target 450ns)

Bit 0 timing:

High: 2 cycles = 250ns ✗ Wait, this needs fixing...

Actually, corrected version:

.program ws2812
.side_set 1

.wrap_target
    out x, 1 side 1 [2]    ; Get bit, start HIGH, 3 cycles high
    jmp !x bit0 side 1 [1] ; Test bit, stay high, 2 more cycles
bit1:
    jmp end side 1 [1]     ; Bit=1: stay high 2 more (total ~7 cycles = 875ns)
bit0:  
    nop side 0 [1]         ; Bit=0: go low immediately
end:
    nop side 0 [1]         ; Low period
.wrap

The point: side-set eliminates separate GPIO toggle instructions. The data line changes state as part of flow control, not as separate operations.

Power Consumption Analysis: Side-Set vs Traditional

Traditional bit-banging (ARM Cortex-M0+):

for (int i = 0; i < 24; i++) {
    if (data & (1 << i)) {
        gpio_set(PIN);      // ~3 cycles
        delay_ns(800);      // ~100 cycles @ 125MHz
        gpio_clear(PIN);    // ~3 cycles
        delay_ns(450);      // ~56 cycles
    } else {
        gpio_set(PIN);      // ~3 cycles
        delay_ns(400);      // ~50 cycles
        gpio_clear(PIN);    // ~3 cycles  
        delay_ns(850);      // ~106 cycles
    }
}

Total per bit: ~160-220 cycles at 125MHz = ~1.3-1.8µs per bit

Power draw: CPU running at 125MHz continuously

PIO with side-set:

State machine runs at 800kHz (divided from 125MHz)
CPU loads FIFO once, sleeps
~8 cycles per bit at 800kHz = 10µs per bit (correct timing)
CPU active only ~1% of time

Energy savings: For driving 100 LEDs (2400 bits):

Traditional: 125MHz for 4.8ms = 600,000 CPU cycles
PIO: 125MHz for FIFO load (~100 cycles) + state machine at 800kHz

Rough estimate: 98% energy reduction for this specific task.

For your battery-powered industrial hardware controlling status LEDs, this is transformative.

Design Guidelines for Side-Set

When to use side-set:

Clock generation - Any protocol with a separate clock signal (SPI, I2C, I2S)
Precise timing - Protocols where signal edges must align exactly (WS2812, UART)
Chip select / control signals - Signals that change predictably with data
Bidirectional buses - Use SIDE_PINDIR for direction control

When NOT to use side-set:

Irregular control signals - If timing is data-dependent in complex ways
Many control signals - Limited to 5 pins, might need SET instead
Simple protocols - If you don't need cycle-accurate timing, SET is simpler

Optimal SIDESET_COUNT selection:

Count = 1: Most common, single clock/control signal, keeps 2 delay bits
Count = 2: Dual signals (clock + enable), keeps 1 delay bit
Count = 3: No delay bits, only if you don't need per-instruction delays
Count = 4-5: Rare, parallel control buses

Memory and Power Footprint

PIO instruction memory: 32 instructions × 16 bits = 64 bytes per state machine program

Side-set doesn't increase memory usage - it's encoded in existing instruction bits.

Power consumption hierarchy (lowest to highest):

State machine clock-gated (stopped): ~0µA
State machine running at minimum CLKDIV: ~tens of µA
State machine at full speed (no divider): ~hundreds of µA
ARM core bit-banging: ~mA range

For your industrial tools, keeping ARM cores asleep and letting PIO handle I/O is critical for battery life.

Practical Example: Industrial Sensor Bus

Imagine monitoring 8 digital sensors with timestamps, minimal power:

; 8 sensors on GPIO0-7 (IN pins)
; Sampling clock on GPIO8 (side-set, for external ADCs)
; Trigger output on GPIO9 (SET pin, for external devices)

IN_BASE = 0
SET_BASE = 9, SET_COUNT = 1  
SIDESET_BASE = 8, SIDESET_COUNT = 1

.program sensor_monitor
.side_set 1

init:
    set pins, 0 side 0      ; Trigger low, clock low
    
sample_loop:
    set pins, 1 side 0      ; Trigger high (start external conversion)
    set x, 99 side 0        ; Wait ~100 cycles for conversion
wait_conv:
    jmp x-- wait_conv side 0
    
    in pins, 8 side 1       ; Sample all 8 sensors, clock high
    set pins, 0 side 0      ; Trigger low, clock low
    
    ; Could repeat 4x to pack 32 bits, then autopush
    
    jmp sample_loop side 0

Result: Continuous sensor monitoring at precise intervals, data buffered in FIFO, CPU wakes only when buffer fills or on timer interrupt to timestamp batches.

Power advantage: If sampling at 1kHz and CPU sleeps for 100ms between FIFO reads, you've reduced CPU active time by 99.9%.

This is the essence of efficient embedded design for battery-powered tools - offload repetitive timing-critical tasks to dedicated hardware (PIO), let CPU handle only decision-making and high-level logic.

peers8862 Feb 11, 2026
Maintainer Author

RP2040 ARM Cortex-M0+ Core Registers - Complete Architecture

The RP2040 has two ARM Cortex-M0+ cores. Each core has its own complete register set. Understanding these is fundamental for assembly programming and low-level optimization.

Core Register Types

The ARM Cortex-M0+ has three categories of registers:

General Purpose Registers (R0-R12)
Special Purpose Registers (R13-R15)
Special Registers (PSR, PRIMASK, CONTROL)

General Purpose Registers (R0-R12)

Thirteen 32-bit registers for general computation and data manipulation.

R0-R7: Low Registers

These are the primary working registers and can be accessed by all 16-bit Thumb instructions (compact encoding).

R0 (Register 0)

Primary accumulator
First function argument (ARM calling convention)
Function return value
Can be used for any purpose within functions

R1 (Register 1)

Second function argument
Secondary accumulator
General purpose within functions

R2 (Register 2)

Third function argument
General purpose

R3 (Register 3)

Fourth function argument
General purpose
Often used as scratch register (caller doesn't expect preservation)

R4 (Register 4)

General purpose
Callee-saved: If a function uses R4, it must preserve and restore it
Good for loop counters in nested functions

R5 (Register 5)

General purpose
Callee-saved
Often used for local variables that span function calls

R6 (Register 6)

General purpose
Callee-saved

R7 (Register 7)

General purpose
Callee-saved
Sometimes used as frame pointer in debug builds
On some systems used for syscalls (not typical in bare-metal RP2040)

Critical for power efficiency: Low registers (R0-R7) generate 16-bit Thumb instructions (2 bytes). High registers (R8-R12) often require 32-bit Thumb-2 instructions (4 bytes). Smaller code = less flash access = lower power.

For your battery-constrained tools, prioritize R0-R7 in hand-written assembly.

R8-R12: High Registers

These require explicit encoding in most instructions (32-bit Thumb-2 opcodes).

R8 (Register 8)

General purpose
Callee-saved
Requires 32-bit instructions for most operations

R9 (Register 9)

General purpose
Callee-saved
On some platforms reserved for OS/RTOS (not typical on bare-metal RP2040)

R10 (Register 10)

General purpose
Callee-saved

R11 (Register 11)

General purpose
Callee-saved
Sometimes used as frame pointer (FP) by compilers

R12 (Register 12) - IP (Intra-Procedure-call scratch register)

Caller-saved (unlike R8-R11)
Used by linker veneers for long branches
Can be corrupted by function calls
Generally avoid using explicitly

Power consideration: Accessing R8-R12 costs 2 extra bytes per instruction. In tight loops executing millions of times, this increases flash bandwidth and power consumption.

Special Purpose Registers (R13-R15)

These have dedicated hardware functions and special access requirements.

R13 - SP (Stack Pointer)

Two stack pointers exist:

MSP - Main Stack Pointer

Used in handler mode (interrupt handlers)
Default stack pointer on reset
Typically used for kernel/system operations

PSP - Process Stack Pointer

Used in thread mode (user code) if CONTROL register configures it
For RTOS task stacks

SP behavior:

Auto-decrements on PUSH
Auto-increments on POP
Must maintain 8-byte (double-word) alignment per ARM ABI
Points to last used stack location (full descending stack)

Critical stack operations:

PUSH {R4-R7, LR}     ; Save registers, SP -= 20 bytes
POP  {R4-R7, PC}     ; Restore registers, SP += 20 bytes, return

Power-critical consideration for industrial tools:

Stack size directly impacts RAM usage. The RP2040 has only 264KB SRAM total. Deep call chains and large stack frames waste precious RAM and increase memory access energy.

Efficient stack discipline:

Minimize local variables in deeply nested functions
Use static/global for large buffers instead of stack allocation
Monitor stack high-water mark in production code

Stack corruption is catastrophic - no protection on Cortex-M0+. Your industrial devices must carefully manage stack, especially when running near memory limits.

R14 - LR (Link Register)

Stores the return address when a function is called.

Function call mechanism:

BL function_name     ; Branch with Link: LR = PC + 4, PC = function_address

Function return mechanism:

BX LR                ; Branch to address in LR
; or more commonly:
POP {PC}             ; Pop return address directly to PC

LR special values:

When an interrupt occurs, LR is loaded with special EXC_RETURN values:

0xFFFFFFF1 - Return to handler mode with MSP
0xFFFFFFF9 - Return to thread mode with MSP
0xFFFFFFFD - Return to thread mode with PSP

These magic values tell the processor how to return from exception handlers. Bit [2] determines which stack pointer to use on return.

Leaf function optimization:

Functions that don't call other functions (leaf functions) don't need to save LR:

; Leaf function - no PUSH/POP of LR needed
add_two:
    ADD R0, R0, R1       ; Result in R0
    BX LR                ; Return directly

Non-leaf function - must preserve LR:

outer_function:
    PUSH {R4, LR}        ; Save R4 and return address
    BL inner_function    ; LR gets overwritten
    ; do more work
    POP {R4, PC}         ; Restore R4, return (PC = old LR)

Power efficiency: Minimizing stack operations (PUSH/POP) reduces memory access. In deeply recursive code or long call chains, this matters.

Critical for interrupt handlers: LR contains the EXC_RETURN value. Corrupting it crashes your system. Always PUSH/POP LR properly in interrupt handlers that call functions.

R15 - PC (Program Counter)

Points to the currently executing instruction + 4 bytes (pipelining artifact).

Special behaviors:

1. Reading PC:

MOV R0, PC           ; R0 = current PC + 4

Returns address of current instruction + 4. Rarely useful except for position-independent code.

2. Writing PC (branching):

MOV PC, R0           ; Branch to address in R0 (must be word-aligned)
POP {PC}             ; Return from function
ADD PC, PC, R0       ; Computed branch (switch statements)

3. PC must be even (Thumb mode):
ARM Cortex-M0+ only supports Thumb mode (16-bit instructions). PC bit [0] must be 0. The processor enforces this.

4. PC-relative addressing:

Many loads use PC-relative addressing for constants:

LDR R0, =0x12345678  ; Assembler generates PC-relative load
; Expands to:
LDR R0, [PC, #offset] ; Loads from literal pool near code

Power impact: PC-relative addressing eliminates need for absolute addressing, reducing instruction size and flash reads.

Security consideration: PC corruption causes unpredictable jumps. In industrial environments (vibration, electrical noise), watchdog timers are critical to recover from PC corruption.

Program Status Register (PSR)

Actually three registers aliased into one 32-bit register:

APSR - Application Program Status Register (bits 31-27)
IPSR - Interrupt Program Status Register (bits 8-0)
EPSR - Execution Program Status Register (bits 26-24, 15-10)

Combined they form xPSR - the full Program Status Register.

APSR - Application Program Status Register

Contains condition flags set by arithmetic and logic operations:

N - Negative flag (bit 31)

Set if result is negative (bit 31 of result = 1)
Signed arithmetic interpretation

Z - Zero flag (bit 30)

Set if result is zero
Used by conditional branches

C - Carry flag (bit 29)

Set if unsigned overflow on addition
Set if no borrow on subtraction
Set to last bit shifted out in shift operations

V - Overflow flag (bit 28)

Set if signed overflow occurred
Two positives produced negative, or two negatives produced positive

Q - Sticky saturation flag (bit 27)

Used by DSP instructions (not present in M0+)
Ignored on RP2040

Example - Addition:

MOVS R0, #0xFFFFFFFF    ; R0 = -1 (or 4294967295 unsigned)
ADDS R0, R0, #1         ; R0 = 0
; Flags: Z=1 (result zero), C=1 (carry out), V=0 (no signed overflow)

Example - Subtraction:

MOVS R0, #5
SUBS R0, R0, #10        ; R0 = -5 (0xFFFFFFFB)
; Flags: N=1 (negative), Z=0, C=0 (borrow occurred), V=0

Conditional execution based on flags:

CMP R0, R1              ; Compare (subtract without storing result)
BEQ equal_label         ; Branch if equal (Z=1)
BNE not_equal_label     ; Branch if not equal (Z=0)
BGT greater_label       ; Branch if greater (signed, uses N,Z,V)
BHI higher_label        ; Branch if higher (unsigned, uses C,Z)

Power-critical pattern:

Testing zero is faster than comparison:

; Slower (2 instructions):
CMP R0, #0
BEQ zero_label

; Faster (1 instruction):
SUBS R0, R0, #0         ; Sets flags, result still in R0
BEQ zero_label

; Or even better if you just consumed R0:
; Many instructions set flags automatically with 'S' suffix
MOVS R0, R1             ; Move and set flags

Fewer instructions = less power. Critical in loops executing millions of times.

IPSR - Interrupt Program Status Register

Bits 8-0 - Exception number

Contains the current exception/interrupt number:

0 = Thread mode (not in handler)
1 = Reset
2 = NMI (Non-Maskable Interrupt)
3 = HardFault
4-10 = Reserved
11 = SVCall
14 = PendSV
15 = SysTick
16+ = External interrupts (IRQ0, IRQ1, ...)

Reading IPSR:

MRS R0, IPSR            ; Move special register to general register
; R0 now contains exception number

Use case: Determine if code is running in interrupt context:

MRS R0, IPSR
CMP R0, #0
BNE in_interrupt        ; Non-zero = in interrupt handler
; ... thread mode code ...

Power consideration: This check is useful for deciding whether to sleep. You can't enter deep sleep from interrupt context.

EPSR - Execution Program Status Register

Bit 24 - T (Thumb state bit)

Always 1 on Cortex-M0+ (Thumb-only processor)
Attempting to clear this causes HardFault

Bits 26-25, 15-10 - IT (If-Then state bits)

Not used on Cortex-M0+ (no IT blocks)
Present on M3/M4

ICI/IT bits - Interruptible instruction state

Used by multi-cycle instructions to resume after interrupt
Hardware managed, transparent to software

EPSR is mostly hardware-managed and rarely accessed directly.

PRIMASK - Interrupt Mask Register

Single-bit register controlling interrupt enable/disable.

Bit 0:

0 = Interrupts enabled
1 = All interrupts disabled except NMI and HardFault

Critical operations:

CPSID I                 ; Clear PRIMASK, disable interrupts
; ... critical section ...
CPSIE I                 ; Set PRIMASK, enable interrupts

; Or via MRS/MSR:
MRS R0, PRIMASK         ; Read current state
; ... modify R0 ...
MSR PRIMASK, R0         ; Write new state

Use cases:

1. Protecting critical sections:

CPSID I                 ; Disable interrupts
LDR R0, [R1]           ; Read-modify-write sequence
ADD R0, R0, #1         ; that must be atomic
STR R0, [R1]
CPSIE I                 ; Re-enable interrupts

2. Ultra-low-power sleep:

CPSID I                 ; Disable interrupts
; ... configure wake sources ...
WFI                     ; Wait For Interrupt
CPSIE I                 ; Re-enable on wake

Power trade-off: Disabling interrupts can increase latency, causing interrupts to be serviced late, potentially increasing power if peripherals are waiting. Use minimally.

For industrial hardware: Critical sections protecting shared sensor data between main loop and interrupt handlers need PRIMASK protection, but keep duration minimal (< 10 microseconds).

CONTROL - Control Register

Configures processor operating mode.

Bit 1 - SPSEL (Stack Pointer Select)

0 = Use MSP (Main Stack Pointer)
1 = Use PSP (Process Stack Pointer)
Only writable in privileged thread mode

Bit 0 - nPRIV (Not Privileged)

0 = Privileged mode (can access all resources)
1 = Unprivileged mode (restricted access)

Typical bare-metal configuration:

CONTROL = 0 (privileged, using MSP)
Simple applications never change this

RTOS configuration:

Kernel runs privileged with MSP
Tasks run unprivileged with PSP (each task gets own stack)
Provides memory protection and task isolation

; Switch to PSP (typical RTOS task switch)
MRS R0, CONTROL
ORRS R0, R0, #2         ; Set bit 1 (SPSEL)
MSR CONTROL, R0
ISB                     ; Instruction Synchronization Barrier

ISB required after CONTROL writes to ensure pipeline consistency.

Power consideration: Separate stack pointers allow task switching without copying stacks, reducing memory operations. RTOS overhead is minimal on M0+ if properly configured.

Special Register Access

General registers (R0-R12) use normal instructions. Special registers require MRS/MSR instructions:

MRS R0, APSR            ; Read APSR into R0
MRS R1, IPSR            ; Read IPSR into R1
MRS R2, PRIMASK         ; Read PRIMASK into R2
MRS R3, CONTROL         ; Read CONTROL into R3

; Modify and write back
MSR PRIMASK, R0         ; Write R0 to PRIMASK
MSR CONTROL, R1         ; Write R1 to CONTROL

Cannot use MSR/MRS with general registers - attempting MSR R0, R1 is invalid.

Memory-Mapped Registers (Peripherals and Core)

Beyond CPU registers, RP2040 has memory-mapped registers for peripherals and core configuration. These appear as memory addresses.

System Control Block (SCB)

Located at 0xE000ED00 - controls core configuration.

Key SCB registers:

CPUID (0xE000ED00) - CPU identification

Read-only
Contains implementer, variant, architecture info
Value: 0x410CC601 (ARM Cortex-M0+, revision 1)

ICSR (0xE000ED04) - Interrupt Control and State

Bit 31 - NMI pending
Bit 28 - PendSV pending
Bit 26 - SysTick pending
Bits 8-0 - Active exception number

VTOR (0xE000ED08) - Vector Table Offset Register

Not present on Cortex-M0+
RP2040 vector table always at 0x00000000 (flash) or 0x20000000 (RAM)
To use RAM vectors, must copy entire table to RAM start

AIRCR (0xE000ED0C) - Application Interrupt and Reset Control

Bit 2 - SYSRESETREQ: Write 1 to request system reset
Bits 31-16 - VECTKEY: Must write 0x05FA to modify register

System reset from software:

#define AIRCR (*((volatile uint32_t*)0xE000ED0C))
AIRCR = 0x05FA0004;     // Request system reset

SCR (0xE000ED10) - System Control Register

Bit 2 - SLEEPDEEP: Use deep sleep vs light sleep
Bit 1 - SLEEPONEXIT: Sleep when returning from ISR

Power management:

LDR R1, =0xE000ED10     ; SCR address
LDR R0, [R1]
ORRS R0, #0x4           ; Set SLEEPDEEP
STR R0, [R1]
WFI                     ; Enter deep sleep

SHPR2, SHPR3 - System Handler Priority Registers

Configure priority for SVCall, PendSV, SysTick
2-bit priority on Cortex-M0+ (4 levels: 0-3)

SysTick Timer

Located at 0xE000E010 - 24-bit countdown timer integrated in core.

SYST_CSR (0xE000E010) - Control and Status

Bit 16 - COUNTFLAG: Read=1 if counted to 0 since last read
Bit 2 - CLKSOURCE: 0=external, 1=processor clock
Bit 1 - TICKINT: Enable interrupt on count to 0
Bit 0 - ENABLE: Enable counter

SYST_RVR (0xE000E014) - Reload Value

Bits 23-0 - Reload value (max 16,777,215)
Counter reloads from this when reaching 0

SYST_CVR (0xE000E018) - Current Value

Bits 23-0 - Current counter value
Write any value to clear counter to 0 and clear COUNTFLAG

Example - 1ms tick at 125MHz:

#define SYST_CSR (*((volatile uint32_t*)0xE000E010))
#define SYST_RVR (*((volatile uint32_t*)0xE000E014))
#define SYST_CVR (*((volatile uint32_t*)0xE000E018))

SYST_RVR = 125000 - 1;  // 125MHz / 125000 = 1kHz = 1ms
SYST_CVR = 0;           // Clear current value
SYST_CSR = 0x7;         // Enable, interrupt, processor clock

Power usage: SysTick at 1kHz wakes processor 1000 times/second. For ultra-low-power industrial devices, consider disabling SysTick and using RTC or GPIO interrupts for wake.

NVIC - Nested Vectored Interrupt Controller

Controls external interrupts (IRQs).

ISER (0xE000E100) - Interrupt Set-Enable Register

Bits 31-0 - Enable IRQ0-31
Write 1 to bit N to enable IRQN
Writing 0 has no effect (use ICER to disable)

ICER (0xE000E180) - Interrupt Clear-Enable Register

Write 1 to bit N to disable IRQN

ISPR (0xE000E200) - Interrupt Set-Pending Register

Write 1 to bit N to manually trigger IRQN

ICPR (0xE000E280) - Interrupt Clear-Pending Register

Write 1 to bit N to clear pending IRQN

IPR0-IPR7 (0xE000E400-0xE000E41C) - Interrupt Priority Registers

8 bits per interrupt, but only top 2 bits used (4 priority levels)
Priorities: 0x00, 0x40, 0x80, 0xC0

Example - Enable UART0 interrupt:

#define NVIC_ISER (*((volatile uint32_t*)0xE000E100))
#define UART0_IRQ 20

NVIC_ISER = (1 << UART0_IRQ);  // Enable UART0 interrupt

Power-critical interrupt configuration:

Configure interrupts as wake sources before sleep:

// Only enable wake interrupts
NVIC_ISER = (1 << GPIO_IRQ) | (1 << RTC_IRQ);
// Disable all others
NVIC_ICER = ~((1 << GPIO_IRQ) | (1 << RTC_IRQ));

// Enter sleep
__WFI();

Fewer enabled interrupts = lower power (less wake events).

RP2040-Specific Registers

Beyond ARM standard registers, RP2040 adds custom peripherals.

SIO - Single-Cycle IO

Direct GPIO access at 0xD0000000:

GPIO_OUT (0xD0000010) - Direct GPIO output
GPIO_OUT_SET (0xD0000014) - Set bits (write 1 to set)
GPIO_OUT_CLR (0xD0000018) - Clear bits (write 1 to clear)
GPIO_OUT_XOR (0xD000001C) - Toggle bits (write 1 to toggle)

Single-cycle GPIO toggle:

LDR R0, =0xD000001C     ; GPIO_OUT_XOR
MOVS R1, #(1<<25)       ; Bit 25 (onboard LED)
STR R1, [R0]            ; Toggle - single cycle!

Power advantage: Direct SIO access avoids read-modify-write, saving 2 cycles and a memory read.

CPUID (0xD0000000) - Core ID

Returns 0 or 1 (which core is executing)
Used for multicore synchronization

FIFO - Inter-core Communication

FIFO_ST (0xD0000050) - FIFO Status
FIFO_WR (0xD0000054) - Write to other core's FIFO
FIFO_RD (0xD0000058) - Read from this core's FIFO

Power-efficient multicore: Core 1 can sleep waiting on FIFO, Core 0 sends data via FIFO interrupt.

Atomic Register Operations

RP2040 peripherals support atomic set/clear/XOR via address aliasing:

For any register at address BASE:

BASE + 0x0000 - Normal access
BASE + 0x1000 - Atomic bitmask XOR
BASE + 0x2000 - Atomic bitmask SET
BASE + 0x3000 - Atomic bitmask CLEAR

Example - Atomic GPIO bit set:

#define GPIO25_CTRL 0x400140CC
#define GPIO25_CTRL_SET (GPIO25_CTRL + 0x2000)

*(volatile uint32_t*)GPIO25_CTRL_SET = (1 << 5);  // Atomic set bit 5

No read-modify-write needed, no interrupt disable needed, single bus cycle.

Critical for power: Atomic operations eliminate interrupt masking in shared resource access, reducing interrupt latency and allowing more time in sleep.

Register Usage Conventions (AAPCS)

ARM Architecture Procedure Call Standard defines register usage:

R0-R3 - Argument passing, scratch registers (caller-saved)

R0 = first arg, return value
R1 = second arg
R2 = third arg
R3 = fourth arg
Additional args on stack

R4-R11 - Local variables (callee-saved)

Must be preserved across function calls
PUSH on entry, POP on exit

R12 (IP) - Intra-procedure scratch (caller-saved)

Linker may use for veneers

R13 (SP) - Stack pointer (special)

R14 (LR) - Link register (special)

R15 (PC) - Program counter (special)

Violating conventions causes hard-to-debug issues when linking C and assembly.

Power-Optimized Register Usage Patterns

Minimize Stack Operations

Bad (high power):

function:
    PUSH {R4-R7}        ; 4 memory writes
    ; ... use R4-R7 ...
    POP {R4-R7}         ; 4 memory reads
    BX LR

Better (if possible):

function:
    ; Use only R0-R3 (no PUSH/POP needed)
    ; ... computation in R0-R3 ...
    BX LR

Each stack operation is a RAM access (~1-2 cycles, ~pJ energy). Eliminating 100 stack ops per millisecond saves measurable power.

Register Allocation in Loops

Bad:

loop:
    LDR R0, =peripheral_addr  ; Reload every iteration
    LDR R1, [R0]
    ; ... process ...
    B loop

Good:

    LDR R4, =peripheral_addr  ; Load once
loop:
    LDR R1, [R4]              ; Use cached address
    ; ... process ...
    B loop

Best (if peripheral has consistent offset):

    LDR R4, =base_addr
    MOVS R5, #0x10            ; Offset
loop:
    LDR R1, [R4, R5]          ; Single-cycle indexed load
    ; ... process ...
    B loop

Thumb-2 Instruction Selection

Prefer 16-bit Thumb instructions (use R0-R7) over 32-bit Thumb-2 (use R8-R12) when possible.

16-bit instruction (2 bytes):

ADDS R0, R1, R2             ; 16-bit encoding

32-bit instruction (4 bytes):

ADD R8, R9, R10             ; 32-bit encoding required

For code executing in tight loops, smaller encoding reduces flash bandwidth and power.

Practical Example: Ultra-Low-Power Sensor Reading

Combining efficient register usage for minimal power:

; Read sensor on GPIO, accumulate 1000 samples, then wake CPU
; Uses SIO for single-cycle GPIO, minimal registers, no stack

.thumb
.global sensor_task

sensor_task:
    ; R4 = sample counter
    ; R5 = accumulator
    ; R6 = GPIO base address
    ; Use only R4-R6 (callee-saved, preserved across calls)
    
    PUSH {R4-R6, LR}
    
    LDR R6, =0xD0000004         ; SIO GPIO_IN address
    MOVS R5, #0                 ; Clear accumulator
    LDR R4, =1000               ; Sample count
    
sample_loop:
    LDR R0, [R6]                ; Read all GPIOs (1 cycle)
    LSRS R0, R0, #15            ; Shift sensor bit (GPIO15) to bit 0
    ANDS R0, #1                 ; Mask to single bit
    ADDS R5, R5, R0             ; Accumulate
    
    ; Delay ~1ms (125k cycles at 125MHz)
    LDR R1, =41666              ; Delay iterations
delay_loop:
    SUBS R1, #1
    BNE delay_loop
    
    SUBS R4, #1
    BNE sample_loop
    
    ; R5 now contains count of "1" samples (0-1000)
    MOVS R0, R5                 ; Result to R0 (return value)
    POP {R4-R6, PC}             ; Restore and return

Power analysis:

No stack operations inside loop (PUSH/POP outside)
Single-cycle GPIO read via SIO
Minimal register usage (only R4-R6 preserved)
Could be improved by sleeping during delay instead of busy-wait

This is the level of optimization needed for industrial battery-powered devices measuring sensors continuously.

Understanding these registers deeply allows you to write assembly that's not just correct, but optimal for your power-constrained industrial hardware. Every register choice, every instruction selection, compounds across billions of cycles into measurable battery life differences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Babb Works

Minimize usage of CPUs through alternatives like State Machines or other technology #37

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Babb Works

Minimize usage of CPUs through alternatives like State Machines or other technology #37

Uh oh!

peers8862 Feb 11, 2026 Maintainer

Replies: 1 comment · 3 replies

Uh oh!

peers8862 Feb 11, 2026 Maintainer Author

Uh oh!

peers8862 Feb 11, 2026 Maintainer Author

RP2040 PIO State Machine Registers - Detailed Reference

SMx_CLKDIV (Clock Divider Register)

SMx_EXECCTRL (Execution Control Register)

SMx_SHIFTCTRL (Shift Control Register)

SMx_ADDR (Address Register)

SMx_INSTR (Instruction Register)

SMx_PINCTRL (Pin Control Register)

FIFO Registers (TXFx and RXFx)

TXFx (Transmit FIFO)

RXFx (Receive FIFO)

Internal State Machine Registers

X and Y Scratch Registers

ISR (Input Shift Register)

OSR (Output Shift Register)

Practical Integration Example

Uh oh!

peers8862 Feb 11, 2026 Maintainer Author

Deep Dive: SMx_PINCTRL and Side-Set Architecture

Fundamental Concept: Pin Mapping Abstraction

SMx_PINCTRL Register Breakdown (32 bits)

The Four Pin Operation Types

1. OUT Pins (Data Output)

2. IN Pins (Data Input)

3. SET Pins (Direct Pin Control)

4. Side-Set Pins (The Power Feature)

Side-Set: Deep Conceptual Understanding

Side-Set Instruction Encoding

Side-Set in Practice: SPI Master

Side-Set Pin Direction Control

Bidirectional Protocol Example: 1-Wire / DHT Sensors

Advanced Side-Set: Multi-Pin Control

Side-Set Optional Mode

Combining All Pin Operations

Real-World Industrial Application: WS2812B LEDs

Power Consumption Analysis: Side-Set vs Traditional

Design Guidelines for Side-Set

Memory and Power Footprint

Practical Example: Industrial Sensor Bus

Uh oh!

peers8862 Feb 11, 2026 Maintainer Author

RP2040 ARM Cortex-M0+ Core Registers - Complete Architecture

Core Register Types

General Purpose Registers (R0-R12)

R0-R7: Low Registers

R8-R12: High Registers

Special Purpose Registers (R13-R15)

R13 - SP (Stack Pointer)

R14 - LR (Link Register)

R15 - PC (Program Counter)

Program Status Register (PSR)

APSR - Application Program Status Register

IPSR - Interrupt Program Status Register

EPSR - Execution Program Status Register

PRIMASK - Interrupt Mask Register

CONTROL - Control Register

Special Register Access

Memory-Mapped Registers (Peripherals and Core)

System Control Block (SCB)

SysTick Timer

NVIC - Nested Vectored Interrupt Controller

RP2040-Specific Registers

SIO - Single-Cycle IO

FIFO - Inter-core Communication

Atomic Register Operations

Register Usage Conventions (AAPCS)

Power-Optimized Register Usage Patterns

Minimize Stack Operations

Register Allocation in Loops

Thumb-2 Instruction Selection

Practical Example: Ultra-Low-Power Sensor Reading

peers8862
Feb 11, 2026
Maintainer

Replies: 1 comment 3 replies

peers8862
Feb 11, 2026
Maintainer Author

peers8862 Feb 11, 2026
Maintainer Author

peers8862 Feb 11, 2026
Maintainer Author

peers8862 Feb 11, 2026
Maintainer Author