Skip to content

chore: upgrade CFLAGS from -O2 to -O3 -ffast-math#23

Open
siddiquifaras wants to merge 1 commit intoRightNow-AI:mainfrom
siddiquifaras:chore/compiler-flags
Open

chore: upgrade CFLAGS from -O2 to -O3 -ffast-math#23
siddiquifaras wants to merge 1 commit intoRightNow-AI:mainfrom
siddiquifaras:chore/compiler-flags

Conversation

@siddiquifaras
Copy link

@siddiquifaras siddiquifaras commented Mar 13, 2026

What does this PR do?

Upgrades Makefile CFLAGS from -O2 to -O3 -ffast-math for all Makefile-based builds (Linux, macOS, Pi, RISC-V). The Windows MSVC build is unaffected since it uses hardcoded flags in build.bat / CI.

Type of change

  • Performance improvement

Why these flags are safe

-O3 enables more aggressive inlining and loop optimization with a modest binary size impact.

-ffast-math allows float reordering and assumes no NaN/Inf in float operations. Safe here because:

  • The software FP16 conversion (fp16_to_fp32 / fp32_to_fp16) uses integer bit manipulation, completely unaffected by -ffast-math
  • The online softmax exponents are always <= 0 by construction, so no overflow risk
  • Model weights are 4-bit quantized, so ULP-level rounding differences are irrelevant

What I deliberately left out

  • -funroll-loops: would bloat the binary from ~80KB toward 200-400KB, breaking the project's advertised binary size
  • -flto: requires holding full program IR at link time, could OOM on Pi Zero / LicheeRV Nano (512MB / 256MB RAM) during on-device compilation

Testing

  • Tested on ARM64 (Apple Silicone, macOS)
  • Tested with TinyLlama 1.1B Q4_K_M

Test command:

make clean && make native
./picolm model.gguf -p "The capital of France is" -n 20 -t 0

Output:

Paris.

2. B.C. The capital of ancient Rome was Rome.

3

Output is character-identical to the baseline -O2 build at all context lengths tested (-n 20, -n 100, -n 256).

Results

Metric Before (-O2) After (-O3 -ffast-math)
Binary size 87,784 bytes 87,736 bytes (-48 bytes)
Generation (-n 20) 23.9 tok/s 26.6 tok/s (+11%)
Generation (-n 100) 20.9 tok/s 22.2 tok/s (+6%)

Checklist

  • Code compiles without warnings (make native)
  • No new dependencies added
  • Memory usage not increased (45.17 MB, unchanged)
  • Works with --json mode
  • Works with --cache round-trip

Closes #16

-O3 enables more aggressive inlining and loop opts.
-ffast-math allows float reordering, safe here because:
  - software FP16 uses integer bit manipulation, unaffected
  - online softmax exponents are always <= 0 by construction
  - model weights are 4-bit quantized, ULP differences irrelevant

Deliberately omitted:
  -funroll-loops (would bloat ~80KB binary toward 200-400KB)
  -flto (could OOM on Pi Zero during on-device compilation)

Binary size: 87784 -> 87736 bytes (-48 bytes).

Tested on Apple M4 Pro, TinyLlama 1.1B Q4_K_M, -t 0 greedy:
  -n 20:  23.9 -> 26.6 tok/s (+11%)
  -n 100: 20.9 -> 22.2 tok/s (+6%)
  Output character-identical to baseline.

Closes RightNow-AI#16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FYI got a little more speed

1 participant