From day one
Chronological log. What got built, what got measured, what's still open. Where something is broken, in flight, or out of scope — that's said directly.
A C99 inference engine, looking for a model
The engine came first — a tiny program that runs ternary AI on bare-metal microcontrollers. No OS, no internet, no allocator. Pure C99, zero heap, zero floats in the matmul, zero external dependencies. Bit-exact Python ↔ C parity at FP32 epsilon. Static atome_block_t fixed at three pathways — that structural constraint drives every architecture decision on the model side.
Atome lm is born — three pathways, by force
The original 4-pathway sketch was retired in favour of strict alignment with atome_block_t: local depthwise causal conv + diagonal SSM + top-k sparse attention. First commit: 42 tests, byte tokenizer, ATOME01 exporter. ~60K params packing to ~15.1 KB on disk.
Cortex-M3 firmware boots in QEMU — parity holds
Cross-compile sweep across Cortex-M0 / M3 / M4 / M4F / M7 at -Os. Engine code: 2.6–2.8 KB on all five. Full QEMU MPS2-AN385 firmware boots, runs forward pass under semihosting. End-to-end Python ↔ Cortex-M3 parity: max |Δ| = 3.7×10⁻⁷.
Sampling, REPL, first trained checkpoint
Added temperature / top_p / top_k / seeded generator. Default temperature=0 preserves bit-exact parity with C engine's argmax. REPL with per-layer router-entropy bars, CPU benchmark, held-out bpb evaluator. First checkpoint: 800 steps on TinyStories, bpb 3.48, ppl 11.16.
Frontier finding — A/B against vanilla GPT at 60K
Vanilla decoder-only transformer at 60.8K params (param-fair) and 6K params (flash-fair). Three-seed median at 3,000 steps: Atome 6.31 ppl vs vanilla-60K 8.12 ppl vs vanilla-6K 13.10 ppl. +22% param-fair · +52% flash-fair.
Per-pathway ablation — the conv carries the win
Dropping local-conv +20% ppl (largest hit). Dropping SSM +6%. Dropping sparse-attn +4% (smallest). At 60K params on TinyStories, the conv pathway is doing most of the work.
SSM-state bug fixed — 48/48 bit-exact
atome_predict_next reprocessed the full token list each call but never reset state->ssm_h. Fix is four lines (memset at the top of predict_next). Multi-token QEMU parity jumped 23/48 → 48/48 bit-exact.
944K-param model trained — coherent TinyStories prose
Trained 944,640 params (d=256, 8 layers) on the full TinyStories corpus. 30,000 steps, effective batch 256, BF16, cosine LR. Best val loss 1.0545, ppl 2.87. ~3 h 20 wall, ~$2 cloud. 16/16 QEMU ↔ Python bit-exact on the new checkpoint.
944K vanilla A/B — the headline narrows at scale
Same recipe, same val slice, vanilla GPT FP32 at 950K params: val loss 0.9337, ppl 2.54 — beats Atome ternary 944K by 11.4% loss / 11.5% ppl. Implication: the architecture's bet is the sub-1M regime; above ~1M the inductive bias becomes a constraint. Multi-seed run pending.
Pre-launch verification · 944K QEMU re-verified
Re-trained and re-evaluated all three internal task classifiers on their synthetic distributions: command-intent 100 % · bad-reading 91.7 % · intent-bucket 100 %, C-engine argmax matching Python. Each binary fits 20.2 KB, total state RAM 52 KB. Re-ran the 944K through Cortex-M3 emulator: 4 / 4 bit-exact. 146 pytest tests green.
Atome LM v2 (SuperESP) — 12 on-device AI apps, on real silicon
The ternary engine grew an applied layer: 12 on-device AI apps — 11 applied heads (agriculture, voice, motion, anomaly, air-quality, energy/NILM, occupancy, wearable, water-leak, predictive failure, sound) plus an on-device OS dispatcher — a universal installer for any ESP32 (S2/S3/C3/C6/H2), and signed, audited models. Verified on a physical ESP32-WROOM-32: all 12 ran on-chip, bit-exact to the host. Read the release →
What's next
More silicon. ESP32-WROOM-32 shipped (v2); next: Nucleo-F411RE / RP2040 with published tokens/sec + Joules/token.
Multi-seed at 944K. Three more training runs to pin down the perplexity range.
Q15 inference path. Halves BSS, multiplies M0/M3 speed by 5–10×.
Narrow-domain distillation. Use a strong teacher to generate curated training data for a target deployment domain.