An ongoing research project
Pipeline-parallel inference for pretrained dense LLMs with empirically-placed cuts, plus a reducibility map of every MLP projection inside.
We slice frozen pretrained dense LLMs (Qwen2.5-3B and 14B in our experiments) at functional layer boundaries — found by spotting CKA cliffs in the residual stream — and place the resulting tiles across multiple GPUs. Pipeline parallelism with measured cut placement, lossless on N=200 prompts.
Inside each tile, we measure which MLP down_proj matrices
compress to a k-NN lookup table without breaking generation. Boundary
layers do (k=1024 holds 100% same_top1, KL 0.002). Middle layers don't
— that's the cocoon, the irreducible computational kernel; the
reducible boundary is silk. Coordinating the assembled tiles
uses small classifier models (~19K params each, what we call
notes) — a post-hoc, structural mixture-of-experts gate that
runs on frozen weights with no end-to-end training.