Engineering Deep Dive: Context Management At 2M Tokens

How Grok 2M and MiniMax 256K force different pruning, action-ledger, and endgame strategies in the test generator.

Large context windows change failure modes. With a 2M-token model like Grok, the dominant risk is not immediate truncation. It is wandering: the model can keep too much conversation around and drift into side quests because nothing forces compression early. With a 256K-class model like MiniMax highspeed, the dominant risk is the opposite. You can lose the exact locator that made login work on iteration 4 before you reach final code synthesis on iteration 42.

The generator in discover-generator.ts is the most memory-sensitive stage in the whole discovery pipeline because it needs both early exploratory detail and late-stage code composition fidelity. That combination is exactly why we do not use one global context policy. The explorer, the healer, and the generator each have different memory shapes, so the generator overrides the global model defaults instead of pretending one pruning schedule fits every agent workload.

The Generator Has Its Own Context Policy

The global provider config in ai-provider.ts sets the baseline: Grok-tier models keep the whole thread by default, with no scheduled prune and an emergency threshold at 200K estimated tokens. MiniMax-tier models prune every 10 iterations, keep a 12-message tail, and force an emergency prune at 100K. That is a reasonable explorer default. It is not a good generator default.

So the generator overrides MiniMax behavior to prune later and retain more raw conversation. In current source, MiniMax in generation mode prunes every 15 iterations and keeps a 16-message tail. That sounds like a tiny change until you remember that each iteration adds roughly two or three messages. The difference between 12 and 16 tail messages is the difference between retaining the last four or five iterations and retaining enough of the login flow to still have exact locator strings when finish_generate gets called.

Why A 2M Window Still Needs Discipline

A common mistake with frontier context windows is assuming that "can hold more" means "should hold everything." It does not. The generator still estimates token load, still has an emergency prune path, and still injects endgame instructions that force the model out of exploration mode. Even when Grok can technically retain the whole transcript, the model does better when there is a clear shift from wandering to composing.

Topics: Engineering, Deep Dive, Architecture, Context Windows.

Read the full article · Get Started Free