Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.edgee.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Token compression reduces the number of tokens sent to and received from the LLM, without losing information from the model’s perspective. Compression is the surgical removal of redundancy. Not summarization.

Two layers

The two-layer taxonomy is the non-negotiable foundation. Every strategy in this page belongs to exactly one of them.
  • Input compression: ~99% of total token volume, ~90% of the cost. What enters the context window: system prompts, tool results, codebase context, conversation history, MCP tool definitions.
  • Output compression: ~1% of total volume but 10% of the cost. What the model generates: filler, repetitive scaffolding, polite preambles, over-explanation, markdown overhead.
Agentic workloads consume 5–30× more tokens per task than chatbot workloads, and approximately 40% of those tokens are redundant. Compression targets that redundant share.

The three compression strategies

Edgee ships three named compression strategies, toggleable independently.
CompressionLayerCost reduction
Tool ResultInput−19%
Tool Surface (alpha)Input~−25% projected
OutputOutput−6.5% when enabled

Tool Result Trim

Filters tool_result messages before they reach the model. Strips:
  • Boilerplate framing
  • Pagination markers
  • ANSI escape sequences
  • Repeated headers
  • Verbose JSON wrappers
What it targets in a typical coding-agent session:
  • File contents — output from Read tool and file system operations.
  • Grep and search outputs — code search, ripgrep, similar tools.
  • Shell command output — stdout/stderr from Bash and terminal commands.
  • API responses — large JSON or text payloads returned by tool calls.
  • Database query results — rows and records returned from tool-executed queries.
User messages and assistant turns are not modified. Lossiness. Lossless on tool_result payloads — the model receives the same technical content, with redundant framing removed. Customer traffic. tool_result_trim reduces token costs by 19% on average.

Tool Surface Reduction

Coding agents send a flat tools list to the model on every request: every MCP server connected, every skill registered, every tool definition, always on, regardless of the task. The model receives the union of everything it could call, even when 95% of those tools are irrelevant to the request at hand. This bloats context, drives up cost, and forces engineers to manually toggle MCP servers on and off like a mixing desk between tasks.

How it works

Edgee already sits in the request path with two things it needs: the user prompt and the full tools list. From there, a small classifier model:
  1. Classifies the user’s task — “bugfix”, “db migration”, “billing issue”, “docs cleanup”, “code review”, and so on.
  2. Scores each tool by relevance — using its name, description, and JSON schema against the classified task.
  3. Strips out unrelated tools and skills — before the request ever hits the primary LLM. Borderline tools are down-scoped (kept but marked low-priority) rather than removed outright.
The result is a tool-aware gateway:
  • The IDE still exposes all MCP servers — nothing changes for the developer’s setup.
  • The agent still discovers tools through the standard MCP protocol — nothing changes for the agent’s behavior.
  • The model only ever sees a curated, task-relevant subset. The rest is removed or down-scoped by Edgee at the edge, dynamically, on every request.
Status: alpha. Internal benchmarks project a ~25% reduction in total token costs on top of tool_result_trim. Reach out if you want to opt in early.

Output Brevity

Reduces verbosity in model responses without losing technical content. Same answer, fewer tokens.

Available levels

LevelWhat it doesTrade-off
lightAsks the model to skip pleasantries, articles, and filler, while keeping standard sentence structure.Lowest output reduction, most readable.
mediumForces the model to drop articles, fragments, and conventional grammar in favor of dense technical content.Dense, less natural prose.
hardAn aggressive variant that pushes output brevity further with stricter instructions..Highest output reduction; least readable for humans, still parseable for downstream tools.
For coding-agent sessions, output is a small share of total token volume (~1%), so output_brevity is opt-in and disabled by default. For chat-style or RAG workloads where the model produces long-form answers, output is the dominant cost and output_brevity becomes the lever. Customer traffic. Where enabled, output_brevity reduces total token costs by 6.5% on average. Academic note. Recent work supports the broader claim — Brevity Constraints Reverse Performance Hierarchies in Language Models (Hakim, arXiv:2604.00025, March 2026) found that constraining models to brief responses can improve accuracy on certain benchmarks. The study is on open-weight models, not Claude/GPT directly.

Reading the compression block

Every response that runs through any compression strategy carries a compression block on the response body. Use it to track savings per request.
const response = await edgee.send({
  model: 'gpt-5.2',
  input: 'Long prompt with lots of context...',
});

if (response.compression) {
  console.log(response.compression.saved_tokens); // e.g. 450
  console.log(response.compression.cost_savings); // micro-units (1_000_000 = $1.00)
  console.log(response.compression.reduction);    // percentage, e.g. 48 → 48%
  console.log(response.compression.time_ms);      // ms spent on compression
}
Field reference:
FieldTypeMeaning
saved_tokensintegerInput tokens removed (original count minus compressed count).
cost_savingsintegerEstimated cost savings in micro-units. Divide by 1_000_000 for USD.
reductionnumberPercentage reduction in input tokens. 48 → 48%.
time_msintegerWall-clock time spent on compression.
The usage.prompt_tokens field on the same response reflects the compressed count actually billed by the provider, not the original input.

Enabling and disabling

Three surfaces, in order of how most users will use them.

CLI (default-on for coding agents)

When you launch a coding agent through the Edgee CLI, tool_result_trim is enabled automatically — no console step required.
edgee launch claude
edgee launch codex
edgee launch opencode
tool_surface_reduction is alpha and opt-in. output_brevity is opt-in for coding-agent sessions because output is a small share of their volume.

Console (per-key toggle)

In the Edgee Console, open Dashboard and manage your agent’s settings right from the UI. For team-managed keys, the same toggles are available per-member from Team management → agent settings. See Team management.

Next

https://mintcdn.com/edgee/RmPUqoqJw-u0FxFP/images/icons/claude.svg?fit=max&auto=format&n=RmPUqoqJw-u0FxFP&q=85&s=d3154991b618d253ee22ffaf55a433fc

Claude Token Compression

tool_result_trim applied to Claude API traffic.
https://mintcdn.com/edgee/CrNen493EQpoYoa2/images/icons/codex.svg?fit=max&auto=format&n=CrNen493EQpoYoa2&q=85&s=0f19fa96ee1277109c66c3b411f868c0

Codex Token Compression

tool_result_trim applied to the OpenAI Responses wire format.