Context Efficiency: Using Code Execution within MCP to Reduce Token Bloat

Introduction: The Problem of Token Bloat

In the architecture of AI agent systems, two fundamental constraints govern all design decisions: the context window limit of the Large Language Model and the financial cost of every token processed. A naive approach to providing an agent with information—for instance, asking it to analyze a 100-page financial report—is to attempt to "stuff" the entire document into the model's context.

This strategy fails spectacularly. It results in "token bloat," a condition with three crippling consequences: 1. Exorbitant Costs: LLM API calls are priced per token. Sending a 100,000-token document to a model is orders of magnitude more expensive than sending a 2,000-token summary. 2. Impossible Context Limits: In many cases, the raw data simply will not fit. A 1-million-token document cannot be processed by a model with a 128k-token context window. 3. High Latency: The more tokens an LLM has to process, the longer it takes to generate a response, leading to a poor user experience.

The core engineering problem is this: how do we provide an agent with the necessary context to perform a task without sending massive, unprocessed, and expensive raw data payloads?

The Engineering Solution: Server-Side Context Compression via MCP

The Model Context Protocol (MCP) provides an elegant solution by enabling server-side code execution. Instead of being a passive data adapter, the MCP Server becomes an active, intelligent pre-processor that can compute, transform, and compress data before it ever reaches the AI agent.

The workflow is inverted. Rather than the agent pulling large amounts of data to process itself, it delegates the processing to the tool's MCP Server.

High-Level Request: The AI agent makes a simple, high-level call, such as invoke('document_analyzer.get_key_insights', {doc_id: 'Q3-Report.pdf'}).
Server-Side Execution: The MCP Server receives this request. It uses the doc_id to fetch the 100-page PDF from a data store. It then executes its own internal code to perform a complex data transformation—for example, a map-reduce summarization. It splits the document into chunks, uses a smaller model to summarize each chunk, and then performs a final summarization of the summaries.
Compressed Context Response: This process might reduce a 100,000-token document into a dense, 2,000-token summary of its key findings.
Efficient LLM Call: The MCP Server returns only this small, token-efficient summary to the AI agent. The agent can now use this highly relevant context to perform its final reasoning task at a fraction of the cost and time.

+-------+ 1. invoke('summarize', {doc}) +-----------------+ | Agent |------------------------------>| MCP Server | +-------+ | [Code Execution]| ^ | 1. Fetch 100k tokens| | | 2. Summarize | | 2. return {summary: 2k tokens} | 3. Compress | `----------------------------------| | +-----------------+

Implementation Details

This powerful pattern is implemented within the tool's MCP Server, keeping the agent's logic clean and simple.

Snippet 1: An MCP Tool with a Server-Side Code Execution Step (Python) The @tool decorator on the MCP Server contains the complex data processing logic. This logic is completely hidden from the AI agent.

```python

mcp_document_server.py

from mcp_server_py import McpServer, tool import document_store from local_summarizer import map_reduce_summarize

class DocumentAnalyzerServer(McpServer): @tool( description="Analyzes a large document and returns a concise summary.", input_schema={"type": "object", "properties": {"doc_id": {"type": "string"}}} ) def get_key_insights(self, doc_id: str) -> dict: """ This tool executes code server-side to avoid sending a large document to the agent. """ # 1. Server fetches the large, raw data object raw_text = document_store.get_text(doc_id) # Could be 1,000,000+ tokens

    # 2. Server runs complex, token-intensive pre-processing code
    # This logic is the server's responsibility, not the agent's.
    final_summary = map_reduce_summarize(raw_text, max_tokens=2000)

    # 3. Server returns ONLY the small, compressed, high-value context
    return {"summary": final_summary, "token_count_original": len(raw_text.split())}

```

Snippet 2: The Agent's Simple and Efficient Perspective (Python) The agent's code remains clean and high-level. It is completely unaware of the summarization logic happening on the server.

```python

report_writing_agent.py

import mcp_client_py

Connect to the MCP tool server

doc_analyzer = mcp_client_py.connect("http://mcp-doc-server:8001")

The agent makes a simple, high-level request, delegating the heavy lifting.

result = doc_analyzer.invoke( "get_key_insights", {"doc_id": "q3-performance-review.pdf"} )

The agent receives a tiny, token-efficient payload to work with.

llm_prompt = f""" Based on the following executive summary, write a three-bullet-point takeaway for the CEO.

Summary: {result['summary']} """

... now send this small, cost-effective prompt to a powerful reasoning LLM ...

```

Performance, Cost, and Security Considerations

Performance: While the server-side code execution on the MCP Server adds a few seconds of initial processing latency, this is massively offset by the dramatic reduction in data transfer time and, more importantly, the LLM's own inference time. An LLM can process a 2,000-token context far faster than a 100,000-token one, leading to a significant decrease in end-to-end latency for complex tasks.

Cost (The Primary Driver): This is the single biggest ROI. LLM APIs are priced per input and output token. By using the MCP Server to compress a large document into a small summary, the cost of the subsequent call to the powerful reasoning model can be reduced by 98% or more. The cost of running the summarization code on the MCP Server's commodity hardware is trivial in comparison.

Security: This pattern provides a powerful security benefit. Raw, sensitive data from a document never needs to enter the primary agent's context window. The MCP Server can be designed to sanitize or redact sensitive information (like names, addresses, or financial figures) as part of its pre-processing step, ensuring that only clean, anonymized data ever reaches the LLM. This minimizes the risk of sensitive data being inadvertently exposed in logs or model outputs.

Conclusion: The ROI of Active Context Preparation

Executing code within the MCP layer transforms the MCP Server from a simple, passive data adapter into an intelligent, active participant in the AI workflow. This architectural pattern is a critical optimization strategy for building any serious, production-grade agentic system.

The return on investment is immediate and substantial: * Drastic Cost Reduction: It slashes API token costs by pre-processing data at the source. * Superior Performance: It improves end-to-end latency by providing the LLM with smaller, more relevant, and easier-to-process context. * Unlocks New Capabilities: It enables agents to "reason about" massive documents and data sources that are far too large to fit into any model's context window. * Enhanced Security: It creates a "data sanitization" layer, preventing raw, sensitive information from ever reaching the LLM.

This approach—delegating data transformation to the edge—is an indispensable part of a mature, efficient, and secure agentic architecture.