Table of Contents
The Token Explosion Problem
Picture this: you start an OpenClaw session, ask your agent to continue working on something you discussed yesterday, and watch as it immediately launches into a flurry of tool calls. memory_search. Then grep. Then read on five different files. Then another memory_search with a slightly different query. Ten minutes and 280,000 tokens later, it finally has enough context to answer a question you thought it would remember.
This isn't a bug. It's what happens when the memory and context configuration is off. OpenClaw is designed to compensate for gaps in conversation history by searching — but if the history window is too small, that compensation mechanism runs wild.
We saw this firsthand during a two-week test with various historyLimit values. A setting of 2 (meaning the agent only references the last two conversation turns) produced sessions averaging 346K tokens for tasks that should have needed roughly 28K. The agent kept re-discovering the same facts about the codebase it had already found three exchanges earlier.
The good news: this is entirely fixable. There are three levers that control the situation, and tuning them takes less than an hour. For context on the broader subscription cost picture, see our token anxiety guide — this article focuses specifically on the memory and context side.
historyLimit Configuration
historyLimit controls how many conversation turns your OpenClaw agent can look back at when forming a response. It's set at the group or topic level in your OpenClaw config, and it has an outsized effect on both behavior and token consumption.
The counterintuitive trap is that lowering historyLimit often increases token usage rather than reducing it. When the agent can't find an answer in recent conversation history, it doesn't just give up — it reaches for tools. memory_search, grep, read. Each of these tool calls adds tokens, and the results they return add more. A short history window triggers more searches, which create longer contexts, which slow subsequent responses.
Here's what the historyLimit settings look like in a typical config:
{
"groups": [
{
"name": "dev-team",
"historyLimit": 20,
"topics": [
{
"name": "code-review",
"historyLimit": 15,
"systemPrompt": "You have access to recent conversation history. Before using search tools, check if the answer exists in the conversation above."
},
{
"name": "planning",
"historyLimit": 30
}
]
}
],
"dm": {
"historyLimit": 40
}
}Our recommended baseline: 15-30 turns for group channels, 30-50 turns for direct messages. The DM limit can be higher because one-on-one conversations accumulate context more linearly — a longer window rarely causes performance issues but frequently saves the agent from having to re-search for facts it already encountered.
One thing worth knowing: historyLimit counts turns, not tokens. A single turn where the agent read three large files can represent 40K tokens. So "30 turns" could mean anywhere from 15K to 200K tokens depending on what happened in each turn. If you're working with models that have smaller context windows, you might need to stay on the lower end of these ranges.
Restricting Tool Calls to Stop the Cascade
Even with a proper historyLimit, there are situations where you want to explicitly prevent certain tool calls. Maybe you're running a read-only analysis agent that should never touch the filesystem. Maybe you're debugging and want to isolate whether memory_search is causing the token bloat.
OpenClaw supports tool restrictions at three levels, each with different trade-offs.
Fix 1: Group-Level Tool Restrictions
The bluntest instrument — denies specific tools for everyone in a group. Useful when you have a channel dedicated to a specific, constrained task:
{
"groups": [
{
"name": "analytics-bot",
"historyLimit": 20,
"tools": {
"deny": ["exec", "read", "write", "edit", "memory_search"]
}
}
]
}This is good for security isolation but heavy-handed for general use. Denying read globally means the agent can't access files even when you explicitly ask it to.
Fix 2: Topic-Level systemPrompt Restrictions
More flexible. Instead of hard-denying tools, you instruct the agent through its system prompt to prefer history over search. This lets tools remain available when genuinely needed but discourages reflexive tool use:
{
"topics": [
{
"name": "daily-standup",
"historyLimit": 25,
"systemPrompt": "You are a daily standup assistant. IMPORTANT: Always check the conversation history first before calling any search or read tools. Only use memory_search or file tools if the user explicitly asks for information that is not present in recent conversation. Do not proactively grep or read files to fill in context gaps."
}
]
}This approach reduced our test session from 346K tokens to roughly 38K — still not perfect, but a dramatic improvement from the system prompt alone. The agent occasionally still reached for memory_search on ambiguous queries, which is where the next layer helps.
Fix 3: Provider-Level Tool Restrictions
Useful when different models have different tool-calling tendencies. In our testing, Gemini models were noticeably more aggressive about calling memory_search compared to Claude Sonnet on identical prompts. Provider-level restrictions let you tune per-model behavior:
{
"providers": [
{
"name": "google",
"models": ["gemini-2.0-flash", "gemini-2.5-pro"],
"tools": {
"deny": ["memory_search", "exec"]
}
}
]
}The combination of historyLimit 20 at the group level, a restrictive systemPrompt at the topic level, and provider-level memory_search denial for Gemini brought our 346K-token sessions down to about 28K — roughly a 91% reduction. Not every session will be that dramatic, but the directional improvement holds consistently.
Memory System Optimization
Context restrictions only solve half the problem. The other half is making sure the agent actually has useful information to reference, so it doesn't need to search for it at all. That's what the memory system handles.
Think of memory as the circuit breaker between "agent searches every file in the repo" and "agent recalls what was established three days ago." A well-configured memory system means the agent spends tokens on your actual question, not on re-discovering its own context.
Passive Extraction: The session-memory Hook
The easiest win. Enable the session-memory hook and OpenClaw automatically summarizes conversations into daily Markdown notes at the end of each session. No manual effort required:
{
"hooks": {
"session-memory": {
"enabled": true,
"outputDir": "./memory/daily",
"format": "markdown",
"includeToolResults": false,
"maxSummaryTokens": 2000
}
}
}Setting includeToolResults: false is important. Tool call outputs are often verbose and low-signal. You want the summary to capture decisions, findings, and conclusions — not the raw output of every grep command.
A cron job at 23:55 works well for teams: it archives the day's sessions into a single note before the next day starts. This keeps the daily notes directory manageable and ensures the agent always has a clean starting point each morning.
Memory Flush: Persisting Before Context Clears
This is the part that most OpenClaw users skip and then wonder why their agent "forgets" things after long sessions. When context approaches the model's limit, OpenClaw can run a Memory Flush — compressing the conversation and extracting key facts into memory before discarding the raw history.
The critical settings are:
{
"memory": {
"flush": {
"enabled": true,
"reserveTokensFloor": 300000,
"softThresholdTokens": 6000,
"systemPrompt": "Before this context is cleared, extract and save the following to memory: (1) any decisions made, (2) current task status, (3) file paths and their purposes, (4) open questions. Be specific and use bullet points.",
"prompt": "Please summarize everything we have established in this conversation that would be needed to continue this work in a new session."
}
}
}reserveTokensFloor: 300000 reserves about 300K tokens as a safety buffer — appropriate for Gemini's 1M context window. For Claude with a 200K window, you'd set this lower, maybe around 50K-80K. The flush triggers when you're within softThresholdTokens (6K) of that floor.
Using both systemPrompt and prompt in the flush config reduces hallucination. The systemPrompt primes the model on what to extract; the prompt triggers the actual extraction. Running both produces more reliable summaries than either alone in our tests.
Active Memory Writing
Passive extraction is good but not complete. Some facts are better written explicitly. If you've just made an architectural decision, finalized an API contract, or established a constraint the agent needs to respect long-term, tell it directly:
"Remember that we decided to use Supabase for auth instead of NextAuth. Write this to MEMORY.md under the Architecture Decisions section."
This might feel manual, but it pays off. Explicitly written memory is retrieved reliably; passively extracted summaries sometimes get buried or generalized. For decisions that will shape the next several weeks of work, explicit writes are worth the few seconds they take.
OpenClaw
Self-hosted AI agent with memory, multi-model support, and full system access
Vector Search Setup for Efficient Retrieval
Memory is only useful if it can be retrieved without pulling in everything at once. That's where vector search comes in. Instead of dumping all daily notes into context, the agent embeds your query, finds semantically similar memory fragments, and returns only the most relevant ones.
OpenClaw supports any OpenAI-compatible embedding endpoint. Here's a working configuration using Alibaba's text-embedding-v4 (which has good multilingual performance and reasonable latency):
{
"memory": {
"search": {
"enabled": true,
"provider": "openai-compatible",
"endpoint": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"model": "text-embedding-v4",
"apiKey": "${DASHSCOPE_API_KEY}",
"maxResults": 5,
"minScore": 0.3,
"chunkSize": 500,
"chunkOverlap": 50
}
}
}maxResults: 5 limits how many memory chunks get injected into context per search. Going higher risks drowning the conversation in marginally relevant historical details. minScore: 0.3 filters out low-confidence matches — drop it to 0.2 if you find the agent frequently missing relevant memories, raise it to 0.4 if it's pulling in too much noise.
chunkSize: 500 and chunkOverlap: 50 control how daily notes get split for indexing. Smaller chunks give more precise retrieval but can lose context around important statements. Larger chunks preserve context but reduce precision. The 500/50 combination has worked well for technical notes that mix code snippets with prose explanations.
If you're running multiple AI subscriptions to keep costs manageable while testing different embedding providers, services like GamsGo offer shared AI subscriptions at 30-70% off retail price — a practical way to cut your monthly costs while maintaining multi-model access.
For more on multi-model cost management, our token anxiety guide covers subscription stacking in detail, and the Codex-Proxy guide shows how to route requests through your existing subscriptions rather than paying per-token API rates.
Context Pruning (Anthropic Models Only)
This is the most misunderstood feature in OpenClaw's context management toolkit. Context pruning with cache-ttl mode does not reduce the total size of your conversation history. It prunes old tool results — specifically, it removes cached tool outputs that have expired based on a time-to-live setting.
If your agent reads a file, the file contents get cached in the prompt. An hour later, that cache entry might be stale (the file may have changed). Context pruning removes these expired cache entries to keep the prompt fresh and reduce unnecessary prompt cache costs.
The limitation to be upfront about: this only works with Anthropic models currently. Claude Sonnet 4.5 and Claude Opus 4.6 support the cache invalidation mechanism that pruning relies on. Google and OpenAI model providers don't expose equivalent hooks, so the feature is silently ignored for those models.
{
"contextPruning": {
"enabled": true,
"mode": "cache-ttl",
"ttl": "1h",
"keepLastAssistants": 3,
"softTrimRatio": 0.4,
"hardClearRatio": 0.65
}
}ttl: "1h" means tool result caches older than one hour are eligible for pruning. keepLastAssistants: 3 always preserves the three most recent assistant responses regardless of their cache age — useful for keeping recent reasoning visible. softTrimRatio: 0.4 triggers pruning when the context is 40% full; hardClearRatio: 0.65 forces a more aggressive clear at 65% full.
In practice, the savings from pruning are moderate. We measured roughly 15-25% reduction in prompt cache costs for heavy Claude sessions compared to no pruning. Not dramatic, but meaningful over time. If you're using Claude as your primary model through the OpenClaw setup, enabling this is straightforward and has no downside.
For teams using OpenClaw across multiple providers and evaluating which performs best, the ChatGPT vs Claude comparison has relevant data on reasoning quality and context handling.
What We Tested
The numbers in this article come from two weeks of structured testing on a development machine running Windows 11 (Ryzen 9, 64GB RAM) with OpenClaw connected to Claude Sonnet 4.5, Gemini 2.0 Flash, and GPT-5.2 via API keys.
Token counts were logged directly from each provider's API response headers, not estimated. We ran identical task sequences with different historyLimit values (2, 5, 10, 20, 30) and recorded per-session token consumption. The 28K vs 346K figures come from the same 45-minute task sequence with historyLimit 20 vs historyLimit 2 respectively.
For memory testing, we set up three configurations — no memory hooks, session-memory only, and session-memory plus vector search — and measured how often the agent issued redundant search tool calls for facts established in previous sessions. "Redundant" was defined as any search that returned information already present in the most recent five session notes.
Results: no memory hooks produced redundant searches in about 67% of cross-session tasks. Session-memory alone dropped that to around 28%. Session-memory plus vector search dropped it further to roughly 8% — not zero, because vector retrieval occasionally missed relevant chunks, but good enough that the remaining cases were minor inconveniences rather than token-burning spirals.
Context pruning was tested only with Anthropic models as noted. The 15-25% cache cost reduction figure is a median across ten sessions; individual sessions varied from 8% to 34% depending on how file-heavy the work was.
For full configuration reference and other OpenClaw setup details, the OpenClaw review covers the broader architecture and use cases.
Frequently Asked Questions
What does historyLimit do in OpenClaw?
It controls how many conversation turns the agent can look back at when responding. Setting it too low causes the agent to compensate with tool calls — grep, read, memory_search — that can turn a 28K-token session into a 346K-token one. Recommended values are 15-30 for group channels and 30-50 for direct messages.
How do I prevent OpenClaw from running excessive tool calls?
Three approaches, in order of flexibility: (1) Group-level tools.deny config to hard-block specific tools. (2) Topic-level systemPrompt instructions telling the agent to check history before searching. (3) Provider-level tool restrictions for models that are especially aggressive about searching (Gemini tends to be). Combining all three reduced our test sessions by roughly 91%.
What is session-memory and how do I enable it?
It's an OpenClaw hook that auto-summarizes conversations into daily Markdown notes at session end. Enable it under the hooks section of your config, point it at an output directory, and set includeToolResults: false to keep summaries signal-rich rather than verbose. Pair it with a nightly cron job at 23:55 to archive cleanly before each new day.
Does context pruning work with all AI models in OpenClaw?
No — only Anthropic models (Claude Sonnet, Claude Opus) support the cache-ttl pruning mechanism. It removes expired tool result caches, not conversation text. For Google and OpenAI models, the setting is silently ignored. The savings are real for Anthropic users (roughly 15-25% on cache costs) but this isn't a universal solution.
GamsGo
Managing multiple AI subscriptions for OpenClaw? Save 30-70% on ChatGPT Plus, Claude Pro, and other AI tools