MemTree API

Context Memory

With Context Memory, LLMs are no longer limited by their context window. Coding sessions, agents, and conversations can now continue indefinitely as context is compressed via a combination of hierarchical GraphRAG and summarization.

Coding

Models will now get better, rather than worse, as your coding sessions stretch into the millions of tokens.

Agents

Agents can now try many approaches, learning from their past mistakes and remembering important lessons, no matter how long ago they happened.

Chat

Keep your storytelling, role-play, therapy, medical, or other long-running threads going without worrying about the model forgetting important details or missing out on overarching themes.

Motivation

Humans can recall arbitrarily long events at arbitrary levels of detail. For example, people can provide a two minute or one hour version of stories that took place over a few hours to eons. Cognitive scientists refer to this ability as scale-invariant temporal memory, a key function of episodic memory. Episodic memory and semantic memory together form the two main pillars of long-term memory in people.

Many current LLM memories, like ChatGPT's memory and Cursor Memories, are examples of semantic memory. But without episodic memory, LLMs are forced to ingest the entirety of their experience at full resolution for every token they produce. In practice, this leads to context windows balooning to contain too much detail and models producing slop. Then when threads are restarted or compacted, models must be reinformed with key details from previous threads. Many approaches use RAG to recall details, but they often recall poorly sized chunks and/or lose out on the benefits of summarization for understanding how those chunks fit into the larger context of the thread.

How It Works

Context Memory can be thought of as a hierarchical episodic memory that performs lossless compression on arbitrarily long lists of messages. In our implementation, Context Memory is structured as a B-tree where the top of the tree contains a high level summary of the current message history and the bottom of the tree contains verbatim excerpts from that history that are relevant to recent messages. In between are summaries that get more detailed as you go towards the leaves. Rather than simply return matching chunks of text without context, as is typical with RAG, we retrieve relevant details contextualized within a tree of summaries. It's a specialized form of GraphRAG where relationships between nodes are encoded in the tree structure. This tree structure allows us to efficiently expand and collapse nodes based on relevance at query-time.

Example memory response when asking about a divide by zero error

  • Feature 1 summary
    • Task 1 summary
    • Task 2 summary
      • Step 1 summary
      • Step 2 summary
        • Divide by zero error details
    • Task 3 summary

Imagine that you were working to resolve a divide by zero error in a codebase, and that the only source verbatim relevant to the error is:

Divide by zero error details
By unfolding the tree and returning the path of summaries above retreived excerpts, we can provide a concise overview of what led to this divide by zero error.

⚡️ Speed

Since querying memory does involve generating tokens, response time is incredibly fast at around 100ms per memory request. Most of this time is spent creating a single embedding from the most recent message. From there we use a vector similarity search to find the most relevant excerpts and summaries. Finally we unfold the paths above the retrieved excerpts.

API

We build on the OpenAI compatible API format where the message history is passed in its entirety for every request. The main input is a list of OpenAI compatible messages. The output messages are a compressed version of the input messages.

The majority of older input messages get condensed into a single user message containing the context memory. The recent messages are then returned after the system message and memory messages.

Semantic Addressing

There is no memory ID as the messages themselves serve to identify the memory. This allows branching, reverting, and incrementally updating memories while critically avoiding the indexing of messages that are no longer in the message history. E.g. if a coding session goes down a wrong path and the messages are reverted to an earlier state or if a chat edits a message, branches into a multiple conversations, or clones a previous message history. This means that if previous messages are modified (except for system messages: see below), a new memory will be built at the point of the modification.

Indexing

Indexing happens in the background and currently takes around 1 minute for 10k tokens. We do not block API requests to wait for indexing unless we are unable to fit recent unindexed messages inside desired max output size of 20k tokens. We may attempt to truncate unindexed messages using an exponential decay algorithm that prioritizes recent messages to avoid blocking. However, we will never truncate the most recent message as it's critical for the model to see it in its entirety. If we cannot fit the most recent message into the model_context_limit, we will block and wait up to 5 minutes for indexing to complete. If indexing is not complete after 5 minutes, we will return an HTTP 413 error.

Chunking

Inputs are chunked semantically at index time. This ensures chunks are semantically cohesive and that boundaries are clean.

Models

Gemini 2.5 Flash is used for indexing and Voyage 3.5 is used to generate embeddings.

Usage

API Keys

Click here to generate and copy your PolyChat API key. Your API key can be further managed from within the polychat.co app under Settings > Account > API keys.

Setup Context Memory with Kilo Code, Roo Code, or Cline

  1. Set the Base URL to https://polychat.co/api
  2. Get your API key from within the polychat.co app under Settings > Account > API keys
  3. Add the Custom Header:
    • name: x-polychat-memory
    • value: on

/context_memory

This API takes in OpenAI compatible messages and outputs compressed OpenAI messages. It allows you to easily add memory to any model by replacing the original messages with the compressed messages this API returns.

Send a POST request with messages, your api key, and the context limit to compress messages

curl -X POST https://polychat.co/api/context_memory \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello"},
      {"role": "assistant", "content": "Hi there!"}
    ],
    "model_context_limit": 128000
  }'

This outputs compressed messages and usage:

{
  "messages": [
    {
      "role": "user",
      "content": "[Memory message will be here*]"  // Present after 10k tokens indexed
    },
    {
      "role": "assistant",
      "content": "Hi there!"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 29,
    "total_tokens": 34
  }
}
The first message will only contain memory after a threshold of 10k tokens are indexed.

Params

/chat_completions

This is a fully OpenAI compatible API which uses models available on polychat.co to generate responses. It uses Context Memory under the hood to compress the messages for you. Send it uncompressed messages, and we will compress them and send them to the model.

Send a POST request with messages, your api key, and the x-polychat-memory header set to "on"

curl -X POST https://polychat.co/api/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -H "x-polychat-memory: on" \
  -d '{
    "model": "GPT-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "hi"
      }
    ]
  }'

System Prompt

We don't alter the system prompt. It will always pass through without modification and does not get reflected in the memory index. We do this as system prompts are typically important for every message turn (agent instructions, tools available, provider info, etc...). And since they can contain today's date/time, they can invalidate the memory.

Pricing

Note: We are currently not charging for the indexing process. Planned future pricing is below.

Token Type Cost per million tokens
Input $4
Cached input $2
Output $10

Most tokens will be cached input tokens for long conversations as the majority of messages are indexed. Output tokens are limited to around 8k to 20k tokens once the index has caught up with recent messages. This is about 10% of Claude's context window. This target is a parameter we plan to expose in the future. Let us know at [email protected] if that's something you'd like to see. Note: The system message counts as cached input and does not count towards the output token limit.

We also are looking at ways to reduce the price by at least 2x in the near future, so stay tuned for updates there.