[태그:] RAG

  • AI Token Diet: What Headroom Teaches About Cutting LLM Agent Costs

    AI Token Diet: What Headroom Teaches About Cutting LLM Agent Costs

    This English version is a fuller translation and adaptation of the original Korean article, “넷플릭스 개발자의 토큰 다이어트: Headroom이 보여준 AI 비용 절감법,” for global readers. The article discusses the importance of reducing token costs when using AI agents, and how the open-source project Headroom can help achieve this goal. As AI agents become more prevalent in various industries, the need to optimize their performance and reduce costs becomes increasingly important. One of the key challenges in using AI agents is the high cost of tokens, which can quickly add up and become a significant expense. In this article, we will explore the main arguments and findings of the original Korean article and provide a comprehensive overview of the topic.

    AI token diet with Headroom
    AI token diet with Headroom.

    Original Korean article: 넷플릭스 개발자의 토큰 다이어트: Headroom이 보여준 AI 비용 절감법

    What is Headroom?

    Headroom is a context compression layer that compresses the input sent to LLM (Large Language Models) by AI agents. According to the GitHub repository description, it is a tool that reduces tool output, logs, files, and RAG (Retrieval-Augmented Generation) chunks before they reach the LLM. Headroom is not just a simple prompt compression tip, but rather a developer tool that can be used in various forms, such as a library, proxy, MCP (Model-Parallel Computing) server, or agent wrapper. It can be used in front of coding agents like Claude Code, Codex, Cursor, and Aider to reduce token waste.

    LLM agent cost optimization
    LLM agent cost optimization.

    Why do AI agent costs increase?

    When using chatbots, users input questions and receive answers. However, AI agents are different. They read files, search, check logs, call tools, and put the results back into the LLM. The problem is that this process creates a lot of duplication. The same error logs are entered multiple times, unnecessary file contents are included, and RAG search results are too broad. Even information that seems like noise to humans can incur token costs. According to The Register, Tejas Chopra, the creator of Headroom, became interested in token reduction after receiving a $287 bill while using Claude Sonnet. He then discovered that many inputs were not necessary for actual reasoning, but rather consisted of repetition, boilerplate, and duplicate data.

    Headroom’s Core Structure

    The Headroom README explains the structure as consisting of components like CacheAligner, ContentRouter, CCR (Context Compression and Retrieval), SmartCrusher, CodeCompressor, and Kompress-base. Although the names may seem complex, the flow can be understood in a practical sense. First, ContentRouter distinguishes the type of input. Reducing code, JSON, logs, and plain text in the same way can lead to errors, so it is essential to determine the nature of the content first. Second, CodeCompressor and SmartCrusher carefully reduce structured data like code and JSON. Reducing code can damage identifiers or grammar, leading to more loss than gain. Third, CCR stores the original content locally and retrieves it when necessary. It sends only the compressed version but allows the model to retrieve the original content if needed. Fourth, CacheAligner stabilizes the input prefix to prevent the provider’s cache from being broken. Simple compression can lower the cache hit rate, ultimately increasing costs. This is where Headroom differs from simple prompt summarization tools.

    context compression for logs and files
    context compression for logs and files.

    What do the numbers mean?

    The Headroom README claims that it can reduce tokens by 60-95% in actual agent workloads. Examples include code search, SRE incident debugging, GitHub issue triage, and codebase exploration, which show significant reduction rates. However, it is essential to note that these numbers do not guarantee the same results for all organizations. Some tasks may have a lot of logs and search results, making them more prone to reduction. On the other hand, short questions or well-organized inputs may not have many tokens to reduce. Therefore, the practical judgment standard is not just about how much reduction is promised, but rather about measuring input tokens, output tokens, latency, cache hit rate, and failure rate in the actual agent workflow.

    Signals that a team needs token diet

    Teams that should consider introducing Headroom or similar tools are those that exhibit certain signals. These include: coding agents that repeatedly read large repositories, logs and test results that are attached to every request, RAG search results that are overly broad, system prompts and policy documents that are repeated continuously, and AI tool utilization that is halted due to usage limits or monthly costs. In such situations, it is essential to examine the context structure before changing the model. The problem may not be the expensive model itself, but rather the structure that continuously sends unnecessary inputs to the expensive model.

    5 Lessons for Organizations

    First, AI cost optimization is not just a financial issue, but an engineering problem. Costs are determined by token structure, tool calls, cache design, and RAG quality. Second, prompt compression is the last step. It is essential to reduce search results, remove duplicates, and read only necessary files before compressing sentences. It is challenging to solve waste that is not reduced at the source through sentence compression alone. Third, compression must be accompanied by quality verification. If the answer is incorrect, even if the tokens are reduced, it is a failure. This is why Headroom provides benchmarks and reproduction commands. Fourth, cache-preserving design is crucial. Provider prompt caches can be ineffective if the input changes slightly. If the reduction tool breaks the cache, the total cost may increase. Fifth, preserving the original content is essential. If AI only looks at compressed information, it may miss important context. Having a structure that can retrieve the original content when needed is safe.

    Pre-Introduction Checklist

    When reviewing Headroom or similar tools, check the following items first: Are you currently measuring input tokens and output tokens for each agent task? Do you have topK and duplicate removal criteria for RAG search results? Are you putting logs, files, and test results in their entirety? Can you compare the answer rate and task success rate before and after compression? Are you safely protecting code, JSON, security policies, URLs, and identifiers? Is the cache hit rate maintained after compression? Do you have a fallback to turn off compression and re-run in case of failure?

    Conclusion: AI costs are a design problem, not a usage problem

    The insight provided by Headroom is not just about reducing tokens, but about how AI agents fit into an organization’s workflow. When AI agents become part of the workflow, the key capability is how to collect, reduce, preserve, and reuse context. In the future, good AI systems will not just have good models, but will also be able to send only necessary information, reduce duplication, utilize caches, and return to the original content in case of failure. Token diet is not just a cost-reduction technique, but also the starting point for AI operation design.

    Related Reading

    Continue with these related Thinknote English articles in the Digital Transformation cluster.

    FAQ

    What is this article about?

    This article explains a digital transformation, platform, market-structure, or technology-adoption topic with Korea-specific context and global implications.

    How should I use this guide?

    Use it to understand market signals and strategic patterns. Combine it with current market data before making business or investment decisions.

    Where can I read the original Korean article?

    The original Korean article is available here: AI Token Diet: What Headroom Teaches About Cutting LLM Agent Costs.

  • SGLang for Local LLM Serving: Is It the Next Step After Ollama and vLLM?

    SGLang for Local LLM Serving: Is It the Next Step After Ollama and vLLM?

    This fuller English adaptation follows the Korean source on SGLang as a local LLM serving engine. The article’s question is practical: after trying Ollama for easy local use and vLLM for high-throughput serving, when should teams consider SGLang?

    SGLang local LLM serving engine
    SGLang is a local LLM serving engine built for high-throughput inference.

    Original Korean article: SGLang 로컬 LLM 서빙 엔진, Ollama·vLLM 다음 선택지가 될까?

    Why SGLang Is Getting Attention as a Local LLM Serving Engine

    Closer to a service engine than a simple runner

    Ollama made local model testing convenient. But production-like serving has different requirements: concurrency, latency, throughput, batching, caching, observability, and API stability. SGLang belongs to this service-oriented conversation. It is designed for structured generation workflows and efficient serving rather than only one-person experimentation.

    Ecosystem signals are hard to ignore

    The source article notes that ecosystem momentum matters. GitHub activity, benchmark discussions, model support, developer adoption, and integration examples all influence whether a serving engine becomes a serious option. SGLang is drawing attention because it addresses real bottlenecks in repeated LLM requests.

    Core Principle: What RadixAttention Reduces

    Common prompts do not need to be recalculated

    RadixAttention is the key concept highlighted in the Korean article. Many LLM services repeatedly send prompts that share the same prefix: system instructions, policy text, examples, retrieved documents, tool descriptions, or conversation history. If the engine can reuse shared computation, it can reduce waste.

    Why this matters for RAG and agent services

    In RAG systems and agent workflows, repeated context is common. Many users may ask different questions over the same documents, or an agent may run multiple steps with the same tool instructions. Prefix reuse can improve throughput and latency when the workload matches the pattern.

    How to Read Ollama, vLLM, and SGLang Comparisons

    Benchmarks are strong, but conditions matter

    The source article warns against reading benchmark numbers blindly. Performance depends on model size, GPU type, batch size, context length, request pattern, quantization, and serving configuration. A chart that favors one engine under one workload may not apply to another team’s service.

    vLLM’s strengths remain important

    vLLM remains a powerful and widely adopted serving option. Its ecosystem, PagedAttention, OpenAI-compatible APIs, and production experience make it a default candidate for many teams. SGLang should be evaluated against vLLM using the team’s own traffic pattern, not only public claims.

    Decision Criteria by Situation

    Ollama vLLM and SGLang comparison
    Ollama, vLLM, and SGLang fit different local LLM serving needs.

    For personal tests, Ollama is still convenient

    If the goal is to download a model and test prompts locally, Ollama remains the easiest starting point. It is simple, friendly, and good for learning. A developer experimenting on a laptop may not need a full serving engine.

    For general service serving, start by reviewing vLLM

    If the goal is a service API with multiple users, vLLM is often the first serious option to evaluate because of its maturity and ecosystem. Teams should measure throughput, latency, memory use, and operational complexity.

    For repeated-context high-volume requests, evaluate SGLang

    SGLang becomes especially interesting when requests share long prefixes or when agent/RAG workflows repeatedly reuse context. In those cases, RadixAttention and structured generation features may provide meaningful advantages.

    Pre-Adoption Checklist

    Look at tail latency, not only averages

    Average latency can hide user pain. Teams should measure p95 and p99 latency, cold starts, long-context behavior, concurrency, error recovery, logging, deployment complexity, and compatibility with existing clients.

    • Test with your own prompts, documents, and traffic shape.
    • Compare GPU memory use under realistic concurrency.
    • Check model support and OpenAI-compatible API behavior.
    • Monitor tail latency and failed generations.
    • Plan rollback to a known engine if production behavior differs from tests.

    Conclusion: SGLang Is a Candidate for Service-Style Local LLMs

    RadixAttention for repeated prompts
    RadixAttention can reduce repeated computation for shared prompt prefixes.

    The article’s conclusion is balanced. SGLang is not automatically the replacement for Ollama or vLLM. It is a strong candidate when local LLM work moves from simple testing to repeated, service-style generation where caching and structured workflows matter.

    For many teams, the best decision is staged. Use Ollama to learn the model, test vLLM when service traffic appears, and benchmark SGLang when repeated context, RAG, or agent chains become a real cost. The right engine is the one that fits the workload you can measure.

    Related Reading

    Continue with these related Thinknote English articles in the Digital Transformation cluster.

    FAQ

    What is this article about?

    This article explains a digital transformation, platform, market-structure, or technology-adoption topic with Korea-specific context and global implications.

    How should I use this guide?

    Use it to understand market signals and strategic patterns. Combine it with current market data before making business or investment decisions.

    Where can I read the original Korean article?

    The original Korean article is available here: SGLang for Local LLM Serving: Is It the Next Step After Ollama and vLLM?.