LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency

BTCC
LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency
Ledger


Large language models (LLMs) have gained prominence for their ability to handle complex reasoning tasks, transforming applications from chatbots to code-generation tools. These models are known to benefit significantly from scaling their computation during inference, often producing higher accuracy by dedicating more resources to hard problems. However, this approach brings along considerable drawbacks. Longer processing times and higher computing costs make it challenging to scale such solutions in real-world settings, where responsiveness and affordability are crucial. As technology advances toward more intelligent systems, there is a growing need to explore how LLMs can become not only smarter but also more efficient, especially when operating within repetitive or familiar contexts.

One of the biggest inefficiencies in current LLM deployment occurs during query resolution. Typically, when a user poses a question, the model processes it simultaneously with the necessary background context. This test-time compute assumes that the context and question always arrive together. But in real scenarios, such as document Q&A or debugging code, context is usually persistent and can be accessed well before a specific question is asked. Yet, the model processes everything from scratch for each query, even if it has seen the context before. This redundancy results in increased computational costs and response delays, particularly in scenarios involving multiple queries within a single context.

To deal with this inefficiency, various methods have been developed. Sequential and parallel test-time computation are two major strategies. Sequential approaches extend the model’s reasoning path, allowing it to consider more possibilities, while parallel approaches involve sampling multiple outputs simultaneously, known as pass@k. Techniques like speculative decoding aim to cut latency by making early guesses, but their usefulness is limited when the model still has to think from scratch. While helpful, these methods don’t eliminate the need to process context alongside every new question repeatedly. They also typically require test-time conditions that aren’t always feasible, such as access to an oracle or an ideal verifier.

Researchers from Letta and the University of California, Berkeley, introduced a novel solution they call sleep-time compute. The method involves utilizing idle time between user interactions to increase productivity. Instead of waiting for a user question, the model begins analyzing the context beforehand. It anticipates possible future queries and builds a new version of the context enriched with relevant inferences. When a user finally asks a question, the model can simply refer to this pre-processed context. Since much of the thinking is already done, it requires less computational effort to produce accurate answers. This approach becomes even more effective when multiple questions relate to the same context, allowing for shared inferences and distributed computational cost.

Ledger

The implementation of sleep-time compute relies on decomposing the traditional prompt into two parts: a static context and a dynamic query. During the sleep-time window, only the context is used to generate a pre-processed version. This enhanced context, called c′, is built using test-time compute techniques like reasoning chains or summarization. Once this enriched version is stored, it replaces the raw context during real-time queries. The final answers are then generated using much fewer resources. This system not only minimizes redundant reasoning but also paves the way for more proactive LLMs that can think ahead and be better prepared.

To evaluate the effectiveness of sleep-time compute, the research team tested it using two specially designed benchmarks: Stateful GSM-Symbolic and Stateful AIME. Both datasets are derived by splitting existing problem sets into separate contexts and questions. In experiments using models like GPT-4o and GPT-4o-mini, researchers observed a 5× reduction in test-time compute for similar accuracy levels. Notably, accuracy improved by up to 13% for the GSM-Symbolic P2 dataset and by 18% on Stateful AIME when sleep-time compute was scaled. Multi-Query GSM-Symbolic, a new dataset introduced for this evaluation, helped demonstrate that the cost per query could be reduced by 2.5× when 10 queries shared the same context.

When pitted against popular strategies like pass@k, sleep-time compute consistently outperformed them. Unlike pass@k, which assumes access to a perfect evaluator, sleep-time compute works under more realistic conditions. Results show that even at low test-time compute budgets, sleep-time compute produced comparable or better accuracy while consuming fewer tokens. For instance, the GPT-4o-mini model achieved higher accuracy with fewer than 200 test-time tokens using sleep-time compute compared to over 500 tokens needed in the baseline. Even when models like Claude Sonnet 3.7 and DeepSeek R1 were evaluated, similar improvements were observed.

Scaling the amount of compute dedicated to sleep-time further improved outcomes. By running five parallel generations during sleep-time on complex tasks, researchers pushed the pareto curve further. However, they noted diminishing returns beyond this point. Importantly, results showed that stronger models handling more difficult tasks benefited more from additional sleep-time compute. Also, amortizing sleep-time computation became highly cost-effective when contexts served multiple related queries. By weighting test-time tokens as ten times more expensive than sleep-time tokens, aligned with industry latency-cost ratios, the researchers confirmed a reduction of up to 2.5 times in the average cost per query.

Another interesting finding was that sleep-time compute worked best when user queries were predictable. Using Llama2-70B, researchers scored the predictability of each query given its context and found a strong correlation: the more predictable the query, the greater the benefit. In examples where the question logically followed from the given context, sleep-time computation yielded higher gains. Conversely, less predictable or abstract queries experienced reduced effectiveness, although they still showed benefits compared to traditional test-time-only methods.

Altogether, this research presents a smart and scalable technique to enhance the efficiency of LLMs without compromising accuracy. By leveraging otherwise idle time, sleep-time computing reduces the burden on real-time systems, lowers operational costs, and improves response time. The clear quantitative improvements, such as a 5× reduction in compute, 13–18% accuracy gains, and a drop of up to 2.5× in cost per query, demonstrate that forward-thinking approaches like this could shape the next generation of intelligent, context-aware assistants.

Several Key Takeaways from the Research are as follows:

Sleep-time compute allows models to anticipate queries by reasoning on context before the query arrives.

Accuracy improved by 13% on GSM-Symbolic and 18% on AIME datasets when sleep-time computation was scaled.

Test-time compute requirements decreased by approximately 5 times for similar performance levels.

When sharing context across 10 related queries, the average query cost decreased by a factor of 2.5.

Outperformed the pass@k strategy in parallel compute settings at equivalent budgets.

More effective on predictable queries, identified via log-probability scoring.

Diminishing returns noted beyond five parallel generations for sleep-time computation.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.



Source link

Changelly

Be the first to comment

Leave a Reply

Your email address will not be published.


*