The current popular method for test-time scaling in LLMs is to train the model through reinforcement learning to generate longer responses with chain-of-thought (CoT) traces. This approach is used in ...
A new framework called METASCALE enables large language models (LLMs) to dynamically adapt their reasoning mode at inference time. This framework addresses one of LLMs’ shortcomings, which is using ...
DAPO is a scalable reinforcement learning algorithm that helps a large language model achieve better complex reasoning ...
Results that may be inaccessible to you are currently showing.
Hide inaccessible results