Cogito v2 Preview

From inference-time search to self-improvement

31st July, 2025

Takeaways

  • We are releasing 4 hybrid reasoning models under open licence.
    • 2 mid-sized models - 70B dense, 109B MoE
    • 2 large models - 405B dense, 671B MoE
  • The largest 671B MoE model is amongst the strongest open models in the world. It matches/exceeds the performance of the latest DeepSeek v3 and DeepSeek R1 models both, and approaches closed frontier models like o3 and Claude 4 Opus.
  • We extend our work on building superintelligence using Iterated Distillation and Amplification (IDA) by scaling the model's intelligence prior. This is done by the model internalizing the reasoning process using iterative policy improvement, rather than simply searching longer at inference time.
  • This is a novel scaling paradigm where the models develop more “intuition”, and serves as a strong proof of concept for self-improvement (AI systems improving themselves).
    • Since the Cogito models develop a better intuition of the trajectory to take while searching, they have 60% shorter reasoning chains than Deepseek R1.
    • Contrary to the accepted belief that such technical innovations require capital intensive infrastructure, this approach is also significantly more efficient (the Cogito models have been trained for a combined cost of <$3.5M).
  • Next, we plan to hill climb on the gains of iterative self-improvement to build superintelligence. All models we create will be open sourced.

You can download the models on Huggingface, or use them directly through the API on Together AI, Baseten or RunPod, or run them locally with Unsloth.

Our approach

Building superintelligence is fundamentally a tractable machine learning problem.

Our approach to build superintelligence is the following:

  • Step 1 - Develop a scalable training recipe for unbounded iterative intelligence improvements
  • Step 2 - Use more compute to scale our efforts and iteratively improve intelligence to go beyond human performance

While developing these systems, we also need to ensure that each step of the pipeline can work for intelligence smarter than humans (e.g., we need techniques that can reliably supervise AI systems much smarter than humans).

Earlier this year, we released Cogito v1 models, where we talked about using Iterated Distillation & Amplification (IDA) as a promising research direction for general superintelligence. Our main focus was to provide a training signal which is not upper bounded by overseer intelligence.

Today's Cogito v2 release expands on our work on IDA to larger systems, with a stronger focus on the remaining part — self-improvement via distillation.

The Recipe for Building Superintelligence

Super-human performance has been achieved in multiple narrow domains (chess, Go and poker), via the same two-step loop:

  • Inference-time reasoning - which spends compute to search for a solution
  • Iterative policy improvement - which distills the discoveries of that search back into the model's parameters. As a result, the next search starts closer to the goal

AlphaGo exemplifies the pattern: Monte-Carlo Tree Search (MCTS) generates an improved policy from each game, and the policy-and-value network is retrained on those visit counts.

LLMs can be thought of as a similar system, where the inference time reasoning is done in a less structured way (compared to game playing systems) with a thinking process before the answer. However, to be able to complete the loop and drive iterative intelligence improvements, we also need to have the second important step: iterative policy improvement.

That is, we need to distill the reasoning process back to the model parameters so that the model has a stronger prior. What this means is that we should be able to use inference time reasoning in a way which can make the model inherently more 'capable' or 'intelligent'. The model should be able to directly guess the results of running reasoning (without actually doing so) and anticipate the outcome of its own reasoning process.

Although recent LLMs have made progress on reasoning, most improvements have been driven by scaling reasoning length without upgrading the model's intelligence prior. As a result, the increase in LLM performance has primarily been driven by “searching more” for a solution with larger thinking budgets, without an improvement in the model's core "intuition" of which search trajectory is more suitable. Similarly, the improvements in the “non-thinking” mode for LLMs have been driven by incorporating heuristics like backtracking, which function in much the same way.

Improving the model's intelligence is fundamentally a much harder problem to solve, particularly for unstructured reasoning paths like the ones used for LLM reasoning. To do so, we need technical breakthroughs in iterative policy improvement. Cogito v2 is our next step in this direction.

We believe that hillclimbing on iterative policy improvement will pave the way for significantly improved model capabilities beyond what added search (via reasoning tokens) alone can unlock.

Cogito v2 Release

Today, we are releasing four Cogito v2 models in preview.

  • 2 mid-sized models: 70B dense, 109B MoE
  • 2 large models: 405B dense, 671B MoE

Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).

Our main goal is to distill the reasoning step back to the model's parameters, to be able to translate inference time search to intuition.

The first 3 models (70B, 109B MoE, 405B) have been trained, primarily, to improve non-reasoning mode intuition with reasoning. Technically, this means trying to distill reasoning back to the model's original parameters.1

As a result, these models are a significant upgrade over their individual Llama counterparts. The models have a better “intuition” of the right trajectory to take while decoding the response and deliver strong performance, without relying on heuristics like backtracking. All models should feel closer to frontier intelligence in their sizes, with the largest 405B dense model close to the best frontier intelligence models.

The largest model (671B MoE) has been trained on a different recipe — here, the goal is to improve the prior policy (i.e., 'intuition') on the non-reasoning mode and reasoning-mode both. Unlike most techniques used for training reasoning models that provide a signal based on the final output alone (e.g., reinforcement learning from verifiable rewards), we also provide a signal for the thinking process itself during distillation. As a result, we disincentivize the model from “meandering more” to be able to arrive at the answer, and instead develop a stronger intuition for the right search trajectory for the reasoning process.

The 671B MoE preview model performs at par with the latest DeepSeek reasoning and non-reasoning models, by improving its intuition instead of relying on more searching -

  • In reasoning mode, the 671B MoE outperforms DeepSeek R1, and matches the latest DeepSeek R1 0528.
    • It does so by using 60% shorter reasoning chains than DeepSeek R1 0528.
  • In non-reasoning mode, it outperforms DeepSeek v3, and matches the performance of the latest DeepSeek v3 0324.

In our testing, Cogito 671B MoE is amongst the strongest models in the world. It performs roughly at par with the latest open-source models from DeepSeek and Qwen, but is short of closed models like o3 and Claude 4 Opus (albeit not by a lot).

Note that currently these models have been trained primarily to build a scalable training recipe to serve as a strong proof of concept for superintelligence. They have not been scaled up yet with the adequate amount of compute usually needed for model post-training.

(As an example, we have spent <$3.5M for training all 8 Cogito models - from 3B to 671B. This includes all costs, from synthetic and human data generation to running over a thousand training experiments.)

We plan to scale up our recipe next, and release significantly stronger variants of these models. As before, all models will be open source.

Evaluation Details

We are adding evaluations on a few standard benchmarks.

(Note - while these benchmarks provide a useful signal, they do differ from our internal evaluations at times.

For instance, in our internal evaluation, the Cogito models deliver strong results across most open models consistently. We are confident that our models will stand up to such real-world evaluations and deliver strong results in practice.

Secondly, the newest closed models such as o3 and Claude 4 Opus are stronger than the benchmarks here alone may reflect.)

70B Dense

Performance of Cogito 70B

Performance of Cogito 70B

109B MoE

Performance of Cogito 109B

Performance of Cogito 109B MoE

Thinking with images

An interesting consequence of our training pipeline is that although the models are trained on text inputs and outputs, the model learns to do reasoning over the visual domain by pure transfer learning, since the base model is multimodal.

Here is an example where the model is provided two images, and we turn on enable_thinking -

Model Input

Model Input

Compare these two images. What are the similarities and differences?

Thinking

<think>
Okay, so I need to compare these two images. Let me start by looking at them carefully.

The first image is of a mallard duck swimming in water. The duck has a green head, a yellow bill, and a brown and white body. The water around it is blue with some ripples. The duck's reflection is visible in the water, which adds a nice symmetry to the image.

The second image is of a male lion sitting in tall, dry grass. The lion has a large, fluffy mane that's a mix of brown and black, and he's looking directly at the camera. The background is a field of golden grass that stretches out.

Now, thinking about similarities. Both images are of animals in their natural habitats. The duck is in water, which is its natural environment, and the lion is in a savannah or grassland area. Both animals are male, as indicated by their features—the duck's green head and the lion's mane. Both images have a strong focus on the animal, with the backgrounds somewhat blurred or serving as a complementary setting rather than the main focus.

As for differences, the environments are completely different. One is aquatic, the other terrestrial. The duck is small and the lion is much larger. The duck's colors are more vibrant and varied (green, yellow, brown, white), while the lion's colors are more earthy tones (browns, blacks, golds). The duck is in motion, swimming, while the lion is stationary, sitting. The lion's expression is more intense and regal, whereas the duck looks calm and serene. Also, the duck's reflection adds a unique element not present in the lion's image.

I should also consider the mood of each image. The duck's image feels peaceful and tranquil, with the calm water and the duck's relaxed posture. The lion's image conveys strength and power, with the majestic mane and the direct gaze. The lighting in both images is different too—bright sunlight on the lion, possibly softer light on the duck.

Another point is the composition. The duck is centered in the frame with its reflection creating symmetry. The lion is also centered but in a more open space with the grass surrounding him. The textures are different as well—the water's ripples versus the tall grass.

I think that covers the main points. Let me make sure I didn't miss anything. Maybe the purpose of the images? The duck might be for a nature documentary or a wildlife photography showcase, while the lion could be for a similar purpose but emphasizing the majesty of big cats. Both are examples of wildlife photography highlighting the beauty and characteristics of their subjects.
</think>

Response

Let me compare these two wildlife photographs:

Similarities:

  • Both feature male animals in their natural habitats
  • Both are high-quality wildlife photographs with the animals as the main focus
  • Both subjects are captured in natural lighting conditions
  • Both animals are shown in a calm, serene state
  • Both images use the background to complement the subject without overwhelming it

Differences:

  • Environment: The duck is in an aquatic setting with blue water and ripples, while the lion is in a terrestrial setting with tall, golden grass
  • Size and Scale: The duck is small and compact, while the lion is large and majestic
  • Color Palette: The duck has vibrant colors (green head, yellow bill, brown and white body), while the lion has earthy tones (browns, blacks, and golds)
  • Movement: The duck is captured in motion (swimming), while the lion is stationary
  • Composition: The duck's image includes a reflection in the water, creating symmetry, while the lion is set against an open grassland backdrop
  • Mood: The duck's image feels peaceful and tranquil, while the lion's image conveys strength and power
  • Texture: The water's ripples contrast with the tall, dry grass surrounding the lion

Each image effectively captures the unique characteristics and beauty of its subject, showcasing the diversity of wildlife photography.

Although we have not explicitly trained for images (in fact, there are no examples of images in our training pipeline), these capabilities come natively, without relying on any multimodal data. Such emergent properties are interesting from a research perspective.

We expect this to be a good way to bootstrap training data to teach models reasoning over images / a technique to get a starting point for reinforcement learning for reasoning over images.

(Note that we have not evaluated this capability over visual benchmarks.)

405B Dense

Performance of Cogito 405B

Performance of Cogito 405B

671B MoE

Performance of Cogito 671B

Performance of Cogito 671B MoE in Non-Reasoning

Performance of Cogito 671B

Performance of Cogito 671B MoE in Reasoning

Acknowledgements

We are deeply grateful to our investors at Benchmark, especially Eric Vishria, who led our seed round, and to Aditya Agarwal and South Park Commons for their continued support, backing us early with conviction in our vision to build open-source general superintelligence.

This work was also supported by multiple excellent teams and projects that we'd like to thank: Llama and DeepSeek teams, Hugging Face, Anthromind, RunPod, Together AI, Baseten, Ollama, Nebius, LM Studio and Unsloth.

Join Us

We believe that the most transformative AI research comes from small, committed teams working together towards a shared goal. We are hiring people who like working at the intersection of research and engineering and have an interest in building large-scale AI systems.

If this resonates with you, please apply here.

1 This requires a different approach than regular distillation from a teacher model as distilling without the chains of thought is not useful, as can be expected.