Skip to main content

Reflecting on OpenAI DevDay 2025

Author: Douglas Adams & Rishabh Sagar

What a difference 12 months makes.

Last year we wrote about the emergence of multi-agent systems and Sam Altman’s promise of collaborative AI architectures. Attending OpenAI’s DevDay 2025 in San Francisco, we saw those promises begin to materialise into production-ready tools and platforms.

A Platform Coming of Age

The most striking shift from last year wasn't in the models themselves, but in everything around them. OpenAI's announcements showed a platform evolving into a complete ecosystem for building production-ready AI applications.

The headline releases included:

  • Apps in ChatGPT (Apps SDK)
  • AgentKit
  • Improved evaluations suite
  • GPT-5-Pro, Sora 2, and new mini voice and image models in the API
  • Codex SDK and general improvements

With these, OpenAI moves from ‘a great model with some add-ons’ to a unified platform supporting the full lifecycle of building with AI. Apps in ChatGPT brings distribution and user context to the foreground; AgentKit provides a convenient and powerful scaffold for building agentic systems; and the expanded evaluations offering closes the loop to help people push the most performance out of their builds.

By packaging the models alongside distribution, tooling, and evaluations, OpenAI meaningfully raises the practical and perceived value of the models themselves. The addition of supporting offerings makes the model offerings the obvious choice for more teams.

Realising the Promise of AI-Assisted Coding

A core focus of the day was Codex. Code generation has evolved from promising demos to become a productive and dependable part of programmers’ daily workflows.

The trajectory has been striking. Following the release of GPT-5 earlier this year and its coding-specialised variant GPT-5-Codex, adoption has surged—OpenAI reported that token usage has increased 10x since August. According to internal benchmarks shared on stage, developers using Codex complete 70% more pull requests than those without it, and about 92% of OpenAI's developers use it regularly.

What makes GPT-5-Codex particularly effective is its ability to tackle complex problems. Unlike earlier iterations that would give up on challenging tasks, it can work persistently on difficult problems (while also completing simple tasks far more quickly). In one DevDay session, the Codex team shared an example where the tool worked continuously for 7 hours to correctly implement a complex feature.

This productivity gain is visible in OpenAI's own shipping velocity. AgentBuilder, announced at DevDay, was said to be built in just six weeks with heavy Codex assistance, a timeline that would have been hard to hit even a year ago. At Tomoro, we're seeing similar acceleration. Coding assistants are enabling us to deliver more complex client projects faster than ever before.

The competitive landscape is intensifying and the spotlight on Codex highlights the growing competition for the lion’s share of the code generation market. We expect this competition to continue delivering gains to developer productivity through 2025 and into 2026.

Evals, Evals, Evals

With more companies running successful AI pilots and then looking to bring these into production, evals come sharply into focus. In our projects, we treat evaluations as a first-class concern from day one, and we think that these platforms will make it easier for others to do the same.

OpenAI announced a significantly upgraded evaluations suite. The new capabilities include dataset creation and management, trace-level grading, automated prompt optimisation, and crucially, support for evaluating third-party models.

As builders, we welcome this release. A robust evaluation setup is foundational to creating AI solutions that actually work. Without it, you’re flying blind.

We also enjoyed Greg Kamradt's talk on "Measuring Agents with Interactive Evaluations." As President of the ARC Prize Foundation, Kamradt discussed ARC-AGI 3's game-based evaluation approach, which measures AI systems through dynamic, interactive challenges rather than static benchmarks.

This raises an interesting question: how would we evaluate and train enterprise-focused agentic systems in a similar way, given that adaptability and contextual reasoning matter more than performance on fixed test set?

Opening New Frontiers

Sora 2 represents a step change in video generation quality and steerability. Sora 2 produces notably more coherent, longer-form content than its predecessor, and offers users the ability to insert themselves or their desired characters into videos. With OpenAI bringing Sora 2 to their API, enterprises can now harness this powerful capability at scale, unlocking creative applications that were previously confined to demo environments.

Alongside Sora 2, OpenAI released new mini models for voice and image generation that dramatically lower the cost barrier for creative applications. We've already seen compelling use cases emerge. At Tomoro, we've been pushing the boundaries of voice AI with projects like Virgin Atlantic's AI voice concierge. Now, with more affordable voice and image models, we can bring this calibre of experience to many more applications and industries.

The Road Ahead

Looking back from DevDay 2025, so much has changed in the last 12 months.

But, the core lesson remains. Altman’s 2024 advice still holds: build at the edge of what models can do today, because capability and cost curves move faster than you expect. This year, a line from the fireside chat with Sam Altman and Jony Ive captured the moment.

The pace of AI growth is the great equaliser. Everyone is a novice again, we all need curiosity and a deep desire to learn and adapt.

That mindset is becoming a competitive advantage. The teams who ship, measure, and adapt fastest will win. The frontier keeps moving. Research scales to production, innovation drives adoption, and ambitious organisations turn new capabilities into real business value.

And we’re excited to keep building there.