Insights
By now, I’m sure most people have seen Advanced Voice Mode in ChatGPT.
A world where you can have a seamless, real-time conversation with an AI that sounds just like a human, used to be pure science fiction- but it looks like that world has just become our reality.
However, the real power of this technology will be unlocked when it is taken out of a general knowledge assistant on your phone and starts to be customised and integrated into other systems in the real world. This is where the new OpenAI’s Realtime API comes in. It enables users to stream audio input and output with the model with their own system prompts, as well as enabling capabilities like function calling.
At Tomoro AI, we've had early access to this API and have been putting it through its paces across various client scenarios. We're excited to share some of the standout features that genuinely blew us away, along with some challenges to watch out for if you're thinking of integrating this technology into your operations.
The model has exceeded our expectations, representing a significant leap forward from what was previously possible in AI voice interactions. With the right implementation approach and controls, it has the potential to revolutionise custom voice-based applications in many organizations. Despite some technical and human challenges to overcome, significant value is now clearly within reach.
To test this new capability, we built a test harness (Aurora) that allows users to:
This was stood up in a matter of hours, thanks in no small part to the contemporaneous launch of o1 preview and mini!
Screenshots of Tomoro test harness "Aurora" showing configuration options and featuring feedback collection.
While we're excited about the possibilities, it's important to be aware of some challenges that come with implementing this technology. Here's a rundown of what we found, and some high-level suggestions on how best to tackle them:
Substitution errors and misquotations
Challenge:
The model occasionally misquotes numbers or mispronounces terms, which can lead to confusion—especially critical in contexts like financial transactions.
Mitigation Strategy:
Optimise prompts to be shorter and clearer. As much as possible, rely on external capabilities, such as function calling and structured outputs, over complex prompts. Limit the model's responses to verified data sources. Implement moderation and validation steps for critical steps.
Audio quality and artifacts
Challenge:
Users sometimes experienced inconsistencies in audio quality, including artifacts, such as unintended sound effects, that affected call clarity.
Mitigation Strategy:
Consider post-processing audio output to smooth out artifacts, although this will increase latency. Custom voice tuning (if it becomes available in future) may also enhance clarity and consistency.
Accuracy of information capture
Challenge:
Occasionally, the model struggled to accurately capture key details, such as names, affecting the accuracy of information from users.
Mitigation Strategy:
Double-checking any key information captured with users. Depending on the context, this could be via audio or visual text output.
As AI models become more human-like in conversation, it's not just about the technology—it’s also about how people feel using it. Will customers or employees be comfortable talking to an AI? How should the AI handle nuances like humour, empathy, or frustration?
It felt surprisingly good when the model complemented us on our negotiation skills during one interaction, which highlighted how this technology opens up interesting new ways to enhance user experience and build customer relationships.
Example transcript of part of a conversation with the voice agent
These are important questions because the etiquette for interacting with AI isn't fully established yet. Companies need to consider user experience and set clear expectations. It's also crucial to design AI interactions that respect privacy and build trust.
Integrating advanced AI models into your business isn't a plug-and-play situation. The challenges we've highlighted show that while the technology is powerful, it requires careful implementation. That's where experience makes a big difference.
We’re all on a journey learning how best to design systems that incorporate this powerful new paradigm into our interactions and workflows. The prizes are clear for those who achieve it successfully.
Thanks to our colleagues at OpenAI for the opportunity to collaborate on the alpha for this exciting new capability!
Tomoro works with the most ambitious business & engineering leaders to realise the AI-native future of their organisation. We deliver agent-based solutions which fit seamlessly into businesses’ workforce; from design to build to scaled deployment.
Founded by experts with global experience in delivering applied AI solutions for tier 1 financial services, telecommunications and professional services firms, Tomoro’s mission is to help pioneer the reinvention of business through deeply embedded AI agents.
Powered by our world-class applied AI R&D team, working in close alliance with Open AI, we are a team of proven leaders in turning generative AI into market-leading competitive advantage for our clients.
Continue Reading
We’re looking for a small number of the most ambitious clients to work with in this phase, if you think your organisation could be the right fit please get in touch.