TL;DR
Microsoft released three in-house AI models today (MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2), available immediately through Microsoft Foundry. They undercut OpenAI and Google on price across speech, voice, and image generation. After investing $13 billion in OpenAI, Microsoft is now building its own competing AI stack.
What Microsoft Actually Released
Three models, three modalities, all built by Mustafa Suleyman’s superintelligence team:
MAI-Transcribe-1 — a speech-to-text model that Microsoft claims beats both OpenAI’s Whisper large-v3 and Google’s Gemini 3.1 Flash on accuracy. It averages 3.8% word error rate across 25 languages on the FLEURS benchmark, ranking first in 11 of those languages. It runs 2.5x faster than Microsoft’s previous Azure Fast transcription offering. Price: $0.36 per hour of transcribed audio.
MAI-Voice-1 — a voice generation model that produces 60 seconds of expressive audio in under one second on a single GPU. It can clone a voice from just a few seconds of sample audio. Price: $22 per million characters. That directly undercuts ElevenLabs and puts pressure on OpenAI’s voice API.
MAI-Image-2 — a text-to-image model that debuted at #3 on the Arena.ai leaderboard for image model families. Price: $5 per million input tokens, $33 per million output tokens.
All three are available now through Foundry and a new MAI Playground where you can test them interactively.
Why This Matters: The OpenAI Relationship Is Fraying
Let’s be direct about what’s happening here. Microsoft poured $13 billion into OpenAI. They got exclusive cloud hosting rights and access to every model OpenAI builds through 2032. It looked like the tech deal of the century.
Then things got complicated.
OpenAI started losing $14 billion a year. They raised $122 billion at an $852 billion valuation, bringing in Amazon and other cloud providers — Microsoft’s direct competitors. The ChatGPT creator started building partnerships with AWS. Microsoft investors grew visibly uncomfortable with the exposure.
Microsoft did what any company with 200,000 engineers would do: they started building their own stuff.
Last September, Microsoft renegotiated the OpenAI contract. The new terms freed them to pursue frontier AI independently, including AGI and superintelligence, while keeping license rights to everything OpenAI builds through 2032. Suleyman formed a dedicated superintelligence team. Six months later, here we are: three production models, competitive benchmarks, aggressive pricing.
Suleyman says nothing is changing with the OpenAI partnership. “We’ll be partners at least until 2032 and hopefully longer.” But the models tell a different story. You don’t build a competing speech-to-text model that explicitly benchmarks against Whisper if you’re not hedging your bets.
The Pricing Game
Microsoft’s pricing tells you everything about their intentions:
| Model | Microsoft Price | Competing With |
|---|---|---|
| MAI-Transcribe-1 | $0.36/hour | OpenAI Whisper API, Google Speech-to-Text |
| MAI-Voice-1 | $22/M characters | ElevenLabs, OpenAI Voice API |
| MAI-Image-2 | $5/M input tokens | DALL-E, Midjourney API |
These aren’t research previews. They’re production-ready, priced to win enterprise contracts, and available through the same Foundry platform where Microsoft already hosts OpenAI’s models. An Azure customer evaluating transcription costs can now compare Whisper and MAI-Transcribe side by side in the same console.
That’s not partnership. That’s competition with a shared roof.
What the Benchmarks Actually Show
MAI-Transcribe-1’s benchmarks deserve scrutiny. Microsoft claims it beats Whisper large-v3 on all 25 tested languages and Gemini 3.1 Flash on 11 of 14 remaining languages. The 3.8% average WER on FLEURS is genuinely strong.
But there are caveats. FLEURS is a read-speech benchmark — clean audio, scripted text. Real-world transcription involves crosstalk, background noise, accents, and domain-specific jargon. Whisper’s dominance in the open-source community comes partly from its reliability in messy conditions, not just clean benchmarks.
MAI-Voice-1’s 60x real-time speed claim is impressive if it holds up in production. Generating a minute of audio in under a second makes real-time voice applications viable: customer service bots, accessibility tools, live translation. The voice cloning from seconds of audio puts it in ElevenLabs territory.
MAI-Image-2 landing at #3 on Arena.ai is solid but not dominant. It trails behind models from companies whose entire identity is image generation. For Microsoft, though, it doesn’t need to be best-in-class. It needs to be good enough and cheaper.
The Bigger Picture: AI Self-Sufficiency
These three models aren’t the endgame. Microsoft’s internal target, according to reporting from CTOL Digital Solutions, is to have frontier-class language models by 2027. The MAI series started with MAI-1 in 2024 (a 500-billion parameter model that never shipped publicly) and has been building toward a full-stack alternative to OpenAI’s offerings.
The math is simple. Microsoft spends roughly $120 billion on AI infrastructure. They can’t keep sending a significant chunk of their AI revenue back to OpenAI as licensing fees while also funding their own research. Every model they build in-house improves their margins on Azure AI services.
This isn’t unique to Microsoft. Google has Gemini and still partners with Anthropic. Amazon has Nova and still hosts Claude on Bedrock. Big cloud providers all do the same thing: partner for coverage, build for margins.
But Microsoft’s situation is different because the scale of the OpenAI bet was unprecedented. $13 billion. Exclusive hosting. A relationship where OpenAI’s models were basically Microsoft’s product. Now Microsoft is building the escape hatch while insisting the door is still open.
OpenAI has been diversifying too, with the AWS partnership, the $122 billion raise that brought in non-Microsoft investors, and the reported pivot toward a super-app strategy. Both companies are slowly untangling from each other while maintaning the legal fiction that everything’s fine. The contract says 2032. The models say “we’ll see.”
What This Means for Developers
Azure developers now have more options at lower prices. That’s unambiguously good. Competition between MAI models and OpenAI models on the same platform means pricing pressure and better tooling for everyone.
For teams building on OpenAI’s API directly, this doesn’t change much today. But it signals where things are heading. Microsoft controlling both the platform and competing models creates an awkward dynamic. Azure could eventually steer developers toward MAI models through pricing incentives, default selections, or tighter integration. Watch which models Azure highlights in documentation and quickstart guides over the next few months. That’ll tell you more than any press release.
Startups building voice or image products should take note: the pricing floor just dropped. MAI-Voice-1 at $22 per million characters and MAI-Image-2 at $5 per million input tokens set new baselines for what enterprise multimodal AI should cost. If you were budgeting around ElevenLabs or OpenAI voice pricing, it’s worth running the numbers again.
There’s also a practical consideration for teams that have standardized on Whisper for transcription. MAI-Transcribe-1 is a proprietary API with no open-source version you can self-host. Whisper’s open weights mean you can run it on your own hardware, fine-tune it for your domain, and avoid vendor lock-in entirely. The benchmark numbers favor MAI, but the deployment flexibility favors Whisper. Pick based on what matters more for your use case.
FAQ
Are Microsoft’s MAI models better than OpenAI’s?
On transcription, Microsoft claims MAI-Transcribe-1 beats Whisper large-v3 on all 25 benchmarked languages. For voice and image generation, they’re competitive but not category-leading. The real selling point is price and Azure integration.
Is Microsoft breaking up with OpenAI?
Not officially. The partnership runs through 2032 and Microsoft still hosts and resells OpenAI’s models. But you don’t benchmark your new speech model against your partner’s Whisper and price it lower unless you’re planning for a future without them.
Can I use MAI models today?
Yes. All three are available through Microsoft Foundry and the MAI Playground. You’ll need an Azure account.
How do MAI model prices compare to alternatives?
MAI-Transcribe-1 costs $0.36/hour (competitive with Whisper API). MAI-Voice-1 runs $22/million characters (below ElevenLabs pricing). MAI-Image-2 is $5/million input tokens.
Will Microsoft replace OpenAI models on Azure?
Not anytime soon. Microsoft’s stated strategy is to offer both MAI and OpenAI models through Foundry, letting customers choose. But the pricing incentives will increasingly favor MAI over time — that’s the whole point of building them.
Bottom Line
Microsoft spent $13 billion to make OpenAI’s models the backbone of Azure AI. Now they’re spending billions more to build replacements. Both things can be true: the OpenAI partnership still has value, and Microsoft is actively reducing its dependence on it.
The three MAI models launched today aren’t frontier language models. They cover speech, voice, and image. The language models are coming, reportedly by 2027. When they arrive, the Microsoft-OpenAI relationship will face its real test.
Microsoft is done being just a distributor. They want to be a builder. And they’re pricing their models to make sure you notice.
