Microsoft AI on April 2 released three in-house foundational models: MAI-Transcribe-1 (speech-to-text across 25 languages, 2.5× faster than Azure Fast at $0.36/hour), MAI-Voice-1 (text-to-speech with custom voice cloning from 10-second samples, 60 seconds of audio generated in under one second, at $22/1M characters), and MAI-Image-2 (text-to-image ranked #3 on Arena.ai, at least 2× faster generation than prior MAI versions). All three models are in public preview on Microsoft Foundry and already power Copilot, Bing, PowerPoint, and Azure Speech — the first significant in-house multimodal stack from a team Microsoft assembled largely independently of its OpenAI partnership.

Microsoft releases MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 via Foundry

Citations