The gap nobody was filling
When Dylan Fox was working on a machine learning team at Cisco, his team needed speech recognition. They tried to license it from the established players, including Nuance and others who had dominated the space for years. The experience was, in his words, unexpectedly painful. Documentation was thin. Support was slow. The underlying technology had fallen behind what academic research had made possible.
Google had released its first public speech-to-text API around the same time. The technology was actually quite good, but there was no meaningful support structure around it. No relationship. No feedback loop. A developer couldn't use it as a genuine differentiator in a product because the dependency was too fragile.
Fox saw a specific pattern emerging: deep learning research was rapidly advancing the state of what was possible in speech recognition, but none of that progress was reaching developers in a usable form. The companies with the most capable models had no incentive to make them easy to use. The companies with developer-friendly products were running on older technology.
The Twilio model applied to audio
The insight that became AssemblyAI was straightforward in retrospect: take the Twilio approach (complex infrastructure made accessible through a clean API and a genuine commitment to developer experience) and apply it to the latest deep learning research in speech recognition.
Fox left Cisco in 2017 to build it. He applied to Y Combinator more as a learning exercise than an expectation, submitting about a month after the deadline, and was accepted. From zero to funded in a week. The model was sound: developers needed this, no one was building it right, and the moment was exactly right.
What AssemblyAI built was not just a transcription API. It was a platform where developers could get audio intelligence (transcription, entity recognition, sentiment analysis, topic detection, summarization) with a single API parameter and the confidence that a team was on the other end who cared whether the integration worked.
The real problem with relying on Big Tech infrastructure
Fox identified five specific failure modes that come with building on infrastructure that a major tech company treats as a side project: you can't make it a genuine differentiator; the services aren't adequately supported; they can be deprecated or changed without notice; they're updated infrequently; and you have no real relationship with the people building them.
He made the point directly: Google Assistant and Alexa exist primarily to harvest user data. That's the product. Making the voice interface actually understand people better isn't the goal. Capturing what people ask for is. The incentives are misaligned with what a developer building a serious product actually needs.
This is a pattern that extends well beyond audio AI. Any time an organization builds a critical workflow on top of infrastructure a vendor maintains as an afterthought, they're taking on risk that doesn't show up in the procurement process. The vendor doesn't lose sleep over your dependency. You do.
How you actually compete with incumbents
At the time of our conversation, AssemblyAI was training models on somewhere between 100 and 150 GPUs, with training cycles running six weeks or more. That's significant investment, but it's targeted investment, focused entirely on a specific problem domain rather than a general-purpose system.
That focus is the mechanism by which a smaller organization competes with incumbents who have more resources. You don't try to out-general them. You go deeper on a specific problem, build a better product within that scope, and provide the relationship and support quality that large organizations structurally can't replicate at scale.
AssemblyAI's transcription models eventually reached a level where the team was confident they were among the largest and most accurate in the industry, not because they had Google's compute budget, but because every engineering decision was in service of the same specific outcome.
Left Cisco in 2017 to start AssemblyAI. Applied to Y Combinator a month after the deadline and was accepted.
Built on PyTorch with end-to-end deep learning models, with training cycles up to 6 weeks on 100–150 dedicated GPUs.
Identified 5 structural failure modes of building on Big Tech AI infrastructure as a core competitive argument.
Developer experience treated as a first-class product concern, not an afterthought to the core technology.
Audio intelligence beyond transcription: entity recognition, sentiment analysis, topic detection, and summarization. All via a single API parameter.

