Address
304 North Cardinal St.
Dorchester Center, MA 02124
Work Hours
Monday to Friday: 7AM - 7PM
Weekend: 10AM - 5PM
Address
304 North Cardinal St.
Dorchester Center, MA 02124
Work Hours
Monday to Friday: 7AM - 7PM
Weekend: 10AM - 5PM
Welcome to an exciting exploration of the world of AI with Dylan Fox, the innovative Founder of AssemblyAI. This tech maverick has created a vibrant space where audio intelligence isn’t just a buzzword, but the backbone of their thriving AI startup.
From his early days at Cisco to pioneering a developer-friendly platform leveraging the latest in deep learning research, Fox’s journey is a strong example of the growth of AI. This narrative not only showcases the amazing benefits of AI.
Let’s delve into his inspiring narrative, showcasing the revolutionary impact of audio intelligence in the tech industry today!
I began my career with Cisco, where I joined a Machine Learning team and worked on a lot of NLP projects. We had a project that required speech recognition. But, even at Cisco, we were not going to build it ourselves. We attempted to obtain API access from several legacy companies, such as Nuance. However, it was unexpectedly painful.
Google had released its first public speech-to-text API at the time. And we wanted to play around with it, and it was actually rather excellent, but there was no support.
I’d become interested in deep learning research and noticed that there was a lot of work being done on speech recognition and how it was improving.
So, it was a combination of factors that inspired me to think, “What if you could build a Twilio-style company using the latest deep learning research that was just much easier for developers to use with a much better developer experience and better tech because it was using the very latest research, a Brand-new speech recognition API?” And it was from there that the idea grew.
In 2017, I left my position at Cisco, and I was doing freelance work to supplement my income while I was working on getting the company off the ground.
 Even though I didn’t feel ready, I applied to Y-Combinator because I thought it would be a good exercise. Even though it was about a month after the deadline, I got an interview and was accepted. I was honestly really surprised. From one week to the next, it went from zero to sixty. That made my head spin. But that’s how it all began.
Our customers can now perform a lot more than merely transcribing. We can give them access to plenty of extremely cool use cases, applications, and features. They get audio intelligence that they can easily create on top of the transcription with just a single API parameter.
As a result, it all starts with those basic use cases. People immediately realized, however, that we now have all of these phone calls transcribed. We can, for example, look at the keywords in those or summarize them. Alternatively, if you’re undertaking sentiment analysis, we can do the same. Many of our customers request us to transcribe information for them, such as their Zoom meetings.
We focus on the developer experience. The legacy firms lose that sense of urgency and anxiety that smaller organizations have. As a result, the legacy companies’ lack of thinking about the need to continually upgrade their technology has truly fallen behind.Â
Customers select us because we can give more than what you can get from a major tech provider like Google or Amazon. It’s also giving them the assurance that they will receive the finest assistance and that we will have a staff to assist them.
The Problem with the legacy providers is that:
You can’t have it as a fundamental component of your solution, and it’s not like everyone else who develops on AWS or GCP. However, we’re talking about services that aren’t adequately supported.
We spend a lot of money using Amazon Web Services. Even yet, it seems like we get a new account manager every two months. They don’t know who we are or what we’re up to. As a result, there is no connection.
That is why people seem to prefer us over big tech. We make it more personal to you. As a result, it feels more like a collaboration and an extension of your product.
We use PyTorch to create all of our models and audio intelligence. So, for voice recognition systems, we’re developing end-to-end deep learning models. Many of the other models we developed, such as entity recognition and topic detection, are all deep learning-based, as are automated punctuation restoration. Convolutional neural networks are still really powerful. Transformers, attention based models, are really powerful. We train really big models on dedicated hardware with somewhere between 100 and 150 GPUs at this time.I believe that our transcription models are currently the largest. Training on 32 GPUs takes around six weeks.