Can you tell the difference between Text-to-Speech and real voices?

20 August 2021

Text-to-speech versus real voices?

The adoption of virtual assistants (Google Home, Amazon Alexa, Siri, Cortana) has been staggering over the last two years. Alongside the AI used to understand the questions asked of them, there has been a vast improvement in the Text to Speech (TTS) used to deliver the answers. TTS has moved from being ‘intelligible but robotic’ to sounding human. But is that good enough, or is there a higher level of humanness that your customers are looking for?

PromptVoice has been a leading provider of voice artist recordings for 25 years, and we’ve long been monitoring TTS. Our feelings towards it have moved from laughter at obvious mispronunciations, to keeping a keen eye on it, to partnering with one of the world’s leading TTS providers, to bringing TTS development in-house and embedding it in PromptVoice Portal. It’s good -very good - now!

Does TTS signal the end for voice artists?

No way! Thanks to advanced integration and global Speech Synthesis Markup Language, our best of breed TTS characters are optimised for telephony and now sound very human, but most callers can still discern them from real voices. The voices of successful professional voice artists and radio presenters stand out massively from non-voice characters. They have an imperceptible quality to their voices that draws the listener’s attention no matter how dull the topic they are talking about.

We are approached by 10-20 people each week saying they are voice artists wanting to work with us, but they are not. They are just people with a microphone who think that because they can speak, they can be a voice artist, but they just don’t have the ‘Je ne sais quoi’.

We worked with our university partner to identify the essence of what makes a successful voice artist stand out - and failed. They speak 10-30% faster than the rest of us without losing clarity, but there is something else too. It became apparent that the ‘something else’ varied between each voice artist in the same way that handsome men might often be tall and dark, but they still look very different in all manner of ways!

Many end users get overly hung up on choosing a voice artist, without understanding that the way the voice is ‘produced’ is far more important. As professionals, voice artists are accustomed to being produced. Meryl Streep was equally convincing as Margaret Thatcher in the Iron Lady, Karen Blixen in Out of Africa, and Donna in Mamma Mia.

Prompts and messages for phone systems are usually quite short, and it is impossible for a TTS engine to deduce, from the text alone, where the emphasis needs to be. Nuances of emphasis and tone can completely change a listener’s impression, which is why TTS will never threaten voice artists.

Thank you for calling. Thank you for calling. Thank you for calling.

SSML (Speech Synthesis Mark-up Language) and some TTS tuning tools offer the ability to change the output to a certain extent, but they are quite crude and broadly restricted to pitch, tone and volume. Some engines support a change in style from the ‘information delivery’ tone you get from virtual assistants or satellite navigation systems to newscaster, chat or empathetic style. To date, we’ve not found a promotional style from any of the main TTS providers that satisfies the style most often needed for in-queue messaging or IVR prompts.

These examples illustrate the points above for EAW Fixings, a fictitious company:

Raw text to speech using one of the best British text to speech characters. 
Human sounding, but some mispronunciations and rather flat read.

Edited text to speech using speech synthesis mark-up language (SSML) and studio editing.  
Mispronunciations corrected and more emotion / expression added. 

On-brand real voice, studio produced.  
Choice of voice aligned with company audio brand guidelines.

What does TTS mean for the channel?

The major advantage of TTS is that it is instantaneous – so ideal for instant implementation, and the times when your customers need to adapt their in-queue and on-hold audio fast. But we still think an on-brand recording by a real human voice gives a significantly better caller experience, and should replace the TTS recording as soon as it's available –a process we've automated with PromptVoice Portal’s streaming option.

We are often asked, “Which TTS engine is better, Amazon Polly, Google, IBM or Microsoft?”. The answer is that it's different for different languages and dialects. The quality of the TTS is dependent on the TTS engine and the input recordings used to create the TTS character. The best male Irish TTS character may be produced by a different manufacturer from the best female Irish character. Ultimately there is no global ‘best’, even for a particular gender or language, as it depends so heavily on what persona your customers are looking for.

This is a fast-moving technology, and we will be blogging about it again soon, so please subscribe to stay up to date.

Follow us on LinkedIn and Twitter for more updates on when further resources become available.


Back to blog

Never miss a post