Mar 15, 2021 9 min read Why Audio?

5 Questions About Text-to-Speech Each Publisher Should Ask

In our talks with publishers, we at BotTalk often face some common misconceptions about text-to-speech.

It seems that many publishers don't realize how big of a leap text-to-speech technology has made in recent years.

Moreover, publishers often misunderstand text-to-speech as an accessibility add-on - essential but not required for most users.

This article focuses on the top 5 questions about text-to-speech each publisher should have an answer to. We argue that text-to-speech is an essential engagement tool. Audio demand is growing so fast that users expect to have an audio version of every written article.

How natural does Text-to-Speech sound like in 2021?
Can Text-to-Speech turn fly-bys into loyal users?
How high is the engagement of Text-to-Speech articles?
How big is the demand for audio content?
What is Text-to-Speech's place in a typical news reader's customer journey?

How natural does Text-to-Speech sound like in 2021?

Text-to-speech quality is evolving rapidly. According to BotTalk quality assurance metrics, it raised from 3,5 to 4,5 stars in past year.

Text-to-speech technology received its first wide public attention after an iconic keynote by Steve Jobs back in 1984. Showing the original Macintosh, Jobs has allowed the machine to introduce itself and its creator.

Synthesized voiced has developed dramatically since then, hitting its peak with the introduction of smart assistants such as Amazon Alexa and Google Assistant. Now people and machines can truly communicate with each other.

But are they, though?

Are you generally satisfied with how helpful your Siri or Alexa is? Can you entrust it with all the things you used to do on the computer yourself? Have you ever tried something more complicated than just setting the alarm or playing a playlist? Ordering tickets? Ordering pizza to go? Asking how to navigate to the new location - and when do you need to leave - so that you don't get stuck in traffic?

Siri, Alexa, and Google Assistants - promised those use cases - but never delivered. The technology is not there yet. Going from scripted responses to genuine assistance - is the leap that's still ahead of us.

And what about the quality of the synthesized voice? Are you regularly letting your Mac read the lengthy texts for you? Remember - Macintosh has had this feature for the last 37 years! What about news articles? Books?

Why not?

Why is it that you're spending hours listening to the radio, podcasts, and Audible? But when it comes to synthesized voice, you tend to avoid lengthy interactions with it. You let Google Maps navigate you - but never listen to the whole audiobook synthesized using "machine voice."

It's about the speech quality.

That synthesized voice sounds like... well... like a robot. And we humans generally don't like it too much.

Well, forget everything you thought you knew about text-to-speech. And welcome to 2021.

We're about to blow your mind with how great a synthesized voice can sound.

Here is a quick sample from New York Times - read out loud by Siri:

And now the same sample read using BotTalk text-to-speech technology:

The difference is much more staggering in non-English texts. Historically, text-to-speech synthesis was best in English since most linguistic data models are based in this language.

We took a short German text and converted it to speech, using the major text-to-speech platforms - our direct competitors.

Below are the visual representations of errors alongside with the audio samples that were generated using this short peace of text.

Errors visualized with highlighting in the German text by the leading text-to-speech platforms.

Natural Reader:

Linguatec:

Trinity Audio:

Now take a listen - how that exact same text sounds when generated using BotTalk text-to-speech technology:

🤖 BotTalk TTS:

You can find the full competition comparison in this LinkedIn post.

Our take-away from this section is simple. You have probably interacted with text-to-speech in the past, either using Siri or Alexa or trying the built-in TTS tools on your Mac. Well, forget it.

Good text-to-speech sounds a lot better than you probably think.

Can Text-to-Speech turn fly-bys into loyal users?

10% of the fly-bys click on the "listen to this article" button and convert into loyal users.

What news do you read? How often do you switch apps? How thoroughly do you read an article?

For most of us - the answer is simple: "Well, I jump around. But I also have a couple of news sites I regularly go to read a daily portion of information."

This answer illustrates the typical behavior of the content consumer. And it has a name: the fly-by. In this age of information overload, we're triggered by social media to quickly click the catchy title and leave (bounce) in a couple of seconds.

Converting the fly-bys into loyal users - and then into subscribers - is one of the significant challenges publishers face today.

Main revenue focus in 2019 - publishing industry

According to the 2019 report from the Reuters Institute for the Study of Journalism, publishers' main revenue focus lies in growing their subscription base. Subscribscriber-based revenue has a 52% lead, followed by display advertising with 27%.

More and more publishers realize that audience development is the key to their subscription business. Digiday data shows that 70% of publishers either already personalize their content for readers or plan to do so shortly.

Right content, right user, right time - is the motto for the publishers that try to win their readers' loyalty.

The user experience plays a big part: it's literally about what the users see on the page: features content boxes, online search, non-intrusive ads. Those elements will attract and retain the user to some extent.

But only when the time is right.

As as we all know, we never have enough time. Between you and that interesting article you've meant to read for a while - there's always that email, push notification, slack message.

BotTalk player integrated in every article on the Augsburger Allgemeine webpage.

BotTalk's Web Player solves that problem for the reader by offering a great user experience. You can now listen to the article - and carry on with that email, check your push notifications, or continue browsing the web. All this - without leaving (bouncing) the page.

According to our data, 10% of the readers find this UX improvement attractive! The text-to-speech feature in the website interface catches the attention - and 10% of the readers click that play button - and become listeners.

Introducing text-to-speech to the articles as a user experience improvement gives the publishers a way to reach the right user with the right content at the right time.

And yes, it does translate into great engagement numbers. And we'll talk about it in the next section.

How high is the Engagement of Text-to-Speech Articles?

75% of users hear more than half of an article. Also they stay for 2,5 minutes on the page.

Publishers used to measure their success in page views and unique users. Nowadays, the increasing number of publishers are shifting their metrics towards the time readers spend on individual articles.

The focus on user engagement as the primary tool to grow traffic, subscription, and ad revenue leads to incredible results. According to the industry data The Post and Courier of Charleston, South Carolina, went from concentrating on pageviews to measuring dwell time and engaged minutes. Which lead to the growth of their digital subscriptions by 250% between 2017 to 2019.

Industry leaders are establishing engagement as their "North Star metric". The Wall Street Journal calls this metric "active days" — the number of days a reader engages with content. The Financial Times uses an engagement formula based on the cumulative score of volume, frequency, and recency.

Many publishers don't realize that text-to-speech is a great engagement tool.

BotTalk measured that over 75% of users that clicked on the "hear this article" button hear more than half of an article. An average news article is around 2300 characters long. That translated to 2 minutes and 30 seconds - when converted into audio.

Let's get through the numbers again. When presented with the opportunity to listen to the article, 10% of the readers decide to become listeners. And 75% of them are staying on the single article page 2,5 minutes long.

But that's not all.

Even more impressive engagement numbers appear when publishers combine content personalization with text-to-speech.

BotTalk Playlist Player with the personalized selection of news to listen on the go.

That's where BotTalk's Playlist Player shines. BotTalk automatically analyses the most engaging stories on the publisher's website and presents them in the form of a playlist. For instance - the top 5 articles of the day. Since the player is based on real-time usage data, these playlists are highly dynamic.

BotTalk Playlist Player allows the user to listen to the stories sequentially - prolonging the dwell time on the website even further.

How big is the demand for audio content?

Users spend over 40 minutes per day listening to podcast. Audio is the activity number one on mobile.

In 2019, Americans spent more time on their mobile devices than watching television.

Audio became the number one activity on mobile in 2020, beating social media, gaming, and video consumption.

Podcasts play an integral role in the audio revolution. Today podcasts reach over 100 million Americans
for over 40 minutes per day.

Demand for podcasts was so huge in the last years that the leading music streaming service Spotify announced several acquisitions in the space - to grow its dominance as a podcasting provider.

CEO of Spotify - Daniel Ek - said the company had 2.2 million podcasts on its platform at year-end, up from 1.9 million in the prior quarter. In an investor letter, Ek pointed out:

increasing conviction in the causal relationship between the growth in podcast consumption driving higher LTV and retention among our user base.

The consumption of podcasts on Spotify doubled in 2020. The growth was so dramatic that it forced Spotify to invest even further in content creators' collaborations. For instance, Michelle Obama's podcast was launched exclusively on Spotify earlier that year.

Podcasting is on the rise. Podcasts are great. But very costly to produce.

Text-to-speech gives publishers a unique opportunity to produce from 7 to 30 hours of audio from their daily articles. And as we've shown in the previous section, when the speech synthesis quality is high - the users are more than happy to listen to those articles.

What is Text-to-Speech's place in a typical news reader's customer journey?

Text-to-speech articles take the place in the new context, that was previously taken by audiobooks, podcasts and radio.

Nurturing the user's experience starts with understanding the reader's customer journey.

It is vital to analyze and perfect the critical experience points discussed earlier in the article: user loyalty, engagement numbers, or dwell time. Nevertheless, publishers need to understand that those are just (short) touchpoints with their content.

Here is the customer journey map BotTalk created to visualize a day of a typical newsreader.

Customer journey map of a typical newsreader.

As you can see, there are two critical touchpoints with the publisher's written content. People usually check the news in the morning and on their lunch break.

But what happens during the day - in between? On their commute to work? In the gym? While jogging in the park? While cooking dinner?

Those are the long periods that the users usually fill up listening to podcasts, audiobooks, or radio.

Text-to-speech offers publishers a perfect opportunity to provide their content in the context that was unavailable to them earlier — entering the touchpoints that were previously occupied by radio, audiobooks, or podcasts.