High-Quality Audio Blog Posts With Google's Text-to-Speech

David Miranda

Mar 1, 2022 • 4 min read

In this post, I'll cover how to use Google Cloud's Text-to-Voice API to generate audio (MP3) versions of your blog posts.

I like having audio versions of my posts because they're a lot easier to consume for people. Someone might be going for a walk or doing a repetitive task — but if they can listen instead of reading, they can still engage with the content.

You can use your generated audio:

To make a podcast.
To send to a friend.
To post to YouTube.

I'll cover the basics of generating the audio, as well as some issues and improvements you'll want to address to make your audio the best it can be.

Initial Setup

Here's the full tutorial, outlined by Google:

Quickstart: Create audio from text by using the command line.

👆 This covers the high-level steps, but doesn't go into detail about what you need to do to get started. For those early steps, you'll need to finish this guide first:

Quickstart: Before You Begin.

Check these off as you go through:

Create a new Google Cloud project
Enable billing for that project
Enable text-to-speech for that project
Create a service account (I made mine an "owner")
Generate keys for that service account
Download the credentials as a JSON file
Store the credentials JSON file somewhere on your computer
Set the GOOGLE_APPLICATION_CREDENTIALS environment variable, so the gcloud command line tool knows how to access your credentials

Last of all, install the gcloud command line tool if you haven't already and connect your google account by running the command gcloud init.

Now you're done with the initial steps! We're ready to dig in!

Generating Audio

Here's where the fun starts.

Continue following the steps from: Quickstart: Create audio from text by using the command line.

Create a request.json file with the example content Google provides.
Run the curl command to ping their API.

The request.json file looks something like this:

{
  "input":{
    "text":"Hello world!"
  },
  "voice":{
    "languageCode":"en-us",
    "name":"en-US-Wavenet-J",
    "ssmlGender":"MALE"
  },
  "audioConfig":{
    "audioEncoding":"MP3"
  }
}

You should get back a base64 string inside a JSON object. I usually save this to a file called audio-data.json.

You can decode the base64 string pretty easily into an MP3 — and voila! — then you'll have an MP3 you can use for your blog!

Running Into Trouble Using Custom Text

So far we can:

Send some text to Google.
They'll send you back a base64-encoded string.
You can decode this string into an MP3.

⛔️ But when you try to use your own text, you'll run into a few issues:

The text-to-speech API won't convert text longer than 5,000 characters. 😱
You can't send text with double quotes in it (e.g. 👉"👈) unless you format your text correctly first.
You'll probably want to use a different voice than the one they provide in the example.

Fixing Problem #1: The 5,000 Character Limit

Fixing this is relatively easy: we'll just need to split our text into parts that are shorter than 5,000 characters and then stitch them back together.

So, select the first 4,900 characters of your text and put that into your request.json file.

Then, run the API command and you can get the audio for just that part.

Continue with the next 4,900 characters of your text to generate another MP3.

You can name them like this: blog-post-1.mp3, blog-post-2.mp3, blog-post-3.mp3

Then, install the amazing ffmpeg library (get it here). This is an amazing tool that lets you convert/edit/remix all kinds of media files, including MP3s, MP4s, GIFs, and anything else you can imagine.

Run this command to stitch all your MP3s into one (make sure you replace the MP3 names with however you named your files):

ffmpeg -i "concat:blog-post-1.mp3|blog-post-2.mp3|blog-post-3.mp3" -acodec copy final-blog-post.mp3

This command will output a file called final-blog-post.mp3 that will be a combination of all the MP3s you've generated!

The final audio will be seamless — you won't even be able to notice they were separate files that were stitched together! 🤯

Fixing Problem #2: Using Text With Double Quotes

This problem is much easier to fix than problem #1:

Get a text editor that can Find and Replace text
Find all double quotes (i.e. 👉"👈) and replace them with a double quote preceded by a backslash (i.e. 👉\"👈)

After you do that, you'll be able to paste your text into the right field in the request.json file and it will work!

Fixing Problem #3: Using a Different Voice

You can browse all of the Text-to-Speech voices here: Supported voices.

I highly recommend choosing a voice marked as "Wavenet" because those voices tend to be richer in quality and speak with more nuance.

After you preview a few and find one you like, just replace the options in your request.json file with the voice options you chose.

I personally chose these options and put them under the "voice" option:

"languageCode":"en-us",
"name":"en-US-Wavenet-J",
"ssmlGender":"MALE"

Some final suggestions

Remove the headlines from your text before you generate your MP3s — headlines work fine for written text, but they usually don't read very well.
Add periods after each line of your text (even if it's a line from a list of items) — this will make the voice pause briefly before reading the next line.
Listen to your generated audio all the way through before sharing it, so you can catch obvious mistakes.

Conclusion

Once you've done the initial setup steps and get Google Cloud configured, it's actually pretty easy to do the other steps for every new blog post.

Splitting long blog posts into smaller ones takes 30 seconds.
Replacing all the double quotes takes 5 seconds.
Running the API calls and decoding the base64 strings takes 30 seconds.
Combining the separate MP3s into a final MP3 takes 10 seconds.

So, for about 1 minute of work, you get an audio version of your blog post that you can upload to your podcast host of choice — and give your readers an easy way to engage with your content on the go! 🥳