📖🗣️ Text to Speech: The Listen to Article Widget

As part of taking stock of AI tools, I started to play with Amazon Q Developer. It is accessed as an extension in Visual Studio Code, and is seemingly free. It provides both a chat window, and an inline edit command.

I began with a simple experiment: generating hero images for articles. Although I was a bit skeptical that AI-generated images might feel overly “AI-like,” I adopted the mindset of “stop doing things manually, use AI.” With that in mind, I asked Amazon Q to create a script that takes an article slug and produces a hero image. It nailed it on the first try. I copied the generated code, ran the script, and—success! The script called out to OpenAI’s DALL·E 3, generated an image based on the article content, and saved it to the file system. I was fairly impressed—and further, I found myself without a task to complete!

Original Dall-e-3 hero image generated for this article

So, I think, “what about having a “listen to article” button at the top of articles using text to speech - TTS?”. I type in my prompt.

Just like with the auto-hero.ts script Amazon Q generated a working auto-tts.ts script first shot based on the first one. Well second, because initially it tried to use Amazon’s AWS products so I asked to stick with OpenAI TTS for now.

The raw transcription was, given the five minutes I’d spent to get to that point, all in all quite good.

The first problem is that using the mdx file as the source means that code like import X from package Y was being read out. I ask Amazon Q to instead scrape the text from the article page served from the Astro dev server. That results in an 80% good solution but a few items in the 20% do tickle me. It would be nice if my italicised and bold words had some voice emphasis. Additionally, lists were a little hard to follow.

Asking ChatGPT, it explains that there is a thing called Speech Synthesis Markup Language (SSML). This syntax is XML like:

<speak>
  <emphasis level="strong">This part is emphasized.</emphasis>
  <break time="500ms"/> <!-- Adds a 500-millisecond pause -->
  You can <prosody pitch="high" rate="slow">change pitch and speaking rate</prosody> for dramatic effect.
  Or <prosody volume="x-loud">increase volume</prosody> to stress a point.
</speak>

Unfortunately ChatGPT also hallucinates that OpenAI “of course” supports SSML, and this leads us down a false path. To support SSML, we need to preserve some HTML tags when scraping the rendered article and convert them to SSML. I get a little lost in the weeds here. Amazon Q seemed determined to stick with full AST traversal mechanisms, and I had to tell it to chill out and just use an HTML sanitiser but eventually got a handle on things. Eventually I was able to output some good looking SSML. Fired off the script and we have some audio.

This audio is a little odd. There are blips, which as I listen I realise are sort of random decisions by the TTS to pay attention to the SSML tags, and other times just to read them out verbatim. This isn’t supporting SSML at all! Sure enough, the OpenAI text to speech docs, read by me a human, clearly say it accepts text only. Darn.

Perhaps I won’t give up yet. I’ll try Amazon Q’s suggestion to use the AWS TTS called Polly, which supports SSML, after all. Amazon Q makes short work of generating a new auto-tts-aws.ts file targeting Polly and it turns out I do have some AWS API keys lying around, we’re all systems go…

…except that Polly, using the standard voice, sounds like it’s from the dark ages compared to OpenAI. The good news is it does pay attention to the SSML tags. The bad news is that even then it’s way worse than OpenAI’s alloy voice, sounding ironically more metallic and machine like. I decide to go back to using OpenAI for now, with the full knowledge that plain text is all we can provide to it.

Plain text might not be completely expressionless though. I pose that if something is in ALL CAPS then perhaps OpenAI’s TTS will do some emphasis. Remember, the ethos behind these models is that if the model was perfect it would do what any human would do reading the same text. That’s why even with those earlier SSML tags it didn’t officially understand sometimes it would pause when it saw a break tag. I proceed to convert both em and strong tags to all caps. The result: yes indeed the output audio has responded quite well, at least on the first listen! One risk to consider: it could think it’s reading an abbreviation in some cases. Time will tell.

We have some audio files but we’re going to need a button at the top of the articles to say “listen to this article”. I asked Amazon Q to deliver me such an audio playback widget. Once again it did a pretty decent job. The visuals feel a bit lacking, as I have in mind the audio playback buttons I’ve seen around with the audio waveform behind them, a little like voice snippets on Apple messages or WhatsApp.

This turns into a sticking point. I’m not yet trained well on using AI to generate good UI, and furthermore it also seems that what I had in mind isn’t common or easily searched for, and I can’t remember the sites where I saw it.

I take a step back and reflect: as a developer I’ve often felt justified to take time to perfect things, but in the age of AI and being more holistically mission focused, perhaps sometimes I’m just going to have to accept things as “good enough”. We can always make it better later, but speed is important. With this fresh perspective, Amazon Q basically nailed it if I’m honest. We have a perfectly sufficient audio play widget.

Original visual for the listen to article widget

We’re making good progress, but one obvious unattended to thing is things like captioned images and code blocks. On the images front, it seems fine to just ignore their existence for now, but for code blocks they’re just don’t naturally translate from visual to audio. It would be ideal to have some kind of alt text for them for the audio version, specified in the mdx file.

I get a bit lost in the weeds here again. Because we’re generating the TTS text from the Astro rendered page, we need some way to add the alternative text to the HTML document such that it won’t affect the article visually, but that we can pick it up in the auto-tts.ts script. How would you solve this?

Initially I think HTML comments is the way to go, but rendering actual HTML comments from MDX which is effectively JSX or dynamically from astro components is convoluted. Finally I settle on a custom tts html element with an “alt” property. The browser will ignore the tag when rendering the document, but it’s still there for our script. Perfect.

I additionally come up with a syntax to easily add alt text to multiline code blocks where any triple backtick followed by a colon and then a text string will get automatically turned into a tts element. Additionally, any tts element directly preceding a code block will cause the removal of the code block.

{/* code block syntax */}
```tsx : this is alternate for audio
<div>
   code never to be heard
</div>
```
{/* inline syntax */}
<tts alt="audio alternate" />`code never to be heard`

I work with adding these audio annotations to the In Search of A True Hero article. It certainly goes a long way from taking the audio version from gobbledygook to have a more natural flow. Despite these improvements, converting from one medium to another naturally seems challenging. Especially when the first medium, in this case the written article, was not written with audio in mind originally.

The “listen to article” widget is not the only target audio format either. Since early on in this project I’ve had the idea to convert these articles to a podcast format. This would seem an even higher bar than just a “Listen to article” button, as normally podcasts have a well produced conversational feel that sets the right mood for the underlying content to be delivered. I ask ChatGPT to try to convert one article into a podcast tone, but this is quite an ask, and it falls flat. Probably something to tackle more when looking into podcasts directly.

One takeaway reflection is that keeping in mind all of these intended mediums when writing content could help to improve the outcome. The holy grail would be to write once – deliver everywhere, but it will be a continual exploration on how feasible that is.

That’s a wrap for this initial exploration of TTS. We finish with a new “listen to article” widget at the top of article pages. Perhaps you’re listening to this article via the widget right now!

Editors notes: I later encountered this X post about the Radio Hacker News project, and and was able to further improve the audio widget based on that projects source code. Thanks Enes!