frederik's blog

Music (5) Projects (2) Snippets (8) Technology (5) Thoughts (9)

Speak×

Text-to-speech using the Web Speech API

!

I’ve always been fascinated by text-to-speech programs.

It seems I’m not the only one since the technology has been around for decades and dedicated people around the world are working relentlessly on many different solutions.

One of those approaches manifested in the Web Speech API which works inside the browser (Firefox, Chrome) and doesn’t need any plugins or libraries.

It’s convenient to use jQuery to handle the events but it’s not necessary.

Using the Web Speech API

I thought it might be nice to have somebody read my articles when a user clicks a button. Professionals in this area are quite expensive and I don’t want to pay money for something a machine does for free. Somebody‘s voice is of course a synthetic voice but for articles not related to technology it works just fine with a few lines of code.

I stumbled upon this Pen by Steve Robertson and copied/simplified his code to fit my needs.

First, I added jQuery for convenience.

<script src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>

The second and last bit is this:

$(function(){
  if ('speechSynthesis' in window) {
    $('#speak').click(function(){
      var text = $('.entry').text();
      var msg = new SpeechSynthesisUtterance();
      msg.rate = 1;
      msg.pitch = 1;
      msg.text = text;
      msg.volume = 0.2;
      speechSynthesis.speak(msg);
    });
    
    $('#cancel-speech').click(function() {
      speechSynthesis.cancel();
    });
  } else {
    console.log('The Web Speech API seems to be missing. Please try another browser.');
  }
});

Whenever someone clicks on the button in the top right corner of every article (#speak), the speech synthesis starts. The other button (#cancel-speech) of course cancels the text-to-speech output.

The voice defaults to the one set in the HTML lang attribute in the header of your HTML file. In my case that’s en-US.

Choosing a different voice

The following snippet logs all available voices to the console. Different browsers might yield different results. I’m happy with the defaults.

console.log(window.speechSynthesis.getVoices())

Choosing a different voice with the Web Speech API is relatively easy.

A useful example can be found here.

Web Speech API Oddities

The Web Speech API isn’t finished yet. It didn’t crash my browser yet but I wouldn’t call it stable either. It’s not a W3C Standard and the current specification (2014) is a draft and subject to change. It works for my purpose but I wouldn’t want to debug a commercial-grade application built around it.

Some things are strange:

  • The default voice in Firefox is male*, in Chrome it’s female
  • Speech synthesis cancels on longer texts in Chrome
  • I can’t get to work the volume parameter

Browser compatibility is coming along slowely but surely. Firefox can do it, Chrome can, Safari does it, even Edge is able to convert text to speech.

* It’s sounds very synthetic and arguably not male at all. It’s the voice of a very stereotypical robot, similar to the text-to-speech output from Apple’s 1984 keynote (3m17s). Playing around with the pitch parameter doesn’t make the output sound more human.

To do

  • I should exclude code snippets before adding the text to the SpeechSynthesis object
  • Headers should be spoken as headers, with a sufficient pause before continuing with the text body
  • Maybe images or image captions should be read in context as well
  • Instead of canceling the text-to-speech output I should pause it (so it can be resumed)

Web Speech API – Conclusion

It’s a fun technology and it has been becoming increasingly useful over the last couple of years. I find it amazing how human the default voice sounds in Chrome. Fast forward a few years and we’ll have hardly any need for professional voice actors anymore. The announcements on German trains (DB) already sound very generic and I can imagine those to be replaced by machines very soon. Service hotlines will be using them as well. Tech support, dominated by non-native speakers already, will be based on synthetic voices, too.

It’s inevitable.

But for now it’s just a nice toy.