Text-to-speech technology: Look who's talking

02/02/2022 Know-How

Talking and listening are the most natural ways for humans to communicate with one another—writing did not come until much, much later. With human–machine communication, the trend is heading very much back to the roots. These days, high-quality audio files can be created in many languages with absolute ease.

A device or a machine that can talk provides massive benefits in a great many applications. It provides accessibility for people with impaired vision. It is no longer necessary to have the device in question in sight, which is a massive safety boost when driving a car, for example. And it can also be very helpful when people are in a different room, for example when nurses in a hospital are audibly alerted to a dangerous situation, even if they are currently not actually with the patient. Similar warnings can be valuable in production facilities as well. Speech output can also make the operation of ever more complicated equipment much simpler.

Applications with bidirectional communication—i.e. those that can not only “speak,” but also “listen,” such as Siri, Cortana, and so on—take this a step further, although speech output is often wholly adequate. This offers the benefit of keeping hardware and software requirements much lower and eliminates the need for complex infrastructure with Internet connectivity.

Find more informations about Text to Speech on our Landingpage.

Generating Speech with Ease from Text Files

Previously, text had to be recorded in each desired language to support speech output. This meant hiring a recording studio and professional voice actor or setting up your own studio—an expensive and time-consuming solution. Reducing development time and costs drastically, Epson has developed the ESPER2 Voice Data Creation PC Tool. This PC-based development environment can be used to create high-quality audio files, currently for up to 12 languages.

To do this, pre-worded sentences can be imported into the tool as a CSV file or entered directly into an editor form. The tool is used to generate a language file. ESPER2 also analyzes the text’s sentence structure to achieve a proper and natural pronunciation and emphasis, and it has an extensive dictionary, too. The pronunciation of product names, proper names, and invented words that are not in the dictionary can be defined as desired using the edit function. This makes it possible to have audio files generated of such quality that it is difficult to tell them apart from the natural spoken word of a human being.

If voice and audio files are already available in WAV format, these can also be used with ESPER2. The WAV files can be easily imported into the development environment, where they are joined with the files generated by ESPER2. For further editing, sentences can be exported from the tool’s CSV format for use in Excel.

Be Understood Everywhere in the World

ESPER2 currently supports 12 languages: US and UK English, French and Canadian French, German, Italian, Russian, European Spanish and Latin American Spanish, Chinese, Japanese, and Korean. To accommodate language-specific features, it is possible to adjust the tone and speed of the voice.

However, the tool does not have a translation function, which means that the text needs to be entered into ESPER2 in each desired language.

Epson has already announced a library containing audio files with common units such as currencies, weights, and similar values, as well as basic noises that can be used to add flourishes to human speech.

Minimal Storage Space and High Voice Quality

To enable efficient transfer and storage, ESPER2 uses Epson’s proprietary EOV codec format (Epson Own Voice). Compared to the standard compression format ADPCM (adaptive differential pulse-code modulation), EOV shrinks file sizes by up to 66%—all while preserving superb speech quality at bitrates of 16kbit/s to 40kbit/s.

The .eov file consists of a lookup table combined with the audio files. To keep the countless sentences in multiple languages manageable for developers, they can assign the same ID in the lookup table to a sentence in several languages. This means that they then only need to reference one ID and the language is played in all languages.

To save even more storage space, it is possible to combine expressions that are frequently used with other expressions by joining them with a slash (/). For example, the days of the week would look like this:

ID number 1: “Today is/Monday.”

ID number 2: “Today is/Tuesday.”

ID number 3: “Today is/Wednesday.”

The generated voice units here are: “Today is,” and “Monday,” “Tuesday,” “Wednesday.”

Storage and Speech Output with Integrated ...

To store and output the generated voice files, Epson offers an integrated and a discrete solution. The integrated solution comprises a 32-bit ARM Cortex-M0+ microcontroller with an integrated voice and audio hardware processor (SoC) that enables the audio to be output simultaneously over two channels with a sample rate of 15.625kHz each.

This is currently the only integrated solution on the market that can output text and audio at the same time. The unique feature here is that the individual volumes can be adjusted independently of one another. This can be used, for example, to reduce the volume of music as soon as the speech output begins. The pitch and voice speed are managed at a hardware level, and speed can be adjusted in 5% increments between 75% and 125%.

The IDs of the generated voice and audio files are written to a register in the processor, which then plays the relevant audio files. This eliminates the need for special program code to link the audio files. Once the audio output starts, no additional CPU resources are required, freeing up the CPU entirely to handle other tasks or go into sleep mode.

... or Discrete Solution

The discrete solution combines a module from the S1V30xxx speech output IC range from Epson with an external host microcontroller. This is ideal for existing designs where the microcontroller cannot or should not be replaced. Any microcontroller with an integrated serial interface is suitable for this method.

The first module in this series, the S1V3G340, has just one audio channel, which means that it can output either speech or music. According to the manufacturer, all new speech output ICs should be fitted with two discrete channels, like the integrated solutions. Currently, the S1C31D50 microcontroller is available with two channels; the mix-play function allows these to be mixed together, for example as voice output with discreet background music. The S1C31D51 model also offers a sound generator to achieve speech output via a piezoelectric or electromagnetic buzzer. Special applications based on keywords to be recognized can be supported by the S1C31D50 or S1C31D51 microcontrollers in conjunction with a microphone connected to an A/D converter input.

Various evaluation tools from Epson can be used by developers to test the quality of the speech output—via a loudspeaker using the S5U1C31D50T1200 and S5U1C31D51T1100 evaluation board, or using the S5U1C31D51T2100 buzzer board via a piezoelectric or electromagnetic buzzer in conjunction with the S5U1C31D51T1100 evaluation board. All of the evaluation tools offer extensive testing software available in different languages, and the desired language can be selected using a DIP switch. Once the free ESPER2 software is installed and licensed, it is also possible to create your own sentences, modify them as you wish, and export them to the evaluation board.

Short Development Times Thanks to Rutronik Adapter Board

If you need even shorter cut development times for high-quality speech output, you are best served by the Arduino-compatible RutAdaptBoard-TextToSpeech adapter board (Arduino Shield) from Rutronik, which can be plugged into any standard microcontroller evaluation kit with an Arduino interface. However, it is at its most convenient when combined with the RutDevKit Development Kit, because the appropriate software drivers are already available free of charge. As an alternative to the STM32L5 software driver, Rutronik has developed a driver for the Infineon/Cypress PSoC microcontroller.

The S1V3G340 sound IC from Epson is the heart of the RutAdaptBoard-TextToSpeech. It is controlled by the host microcontroller and can play back previously defined speech stored in the external NOR flash memory as binary data. The USB to SPI bridge converts the data from the USB protocol to a serial protocol during the flash process.

The speech is first generated as a ROM file using the ESPER2 Voice Data Creation PC Tool for this purpose before it is then exported to the external NOR flash memory of the adapter board. Rutronik has developed a special PC software tool specifically for this that allows all previously generated speech data to be tested by outputting it on the PC alongside the flash process before it is actually written to the flash memory.

The speech is output to any external loudspeaker via an audio amplifier and 3.5mm jack. Optimum audio output can be achieved with a loudspeaker with an impedance of 8Ω or more.

Find more informations about Text to Speech on our Landingpage.

RutAdaptBoard-TextToSpeech and RutDevKit are available at www.rutronik24.com. Rutronik’s specialists are on hand to provide implementation support and advice, including for questions regarding additional components for the application such as audio amplifiers, NOR flash memory, or loudspeakers.

Rutronik’s Arduino-compatible adapter board allows for fast creation of high-quality speech output.

Text to Speech: Look Who’s Talking!