Search
Text-to-Speech
With SVOX text-to-speech technology, speech is generated from written text using a two-step method: Text Analysis and Voice Synthesis.

Text Analysis
Abbreviations, specifications of dates and times, and other special character sequences (e.g. emoticons) are converted into readable text. After this text normalization, each word of the input text is analyzed and decomposed into smaller units (stems, suffixes etc.). Using the results of the word analysis, the structure of each sentence (subject, objects, adverbials etc.) is then determined based on a sentence grammar. Accentuation patterns and phrasing information are derived from the sentence structure. Phonetic lexica are applied in the word analysis, so this process also yields the phonetic representation of each word.

Voice Synthesis
The speech signal is generated by the concatenation of small units extracted from natural, human speech. These units are of variable length. Shortest units range from the middle of one sound to the middle of the next sound. Longest units may include complete phrases or sentences. The units to be concatenated for a specific utterance are selected in a way that a cost function for the distance between the output of text analysis and the annotations of a prerecorded database of natural speech (unit selection corpus) is minimized.

Languages
SVOX has already developed over 30 voices covering the foremost languages spoken in Europe, the Americas, and Asia. Our already proven language development process and tools are constantly being improved and extended as we continue to implement and support new languages.

Compression
In order to maintain the quality of our most sophisticated text analysis modules and unit selection corpora while reducing the target memory requirements, we have developed several highly efficient, proprietary compression algorithms for our text-to-speech system.