HARNESSING THE POWER OF AI IN TEXT TO SPEECH - USING TTS IN 2021 AND BEYOND - GOAP

Page created by Dolores Bell

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

HARNESSING THE POWER OF AI IN TEXT TO SPEECH - USING TTS IN 2021 AND BEYOND - GOAP

Harnessing the power of AI
    in Text to Speech
    Using TTS in 2021 and Beyond

Overview
●   What is Text To Speech [TTS]?
●   Why is spoken language difficult?
●   What to consider with TTS
●   Let’s make some TTS
●   Providers

A word about AI
Artificial Intelligence is
a large area of study,
and primarily has seen
Language Processing
as a Data and
Statistical problem.
NLP is the branch that
overlaps with what we
see today in Machine
Translation and
Speech.

A historical progression

               1950                              1980                             2010                            2020

                                                                        The age of the Neural
          Starting out                        Statistics                      Network                            Today

  Originally from the 30’s Speech        Calculating statistical       Neural Networks, TacoTron,     1-shot from hearing to speaking
 synthesis. It works but not always   probabilities leads to better   WaveNet, Machine Learning and        Robust cloud models
                                                 results.                more Neural Networks            Voice Agents everywhere

Microsoft
                                                                                                                   2019

https://azure.microsoft.com/en-us/blog/microsoft-s-new-neural-text-to-speech-service-helps-machines-speak-like-people/

2008-2016: many changes in the AI space
We moved from
procedural
statistical
processing to
massively
parallel
processing.

Text To Speech is difficult
Speech is made up of sounds, and those sounds are complicated:

                                  https://www.deepmind.com/blog/article/wavenet-generative-model-raw-audio

Voice is different...
1. watt ewe right iz knot watt u saye.
2. Even if it’s written correctly you can’t tell Bass from Bass...
3. The tools are theoretical
4. No existing “spell-check” or make-language-sound-right-check--human
   validation required
5. “Standards” vary by language/grammar/transcription method, and more!
6. Everyone has a slightly different implementation and acceptance criteria
7. Not all customers accept

                                                                              8

Voice is faster and cheaper
1)   Never gets sick
2)   Sounds the same months later--Infinite retakes!
3)   Doesn’t get tired
4)   Doesn’t care what it is saying
5)   Engineering costs + QA costs vs. Raw Talent

Prosody/Suprasegmentals [Phonetics 101]
Speech is more than just sounds put together, it’s about how they are put
together.

Suprasegmental: A set of qualities superimposed on a set of phonetic segments

 ●   Pitch / Tone
 ●   Juncture [e.g. punctuation]
 ●   Stress [loud/soft]
 ●   Intonation/Rhythm/Melody/pauses
 ●   Duration

     AKA: Prosody
                                                                                10
     "I never said that she stole my money"

Sounds versus Words
Localization has traditionally been text-based.

         When we do voice--it’s a human actor, reading a written script.

How many CAT tools exist vs. how many “language translation speech tools”
exist.

Text has many advantages, until recently we didn’t have similar methods for
spoken language and our tools have been limited.

Modern speech tools are a product of the AI/NLP Evolution of the past 5 years

                                                                                11

IPA and SSML

               12

SSML
It’s like HTML, but for speech!

Allows you to control timing, volume, pitch, stress… Prosody

Each vendor has a slightly different implementation but there is an official
standard from the W3 consortium.

                                                     https://www.w3.org/TR/speech-synthesis11/

                                                                                                 14

Voxabot TTS Editor

A TTS SSML Editor

                                      Custom SSML
                                      Tag
      SSML tag selection
                Provider selection
                Language and Voice
                Selection
                   General controls

SSML Applied

Voxabot Demo

Working with TTS in many languages
TTS is synthesized and generated based on abstract language rules--abstract
because sometimes it’s a guess.

Adding tags and punctuation can make unknown changes--and also fix defects.

Since most people don’t speak 120+ languages, you send it to a linguist who can
tell you where it’s wrong.

TTS feedback: How To
Just like any feedback:

 1) Be precise with replace [this] with [that]
 2) If the error is phonetic as in Bass vs. Bass [fish/instrument] then spell it out
    like:
      a) BASSE or BASE
      b) If you have pinyin or other phonetic guides use them:
              
                   你说 薄。
                   我说 薄。

Main TTS providers
Amazon: AWS Polly

Microsoft: Azure Cognitive Speech Services

Google: Google Compute Cloud: Speech

Each provides a set of languages and are gradually expanding their Neural TTS
offerings.

Other players exist in Canada, US and Australia which are building their own
models.

The Future
Voice cloning [actors go virtual]

Customized voices [brand voice]

No Translation stage [voice to voice]

Conclusion
TTS is easy, language is hard

You can almost replace voiceover.

You can also read

A Survey on Neural Speech Synthesis - arXiv

Healthy child, healthy future - Speech and language therapy for children Information and referral guidance - HSC Public Health Agency

Listeners' Linguistic Experience Affects the Degree of Perceived Nativeness of First Language Pronunciation - Frontiers

Speak Up for Communication Rights - International Communication ...

SPEECH AND LANGUAGE THERAPY ASSESSMENT IN VIETNAM - USAID

2021 Medicare Fee Schedule for Speech-Language Pathologists - ASHA

Bercow: Ten Years On - www.bercow10yearson.com

Best start in speech, language and communication: Guidance to support local commissioners and service leads - Gov.uk

Stimulability Character Cards provided by Adele W. Miccio 1952-2009

Determining Special Education Eligibility - Speech Language Impairment - Speech Language ...

Kentucky Eligibility Guidelines For Students with Speech or Language Impairment

Training Brochure 2019 2020 - www.leedscommunityhealthcare.nhs.uk/cslt www.eventbrite.com/o/childrens-speech-and-langu.. ...

You Talk Too Much: Limiting Privacy Exposure via Voice Input

2015 Medicare Fee Schedule for Speech-Language Pathologists - American Speech-Language-Hearing Association

SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

CLINICAL GUIDELINES Speech Language Pathology Services Version 1.0 - Effective May 15, 2021 - eviCore

An Analysis of Figurative Language on Joe Biden's Victory Speech

Long COVID and speech and language therapy: Understanding the mid- to long-term speech and language therapy needs and the impact on services ...

A Review Paper Implementation of Indonesian Text-to-Speech using Java

SLAM: A UNIFIED ENCODER FOR SPEECH AND LANGUAGE MODELING VIA SPEECH-TEXT JOINT PRE-TRAINING

Neural envelope tracking as a measure of speech understanding in cochlear implant users - bioRxiv

Swordfish IV User Guide - Copyright 2007 - 2021 Maxprograms

Speech-to-Text System for Phonebook Automation

Open Voice Data in Indian Languages - A Study on - Toolkit Digitalisierung

Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features - IOPscience

Manual Four Way Test Speech Contest 2018-2019 - Revised JULY 16, 2018 - Rotary District 6110

Automated Speech Act Classification For Online Chat

President's speech portrait - E3S Web of Conferences

Testing the First Amendment Validity of Laws Banning Sexual Orientation Change Efforts on Minors: What Level of Scrutiny Applies After Becerra and ...

Tackling Disinformation and Online Hate Speech: Case studies of 27 EU member states, so far - Digital ...

4-H Presentations Training & Contest Guide - "TO MAKE THE BEST BETTER!" South Carolina - Clemson ...

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition - MDPI

Parallel and distributed encoding of speech across human auditory cortex

2019 AWARDS PACKET SENIOR LAWYERS DIVISION STATEWIDE LAW DAY ESSAY CONTEST - 2019 LAW DAY THEME - South Carolina Bar

Free Expression on Campus: A Survey of U.S. College Students and U.S. Adults

HARMING WOMEN WITH WORDS: THE FAILURE OF AUSTRALIAN LAW TO PROHIBIT GENDERED HATE SPEECH - UNSW Law Journal

FASTSPEECH 2: FAST AND HIGH-QUALITY END-TO- END TEXT TO SPEECH - OpenReview

Spectrogram-based Efficient Perceptual Hashing Scheme for Speech Identification

PERCEPTION OF NIGERIAN DÙNDÚN TALKING DRUM PERFORMANCES AS SPEECH-LIKE VS. MUSIC-LIKE: THE ROLE OF FAMILIARITY AND ACOUSTIC CUES - MPG.PURE

RESTART-DCM Method Marie-Christine Franken and Ellen Laroes - REVISED EDITION 2021

Depressed fathers' speech to their 3-month-old infants: a study of cognitive and mentalizing features in paternal speech

The Airbus Air Traffic Control speech recognition 2018 challenge: towards ATC automatic transcription and call sign detection

PUBLIC DISCOURSES OF HATE SPEECH IN CYPRUS

An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions

Learning-based Practical Smartphone Eavesdropping with Built-in Accelerometer

2020 Medicare Fee Schedule for Speech-Language Pathologists - ASHA

STUDENTS SPEAK UP Perspectives of Free Speech Among Student Leaders in the University of California System - National Center for Free ...

AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING

Motor neuron disease: the impact of decreased speech intelligibility on marital communication

ADASPEECH: ADAPTIVE TEXT TO SPEECH FOR CUSTOM VOICE