{"id":15786,"date":"2020-04-13T22:07:22","date_gmt":"2020-04-13T17:07:22","guid":{"rendered":"https:\/\/umang.pk\/2020\/04\/man-vs-machine-for-voice-over-in-elearning-part-2-how-tts-technology-can-enhance-elearning\/"},"modified":"2020-04-13T22:07:22","modified_gmt":"2020-04-13T17:07:22","slug":"man-vs-machine-for-voice-over-in-elearning-part-2-how-tts-technology-can-enhance-elearning","status":"publish","type":"post","link":"https:\/\/umang.pk\/ur\/2020\/04\/13\/man-vs-machine-for-voice-over-in-elearning-part-2-how-tts-technology-can-enhance-elearning\/","title":{"rendered":"Man Vs. Machine For Voice Over In eLearning &#8211; Part 2: How TTS Technology Can Enhance eLearning"},"content":{"rendered":"<p><img decoding=\"async\" src=\"http:\/\/umang.pk\/wp-content\/uploads\/2020\/04\/Man-Vs.-Machine-For-Voice-Over-In-eLearning-Part.jpg\" alt=\"\" title=\"\"><\/p>\n<div id=\"\">\n<h2>Improving e-learning with TTS technology<\/h2>\n<p>In man vs. Voiceover Machine In eLearning, Part 1, we explore the need for voice actors and why they do better when presented with instructions and scripts that have real guidance. In Part 2 of this series, we will explore how TTS technology can improve e-learning initiatives.<\/p>\n<p>Let&#39;s go back and look at the history of voice technology.<\/p>\n<h3>The evolution of voice technology<\/h3>\n<p>TTS is a relatively ubiquitous technology in the world of telecommunications. It was first introduced in 1939 at the World&#39;s Fair by Bell Laboratories. The New York Times wrote, describing the operation of the machine, &quot;My God, speak.&quot; Talking machines have evolved since then. In 1962, John L. Kelly created a &quot;vocoder&quot; voice synthesizer and recreated the song &quot;Bicycle Built for Two&quot;. Arthur C. Clarke was visiting a friend in the lab and caught the demonstration; it became her novel and the subsequent script for &quot;2001: A Space Odyssey,&quot; where the iconic supercomputer, the HAL9000, sings it while it is off. The voices of the machine have sometimes fascinated us and, as they have improved in imitating our voices, they have sometimes terrified us.<\/p>\n<p>Speaking machines are no longer science fiction. Some of us have daily interactions with smart agents like Siri and Alexa, and Google&#39;s driving instructions aren&#39;t just for getting around Silicon Valley. It is part of our lives. Interactive Voice Response [IVR] Systems have really been the foundation of Machine Voice. They replaced operators in call centers, now they can listen, talk, repeat bank statements, accept payments over the phone, and almost anything a human employee can. For eLearning we really need to ask ourselves, &quot;Are we ready to replace voice actors with machines?&quot;<\/p>\n<p>They are not perfect, they have sometimes been deeply flawed, and in the past they seemed primitive. It also seems that we tend to forget how technology advances on its own scale very quickly. We still deal with items like machine translation and text-to-speech [TTS] As if we had just landed on the moon, we forgot that this technology is almost 80 years old. A public pay phone is a rare thing these days; The phones are in our pockets. In summary, it is a good time to reevaluate the state of technology around voice systems. The talking machines were upgraded through artificial intelligence programs in telecommunications. TTS had a &quot;normal&quot; development cycle until 2015. Then it converged with machine learning and Big Data in the old problem of generating speech was reviewed by AI professionals. Natural language processing and lots of data in 2016 made TTS smarter. More has changed in the last 3 years than in the last 75 on TTS.<\/p>\n<p>Focusing on the phone for a moment, and both Android and iOS have full language settings to understand and respond to you. Unfortunately, you have probably received unsolicited calls and the entire operation was operated by the machine, including the amazing new offer, you stopped listening the moment you realized it was a recording. There are some who stop and say &quot;Can you hear me?&quot; or wait for your answer like a human would. That kind of automated \/ scheduled interaction is a mix of AI and TTS. But is that good enough for eLearning?<\/p>\n<h3>Why the voice matters<\/h3>\n<p>Let&#39;s put AI logic aside [which makes an interesting article on its own] and focus on the delivery vehicle: Voice.<\/p>\n<p>If we return to the main premise of having communication on at least two fronts, voice and text, then yes, check the box where you have spoken the words. But there are many components to the voice:<\/p>\n<ul>\n<li>Should it be male or female? Should it be recorded on both or have a voice that cannot be distinguished?<\/li>\n<li>What kind of tone? Should I be excited, relaxed or flat?<\/li>\n<li>What pattern of breathing and rhythm? Fast, slow or rhythmic?<\/li>\n<li>What type of pronunciation or accent? South, Canadian, etc.?<\/li>\n<\/ul>\n<p>Think of the early days of driving with Google as your browser. Do you remember when the voice destroyed the street names or utterly mispronounced the cities? Or, what happens when the browser says &quot;Recalculating&quot; and you feel that the application is angry with you for not making that left turn? It is often perceived as personal because the TTS system is too impersonal.<\/p>\n<h3>Speech synthesis markup language [SSML]<\/h3>\n<p>TTS has a solution for that, it&#39;s called Speech Synthesis Markup Language [SSML] and allows emphasis, substitutions, and the use of phonemes and other tricks.<\/p>\n<p>With modern TTS systems, tell the machine how to pronounce &quot;<em>tomato&quot;<\/em> It is easy. Just say &quot;Toe-may-toe&quot; or &quot;Toh-ma-tah&quot;. In the southern United States, the pronunciation of a pecan tree can be taught like this:<\/p>\n<p style=\"padding-left: 30px;\"><speak><\/p>\n<p style=\"padding-left: 30px;\">You say, <phoneme alphabet=\"ipa\" ph=\"p\u026a\u02c8k\u0251\u02d0n\">pecan<\/phoneme>.<\/p>\n<p style=\"padding-left: 30px;\">I say, <phoneme alphabet=\"ipa\" ph=\"\u02c8pi.k\u00e6n\">pecan<\/phoneme>.<\/p>\n<p style=\"padding-left: 30px;\"><\/speak><\/p>\n<p>The tool is the phoneme, which is best defined as a building block for the sounds people make. The funny alphabet is the international phonetic alphabet; captures the sounds that human voices [mouths, lips, etc] do. You can encode almost any man-made sound and play it back.<\/p>\n<p>If that is the name of a company, a brand or a person, it is important to have a pronunciation guide of what they &quot;should&quot; call. Sometimes the TTS system will guess the pronunciation of a word based on your training and that can be bad if it is a well known sound. Also, some words are pronounced depending on how you use them: &quot;Bass&quot; is a fish or a type of musical instrument. You can now distinguish to a very specific degree how things should sound.<\/p>\n<p>These systems are fully customizable in several ways: language models, voices and generated sounds, and modeling around other speakers. Speech synthesis markup language; This allows various customizations around:<\/p>\n<ul>\n<li>Reading breaks<\/li>\n<li>Speech speed<\/li>\n<li>Voice tone<\/li>\n<li>Vocal tract length [deeper voice]<\/li>\n<li>Language used [useful for when reading English or foreign names]<\/li>\n<li>And the pronunciation &quot;fixes&quot; with phonemes using Phonetics<\/li>\n<li>Visual synchronization with lip movements. [visimes]<\/li>\n<li>Parts of the sentence [\u201cwill you read the book\u201d vs. \u201cI already read the book\u201d]<\/li>\n<\/ul>\n<p>Clearly, there are options. But how do you choose between the two?<\/p>\n<p>The factors that typically push one to TTS are simple; Demand is greater than capacity. In other words, the amount of voiceover is greater than the capacity to hire human actors. This does not mean that all jobs are divided in this way; only some jobs can only be attended by TTS.<\/p>\n<p>TTS systems tend to be customized for vocabulary [dictionary] and a few hours of engineering time are spent correcting &quot;errors&quot; for every 30 minutes of audio. Still, this rate is substantially less than the traditional talent pool.<\/p>\n<p>The flip side is that humans have prosody, that is the term used for naturally varying speech patterns and differences in intonation, pitch, speed, volume, etc. The things that give richness to the voice. This is 100% available with a study session. However, it&#39;s not as available on TTS unless you dedicate hours of work to minutes of audio.<\/p>\n<p>The recommendation is to ask an eLearning expert and also validate the cost \/ benefit of being in more languages. Most students will probably forgive the TTS if that means they can listen to the lesson instead of reading a transcript.<\/p>\n<p>In other cases, a professional voiceover gives the lesson a certain level of polish that is difficult to replace; But this comes at a cost. An important observation to share is that these things cost less with scale up to a point.<\/p>\n<p>Reserve ten minutes of study time, the talent will be there for an hour; So why not 25? Or 30? These additional minutes are &quot;pooled&quot; into the basic &quot;filing&quot; rate and, as a cost per minute, the rate decreases as it does so. It&#39;s like when you buy an extra large pizza and share it with everyone. You end up with big savings. For individual instructional designers, this could mean grouping 2-3 courses at once; For organizations, learning to coordinate language launches is common practice. If you register all the Japanese courses at the same time and pay less in general than if you had done it one by one.<\/p>\n<p>Unfortunately, getting stars to line up across multiple projects doesn&#39;t always happen, but it&#39;s still a valid cost optimization strategy. As for TTS? That is not really the same. It&#39;s a flat rate almost, the more minutes, the more engineering. Maybe optimization happens, but you never have to pay a reservation fee and adding bits and parts doesn&#39;t give you the same upfront costs.<\/p>\n<h3>TTS&#39;s future is now<\/h3>\n<p>For the past few years, Google and Microsoft have been experimenting with custom language variants, where you can provide a voice model and it&#39;s grafted onto a TTS. Imagine a way to re-shoot and redo scenes in movies after the actor has left production, or correct flubs that would otherwise be perfect. Adobe in November 2016 introduced a technology called &quot;VoCo&quot; at an event with a guest actor. In this event they took the voice of the presenter, the actor Jordan Peele and showed him &quot;photoshop for voice&quot;. Technology could imitate the actor saying anything. This technology faced a huge reaction from people concerned about its potential for misuse. Mark Randall, vice president of creativity at Adobe responded by saying:<\/p>\n<blockquote>\n<p>&quot;That is because, in essence, technology is an extension of human capacity and intention. Technology is no more idealistic than our vision of the possible and no more destructive than our misguided actions. It&#39;s always been that way.&quot;<\/p>\n<\/blockquote>\n<p>Nothing else has been released about the project since then.<\/p>\n<p>Also, in September 2016, Google released Deep Mind WaveNet, which unlike the traditional &quot;ransom letter&quot; style of TTS voice outputs, audio snippets mixed with words, was actually modeled after the actual speech and sounded like If i were. This neural network voice generation technology is what today&#39;s most modern TTSs are used for. But cloning voices, altered normal speech when writing different words are still to come. There&#39;s also work on the lip sync and dubbing side when adding computer vision [reading lips] to transcribe or take the &quot;false&quot; cloned voice to clone &quot;lip movements&quot; and further erode the ability of humans to be the gold standard for voice-over.<\/p>\n<p>Recently, we have been able to &quot;patch&quot; audio using TTS to correct small errors in a voice-over with an edit. This is nothing new for audio editing, but it is new as we no longer have to bring the talent to re-record a line in the eLearning course. Freelance words like &quot;Next&quot; or &quot;Question 2&quot; are also safe enough in an eLearning test environment that TTS is perfectly suited to deliver in 1 hour what it would take a studio 2 hours + the time to find a talent [days]. These patches are limited since if it&#39;s a long speech, a voice actor still beats TTS.<\/p>\n<p>It is also changing the big picture for voice artists. A startup in Montreal has been developing a &quot;voice bank&quot; tool. Imagine if your entire eLearning catalog was voiced by your charismatic training director. How could you continue to make more voices than your schedule allows? How about after she left a long time ago and you still want to use her voice? It is now possible to create a model of a real person&#39;s voice and then use it in TTS. Like the Adobe example, it is open to ethical questions that we are just beginning to ask. Does the compensation model become a royalty model? Does the artist&#39;s voice become the intellectual property of the company that created it with all rights?<\/p>\n<p>Today, solutions for voice banks involve preserving people&#39;s voices when faced with cancer in which they would lose their ability to speak. Famous, film critic Roger Ebert lost his voice, but through an earlier version of this technology he was able to rebuild himself with hours of audio he had produced. These projects used to be monumental efforts of months of recording and engineering. With advances in the past 2 years, it is now just 2 hours of voice recordings and a few hours of processing.<\/p>\n<h3>conclusion<\/h3>\n<p>For eLearning voiceovers, it will be the status quo for years to come until TTS technology becomes ubiquitous and &quot;voice repair&quot; options become mainstream to &quot;pick up.&quot; This, as with other automatable tasks, calling the voice actor for a &quot;redo&quot; will be less likely. Also, premium voice banks will be sold or manufactured for niche markets and will sound like real people. Those actors will continue to have a profession and the ability to license their voice.<\/p>\n<p>Some TTS systems today operate on a licensing model [think automated systems like elevators] where the same recording will be used a million times. For e-learning, these external elements won&#39;t make much of a difference except to lower the cost of entry to certain markets and make maintaining annual mandatory training less costly, since you can edit the same voice and add new ones. details in minutes for all languages.<\/p>\n<p>Many courses today are perfectly happy to have TTS included, not just as an aid [think screen reader for the blind], but more than a standard voice. Eventually, it will be of a better quality and good enough that the narration becomes as ubiquitous as &quot;color graphics&quot; or &quot;air conditioning&quot; or anything else that was once the high-tech world show.<\/p>\n<\/p><\/div>\n<\/pre>\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Improving e-learning with TTS technology In man vs. Voiceover Machine In eLearning, Part 1, we explore the need for voice actors and why they do better when presented with instructions and scripts that have real guidance. In Part 2 of this series, we will explore how TTS technology can improve e-learning initiatives. Let&#39;s go back [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":15787,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1318,4647,9639,15950],"tags":[],"class_list":["post-15786","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-audio-in-elearning","category-elearning-voice-over","category-localization","category-text-to-speech"],"_links":{"self":[{"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/posts\/15786","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/comments?post=15786"}],"version-history":[{"count":0,"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/posts\/15786\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/umang.pk\/ur\/wp-json\/"}],"wp:attachment":[{"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/media?parent=15786"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/categories?post=15786"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/umang.pk\/ur\/wp-json\/wp\/v2\/tags?post=15786"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}