VOICE QUALITIES IN AUDIO SUBTITLES: OPPORTUNITIES AND CHALLENGES IN VOICE DESIGN FOR ACCESSIBILITY AND BEYOND - DIVA PORTAL
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DEGREE PROJECT IN MEDIA TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021 Voice Qualities in Audio Subtitles: Opportunities and Challenges in Voice Design for accessibility and beyond ANNE-CHARLOT SCHOLZ KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Abstract This paper explores novel experiential qualities of the voice in Audio Subtitles through research through design. Audio subtitles is an accessibility service for users who have trouble comprehending subtitles in audiovisual content and has been newly developed for video on demand platforms such as SVT Play. In order to explore possibilities in its voice design, short video clips of films and TV series with different types of audio subtitles were produced, presented to and discussed with a small number of potential users of audio subtitles that included people with dyslexia, cognitive difficulties and autism. Results in- dicated that applied voices that did not support the user’s expectations, low and high pitches as well as low-quality speech synthesis, made for uncomfort- able experiences, which could prove to be useful for provoking reflection and challenging norms. The paper also discusses how voice design for this ser- vice has the potential to match the filmmakers intentions by translating more than semantic information, as well as how audio subtitles could potentially be produced by professional sound designers and filmmakers instead of video on demand services. Finally, challenges such as misgendering and insensitive choices of voice in voice design for audio subtitles are considered, underscor- ing how ethics can’t be avoided when working with the voice modality.
Sammanfattning Denna uppsats utforskar nya kvaliteter av rösten i uppläst undertext genom research through design, en metod där kunskap skapas genom design proces- sen och reaktioner till design. Uppläst undertext är en tillgänglighetstjänst för användare som har problem med att läsa och följa undertexter i audiovisu- ellt innehåll och har nyligen utvecklats för video on demand-plattformar som SVT Play. För att utforska möjligheter i dens röstdesign producerades korta videoklipp av filmer och TV-serier med olika typer av uppläst undertext. De presenterades för och diskuterades med ett litet antal potentiella användare av tjänsten, bland dem personer med dyslexi, kognitiva svårigheter och autism. Resultaten indikerade att röster som inte stödde användarens förväntningar, låga och höga tonhöjden samt talsyntes av låg kvalitet, gav obehagliga upp- levelser, vilket kan visa sig vara användbart för att framkalla reflektioner och utmana normer. Uppsatsen diskuterar även hur röstdesign för uppläst under- text har potentialen att efterlikna filmskaparnas avsikter genom att översätta mer än semantisk information, och hur ljudundertexter kan produceras av pro- fessionella ljuddesigner och filmskapare istället för video on demand tjänster. Slutligen tas utmaningar som felaktig könsbestämning och okänsliga röstval i röstdesign för uppläst undertext i hänsyn, vilket understryker hur etik inte kan undvikas när det arbetas med röst-modaliteten.
Voice Qualities in Audio Subtitles for Film and TV: Opportunities and Challenges in Voice Design for accessibility and beyond Anne-Charlot Scholz KTH Royal Institute of Technology Stockholm, Sweden acscholz@kth.se ABSTRACT 44]. Improved quality of speech synthesis has simultaneously This paper explores novel experiential qualities of the voice opened up possibilities to implement text-to-speech AST in in Audio Subtitles through research through design. Audio a cost-effective manner [51], which has encouraged video on subtitles is an accessibility service for users who have trouble demand platforms to build such a tool for their service. A comprehending subtitles in audiovisual content and has been team at the video on demand service of Swedish Television newly developed for video on demand platforms such as SVT (SVT Play) [2] has recently built a text-to-speech AST service Play. In order to explore possibilities in its voice design, short and offered this paper an inside look into the choices made video clips of films and TV series with different types of audio for the feature. Here they implemented a so called voice-over subtitles were produced, presented to and discussed with a effect in audio subtitles, which infers a voice speaking over small number of potential users of audio subtitles that included the original performers seen on screen, which could make for people with dyslexia, cognitive difficulties and autism. Results an especially interesting voice-user-interaction to explore [7, indicated that applied voices that did not support the user’s ex- 51]. pectations, low and high pitches as well as low-quality speech The aural dimension of films - i.e. the sound and voice quality - synthesis, made for uncomfortable experiences, which could has been said to be immensely relevant for the user experience, prove to be useful for provoking reflection and challenging interpretation and emotional reception of films [43, 20]. The norms. The paper also discusses how voice design for this voice is arguably the most prominent factor in audio subtitles service has the potential to match the filmmakers intentions and is itself an intimate and intricate modality for humans, by translating more than semantic information, as well as how making it all the more relevant to explore [28]. Some efforts audio subtitles could potentially be produced by professional have previously been made to examine the user experience of sound designers and filmmakers instead of video on demand voice qualities in speech synthesis, but there is space for more services. Finally, challenges such as misgendering and insen- research in the context of voice design in audio subtitles [26, sitive choices of voice in voice design for audio subtitles are 16, 14]. considered, underscoring how ethics can’t be avoided when working with the voice modality. This paper therefore aims to explore novel qualities of the voice modality in audio subtitles for fictional film and TV. INTRODUCTION A qualitative research through design methodology was em- The subject of voice design has in recent years become in- ployed, gaining insights on the user experience of the voice creasingly relevant and discussed in the research field of Hu- by designing prototypes and examining reactions to them [53]. man Computer Interaction (HCI). This is in part due to the For this, 28 final prototypes of AST with different applied development of voice assistants and other forms of voice-user- voice qualities were designed, created and presented to a small interfaces, which has led to an interest in examining the voice group of potential users of the service during individual inter- as a design material and exploring its many qualities [8, 50, 39, views. Participants included people with dyslexia, cognitive 28]. In this paper, voice qualities are defined as all conceivable difficulties and autism but it is believed that results will be factors that contribute to the perception and interpretation of relevant for all users in need of AST. Results were analyzed the voice such as perceived age, gender, dialects and expres- and discussed in order to identify possible implications for sion, which makes for an infinite number of qualities. The voice design in audio subtitles and beyond, which could prove voice is an especially important factor in the context of ac- to be beneficial for AST- as well as HCI developers, users and cessibility services such as Audio Subtitles (AST), which is researchers. intended to ensure access to audiovisual content for users that have difficulties reading and comprehending the subtitles [26]. With the continuous development of an online media envi- RELATED WORK ronment and video on demand streaming as well as newly This section presents previous research on audio subtitles, introduced legislation regarding accessibility, the demand for voice qualities and sociophonetics, wherein authors discussed audio subtitles in Sweden has been shown to be growing [1, 36, concerns within voice design.
Voice qualities in Film and HCI Rendering The objective of audio subtitles has been said to fully immerse In Remael’s study [45], the rendering was additionally im- the viewer in the story told on screen and provide access to plemented to emulate how the original audio could and was enjoyable high-quality entertainment through a single cohesive intended to be heard, such as through a telephone or a radio. experience [15, 52]. The researchers stated that this should Results suggested that the rendering was an important factor be achieved by ensuring that the service does not disrupt or for the narrative cohesion of the product. minimize the emotional journey the source text of the film has Dialects to offer. Many factors are to be considered to achieve a high quality user experience, but in the case of AST one of the most Pucher and colleagues [42] examined text-to-speech voices important factors to the experience and perceived quality is with regional dialects, which users perceived as fun and per- said to be the aural medium and more specifically, the voice sonal, but not appropriate for formal settings. It was further [43, 20]. Previous research has highlighted the relevance of claimed that dialect and sociolect speech synthesis will evoke sound and voice quality in relation to engagement with the user’s emotions because of their association of the dialect to a content and emotional reception of a film, which is why aural specific social group. and prosodic properties in the voice such as speech style, Whispering intensity, delivery, intonation and accents should be examined In a recent research through design study, Parviainen and more closely [20, 52, 43, 28]. These properties have also colleagues [39] examined interaction with voice-assistants been shown to be crucial for the user’s quality assessment and through whispering. The authors criticized that voice modal- interpretation of emotion and intent of the source text [28, 15]. ities were usually discarded in the interactions, though they Other characteristics of the voice and how they are perceived are vital for human-to-human interaction. Elimination of them such as gender and age have also been suggested to play a role would ignore human’s understanding, interpretation and emo- in the immersion of the content. tional impact of vocalised voice. If implementations of audio Vocal expression studies have shown the complicated nature subtitles similarly discard those factors - as they usually focus in which a voice is perceived and analyzed by humans [28], on the interpretation of semantic information -, intended emo- making the development of speech synthesis a challenging tional information and experience has been said to potentially ordeal. Many countries have instead resorted to dubbing inter- get lost to the viewer [15]. Similarly, other voice qualities con- national films and TV-series, allowing national performers to tain information that could be consequential for the perception perform vocally to the facial performance of the original actor of the film and alter the quality of the user experience. in the film [10]. In audio subtitles however, the common prac- Sociophonetics in voice user interfaces tice is to instead employ speech synthesis and the voice-over The topic of voice as a design material has previously been effect [7, 51]. The user experience of aural and voice qualities researched and discussed in HCI in relation to voice assis- in synthesized speech has previously been examined, but in tants, since it’s been concluded that voices have the potential the context of AST and its voice design, there is more left to to shape user perceptions and experiences and should be well explore [26, 16, 14]. considered [8, 50]. Studies had previously usually focused What follows are some categories of voice qualities that have on aspects of intelligibility and naturalness of the voice as previously been examined and are of interest in relation to well as the technological development of voice- and speech audio subtitles and this thesis. interactions with voice-user-interfaces [50]. Research on so- ciophonetics - which is the study of social factors that influ- Gender of the voice ence production and perception of speech - has found that Studies that have examined the user experience of synthetic speech could indicate many factors such as gender, age, social voices have shown that a synthetic female voice was preferred class, nationality and sexuality, with one project even creat- over a masculine one in that it was perceived to be more natural ing synthetic voices that were perceived to be introverted and and require less effort by the user to follow [16]. In a study by extroverted [35, 50]. Sociophonetics have previously been Remael [45] the actors performing in the film also provided identified to often be omitted from research concerning voice their voices for the audio subtitles. This inferred that the user design for smart devices, in spite of the potential social impact heard several different voices in AST of the same gender (and the choice of voice could have for voice-user-interfaces [8, in this case, exactly the same voice) as the actor portraying the 50]. Voice-user-interface designers usually employ “one voice character, which was well received. fits all” for their devices and decide which voice qualities are deemed standard or what a neutral, non-dialect and appropri- ate voice in the given national context is, which is grounded Emotional Speech Synthesis in the designer’s existing speech biases. Designing the voice This also allowed for the actors to infuse the audio subtitles for voice-user-interfaces thus comes with responsibility and with emotion and a similar performance as seen in the film [45]. social consequences. Emotional text-to-speech AST is not commonly used, but it has been shown to be preferred over expressionless synthetic It has been argued that choosing a standard voice reinforces speech [47]. A mood-congruent vocal quality in films was speech ideologies, stereotypes and prejudice against certain preferred due to it fitting into the emotional landscape better ways of speaking and an ignorance of the wide variety and than with mood-incongruent qualities, allowing for greater diversity of voices and dialects that exist in each language [50, immersion in the content [15]. 8]. Users have previously shown an apparent preference for
voices in voice-user-interfaces that speak with their own native for AST on video on demand. Possible voice and sound quali- accent, a phenomenon the authors have called the similarity- ties were consequently laid out and compared to the current attraction effect. This in addition to the findings that different state of text-to-speech audio subtitles. The topic was addi- generations of users speak and interpret language differently tionally reviewed in informal discussion with people work- are some of the arguments for the concept of individualization ing with the voice in different ways such as a professor in that Sutton and colleagues [50] propose could be employed voice acoustics, an opera singer and developers of AST. In in voice design, allowing the user to choose a voice of their hindsight the chosen qualities were identified to fit into three preference for voice-user-interfaces. To address the issue of categories, namely assumptions about the speaker, expression filter bubbles apparent in voice design today, the authors si- of the speaker and context of the speaker. multaneously propose reverse individualisation, where the user is instead faced with voice qualities unfamiliar to them Assumptions about the speaker in order to normalize diversity in voice and challenge what is Here the sociophonetics of the voice such as dialects, pitch, considered a “standard” voice for voice-user-interfaces. Con- age, sex and possible speech impediments were considered. text awareness is another concept proposed for voice design, Inspiration was drawn from the synthetic voices kindly made in that multiple voices can be heard depending on the use and available by Acapela Group, a company that develops text- context of the voice-user-interface [50]. to-speech software and services [21]. The voices included an Indian English dialect, Swedish regional dialects, Swedish Overall, researchers call for more expression in voice-user- children’s voices and an elderly English voice. The team at interfaces to embody the social and cultural identity in speech SVT Play sourced the voices used for AST from Acapela and more of all speech variations in the world to be available Group’s as well, which was further reason to include the at- for those interfaces [50, 8]. This research also showed that tributes of dialects and age that the company had available. the user experience is going to be influenced by each user’s individual sociocultural knowledge and experience of voices Expression of the speaker [50]. This category grouped the voice qualities that expressed emo- tion and to an extent circumstance, such as whispering and METHOD singing. In choosing voice qualities for audio subtitles, the This section gives a brief explanation on the methodology used topic was approached by first reviewing previous research in the paper, research through design, and describes how the on voice modalities, Audio Description and AST. Something study was conducted in the following steps: (i) design process, that was often examined was the experience of an emotional how and which voice- and video material were chosen for the tone and expression in Audio Description as opposed to a design artefacts, (ii) how the artefacts were produced and (iii) neutral one, especially with text-to-speech [47, 15]. This led the recruitment- and data collection process in the form of to an interest to design some prototypes with emotional AST, interviews. which were intended to go with the overall emotional vocal Research through design is a research methodology wherein performance by the original performer in the clip. There was knowledge is garnered through the process of designing and a limited number of options available for emotional synthetic exploring reactions to designs [53]. It is described to create speech, a sad and a glad male American English voice, both artifacts that provide concrete embodiments of theory and tech- of which were chosen to be included. After further thought nical opportunities [53, 17]. It is further stated that the intent of how emotion can be expressed, the attributes of shouting, should be to produce knowledge for the research and practice whispering and singing were considered, which are often ways communities. Research through design has been previously to express circumstance and feeling. Whispering could among agreed upon to be about research on the future [54], akin to other things indicate intimacy or fear. Shouting could suggest design futuring, a term which has been used to describe ap- the speaker to be happy, angry, agitated or distressed. Singing proaches for exploring futures with design, often in an effort to could also be an expression of happiness, a performative act or change the present [33]. Design allows the researchers to envi- even be used in a therapeutic manner (as seen in “The King’s sion the futures in more experiential detail through presenting Speech” (2010) [25]). and putting artifacts or pieces of fiction up for discussion. Context of the speaker This paper aims to follow the principles of research through This category represented the way the voice would be heard design and design futuring by designing prototypes of voices differently depending on the context seen on screen, for in- for audio subtitles and engaging with the user groups reactions. stance if the voice was heard in a large hall, a small room or The paper will further analyze the resulting experiential qual- through a telephone. It was reflected upon how voices could ities of the voice and identify implications for voice design be heard in movies and TV, where audio technicians put care- in HCI. Audio subtitles is first and foremost an accessibility ful thought to how the voice should be perceived in a certain service and since the designer is not part of the target group, setting and scene. This called for an inclusion of a particular emphasis will be put on data and knowledge created by the rendering and distortion of voices when they were for example participants instead of the design process of the prototypes. heard through a telephone or a radio. Additional thought was put on the proximity attribute of the voice, which could say Choosing voice qualities a lot about the setting of a scene. Is the speaker far away or Previous research on phonetic qualities of synthetic voices was near, left or right from the audience and what are the acoustics studied to gain inspiration for what should be a preferred state of the setting? Designs with different reverb sizes and filter
effects were thus included to emulate different settings, as used for non-singing scenes, whispering AST was likewise well as playing around with voices being heard from different used for non-whispering scenes, the proximity of the voice mono outputs, i.e. sounds coming from only one side of the did not match the screen and dialects, gender and age did not speakers. correspond to the original speaker. Omitted voice qualities An effort was made to include a variety of voice qualities that Design conditions and tools were relevant in the context of AST in film and TV. Some ideas Available free online text-to-speech resources were reviewed that came up were not included in the end, due to the topic to find suitable material and software for prototyping. Acapela being broad enough as it was or due to a lack of resources. Group kindly granted access to their online Editor Acapela These ideas included opera song in AST, nasal voices signaling Cloud with a wide range of synthetic voices - including a cold or sadness and speech impediments. The latter topic in Swedish ones - and were thus used for text-to-speech ma- particular would have been interesting to examine, challenging terial [21]. Qualities that were not available such as singing the norms in voice design. Unfortunately, no appropriate and to an extent whispering AST were instead created with a voice performer was available to produce AST with a speech natural human voice from an amateur female voice actor. For impediment. the rendering of the voice, the audio editing software Audacity and the available audio effects in iMovie were used to for Choosing video material instance emulate a radio station. A lot of the video material was chosen from prior personal Video material was chosen from video on demand services experience and mapping out what kind of scenes would fit SVT Play [2], ComhemPlay [3] and Netflix [13]. Short sec- with certain voice qualities, although this requirement was tions of the video material were chosen and the translated later discarded. TV and film as mediums of scripted fiction subtitles were transcribed. Audio files of the synthetic and were deemed to be similar enough to both be included in the human voice reading the subtitles aloud were created. SVT study. Documentaries, operas or news programs were not Play had four synthetic voices available and provided a short included, as they were assumed to have a different relation to film ("Index" (2020) [32]) with separate audio files for the the use and design of audio subtitles. AST. All video and audio files were combined in video editing Several international TV series and movies were reviewed in software iMovie in a way to mimic audio subtitles. order to find appropriate scenes for the voice modalities. The final prototypes consisted of video material sourced from sev- Final Prototypes eral different streaming services when a certain TV series or A total of 28 final prototypes were created, each about a minute film was held in mind. “The King’s Speech” (2010) [25] was long. They are referenced to as V1-V28 (see Table 1). for example identified to be suitable potential source material, since it had a lot of situations useful to the voice qualities that Voice qualities had been laid out. This included several scenes with a voice The varying qualities implemented were as follows: Gender being heard over a radio, with an echo or in a cathedral and (male, female); Age (old, middle-aged, child); Dialect (Fin- even speak-singing, which was otherwise thought hard to find. land Swedish, Scanian Swedish, Indian English); Pitch (low Singing scenes could oftentimes be argued to not need trans- pitch, high pitch); Expression (happy, sad, whispering, shout- lation in movies and TV, since the important take-away from ing, singing); Output effect (radio, telephone, echo, muffled, those scenes could be said to be the atmosphere, melody and mono); Reverb size (small room, large room, cathedral). vocal performance instead of the text. In “The King’s Speech” however, the singing is part of the therapy the protagonist has Video material to overcome his stammer. In an arguably pivotal scene for TV series and films used were: "Call my Agent!" (French the relationship between two characters the protagonist has production) [24, 23, 22]; "Index" (Swedish production) [32]; trouble speaking about his childhood and instead sings the "Kuch Kuch Hota Hai" (Indian production) [27]; "Narcos" words out, which would be important to translate and include (American production) [38]; "Perfume" (German production) in audio subtitles. [29]; "The Bureau" (French production) [46, 34, 9]; "The Supporting VS Contradicting Expectations King’s Speech" (British production) [25]; "Weissensee" (Ger- Part of the design process was deciding whether the prototypes man production) [18]. should support expectations of the vocal performance of peo- Text-to-speech material ple seen on screen or if they instead should be provocations, The following synthetic voices were provided by Acapela meant to contradict and challenge expectations. For a while Group and used for the prototypes: Deepa (f, Indian En- voice qualities were matched to appropriate scenes and thus glish); Elin (f, Swedish); Emil (m, Swedish); Emma (f, supported expectations, but soon it was deemed interesting and Swedish); Erik (m, Swedish); Filip (m, Swedish child); Freja engaging to discuss provocations that might not be accepted (f, Swedish child); Mia (f, Scanian Swedish); Samuel (m, Fin- by the target audience and why that is, a concept akin to what land Swedish); Will (m, American English), Will from Afar Sutton and colleagues discussed in their study [50]. This ap- (shouting), Will Happy, Will Old Man (elderly), Will Sad, Will proach was assumed to result in more fodder for discussions Up Close (whispering) [21]. and reflections. Prototypes thus began to include scenes where a child’s voice was used for adult speakers, singing AST was
Recruitment and Interview Process Participants Table 1. Prototypes V1 - V28 of audio subtitles (AST) With the help from two companies working with people with V Video Material Description dyslexia and visual or cognitive impairments, Dyslexiförbun- det [12] and Begripsam [5], 10 participants were recruited for 1 The King’s Speech A voice speaks over the radio, the AST (Samuel) has a Finland Swedish dialect. virtual interviews via video conference system Zoom Video 2 The King’s Speech A voice speaks over the radio, the AST Communications [55]. Participants are referred to as P1-P10. (Mia) has a Scanian Swedish dialect. All of them were potential users of synthetic speech and ac- 3 The King’s Speech A voice speaks over the radio, the AST cessibility services such as audio subtitles or worked with (Erik) has a radio effect. accessibility questions in some form or other. Several of the 4 The King’s Speech A conversation between two men with a high pitched AST (Erik). participants were for instance members of the board or consul- 5 The King’s Speech A conversation between two men with a tants at one of the organizations and had a lot of experience low pitched AST (Erik). with this and similar subjects. Others simply had themselves 6 The King’s Speech A conversation in a cathedral with a large or a relative with a need for an accessibility service such as reverb size on the AST (Erik), having it sound like it is heard in a cathedral. AST. Participants were told they would get to experience and 7 The King’s Speech A conversation in a room with a small react to some conceptual audio subtitles. They were asked to reverb size on the AST (Erik), having it interact with the prototypes and perform a “think-aloud” evalu- sound like it is heard in a small room. ation to discuss their experience in individual semi-structured 8 The King’s Speech A conversation in a room with a larger interviews. reverb size on the AST (Erik), having it sound like it is heard in a larger room. Interviews 9 The King’s Speech The speaker holds a speech in an arena, the AST (Erik) has an echo effect on it. The interview consisted of two parts. During the first, partici- 10 The Bureau A conversation between a man and a pants watched V25-V28 in a previously determined order and woman in a car. The AST alternates be- answered questions about their spontaneous reaction directly tween female (Elin) and male (Erik), one after each viewing. This made for a gentle introduction to the character having a left or right mono out- put. topic of audio subtitles, since it showcased a real-life scenario 11 The Bureau A conversation heard over telephone of how AST are implemented today, namely at SVT Play. This recordings, the AST (Elin) has a tele- part was skipped over with two participants due to time con- phone effect. straints (P6, P9). During the second part, participants viewed 12 The Bureau A conversation between two men, the a selection of five prototypes one at a time and subsequently AST is a human voice whispering. 13 The Bureau A conversation between two men, the answered questions on the experience of them. The interview AST is a human voice singing. concluded with synoptic and broad questions on their expec- 14 Call my Agent! Two women are whispering to each other, tations and experience with AST and the previously viewed the AST is a natural voice also whisper- prototypes. ing. 15 Call my Agent! Two women are whispering to each other, Interviews were recorded and later transcribed with the permis- the AST (Will Up Close) is also whisper- sion of the participants and reactions to experiential qualities ing. 16 Call my Agent! Two men are shouting at each other, the of the voice for audio subtitles were analyzed. AST (Will From Afar) is also shouting. 17 Call my Agent! Two men are shouting at each other, the RESULTS AST (Erik) is muffled. Results showed how differently, as well as similarly the proto- 18 Call my Agent! Two men are shouting at each other, the AST (Filip) is a child. types could be experienced. The reactions were grouped after 19 Call my Agent! An elderly women holds a speech, the the voice qualities examined in the prototypes, accompanied AST (Will Old Man) is an elderly man. by relevant quotes from participants. 20 Perfume A woman speaks with a child, the AST alternates between a female adult voice (Elin) and a child’s voice (Freja). Gender of the voice 21 Weissensee A woman sings a song, the AST is a hu- The question of what gender the voice had was prominent man voice singing. throughout the study and commented on no matter if it was 22 Kuch Kuch Hota Hai Two women have a conversation, the the intended focal point of the prototypes seen by the par- AST (Deepa) has an Indian English di- alect. ticipants. It was stated that it was common to use only one 23 Narcos A man holds a speech, the AST (Will voice for accessibility services such as audio books or screen Happy) has a glad expression. readers due to there often being no option to combine several 24 Narcos A man holds a speech, the AST (Will voices. Participants seemed to therefore be more accepting Sad) has a sad expression. of the gender-coded AST not fitting the original speaker and 25 Index AST (Elin) speaks over several people. 26 Index AST (Erik) speaks over several people. it was seldom argued to be the most important factor in a 27 Index AST (Emil) speaks over several people. clip, but a general wish for the gender to match the speaker 28 Index AST (Emma) speaks over several people. was expressed more than once and described to improve the experience. This became apparent with prototype V10, where the voice of a male- and a female-coded AST was used for a conversation between a man and a woman. Two of three
participants (P1, P2) had positive reactions to the fact that the for different characters seen on screen and three out of four voices matched the speakers. All but one participant suggested participants (P1, P2, P4) that viewed these prototypes had that there should always be an option to at least be able to strong, positive, almost yearning reactions to it. They said it choose between a male and female voice for each film and TV made it easier for them to follow the contents of the clip as series. Seven of the participants (P1, P2, P5, P6, P7, P9, P10) well as improving the experience overall. P4 described it as described usually choosing which voice to use - specifically bringing the events closer to them as a viewer. the gender of the voice - based on a certain category and genre or simply by how they feel that day. This indicates that the In some other prototypes a dialogue option was discussed as gender of the voice used for AST is indeed a matter of interest a must-have, because of the otherwise unintelligible nature to the user, who would like to control that choice for each of the clip (for instance V16-V18). It was speculated that different voices for different characters would be extremely instance of use. helpful in following and keeping up with the conversation, especially when the exchange was quick and heated such Age of the voice as in an argument. The exception was P5, who said they One of the examined voice qualities was the question of age would prefer one voice for everyone in the clips in order to be in a voice used for audio subtitles. The majority of prototypes able to focus. This participant (who had epilepsy and autism used voices that were explicitly not child-like or elderly, but syndrome) found the dialogue prototype V10 especially hard rather middle-aged. Still it seemed possible to discern dif- to bear because of the mono effect that was implemented, ferences in ages among them, as was expressed by P1 when which became extremely uncomfortable in their ears. This led comparing the voices Erik and Emil (V26, V27): "I feel like to the participant not realizing or identifying that there was Erik is a man in his 50’s and this (Emil) was maybe a man both a male- and female-coded AST voice used in the clip in in his 30’s". This also impacted the perceived appropriate- question. ness of the AST voice used for a certain speaker, as P1 later commented on in a scene in prototype V1. Mono effect Here I would have liked to have a (...) neutral Swedish This mono effect in V10 was generally barely noticed in prac- voice like Erik or maybe Emil. Possibly Erik, since this tice and only somewhat approved of in theory. P1 stated that was an older man (speaking). - P1 on V1 the mono output made the voices sound more "flat", although they further said that the effect made it clearer who was speak- The prototypes where the AST voice included that of a child ing and where the voices were coming from, as well as it being garnered strong positive reactions to the child’s voice, espe- a fun idea, which was seconded by P2. In practice it seemed cially in V20, where participants thought the voices matched that the dialogue quality was enough in itself and that a mono with the speakers. As for V18 where a child’s voice was effect did not improve the experience any further. matched to adults, participants pointed out that the voice was well done but was not acceptable in this kind of situation. Dialects The clip instead garnered confused reactions over who was The use of dialects garnered mixed reactions from participants speaking, followed with strong opinions on when a child’s and was often an engaging topic of discussion. Six participants voice would be appropriate to use in audio subtitles. All par- (P1, P4, P5, P6, P9, P19) agreed that it would be a nice feature ticipants who discussed this topic (P2, P4, P5, P6) expressed to have in theory, but often had mellow reactions in practice. that it would fit either a children’s movie, a movie from the Everyone agreed that it should in any case be optional to viewpoint of a child or when used solely when a child was have dialects, albeit for different reasons. P5 stated that for speaking. Prototype V19 provoked a similar initial confusion people with autism, dialects can provide a sort of "safe place", over who was speaking in the clip, as the voice was an elderly the sentiment of which was echoed by P9, a participant who man and the original speaker an elderly woman. Both partic- viewed a prototype with the dialect from their home-region ipants (P3, P7) who saw this clip felt that this voice was not (V2). They described that "something special happens when appropriate at all to use in this case, although the perceived you hear your own dialect" and that it in a way brings the age-appropriateness was mentioned to be a nice touch. They contents closer to them. Several participants (P3, P4, P10) felt that the mismatched gender and the particular deep and expressed that they would not want certain extreme dialects in croaky nature of the audio subtitles voice made for too big a their audio subtitles, worrying about the intelligibility of the contrast to the speaker appearing on the screen in this case, but dialect. They suggested a mild version of the dialect instead. that the voice would fit well for elderly men in movies. P7 even At the same time, participants pointed out that they did not likened the experience to watching the character cross-dress, want "dialect neutral" AST to imply a Stockholm specific which was not the case. dialect, which they felt is often mistaken for dialect neutral Standard Swedish. In prototype V1-2 two participants (P1, Dialogue P5) commented on how mismatched and "weird" the choice Another quality most would comment on unprovoked, was of AST voice was, because they identified the context to be the use and combination of several voices in one clip. This a speech made by a British person and expected a Standard was often already discussed in the first part of the interviews Swedish AST voice to go with it. Three participants (P1, P3, during clips V25-V28, where a woman and several men were P10) posed the requirement of the dialect having to fit the speaking with one male or female audio subtitles voice cov- context of the film. For Swedish regional dialects, participants ering them all. Prototype V10 and V20 used different voices expressed that the character on screen would have to originate
from a given region in order for the dialect to be appropriate At other times, the speech synthesis was commended for well- in audio subtitles. This would entail the product to already be done emphasis. Emotion, tone and emphasis seem to therefore produced in Swedish and would not be applicable in the case always play a very large role in the perception of the AST. of SVT Play, who implement AST exclusively for translated P6 and P10 even stated to have turned off a film or TV series subtitles. P9 was inclined to use an AST voice with dialect because of the disengaging audio subtitles used. anyhow because of the familiarity of the dialect. This could imply that dialects would indeed make a viable and tempting P7 felt that the "glad" AST (V23) fit well with the scene option for users from regions in Sweden with a striking dialect and stated that there is a perceived difference with emotional that is not often heard or used for accessibility services such voices and the right tone, the experience otherwise becoming as audio subtitles or even in dubbing otherwise. The choice disengaging and boring. P4 thought that the "sad" AST (V24) did not fit the scene, since the original speaker had a strong for Swedish regional dialects seems therefore only relevant almost aggressive tone and the AST was too soft-spoken and for users who want to hear their own or another comfortable calm. They stated they would have preferred the voice to dialect of their choice. adapt more to the tone of the scene and again underlined the The subject was perceived and discussed a little differently in importance of tone and emotions in audio subtitles. regard to international dialects. P4 who experienced prototype V22 reacted positively to the Indian English dialect, stating Whispering and shouting that it gave a context and was "culture specific". P9 who Two participants (P2, P10) found the whispering prototype discussed the topic theoretically also agreed that international V14 to be the best, although this coincided with the fact that the dialects would "add something to the experience" and be an prototype in question used a human voice. P2 expressed that entertaining feature, but P3 instead found it unnecessary. This it felt natural to adapt the tone this way and that it "intensified participant reacted negatively to the Indian English prototype the mood", P10 stating something similar: "It fits very well. V22 and found that it "adds a dimension to their conversation And it felt more alive actually". Others (P4, P5, P6, P8) did which doesn’t exist". They argued that there probably would not like the whispering prototypes (V12, V14, V15) at all, not exist a dialect in the mother tongue of the speakers and that either not liking the sharp "S"-sounds made by the human it was inappropriate to add a dialect in the AST. P5 discussed speaker, the strong nature of the whispering, the low sound of that some people with autism could easily discern if a dialect it or the degree of intelligibility. They theoretically preferred a was authentic or not and that it could therefore be risky to have less extreme version of a whispering (and shouting) AST, and artificial dialects that may offend or irritate certain viewers. would instead opt for the service to simply lower the voice or Both P3 and P5 demanded this assurance of authenticity, as adapt the tone. P6 and P8, who viewed prototype V12 where they found it wrong for dialects to be imitated, at least by the AST whispered over non-whispering speakers, argued that human speakers for accessibility services. they did not need this sort of dramatisation, since they would understand the kind of situation the original speakers were in Pitch or when they were whispering regardless. Both participants The prototypes where the pitch of the voice was altered gar- seemed however confused about whether the original speakers nered arguably the strongest negative reactions. P8 called the in the clip were actually whispering or not. P5 had trouble experience "horrible" (V5) and said they did not understand dealing with the sharp sounds made by the human AST speaker the purpose of having an awful and uncomfortable voice that in V12 and stated that they did not realize it was meant to was additionally hard to hear and decipher for audio subti- represent whispering at all. tles. P7 and P10 gave suggestions for scenarios where the Participants who theoretically discussed a corresponding low-pitched voice could fit, for instance a ghost movie or for a shouting audio subtitles (P2, P5, P10) saw many problems villain’s voice. Still they seemed apprehensive as to whether with it. They speculated that for shouting AST to work, the they would actually like to hear this type of AST in those background sound would have to be lowered (P2) or that shout- scenarios. ing AST could be too shocking or distracting to some users and that it would be inappropriate (P10). The participant who Emotional text-to-speech in audio subtitles viewed the shouting AST prototype V15 (P3) seemed reluctant Only two participants (P4, P7) viewed the prototypes with to state that the voice fit the original scene in question. They coded emotions (V23, V24), but many others (P2, P3, P5, P6, described how it on the other hand could destroy the context P10) still discussed the topic of tone and emphasis in audio of the scene if the audio subtitles would not shout. P3 thus subtitles and the importance of it fitting the scene. It was seemed to appreciate the effort of the service to honor the referred to as the "dramatic part" by one participant, stating original tone as much as possible, but also found the scene which voice had a better fit for a dramatic scene such as the hard to follow in general. one in V25-V28. This was also linked to how natural a text- to-speech AST sounded in a given scene, how well the voice Singing adapted to the flow and tone of a sentence and would frequently The singing prototype V21 was generally well-received (P1, be commented on if it was not done well enough. This was P5, P9), in part due to the audio subtitles having a human per- common in V25-V28, a dramatic scene with refugees shouting, former. It was said that a singing AST increases the experience, where many participants (P1, P4, P5, P6, P7) commented that although opinions on whether they would want to use this fea- some of the AST performed flat, "technical" or "robot-like". ture differed. P1 for instance liked the singing, but stated that
they would still prefer to have the service read out the subtitles DISCUSSION instead of singing them simultaneously, especially given that The discussion focuses on overarching themes found in the the AST would have a synthetic voice. Several people (P1, results, such as the fact that some prototypes elicited com- P5, P10) stated that it would be of even greater importance fortable and other uncomfortable experiences. The section that the voice appear very natural or even said that it would consequently discusses possible implications for voice design work exclusively with human performers. The voice would in audio subtitles and beyond, as well as ethical considerations also have to fit the tone of the movie and performers. P5 and in AST. P9 speculated that they would not have understood that the original performer sang (V21) and that only with the singing Design implications for audio subtitles AST it became clear that that was the case. Results indicated that users want the voice design of audio P7 and P10 viewed the prototype with a singing AST for a subtitles to add to the experience of the medium and it became non-singing scene (V13) and felt it was very inappropriate. clear that a gender- and age appropriate AST voice would Both said they would find it appropriate for musical films and intensify and improve the experience of a movie or TV series P7 speculated that a children’s film such as Disney’s "Frozen" greatly. Especially the gender of the voice was frequently (2013) [11] might make use of such an implementation. discussed, although there seemed to be no overarching pref- erence in what gender a voice should have in a given scene. Participants rather demanded that the gender should fit the Dramatising effects original speaker in the scene, the AST voice being customized There were almost exclusively negative reactions to the pro- to each character seen and heard. This is something which is totypes with different effects put on the AST voice, such as commonly implemented in dubbed movies and was shown to effects for reverb sizes and distortions. It was only when fur- be well received in audio subtitles before, albeit with the origi- ther discussed in theory that participants were much more nal actors performing the AST [45, 10]. This kind of dialogue inclined to find the effects a suitable feature for AST, contra- implementation was suggested to be very helpful for users in dicting their initial reaction. Several participants (P1, P2, P3, order to better follow the content of a given scene, as well as P6, P7, P8) found effects to be unnecessary for the experi- improving the experience overall. The quality of experience ence of the AST and that it could even disturb the experience. seemed to be a priority for everyone, although opinions on This was especially apparent with some of the reverb size ef- how far the design of AST should go differed quite a lot. fects, where participants stated they had a harder time hearing what was being said. Here some stated their priorities for Participants had strong opinions on the dramatisation of audio AST, which should first and foremost be to be able to follow subtitles one way or the other. Some participants felt that the conversations and plot of the medium with help of audio dramatised AST (i.e. with different effects or an adapted subtitles. expression such as whispering) were absolutely unnecessary and that AST should first and foremost translate and deliver You have to put a lot of energy into trying to differentiate, semantic information. Others however expressed a wish for a who says what here now. - P9 on V17 high-quality emotional experience and accurate translation of If you can not see the text or can read then it is most the filmmakers cinematic experiential intentions, as described important that you can hear what they say, maybe not by P9. that you describe the space in the actual sound of the text. It’s about enhancing the experience or ensuring that the - P8 experience that the filmmakers want to convey actually P8 argued that you would not emulate effects in your head reaches me as a consumer. I do not want a custom variant, when you read the text. Some others (P4, P9, P10) were I want a variant that conveys the original in a way that I accepting of these sort of dramatising effects, finding the idea can assimilate. - P9 entertaining, although the effects often initially went unnoticed This feeling of AST as an accessibility service representing a or unidentified. This may have been affected by participants "custom variant" was echoed previously by P5, who described seldom finding the effects to be appropriate to the scene, such her disappointment in the quality of product an accessibility as seen in V7 and V8 where large and small room effects service usually implies. It is thus worth discussing how users were applied. Participants seemed to have an easier time who are in need of such services can feel disregarded and that identifying the echo effect in V9, which was arguably more they are not a priority in the entertainment and UX industry. fitting and appropriate for the scene, where the original voice also had an echo. Still, participants (P1, P2, P8, P9, P10) Opportunities in the production of accessibility services thought many of the effects not to originate from the AST In order to ensure inclusion and representation of users in need but the original audio and were albeit confused. P9 even of accessibility services such as AST, they could be treated as mistook the radio effect to simply be an older, technologically an ongoing consideration in the production of audiovisual con- inferior, low-quality AST and not a deliberate effect applied tent. If audio subtitles would be considered and produced early to emulate the radio station. Similarly, the muffled effect on during entertainment production, creators could ensure that was also thought to be a technological issue. The topic of their vision of product is correctly portrayed and translated dramatisation and dramatising audio subtitles showed to be for people in need of accessibility services. This could lead to quite divisive, seemingly grounded in personal experience and the emergence of an industry similar to the dubbing-industry. preference. Human performers have previously been deemed too costly
for accessibility services [16, 51], but with text-to-speech AST, the voice itself, the soundscape of the clip or the mismatch this cost could be eliminated and instead be invested in for between choice of voice and original speaker. What follows instance sound designers or professional voice designers. This are insights garnered on what made for an uncomfortable way, video on demand companies - who may have other prior- experience. ities - would not have to be made responsible for producing accessibility services, as is the case today [44]. Future research A common theme in the uncomfortable experiences with the could thus investigate how sound designers, film-makers and prototypes were the negative reactions to when the voice did voice designers would approach producing audio subtitles to not support participants’ expectations and at times instead give as similar an experience to the original as possible. This actively contradicted them. This discomfort was expressed would also contribute to the discussion on the importance of several times by participants, experiencing that the dialect did not fit the context (V1, V2), the gender or age was inappropri- inclusion, representation and accessibility in HCI [50, 8, 4, 40, ate (V18, V19) or that the expressed emotion and tone did not 31, 49, 30]. match. When expressions such as singing and whispering were Many participants expressed a wish for autonomy in suggest- applied on a visual and context that originally did not include ing multiple choice options for audio subtitles, which indicates those expressions, participants appeared confused. Contextual that the user group recognizes they have different requirements or dramatising effects on the voice such as reverb size and and opinions among themselves. P9 suggested a choice be- distortions seemed to have a similar effect, especially when tween an AST with one voice without effects or any additions the intention was not clear to the participant. Another aspect and one "elevated experience"-AST that would include reverb that made participants uncomfortable were the altered pitches size- and context effects, adapted dialects and expressions and in (V4, V5). This seemed to inspire more of a physical pain, several voices for speakers. Even though several of the par- since the voices were described as "unpleasant" and "horri- ticipants stated they found the translation and interpretation ble". The additional connotation to ghosts and villains and of emotional information to be unnecessary in contrast to se- P7 describing the experience as "scary", suggests that altering mantic information, previous research has described how the the pitch could be an effective way of creating discomfort. two of them combined are highly important to the subsequent There was even a slight general disregard for text-to-speech quality of the user experience [15]. This study additionally voices palpable during the interviews. Some participants had showed that further voice qualities in AST do have the poten- stronger reactions to speech synthesis than others, but most tial to present many benefits to users. Simple effects such as expressed that they would still prefer a human speaker AST. distortions of the voice to emulate context like a radio, tele- It was sure to be commented upon if the text-to-speech was phone or other could - when done properly - make it easier deemed inadequate or "flat", which could lead to frustration to identify speakers and follow a story and become an im- on the user’s part. For one participant (P5) it could even be- portant feature to some users, as previously found in [45]. come physically straining due to their medical preconditions. Dialects could similarly clarify the cultural and geographical Choosing particularly low-quality speech synthesis could thus context of a scene, but also exist to make users of a certain be another way to create uncomfortable experiences. region more comfortable with the AST voice, arguably due to their association to a specific or familiar social group and the These instances suggest that they would make an inappro- similarity-attraction effect discussed by Sutton and colleagues priate choice for conventional audio subtitles, but they si- [50, 8, 42]. Expressions such as seen in highly emotional or multaneously raise the question of whether this suggests any implications for how to design with the voice beyond AST. singing scenes can generally be supported by emulating them Researchers have previously discussed the concept of uncom- to at least a small degree, since some thought that it intensified fortable interactions and ambiguity in HCI, highlighting the the experience. This could possibly be due to these expres- benefits of what would conventionally be considered a shift sions containing emotional information that could otherwise get lost to the viewer, as discussed in [39, 15]. away from traditional UX values [19, 6]. This concept often referred to examples of physical interactions, art installations The difference in opinion on whether accessibility services or performances, but could possibly be applied to voice design should translate emotional in addition to semantic information in voice-user-interfaces as well. and whether an "elevated experience" of audio subtitles is The voice has previously been described to have a big impact desired could be further examined and discussed as the impor- on experience and contain and express a lot of information tance of accessibility and inclusion gains more attention in the industry [31, 43, 52, 37, 41]. between humans, who therefore are very fine-tuned to ana- lyzing and interpreting voice [28]. This intimate relationship between the voice and humans leads to them having precon- Creating discomfort with voice design ceived notions on how and what the voice should be in dif- What became interesting to observe during the interviews was ferent contexts, which makes the voice a very intimate and at times the level of discomfort and confusion experienced by efficient modality to create uncomfortable experiences with. participants, as seen in the especially strong disapproving re- It also explains why participants had such strong reactions to actions to the provocative prototypes (V1, V2, V12, V13, V18, voices that were contradicting their expectations, due to them V19). It became very clear which prototypes did not "work" not being able to intuitively match the voice to the face seen in the sense that the participants did not enjoy the experience on screen. Contradicting expectations thus elicited discomfort or would have liked to use that particular version of AST. This but could simultaneously inspire reflection on the user’s part. was the case due to different reasons, often either related to
You can also read