Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of ...

Page created by Diana Gomez
 
CONTINUE READING
Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of ...
Investigating the Utility of Multimodal Conversational Technology and
                                                   Audiovisual Analytic Measures for the Assessment and Monitoring of
                                                                  Amyotrophic Lateral Sclerosis at Scale
                                                    Michael Neumann† , Oliver Roesler† , Jackson Liscombe† , Hardik Kothare†‡ , David
                                             Suendermann-Oeft† , David Pautler† , Indu Navara , Aria Anvara , Jochen Kummb , Raquel Norelc ,
                                           Ernest Fraenkeld , Alexander V. Shermane,f , James D. Berrye , Gary L. Patteeg , Jun Wangh , Jordan R.
                                                                         Greene,f and Vikram Ramanarayanan†‡

                                                                                            vikram.ramanarayanan@modality.ai

                                                                       Abstract                                  challenging for patients due to various reasons, including, but
                                                                                                                 not limited to: (i) no access to neurologists or psychiatrists; (ii)
arXiv:2104.07310v1 [eess.AS] 15 Apr 2021

                                           We propose a cloud-based multimodal dialog platform for the
                                           remote assessment and monitoring of Amyotrophic Lateral               lack of awareness of a given condition and the need to see a spe-
                                           Sclerosis (ALS) at scale. This paper presents our vision, tech-       cialist; (iii) lack of an effective standardized diagnostic or end-
                                           nology setup, and an initial investigation of the efficacy of the     point; (iv) substantial cost and transportation involved in con-
                                           various acoustic and visual speech metrics automatically ex-          ventional or traditional solutions; and in some cases (v) lack of
                                           tracted by the platform. 82 healthy controls and 54 people with       medical specialists in these fields [3].
                                           ALS (pALS) were instructed to interact with the platform and          The NEurological and Mental health Screening Instrument
                                           completed a battery of speaking tasks designed to probe the           (NEMSI) [4], has been developed to bridge this gap. NEMSI
                                           acoustic, articulatory, phonatory, and respiratory aspects of their   is a cloud-based multimodal dialog system that can be used
                                           speech. We find that multiple acoustic (rate, duration, voicing)      to elicit evidence required for detection or progress monitor-
                                           and visual (higher order statistics of the jaw and lip) speech        ing of neurological or mental health conditions through auto-
                                           metrics show statistically significant differences between con-       mated screening interviews conducted over the phone or via
                                           trols, bulbar symptomatic and bulbar pre-symptomatic patients.        web browser. While intelligent virtual agents have been pro-
                                           We report on the sensitivity and specificity of these metrics us-     posed in previous work for such diagnosis and monitoring pur-
                                           ing five-fold cross-validation. We further conducted a LASSO-         poses [5, 6], NEMSI offers three significant innovations: First,
                                           LARS regression analysis to uncover the relative contributions        NEMSI uses readily available devices (web browser or mobile
                                           of various acoustic and visual features in predicting the severity    app), in contrast to dedicated, locally administered hardware,
                                           of patients’ ALS (as measured by their self-reported ALSFRS-          like cameras, servers, audio devices, etc. Second, NEMSI’s
                                           R scores). Our results provide encouraging evidence of the util-      backend is deployed in an automatically scalable cloud envi-
                                           ity of automatically extracted audiovisual analytics for scalable     ronment allowing it to serve an arbitrary number of end-users at
                                           remote patient assessment and monitoring in ALS.                      a small cost per interaction. Thirdly, the NEMSI system is na-
                                           Index Terms: conversational agent, amyotrophic lateral sclero-        tively equipped with real-time analytics modules that extract a
                                           sis, computer vision, dialog systems.                                 variety of speech and video features of direct relevance to clin-
                                                                                                                 icians in the neurological space, such as speech and pause du-
                                                                                                                 ration for the assessment of ALS, or geometric features derived
                                             1. Multimodal Conversational Agents for                             from facial landmarks for the automated detection of orofacial
                                                       Health Monitoring                                         impairment in stroke.
                                           The development of technologies that rapidly diagnose medi-           This paper investigates the utility of audio and video met-
                                           cal conditions, recognize pathological behaviors, continuously        rics collected via NEMSI for early diagnosis and monitoring
                                           monitor patient status, and deliver just-in-time interventions us-    of ALS. We specifically investigate two research questions.
                                           ing the user’s native technology environment remains a criti-         First, which metrics demonstrate statistically significant differ-
                                           cal need today [1]. The COVID-19 pandemic has further high-           ences between (a) healthy controls and bulbar pre-symptomatic
                                           lighted the need to make telemedicine and remote monitoring           people with ALS or pALS (thereby assisting in early diagno-
                                           more readily available to patients with chronic neurological dis-     sis), as well as (b) bulbar-presymptomatic patients and bul-
                                           orders [2]. However, early detection or progress monitoring of        bar symptomatic patients (thereby assisting in progress mon-
                                           neurological or mental health conditions, such as clinical de-        itoring)? Second, for pALS cohorts, which metrics are most
                                           pression, ALS, Alzheimer’s disease, dementia, etc., is often          predictive of their self-reported ALS Functional Rating Scale-
                                                                                                                 Revised (ALSFRS-R [7]) score? Before addressing these ques-
                                               † Modality.ai, Inc., San Francisco, United States                 tions, however, we briefly summarize the current state of ALS
                                               ‡ University of California, San Francisco                         research, and describe our data collection and metrics extraction
                                              a EverythingALS, Peter Cohen Foundation
                                              b Pr3vent, Inc.
                                                                                                                 process.
                                              c IBM Thomas J. Watson Research Center
                                              d Massachusetts Institute of Technology                               2. Current State of ALS Diagnosis and
                                              e MGH Institute Health Professions
                                              f Harvard University
                                                                                                                                 Monitoring
                                              g University of Nebraska                                           ALS is a neurodegenerative disease that affects roughly 4 to
                                              h University of Texas at Austin                                    6 people per 100,000 of the general population [8, 9]. Early
Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of ...
Sex                                        ALSFRS-R              ALSFRS-R
                             Group      Female       Male        Age (years)               Total                 Bulbar
                             CON          68          14       43; 41.62 (19.00)      48; 47.89 (0.94)      12; 11.94 (0.36)
                             BUL          17          15       63; 59.56 (10.29)      36; 33.09 (7.46)       9; 8.75 (1.57)
                             PRE          12          10       61; 57.18 (11.31)      40; 36.45 (8.58)      12; 12.00 (0.00)
Table 1: Participant characteristics for the three groups – controls (CON), bulbar symptomatic (BUL), and bulbar pre-symptomatic
(PRE). Age and ALSFRS-R scores are presented as: median; mean (standard deviation).

detection and continuous monitoring of ALS symptoms is cru-                        speech samples from participants, inspired by prior work [27,
cial to provide optimal patient care [10]. For instance, decline                   28, 29, 30]: (a) sustained vowel phonation, (b) read speech, (c)
in speech intelligibility is said to have a negative impact on                     measure of diadochokinetic rate (rapidly repeating the syllables
patients’ quality of life [11, 12], and continuous monitoring                      /pAtAkA/), and (d) free speech (picture description task). For (b)
of speech intelligibility could prove valuable in terms of pa-                     read speech, the dialog contains six speech intelligibility test
tient care. Traditionally, subjective measures, such as patient-                   (SIT) sentences of increasing length (5 to 15 words), and one
reports or ratings by clinicians, are used to detect and moni-                     passage reading task (Bamboo Passage; 99 words). After dialog
tor speech impairment. However, recent studies show that ob-                       completion, participants filled out the ALS Functional Rating
jective measures allow for earlier detection of ALS symptoms                       Scale-revised (ALSFRS-R), a standard instrument for monitor-
[13, 14, 15, 16, 17, 18, 19], stratification and classification of                 ing the progression of ALS [7]. The questionnaire consists of
patients [20] and can provide markers for disease onset, pro-                      12 questions about physical functions in activities of daily liv-
gression and severity [21, 22, 23, 24, 25]. These objective mea-                   ing. Each question provides five answer options, ranging from
sures can be automatically extracted, thereby allowing for more                    normal function (score 4) to severe disability (score 0). The
frequent monitoring, potentially improving treatment. The suc-                     total ALSFRS-R score is the sum of all sub-scores (therefore
cess of the Beiwe Research Platform1 [26] to track ALS disease                     ranging from 0 to 48). The ALSFRS-R comprises four scales
progression demonstrates the viability of such remote monitor-                     for different domains affected by the disease: bulbar system,
ing solutions.                                                                     fine and gross motor skills, and respiratory function.
                                                                                        We stratified subjects into three groups for statistical anal-
                           3. Methods                                              ysis: (a) Healthy controls (CON); (b) pALS with a bulbar sub-
                                                                                   score < 12 (first three ALSFRS-R questions) were labeled bul-
3.1. Collection Setup                                                              bar symptomatic (BUL); and (c) pALS with a bulbar sub-score
                                                                                   of 12 were labeled bulbar pre-symptomatic (PRE). Similar to
NEMSI end users are provided with a website link to the se-
                                                                                   [14] we aim at identifying acoustic and visual speech measures
cure screening portal and login credentials by their caregiver or
                                                                                   that show significant differences between these groups.
study liaison (physician, clinic, a referring website or patient
portal). After completing microphone and camera checks, sub-
jects participate in a conversation with “Nina”, a virtual dialog                  4. Signal Processing and Metrics Extraction
agent. Nina’s virtual image appears in a web window, and sub-
                                                                                   4.1. Acoustic Metrics
jects are able to see their own video. During the conversation,
Nina engages subjects in a mixture of structured speaking tasks                    We use measures commonly established for clinical speech
and open-ended questions to elicit speech and facial behaviors                     analysis with regard to ALS [14], including timing measures,
relevant for the type of condition being screened for.                             frequency domain measures, and measures specific to the di-
     Analytics modules automatically extract speech (e.g.,                         adochokinesia task (DDK), such as syllable rate and cycle-to-
speaking rate, duration measures, fundamental frequency (F0))                      cycle temporal variation [31]. Table 2 shows the metrics and
and video features (e.g., range and speed of movement of vari-                     speech task types from which they are extracted. Additionally,
ous facial landmarks) in real time and store them in a database,                   speech intensity (mean energy in dB SPL excluding pauses) was
along with meta-data about the interaction, such as call duration                  extracted for all utterances. The picture description task (free
and completion status. All this information can be accessed by                     speech) was not used for this analysis.
the study liaison through an easy-to-use dashboard, which pro-                          All acoustic measures were automatically extracted with
vides a summary of the interaction (including access to a video                    the speech analysis software Praat [32]. Speaking and articu-
recording and the analytic measures computed), as well as a de-                    lation rates are computed based on expected number of words
tailed breakdown of the metrics by individual interaction turns.                   because forced alignment is error-prone for dysarthric speech
                                                                                   [33]. For that reason, these measures can be noisy, if for exam-
3.2. Data                                                                          ple a patient did not finish the reading passage. Hence, we auto-
                                                                                   matically remove outliers based on thresholds for the Bamboo
Data from 136 participants (see Table 1) were collected between                    task: speaking rates> 250 words/min, articulation rates> 350
September 2020 and March 2021 in cooperation with Every-                           words/min, and PPT> 80% are excluded.4
thingALS and the Peter Cohen Foundation2 . For this cross-
sectional study we included one dialog session per subject.3                       4.2. Visual Metrics
    The conversational protocol elicits five different types of
                                                                                   Facial metrics were calculated for each utterance in three steps:
   1 https://www.beiwe.org/                                                        (i) face detection using the Dlib5 face detector, which uses
   2 https://www.everythingals.org/research
    3 If a subject participated in multiple dialog sessions, we took the              4 The outlier thresholds and the tasks for which they are applied were

first successful one, i.e., the first complete call for which valid metrics        determined by manual inspection of the data.
were extracted and all ALSFRS-R questions were answered.                              5 http://dlib.net/
Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of ...
Speech type    Collected metrics                                   all the analyses presented here is the imbalance of sample size
  Held vowel     Mean F0 (Hz), jitter (%), shimmer (%),              between the cohorts; also, future extensions to this work will
                 HNR (dB), CPP (dB)                                  need a larger sample size of the BUL and PRE cohorts to make
     SIT and     Speaking and articulation duration (sec),           robust and generalizable statistical claims.
     Bamboo      speaking and articulation rate (words/min), PPT
       DDK       Speaking and articulation duration (sec),
                 Syllable rate (syllables/sec), number of
                 produced syllables, cTV (sec)
Table 2: Extracted acoustic metrics. PPT = percent pause time,
DDK = diadochokinesia, CPP = cepstral peak prominence,
HNR = harmonics-to-noise ratio, cTV = cycle-to-cycle tempo-
ral variation.

     Category     Metrics           Description
       Move-      open              Lips opening and
         ment     width             mouth width;
                  LL path           Displacement of lower
                  JC path           lip and jaw center;
                  eye open          Eye opening;
                  eyebrow vpos      Vertical eyebrow
                                    displacement
      Velocity    vLL               Velocity and                     Figure 1: Effect sizes of acoustic and visual metrics that show
                  vJC               speed of lower                   statistically significant differences at p < 0.05. Effect sizes are
                  vLL abs           lip and jaw center               shown with a 95% confidence interval and are ranked by the
                  vJC abs           (px/frames)                      BUL–CON group pair.
       Accel-     aLL               Acceleration of lower
                                                                     5.1. RQ1: Which metrics demonstrated statistically sig-
       eration    aJC               lip and jaw center
                                                                     nificant pairwise differences between controls, bulbar pre-
                  aLL abs           (px/frames2 )
                                                                     symptomatic and bulbar symptomatic pALS cohorts?
                  aJC abs
          Jerk    jLL               Jerk of lower                    We conducted a non-parametric Kruskal-Wallis test for every
                  jJC               lip and jaw center               acoustic and facial metric to identify the metrics that showed
                  jLL abs           (px/frames3 )                    a statistically significant difference between the cohorts. For
                  jJC abs                                            all metrics with p < 0.05 a post-hoc analysis was done (again
       Surface    S (S R, S L)      Area of the mouth                Kruskal-Wallis) between every combination of two cohorts to
                                    (right and left half, px2 )      find out which groups can be distinguished. Figure 1 shows
                  S ratio avg       Mean symmetry ratio              effect size, measured as Glass’ ∆ [38], for all metrics that show
   Eye blinks     eye blinks        Eye blinks per sec.              statistically significant difference (p < 0.05) between different
                                                                     subject groups.
Table 3: Extracted facial metrics. Maximum (suffix max) and
                                                                          In addition to the statistical tests, we conducted 5-fold
average ( avg) were extracted for movement and surface met-
                                                                     cross-validation with logistic regression to investigate binary
rics, and maximum, average and minimum ( min) were ex-
                                                                     classification performances, and in turn sensitivities and speci-
tracted for velocity, acceleration and jerk metrics.
                                                                     ficities, of our aforementioned metrics in distinguishing the
                                                                     CON vs PRE (with applications to early diagnosis) and PRE
                                                                     vs BUL (progress monitoring) groups. We investigated using
five histograms of oriented gradients to determine the (x, y)-       both the full feature set as well as feature selection with recur-
coordinates of one or more faces for every input frame [34],         sive feature elimination for classification and found that the lat-
(ii) facial landmark extraction using the Dlib facial landmark       ter performed better as the former method tends to overfit the
detector, which uses an ensemble of regression trees proposed        data, given our small sample size. Receiver operating charac-
in [35] to extract 68 facial landmarks according to Multi-           teristics (ROC) curves for these classification experiments en-
PIE [36], and (iii) facial metrics calculation, which uses 20 fa-    capsulating sensitivities and specificities are presented in Fig-
cial landmarks to compute the metrics shown in Table 3 (cf.          ure 2. For the CON vs PRE case, we observed that the mean un-
[37] for details). Finally, all facial metrics in pixels were nor-   weighted average recall (UAR) across 5 cross-validation folds
malized within every subject by dividing the values by the inter-    was 0.63 ± 0.08, respectively, significantly above chance, sug-
lachrymal distance in pixels (measured as distance between the       gesting promising applications for early diagnosis. For the PRE
right corner of the left eye and the left corner of the right eye)   vs BUL case (and therefore progress monitoring), these num-
for each subject.                                                    bers were even better – 0.77 ± 0.05, respectively. The BUL vs
                                                                     CON case, unsurprisingly, produced the best results.
          5. Analyses and Observations                                    Looking at acoustic features, we found that timing mea-
                                                                     sures (speaking and articulation duration and rate; PPT; sylla-
To normalize for sex-specific differences in metrics (such as        ble rate, cTV) exhibit strong differences between groups and
F0), we z-scored all metrics by sex group. Additionally, all met-    that the effect sizes of these metrics are highest between the
rics reported below (except speaking and articulation duration)      BUL group and the CON group. Mean F0 also showed a signif-
were averaged across speech task type. An important caveat to        icant difference with small effect sizes. For visual metrics, the
(a) BUL vs CON                                       (b) BUL vs PRE                                  (c) PRE vs CON

       Figure 2: ROC curves displaying the performance of binary classification with 5-fold crossvalidation for all group pairs.

                   (a) Final 17 features obtained for PRE group                        (b) First 20 features (out of 23) for BUL group
                                Figure 3: Acoustic and visual features from the LASSO LARS regression path.

results indicate that velocity, acceleration and jerk measures are              pre-symptomatic pALS.
generally the best indicators for ALS. Additionally, while the                       On the other hand, for the BUL cohort, Figure 3b shows
jaw center (JC) seems to be more important than the lower lip                   a slightly different set of 20 features obtained using LASSO-
(LL) for detecting ALS, further investigations are necessary to                 LARS (based on 26 participants). We observe that facial met-
ensure that the difference between the JC and LL metrics is not                 rics (eye blinks and brow positions, in addition to higher or-
just due to a difference in facial landmark detection accuracy.                 der statistics of jaw and lips) add more predictive power to the
                                                                                model than acoustic metrics (such as cTV, CPP and mean F0),
5.2. RQ2: Which metrics contribute the most toward pre-                         suggesting that these might find utility in modeling severity in
dicting the ALSFRS-R score?                                                     bulbar symptomatic pALS.
In a regression analysis, we investigated the predictive power of
the extracted metrics with regard to both BUL and PRE pALS                                             6. Conclusions
cohorts. To investigate this, we employed a LASSO (least ab-
solute shrinkage and selection operator) regression with the ob-                Our findings demonstrate the utility of multimodal dialog tech-
jective to predict the total ALSFRS-R score (implemented using                  nology for assisting early diagnosis and monitoring of pALS.
least-angle regression (LARS) algorithm [39]). The algorithm                    Multiple automatically extracted acoustic (rate, duration, voic-
is similar to forward stepwise regression, but instead of includ-               ing) and visual (higher order statistics of the jaw and lip) speech
ing features at each step, the estimated coefficients are increased             metrics show significant promise in assisting with both early
in a direction equiangular to each one’s correlations with the                  diagnosis of bulbar pre-symptomatic ALS vs healthy controls,
residual.                                                                       as well as for progress monitoring in pALS. Moreover, using
     Figure 3a shows the final 17 features, in the order they were              LASSO-LARS to model the relative contribution of these fea-
selected by the LASSO-LARS regression, on data from 19 PRE                      tures in predicting the ALSFRS-R score highlights the utility
samples6 along with the cumulative model R2 at each step. We                    of incorporating different speech and facial metrics for model-
observe that both facial metrics (mouth opening and symmetry                    ing severity in bulbar pre-symptomatic and bulbar symptomatic
ratio, higher order statistics of jaw and lips) and acoustic met-               pALS. While higher order statistics of the jaw and lower lip fa-
rics (particularly voice quality metrics such as jitter, shimmer                cial features and timing, pausing and rate-based speech features
and mean F0) added useful predictive power to the model, sug-                   were useful across the board for both cases, voice quality and
gesting that these might be useful in modeling severity in bulbar               mouth opening and area metrics seem to be more useful for the
                                                                                bulbar pre-symptomatic group, while spectral and eye-related
    6 The number of samples in the classification and regression analyses       metrics are relevant for the bulbar symptomatic group. Future
differ from Table 1 because not all metrics were present for all subjects,      work will expand these analyses to more speakers to ensure the
either because the task was not performed correctly or system errors.           statistical robustness and generalizability of these trends.
7. References                                              sclerosis: lip movement jitter,” Amyotrophic Lateral Sclerosis and
 [1] S. Kumar, W. Nilsen et al., “Mobile health: Revolutionizing                   Frontotemporal Degeneration, vol. 21, no. 1-2, pp. 34–41, 2020.
     healthcare through transdisciplinary research,” Computer, vol. 46,     [21]   ——, “Predicting speech intelligibility decline in amyotrophic lat-
     no. 1, pp. 28–35, 2012.                                                       eral sclerosis based on the deterioration of individual speech sub-
 [2] A. Bombaci, G. Abbadessa et al., “Telemedicine for management                 systems,” PloS one, vol. 11, no. 5, p. e0154971, 2016.
     of patients with amyotrophic lateral sclerosis through covid-19        [22]   J. Wang, P. V. Kothalkar et al., “Automatic prediction of intelligi-
     tail,” Neurological Sciences, pp. 1–5, 2020.                                  ble speaking rate for individuals with als from speech acoustic and
 [3] R. Steven and M. Steinhubl, “Can mobile health technologies                   articulatory samples,” International journal of speech-language
     transform health care,” JAMA, vol. 92037, no. 1, pp. 1–2, 2013.               pathology, vol. 20, no. 6, pp. 669–679, 2018.
 [4] D. Suendermann-Oeft, A. Robinson et al., “Nemsi: A multimodal
                                                                            [23]   T. Makkonen, H. Ruottinen et al., “Speech deterioration in amy-
     dialog system for screening of neurological or mental conditions,”
                                                                                   otrophic lateral sclerosis (als) after manifestation of bulbar symp-
     in Proceedings of the 19th ACM International Conference on In-
                                                                                   toms,” International journal of language & communication disor-
     telligent Virtual Agents, 2019, pp. 245–247.
                                                                                   ders, vol. 53, no. 2, pp. 385–392, 2018.
 [5] D. DeVault, R. Artstein et al., “Simsensei kiosk: A virtual human
     interviewer for healthcare decision support,” in Proceedings of the    [24]   E. M. Wilson, M. Kulkarni et al., “Detecting bulbar motor in-
     International Conference on Autonomous Agents and Multi-Agent                 volvement in als: Comparing speech and chewing tasks,” Interna-
     Systems (AAMAS), Paris, France, 2014 May.                                     tional Journal of Speech-Language Pathology, vol. 21, no. 6, pp.
 [6] C. Lisetti, R. Amini, and U. Yasavu, “Now all together: Overview              564–571, 2019.
     of virtual health assistants emulating face-to-face health inter-      [25]   C. Barnett, J. R. Green et al., “Reliability and validity of speech
     view experience,” KI-Künstliche Intelligenz, vol. 29, pp. 161–172,           & pause measures during passage reading in als,” Amyotrophic
     March 2015.                                                                   Lateral Sclerosis and Frontotemporal Degeneration, vol. 21, no.
 [7] J. M. Cedarbaum, N. Stambler et al., “The alsfrs-r: a revised als             1-2, pp. 42–50, 2020.
     functional rating scale that incorporates assessments of respiratory   [26]   J. D. Berry, S. Paganoni et al., “Design and results of a
     function,” Journal of the neurological sciences, vol. 169, no. 1-2,           smartphone-based digital phenotyping study to quantify als pro-
     pp. 13–21, 1999.                                                              gression,” Annals of Clinical and Translational Neurology, vol. 6,
 [8] D. Majoor-Krakauer, P. Willems, and A. Hofman, “Genetic epi-                  no. 5, pp. 873–881, February 2019.
     demiology of amyotrophic lateral sclerosis,” Clinical genetics,        [27]   A. K. Silbergleit, A. F. Johnson, and B. H. Jacobson, “Acoustic
     vol. 63, no. 2, pp. 83–101, 2003.                                             analysis of voice in individuals with amyotrophic lateral sclerosis
 [9] A. Al-Chalabi and O. Hardiman, “The epidemiology of als: a con-               and perceptually normal vocal quality,” Journal of Voice, vol. 11,
     spiracy of genes, environment and time,” Nature Reviews Neurol-               no. 2, pp. 222–231, 1997.
     ogy, vol. 9, no. 11, p. 617, 2013.                                     [28]   B. Tomik and R. J. Guiloff, “Dysarthria in amyotrophic lateral
[10] S. Paganoni, E. A. Macklin et al., “Diagnostic timelines and                  sclerosis: A review,” Amyotrophic Lateral Sclerosis, vol. 11, no.
     delays in diagnosing amyotrophic lateral sclerosis (als),” Amy-               1-2, pp. 4–15, 2010.
     otrophic Lateral Sclerosis and Frontotemporal Degeneration,            [29]   M. Novotny, J. Melechovsky et al., “Comparison of automated
     vol. 15, no. 5-6, pp. 453–456, 2014.                                          acoustic methods for oral diadochokinesis assessment in amy-
[11] A. Chiò, A. Gauthier et al., “A cross sectional study on determi-            otrophic lateral sclerosis,” Journal of Speech, Language, and
     nants of quality of life in als,” Journal of Neurology, Neurosurgery          Hearing Research, vol. 63, no. 10, pp. 3453–3460, 2020.
     & Psychiatry, vol. 75, no. 11, pp. 1597–1601, 2004.
[12] K. Joubert, J. Bornman, and E. Alant, “Speech intelligibility and      [30]   C. Agurto, M. Pietrowicz et al., “Analyzing progression of motor
     marital communication in amyotrophic lateral sclerosis: an ex-                and speech impairment in als,” in 2019 41st Annual International
     ploratory study,” Communication Disorders Quarterly, vol. 33,                 Conference of the IEEE Engineering in Medicine and Biology So-
     no. 1, pp. 34–41, 2011.                                                       ciety (EMBC), 2019, pp. 6097–6102.
[13] Y. Yunusova, J. R. Green et al., “Speech in als: longitudinal          [31]   P. Rong, “Automated acoustic analysis of oral diadochokinesis to
     changes in lips and jaw movements and vowel acoustics,” Jour-                 assess bulbar motor involvement in amyotrophic lateral sclerosis,”
     nal of medical speech-language pathology, vol. 21, no. 1, p. 1,               Journal of Speech, Language, and Hearing Research, pp. 1–15,
     2013.                                                                         2020.
[14] K. M. Allison, Y. Yunusova et al., “The diagnostic utility of          [32]   P. Boersma and V. Van Heuven, “Speak and unspeak with praat,”
     patient-report and speech-language pathologists’ ratings for de-              Glot International, vol. 5, no. 9/10, pp. 341–347, 2001.
     tecting the early onset of bulbar symptoms due to ALS,” Amy-           [33]   Y. T. Yeung, K. H. Wong, and H. Meng, “Improving automatic
     otrophic Lateral Sclerosis and Frontotemporal Degeneration,                   forced alignment for dysarthric speech transcription,” in Sixteenth
     vol. 18, no. 5-6, pp. 358–366, August 2017.                                   Annual Conference of the International Speech Communication
[15] P. Gomez, D. Palacios et al., “Articulation acoustic kinematics in            Association, 2015.
     als speech,” in 2017 International Conference and Workshop on          [34]   N. Dalal and B. Triggs, “Histograms of oriented gradients for hu-
     Bioinspired Intelligence (IWOBI). IEEE, 2017, pp. 1–6.                        man detection,” in Proceedings of the IEEE Computer Society
[16] R. Norel, M. Pietrowicz et al., “Detection of amyotrophic lateral             Conference on Computer Vision and Pattern Recognition (CVPR),
     sclerosis (als) via acoustic analysis,” Proc. Interspeech 2018, pp.           San Diego, USA, June 2005.
     377–381, 2018.                                                         [35]   V. Kazemi and J. Sullivan, “One millisecond face alignment with
[17] A. Bandini, J. R. Green et al., “Kinematic features of jaw and                an ensemble of regression trees,” in IEEE Conference on Com-
     lips distinguish symptomatic from presymptomatic stages of bul-               puter Vision and Pattern Recognition (CVPR), Columbus, USA,
     bar decline in amyotrophic lateral sclerosis,” Journal of Speech,             June 2014.
     Language, and Hearing Research, vol. 61, no. 5, pp. 1118–1129,
                                                                            [36]   R. Gross, I. Matthews et al., “Multi-pie,” in Proceedings of the
     2018.
                                                                                   International Conference on Automatic Face and Gesture Recog-
[18] B. J. Perry, R. Martino et al., “Lingual and jaw kinematic ab-
                                                                                   nition (FG), vol. 28, no. 5, 2010, pp. 807–813.
     normalities precede speech and swallowing impairments in als,”
     Dysphagia, vol. 33, no. 6, pp. 840–847, 2018.                          [37]   M. Neumann, O. Roessler et al., “On the utility of audiovisual di-
[19] A. Wisler, K. Teplansky et al., “The effects of symptom onset                 alog technologies and signal analytics for real-time remote mon-
     location on automatic amyotrophic lateral sclerosis detection                 itoring of depression biomarkers,” in Proceedings of the First
     using the correlation structure of articulatory movements,”                   Workshop on Natural Language Processing for Medical Conver-
     Journal of Speech, Language, and Hearing Research, 2021.                      sations, 2020, pp. 47–52.
     [Online]. Available: https://pubs.asha.org/doi/abs/10.1044/2020        [38]   G. V. Glass, B. McGaw, and M. L. Smith, Meta-analysis in social
     JSLHR-20-00288                                                                research. Sage Publications, Incorporated, 1981.
[20] P. Rong, Y. Yunusova et al., “A speech measure for early stratifi-     [39]   B. Efron, T. Hastie et al., “Least angle regression,” Annals of
     cation of fast and slow progressors of bulbar amyotrophic lateral             statistics, vol. 32, no. 2, pp. 407–499, 2004.
You can also read