Can a Support Vector Machine identify poor performance of dyslectic children playing a serious game? - Diva-Portal.org

Page created by Jerry Gardner

Health & Fitness

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Can a Support Vector Machine identify poor performance of dyslectic children playing a serious game? - Diva-Portal.org

EXAMENSARBETE INOM TEKNIKOMRÅDET
MEDIETEKNIK
OCH HUVUDOMRÅDET
DATALOGI OCH DATATEKNIK,
AVANCERAD NIVÅ, 30 HP
STOCKHOLM, SVERIGE 2021

Can a Support Vector Machine
identify poor performance of
dyslectic children playing a
serious game?

VIKTOR LEMON

KTH
SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

SAMMANFATTNING
Den här uppsatsen har varit en del i utvecklingen av det seriösa spelet Kunna, ett webbaserat spel för barn
diagnostiserade med dyslexi. Spelet består av fem olika övningar som syftar till att öva och utveckla barnens läs- och
skrivförmåga. Då Kunna kan användas var som helst behövs verktyg för att förstå varje individs kapaciteter och
svårigheter. Därför syftar den här uppsatsen till att presentera hur ett seriöst spel och stödvektormaskiner (eng.
support vector machine) kan användas för att identifiera de användare som inte uppnådde prestationskraven. På grund
av den uppblossande coronapandemin kunde dock Kunna enbart testas på barn som inte var diagnostiserade med
dyslexi och därför bör den här uppsatsen ses som en pilotstudie. Inledningsvis identifierades flera variabler för att mäta
prestandan hos barn med dyslexi. Därefter implementerades variablerna i Kunna och testades på 16 spansktalande
barn där resultaten analyserades i syfte att identifiera samband kopplade till svaga prestationer. Slutligen delades
deltagarnas data upp i två grupper, varav en grupp innehöll deltagare med klart högre värden i tid och antal fel.
Uppdelningen gjordes för att träna och utvärdera om en stödvektormaskin kan separera de två grupperna och därav
identifiera de deltagare som inte uppnådde prestationskraven. De slutliga resultaten indikerar dock att en
stödvektormaskin inte är det effektivaste valet för detta ändamål. Istället föreslås att framtida arbeten bör överväga
multiklassificeringsalgoritmer.

Can a Support Vector Machine identify poor performance of
dyslectic children playing a serious game?
Viktor Lemón
KTH Royal Institute of Technology
Stockholm, Sweden
vlemon@kth.se

ABSTRACT word recognition problems, word pronouncing, and reading
fluently [2]. Early detection combined with effective preven-
This paper has been a part of developing the serious game tive practices can minimize the difficulties associated with
Kunna, a web-based game with exercises targeting children dyslexia [3,10,22]. Beyond traditional treatment-tools, seri-
diagnosed with dyslexia. This game currently consists of five ous games are commonly recognized as games used for pur-
different exercises aiming to practice reading and writing poses other than pure entertainment and have the potential
without a therapist or neuropsychologist present. As Kunna to reach anyone in need. These games exist in several fields
can be used anywhere, tools are needed to understand each such as education, health, manufacturing, military, and medi-
individual's capacities and difficulties. Hence, this paper aims cine [6]. Furthermore, serious games have been used to esti-
to present how a serious game and a support vector machine mate the risk of dyslexia, and several studies [4,5,7,10,12,14]
were used to identify children that performed poorly in have combined web-based games and machine learning to of-
Kunna’s exercises. Though, due to the current corona pan- fer screening tests online.
demic, Kunna could only be tested on children not diagnosed
with dyslexia. Therefore, this paper should be seen as a proof The challenges of dyslexia are highly individual [26], and dig-
of concept. As an initial step, several variables were identified ital interventions should therefore be personally adapted.
to measure the performance of dyslectic children. Secondly, Software has the potential to support neuropsychologists and
the variables were implemented into Kunna and tested on 16 therapist’s overall comprehension of a patient, as serious
Spanish-speaking children. The results were analyzed to games can be used without their presence. As a first step to-
identify how poor performance could be recognized using the wards this ambition, this study presents how a web-tool was
identified variables. As a final step, the data was divided into constructed and evaluated in order to identify children strug-
two groups for each exercise, of which one group contained gling with exercises in a serious game. This by training a sup-
participants who appear to perform poorly. These were par- port vector machine (SVM), a classifying algorithm that has
ticipants with clearly outlying values in the number of errors shown promising results in categorizing and differentiating
and duration. Thus, to train and evaluate if a Support Vector dyslexic from non-dyslexic populations [4,7,10,12,20]. Hence,
Machine (SVM) can separate the two groups and thereby iden- the hypothesizes is that SVM can identify poor performance
tify the participants who performed poorly. From the discus- by being trained with two groups of participants. One group
sion followed that the SVM is not the most efficient choice for who exposes difficulties and one group that performs well
this aim. Instead, it is suggested that future work should con- within an exercise.
sider multiclassification algorithms.
The study was done in collaboration with the Swedish web
CCS CONCEPTS agency Prototyp and the Spanish mental health clinic Neu-
• Human-centered computing → Field studies; User studies; rotalentum. These companies have developed the applica-
Empirical studies in accessibility.
tion Kunna which exercises aim to practice and measure skills
needed to handle cognitive challenges of dyslexia. However,
KEYWORDS the current circumstances of covid-19 made it only possible
Serious games, Dyslexia, Identifying, Performance, Support to test the game Kunna on 16 participants without dyslexia.
Vector Machines
Reading guide
INTRODUCTION This paper consists of three steps: 1. Find relevant variables
Dyslexia is a reading disability disorder diagnosed in child- to measure the performance of children diagnosed with dys-
hood, with a prevalence of 5 to 10 percent globally lexia. 2. Evaluate how these variables can indicate poor per-
[1,4,7,10,12]. It is characterized by poor spelling abilities, formance in the context of Kunna’s exercises. 3. Evaluate if the

SVM can identify participants that appear to be struggling         Find the syllable
with an exercise. Thus, three methods (Finding variables,          Like the rhyming exercise, the user is instructed to drag a
Characteristics of poor performance and Testing the support        word containing the target syllable. However, each card only
vector machine) are presented with corresponding results.          showed an image, and the user must listen to each card to
                                                                   identify the syllable. The user could here listen to the sentence
Research Question                                                  “Which word contains the syllable …?”. The target syllable is
How can a Support Vector Machine identify poor performance         presented in the green box in the upper right corner. See im-
of dyslectic children playing a serious game?                      age 3 in figure 1.

PRESENTATION OF THE EXERCISES                                      Game Design
                                                                   Each exercise began with a tutorial and instructions were
                                                                   given in text read by an auto-played voice. The participants
                                                                   had to complete a simplified version of the game to enter the
                                                                   exercise to ensure they understood the instructions. Each ex-
                                                                   ercise has several levels, and the content was varying depend-
                                                                   ing on the exercise and the difficulty level. A level was com-
                                                                   pleted when the participant provided a correct answer, indi-
                                                                   cated by a cheerful message.

                                                                    Exercise      Capacities          Interaction   Audio   Levels
                                                                    Name          tested              type
                                                                    Syllable      Syllabic and pho-   Drag-and-     Yes     6
                                                                                  nological aware-    drop
                                                                                  ness
Figure 1. Showing the five different exercises in the web-
                                                                    Sort          Visual discrimi-    Drag-and-     Yes     6
application Kunna. The CPT exercise (1) is not covered in
                                                                                  nation and cate-    drop
this paper.                                                                       gorization

Exercises                                                           Rhyme         Phonological        Drag-and-     Yes     8
                                                                                  awareness           drop
Kunna currently has five exercises: one Continuous perfor-
mance test (CPT) targeting ADHD (not covered in this paper)         Labyrinth     Visual search       Click         No      4
and four exercises targeting dyslexia. The four dyslexia tar-      Table 1. The exercises studied in Kunna
geting exercises were invented and designed by Neurotal-
entum and are presented below and summarized in table 1.           THEORY
                                                                   Dyslexia
Labyrinth
                                                                   In the Diagnostic and statistical manual of mental disorders
In this task, the user must find a path from the starting letter
                                                                   (DSM -5), dyslexia is defined as “an alternative term used to
or word to reach the flag, by clicking on cells. These cells ei-
                                                                   refer to a pattern of learning difficulties characterized by prob-
ther contained the target word/letter or a non-target. See im-
                                                                   lems with accurate or fluent word recognition, poor decoding
age 2 in figure 1.
                                                                   and poor spelling abilities.” [2]. Several studies agree that dys-
                                                                   lexia results in an impaired phonological processing capabil-
Sort long words from short words
                                                                   ity [1, 3, 16, 22, 23], which causes difficulties in processing
The user is given two or three words and instructed to “drag”
                                                                   sounds in a language. For example, distinguish phonemes of a
short and long words to two different fields. Additionally, the
                                                                   word and judge whether two sounds rhyme [16]. Further-
user was able to listen to the words by clicking on the card. A
                                                                   more, Dyslectic children have shown to perform slower dur-
level was completed when both targets were correctly sorted.
                                                                   ing visual search tasks [11, 18] and maintain attention for a
See image 4 in figure 1.
                                                                   shorter period [18, 19]. Children training their phonological
                                                                   awareness are more likely to improve their reading and
Find the rhyme
                                                                   spelling [9,19].
This exercise is about dragging a card containing a word that
rhymes with a target word presented in a green box in the left
                                                                   Support Vector Machines
corner. The user could either read or listen to the sentence
                                                                   Support Vector Machine is a machine learning classifying al-
“Which word rhymes with …?” See image 5 in figure 1.
                                                                   gorithm used by several screening studies [4,7,10,12]. The bi-
                                                                   nary version aims to select a hyperplane that maximizes the
                                                                   distance between the classes. A test sample is then classified

depending on which side of the hyperplane the sample is. SVM game aimed to differentiate children with and without dys-
supports high dimensional spaces and thus several features. lexia and found five significant indicators. These were: total
According to Mark Hall [14], features should not be depend- clicks, time to the first click, hits (number of correct answers),
ent on other features. They should further be correlated with accuracy, efficiency (time of the last click/ hits). This study
the class yet uncorrelated with each other. Cross-validation is was later followed up by implementing machine learning (us-
a common technique that splits the data into several sets. One ing SVM) to detect dyslexic children [10]. The SVM was
set is used for testing and the rest for training. This is used to trained using the number of clicks per item, number of correct
avoid overfitting – a situation occurring when the boundaries answers, number of incorrect answers, scores (the sum of
between the classes are closely fitted to the training set, which correct answers for each stage's problem type), accuracy (the
will not perform well when classifying new samples [13]. number of hits divided by the number of clicks) and miss rate.
267 children and adults participated, and the results showed
an accuracy of 84.62 percent [10]. Recently, Luz Rello et al. [5]
conducted a more extensive study of the independent game
where 313 children participated. The features: total clicks,
first click, hits, efficiency, duration, fourth click, and average
click time was significantly different for Spanish speaking
children. This time was Random Forest used which resulted
in an accuracy of 69 percent for Spanish speaking children
and 75 percent for German.
Figure 2. Showing a plane differentiating the white and
black samples with maximized margin. Variable Definition Study
Number of clicks 4, 5*, 6, 7
Several kernels can create a hyperplane that differentiates the Correct answers 4, 5*, 6, 7*, 10
samples. Depending on the dataset, the efficiency of achieving Incorrect answers 4, 6, 8*, 10, 11
a high classification accuracy varies among the kernels. The Accuracy Correct answers/ 3, 4, 7, 10
kernels used within this study are some of the kernels pro- number of clicks
Miss rate The number of 4, 8*
vided by the Python package Scipy [28].
misses divided by
the number of clicks
RELATED WORK Time to the first 5*, 7*
A few studies [3, 4, 5, 7, 10, 12] have measured the perfor- click
mance in serious games using machine-data interaction Efficiency Time of the last 5*, 7*
methods. In this paper, these are defined as tools available by click/ correct an-
swers
software technologies to measure interactions. A study in It-
Number of clicks 7*, 10
aly [3] developed six games to predict the risk of dyslexia in per item
preschoolers. The dependent variables were response times, Score The sum of correct 10
scores, and accuracy. However, due to a small population and answers all levels
minor application errors, no significant difference between Duration The time to complete 5*, 7*, 8*, 11, 18
the variables could be reported. Furthermore, data from an an exercise
earlier phonological testing study was used to detect dyslexia N:th click The time to a specific 5*, 7*
using machine learning [20]. The test had four tasks with 56 click
Average click time 5*
children participating and reached an accuracy of 89 percent
using decision trees. However, the study does not mention Table 2. All identified features measured in similar
any of the used features. Rello Luz et al. [4,5,7,10] conducted studies. Variables reported with a significant difference
several experiments intending to predict the risk of dyslexia are noted with a star*.
using machine learning. They first developed a linguistic web-
game to detect dyslexia using SVM. The game consisted of 17 OVERALL METHOD
exercises and 32 levels and was tested on 243 children and Since there are no standards for measuring performance
adults (95 diagnosed with dyslexia). Performance was meas- within these exercises, a combination of research, interviews,
ured by counting the number of clicks, correct answers, incor- testing, and statistical analysis were used to address the re-
rect answers, accuracy (hits/ number of clicks), and miss rate search question. As no similar studies were found using these
(the number of misses divided by the number of clicks). How- exercises, these following questions should first be answered:
ever, no statistical analysis of the variables was presented.
Later on, a language-independent web game (no reading was - How can performance be measured in Kunna?
required) was built with visual and audial-based tasks [7]. To- - How are these variables related to poor performance?
tally 178 children with and without dyslexia participated. The

Question one was answered by studying related work to iden- INTERVIEW RESULTS
tify the state of art of web-based measurements for dyslexia. Looking at the variables in table 2, are these relevant to esti-
Furthermore, one interview was held to include a neuropsy- mate cognitive capacities of children with dyslexia?
chologist’s perspectives and validate the relevance of the var-
iables in the context of these exercises. To answer the second - “During this concept testing phase are all these vari-
question, the identified variables were first implemented and ables relevant. Especially the number of mistakes. But
tested on a few participants. The results were analyzed to the completion time and “Time to the first response”
evaluate the variables' applicability and effectivity as a foun- are interesting since it tells the time it took for them
dation for identifying participants with difficulties. Since to finish and understand a task. The number of audio
these exercises are designed to be challenging for children plays may not tell you anything about dyslexia, but it
with dyslexia, poor performance is defined as a high number may tell you something about the comprehension ca-
of mistakes in relation to a protruding time duration. This pacity which is a variable not too far away from dys-
since slow time response and errors are related to difficulties lexia. […]. The benefit of having computers is that they
in search tasks, reading, and writing [11,16,18,21]. However, can measure variables that are hard for a human to
this doesn’t exclude that poor performance can be explained keep track of. But they must be compared with the test
by other variables. Therefore, correlation to time and errors group and the control group and compared within
were calculated. To finally answer the research question, the age groups”.
participants were divided into two group based on the test re-
sults. This to train and evaluate the SVM. Are there any variables missing?
- “It would be interesting to see if a child is guessing
To summarize the overall method, the following steps were while doing these exercises. Some participants could
applied: e.g., be slow in completing the exercise but have a
- Find dependent variables within similar exercises higher number of mistakes. But can guesses be seen in
that are feasible to implement. any other way?”
- Interview with a therapist of Neurotalentum.
- Develop and implement the variables. This question generated the variables Response Frequency
- Evaluate the results of each variable to find charac- and Fast Responses. Response Frequency measure the time be-
teristics of poor performance. tween each response and Fast Responses counts the number
- Train and evaluate the SVM. of times a response was given in less than one second.
With respect to these variables. How do you expect a child with
As this overall approach contains several steps, these are di- or without dyslexia to perform during these exercises?
vided into three methods with a corresponding result sec- - “Theoretically, these tools should put the children into
tion. test, and we believe that these exercises should be able
to differentiate these two groups. But the outcome
FINDING VARIABLES can be spread. Some children with a diagnose might
The first question was addressed by researching related stud- perform similar to the control group. This can and
ies that has used machine learning and serious games to will probably happen. To interpret the performance,
measure the performance of dyslexia. Furthermore, a semi- we need to look at all the variables being measured to
structured interview was held with one of the neurophysiol- estimate the capacities of the child. “
ogists at Neurotalentum. Thus, to validate and identify
measures that they were interested in and further understand How can these variables help to identify individuals having dif-
how these variables can identify children with difficulties. The ficulties with an exercise?
interviews were audio-recorded in Google Meet and the fol-
lowing questions were asked: - First off, every kid is different. Because dyslexia is a
word for making a summary of all the problems a kid
- Looking at the variables in table 2, are these relevant has with reading and writing comprehension. Inside
to estimate cognitive capacities of children with dys- reading and writing comprehension, there are several
lexia? parameters we need to measure, and we need to focus
- Are there any variables missing? on the parameters that a kid has a problem with. […]
- With respect to these variables. How do you expect a The challenge is to tell this program if a value is okey.
child with or without dyslexia to perform during these One way is to measure the number of mistakes and
exercises? then see if they are above or below average compared
- How can these variables help to identify individuals to their age group.
having difficulties with an exercise?

Type Dependent measures Description Syllable Sort Rhyme Labyrinth
Time Duration [seconds] The total time to complete a level X X X X
Time to first response [sec- X X X X
onds]
Response frequency [seconds] The time between responses X X X X
Longest pause The maximum value of the response X X X X
frequency
Scores Number of incorrect answers X X X X
Events Number of responses The number of clicks or drag and X X X X
drops
Number of regrets The user has regretted a response X X
Number of audio plays The user clicked to hear an audio file. X X X
Number of fast responses A response given in 1 second or less X X X X
Quotients Response speed [re- Number of responses divided by the X X X X
sponses/second] duration
Error rate Number of wrong answers divided by X X X X
the minimum number of required re-
sponses
Table 3. All the measurements implemented
Incorrect answers
- Secondly, we can look at the time and the other mea-
Within the labyrinth, the user can only select a neighbor to the
surements. The next step is to combine the result of all last selected cell. E.g., a user cannot click on the final cell di-
exercises in order to estimate the capacities of the kid. rectly. Therefore, the coordinates of a clicked cell were con-
Such as reading comprehension and attention capac-
trolled in X and Y -direction. This was accomplished by stor-
ity.
ing the previously selected cells' x and y coordinates. Since a
The interview indicates the importance of comparing the val- user can either move in a horizontal or vertical direction, the
ues between the two groups with the corresponding age
direction must be validated. For a click to be valid in the ver-
group. It also highlights that outcomes of dyslexia are unique,
tical or horizontal direction, the absolute difference between
which underlines the importance of forming a system that can the previous and current coordinates must be equal to one. A
interpret individual differences. wrong answer was returned if these conditions were false,
and the cell showed a red flashing animation. The cell changed
How can performance be measured in Kunna?
to a green background if the answer was correct. To validate
Together with the two requested variables from the interview a response, every cell was configured with the JavaScript at-
and the research results, the variables presented in Table 3 tribute onclick. Similarly, the JavaScript attribute on-
were implemented into the exercises. These variables were
change was used within the "drag and drop" exercises to com-
selected to fit the given design. The number of correct answers pare the value-attribute of a given card with the correct an-
was not included since all exercises require a correct answer swer. Within the syllable and rhyming exercise, an incorrect
to pass. Similarly, accuracy and efficiency were not qualified
card was automatically removed to the initial area.
using the definition in Table 2. Furthermore, the variables
were adapted to fit the exercises. E.g., did several of the re- Time
lated studies use clicks as a primary interaction type [4, 5, 7, Currently, JavaScript provides the two inbuilt methods per-
10]. As noted in Table 1, either clicks or drag-and-drops are formance.now and Date.now to measure a time interval. Per-
used in Kunna. A more detailed explanation of each variable formance.now was selected since it runs independent of the
is described in the implementation section. code and is not dependent on JavaScript event loops [15]. The
method's accuracy is estimated to five microseconds [15].
IMPLEMENTATION Therefore, performance.now appears to be accurate enough.
This section will cover how the exercises were built and how A time interval is measured by calling the method twice. A
the variables were implemented and measured. variable t0 stores the time when an event is initialized, and a
variable t1 stores the current time when the event is finished.
Application Setup The time interval is the difference between t1 and t0 but when
The application was built using the Model View Controller should these two variables be initialized? Vue.js includes the
(MVC) framework Sails.js [29] and was connected to a Post- function this.$nextTick, which is called when the entire view
gres database. The front-end was built using the JavaScript is rendered. Hence, the time variable t0 is initialized to ensure
framework Vue.js [30]. The website's images, sounds and the time clock starts when the content is fully displayed. The
texts were stored using the Content Service Manager (CSM) duration was calculated immediately after the last cell was
Contentful.

clicked or when a card was dropped by calculating the differ- strength of the relationship between two variables by return-
ence between t0 and t1. ing one value within the range of -1 to 1. To interpret the co-
efficient is r = 0.1 told to indicate a small correlation, 0.30 me-
Response frequency dium and 0.5 a strong correlation [25]. As duration and errors
Response frequency is defined as the time between each re- are considered as the most vital variable, these two variables
sponse. This was calculated by taking the time difference be- are compared to the others in table 3. To support the correla-
tween the previous and the current response. The value was tions and interpret the spread of the data, box plots are pre-
stored in a list. The purpose of this variable is to see whether
sented - A recommended approach to visualize non-paramet-
the user generally responded slowly or fast during a level. The
final value is the mean of the stored values. ric data [25]. The box of a boxplot shows 50 percent of the
data fall; the horizontal line indicates the median and the X
Time to the first response the mean value. The whisker marks the maximum and mini-
The time to each response from the initial time was calculated mum value and outliers (values greater than three standard
and stored in a list. Time to first response is the first value in deviations of the mean) are marked with a ring [25]. The box-
this list. plots were created in Excel.

Longest pause Ethics and Recruitment
The longest pause is the maximum value in response fre- The participants were recruited by Prototyp. The parents re-
quency list. If this value is very high, e.g., a minute or more, it ceived a consent form to confirm their permission. The con-
may indicate that the user is not paying attention to the exer- sent form specified the scientific use of the data collection and
cise. It could also indicate that the user is struggling or being made clear that the children were allowed to quit at any time.
more thoughtful about the next move. The parents received an email with instructions specifically
telling the parents not to help solve the exercises. The partic-
Fast responses ipants were anonymized by a randomized email address and
A fast response is defined as a value in the response frequency a password to access the website. 16 Spanish-speaking chil-
list that is equal to or less than one second. The purpose of dren (8 boys and 8 girls) without a diagnose participated in
this variable is to estimate the sureness of the user. Compared testing Kunna’s exercises. The participants were between 7-8
with the other variables could, e.g., a high number of re- years old and recruited by Prototyp. They were further re-
sponses and a high number of errors, indicate a guessing ap- quested to use the web browser Chrome (version 80 or later).
proach. Likewise, a combination of a high number of fast re- The estimated time to complete all four exercises (excluding
sponse and a low number of errors could indicate certainty. tutorials) was approximately five minutes.
Quotients
Miss rate [errors/response] is defined as the number of incor-
TEST RESULTS
rect answers divided by the total number of responses. Speed TIME
[responses/second] is calculated as the number of responses 300
divided by the total time.
250 Labyrinth
Database 200
Syllable
seconds

The measures from each exercise are saved in separate Post- 150
gres tables. The results are sent directly after a level is fin- Rhyme
100
ished to make sure every completed level is stored. Besides
the other measurement, all the user’s answers are stored in a 50 Sort
list together with the time until each response. For example, 0
this might tell which target is most often mixed up.
Figure 3. The total time to complete an exercise
Automated Tests The mean total time to complete Labyrinth was 133 seconds
To verify that all the measures were implemented correctly, (sd = 61), the corresponding for Syllable was 69 seconds (sd
automated tests were executed using Mocha.js. The automatic = 32), Rhyming 73 seconds (sd = 49) and Sort 55 seconds (sd
test consisted of two parts: validate the database storage (by = 27). It appears as the labyrinth was more spread and gener-
returning a 200 response) and test if the validation functions ally took the longest time. Compared to the other exercises,
returned the correct values. were no images or sounds implemented. Thus, can the laby-
rinth be assumed to be trickier as these children recently have
CHARACTERISTICS OF POOR PERFORMANCE learned to read. Looking at the correlations in table 4, errors
To find characteristics of poor performance, Pearson's corre- and duration has a medium to strong correlation for all the
lation coefficients were calculated using Excel version 16.45. exercises.
The aim was to find correlations between errors and time in
relation to the other variables. The coefficient measures the

Exercise                       Duration          TTFR       Errors            RF          FR           AP              LP

           Labyrinth     Duration             -                0,712   0,693            0,844      -0,659          -          0,808
                          Errors            0,693              0,504       -            0,288      -0,173          -          0,359
           Syllable      Duration             -                0,981   0,411            0,377         -          0,764        0,351
                          Errors            0,411              0,252       -            0,851         -          0,275        0,902
            Rhyme        Duration             -                0,990   0,623            0,699         -          0,763        0,428
                          Errors            0,623              0,550       -            0,774         -          0,457        0,725
               Sort      Duration             -                0,947   0,381            0,661      0,319         0,759        0,034
                          Errors            0,381              0,252       -            0,042      0,942        -0,054        0,005
Table 4. Pearson correlation coefficient for duration, errors and the other variables.

MEAN TIME TO FIRST RESPONSE (TTFR)                                     LONGEST PAUSE (LP)
           30                                                                      18
           25                                       Labyrinth                      16
                                                                                   14                                     Labyrinth
           20                                                                      12
 seconds

                                                    Syllable
                                                                                                                          Syllable
                                                                         seconds
           15                                                                      10
                                                    Rhyme                           8
           10                                                                                                             Rhyme
                                                                                    6
               5                                    Sort                            4                                     Sort
               0                                                                    2
                                                                                    0
Figure 4. The mean time to the first response for all levels
Syllable resulted in the slowest Time to first response with a         Figure 6. The mean longest pause
mean time of 10.92 seconds (sd = 4.98). The time for the lab-          The mean longest pause within the labyrinth was 5.9 seconds
yrinth was 9.16 (sd = 4.74) and 8.37 seconds for Rhyme (sd =           (sd = 4.59). The corresponding for Syllable was 0.66 (sd
8.38) and 5.83 for Sort (sd = 3.36). Looking at the correlations       =2.15), Rhyme 0.73 (sd = 2.70) and Sort 2.98 seconds (sd =
in table 4, errors and TTFR has a medium to strong correla-            2.31). Similar to the response frequency, the longest pause is
tion for all the exercises.                                            non-zero for the phonological testing exercises if a mistake
                                                                       was made. This is indicated by the strong correlation between
MEAN RESPONSE FREQUENCY (RF)                                           the longest pause and the number of errors. This variable in-
                                                                       dicates that no user was absent for a longer time as the high-
           6
                                                                       est value is approximately 25 seconds.
           5                                        Labyrinth
           4                                                           Scores
 seconds

                                                    Syllable
           3                                                           TOTAL INCORRECT ANSWERS
                                                    Rhyme
           2                                                            30
           1                                        Sort                25                                                 Labyrinth
           0                                                            20
                                                                                                                           Syllable
Figure 5 The mean response frequency                                    15
                                                                                                                           Rhyme
The labyrinth and Sort resulted in the highest RF with 2.6 sec-         10
onds (sd = 1,2) respectively 2.64 seconds (sd = 1,2). Syllable              5                                              Sort
resulted in 0.58 (sd = 0.95) and Rhyme 0.75 (sd = 1.08) sec-                0
onds. The RF was generally higher within the labyrinth and
Sort. However, RF is calculated after the first given response.        Figure 7. The total number of incorrect answers
Since Syllable and Rhyming require a minimum of one answer             The participants made in average 5.6 incorrect answers (sd =
to complete, RF is zero if a correct answer is given. This indi-       6.5) in Labyrinth, Syllable resulted in 0.625 errors (sd = 0.88),
cates why error and RF are highly correlated for these two ex-         Rhyme 1.0 (sd = 1.67) and Sort 0.937 (sd = 1.06). The laby-
ercises and the strong correlation between LP and the num-             rinth resulted in a higher number of errors, which was ex-
ber of errors in Rhyme and Syllable. Thus, RF is mainly work-          pected as the design allows for more mistakes (e.g., by acci-
ing within the Labyrinth.                                              dently pressing a cell). Furthermore, the number of errors

were slightly higher within Sort, which is expected as the par- were not required to listen in Rhyme, audio plays could indi-
ticipants were required to give two answers per level com- cate poor performance.
pared to Rhyme and Syllable. But the correlation between the
errors and duration are less strong in Sort and Syllable com- How are these variables related to poor performance?
pared to the other exercises. First, we can see obvious correlations between duration,
TTFR and longest pause. Regarding the central findings:
FAST RESPONSES (FR)
• In Labyrinth: errors were strongly correlated with du-
25 ration and TTFR. The duration was strongly nega-
20 Labyrinth tively correlated with fast responses and response fre-
quency.
15 Syllable • In Syllable: errors were strongly correlated with re-
10 Rhyme sponse frequency and longest pause. However, these
are invalid due to the dependency with errors. Dura-
5 Sort tion was correlated with response frequency and au-
0 dio plays.
• In Rhyme: errors were strongly correlated with dura-
Figure 8. The total number of fast responses tion, TTFR, response frequency and longest pause.
The labyrinth's mean fast response was 8.06 (sd = 7.1) but Similarly, to Syllable, RF and LP are also dependent on
gave no results for the syllable and the rhyme. Sort resulted the errors. The duration was correlated with audio
in 0.5 (sd = 0.63). The purpose of this variable was to count plays.
spontaneous answers. However, the results indicate that this • In Sort: errors were strongly correlated with fast re-
measure can be interpreted differently for click-based and sponses. The duration was strongly correlated with
drag-based exercises. There is a strong negative correlation audio plays.
with time within the labyrinth but a small correlation be-
tween errors, indicating a rapid but accurate response. This is Duration and Errors were strongly correlated with TTFR in
the opposite of Sort, which shows that a fast response is Labyrinth and Rhyme and with Audio plays in Rhyme. How-
strongly correlated with errors and not with response fre- ever, the audio boxplot shows an outlier which could be ar-
quency. This indicates that the users corrected a wrong an- gued to affect the correlation. Furthermore, TTFR and Dura-
swer quite fast. However, the number of total fast response tion appear to be the same value in the drag-and-drop-based
and errors were quite a few within Sort and thus is such a con- exercise, as the correlation between these two is approxi-
clusion too weak to be seen as a general rule. mately one. Duration and Audio plays were strongly corre-
lated for all of the drag-and-drop-based exercises but were
TOTAL AUDIO PLAYS (AP) medium to non-correlated with Errors. Audio plays could be
30 assumed to indicate poor performance in Syllable and Rhyme.
Though, the correlation with Duration could partly be ex-
25 Labyrinth
plained by the time it takes to listen to the audio file. Further-
20 more, the longest pause has a medium correlation with errors.
Syllable
15 But the boxplots indicate an outlier in incorrect answers. To
Rhyme summarize, these results could be argued as too weak to es-
10
5 Sort tablish a general conclusion. Therefore, the conclusion is that
these results cannot prove strong relationships that indicate
0
poor performance.
Figure 9. The total number of audio plays
The mean number of audio plays was in Syllable totally 6.18
TESTING THE SUPPORT VECTOR MACHINE
(sd = 4.95) plays. Corresponding for Rhyme was 5.06 (sd = Dividing participants into two groups
7.41) Sort and 1.68 (sd = 2.67). (The labyrinth had no audio) As a proof of concept, two groups (G1 and G2) were hand-
Rhyme and Syllable had a higher number of audio plays, picked to see whether the system could identify them. The
which was expected as the design required the users to listen. correlation analysis could not indicate any strong relation-
Time was strongly correlated with the number of audio plays ships for poor performance characteristics. Hence, the selec-
for all three exercises. However, the correlation between au- tion was accomplished by looking at the individual perfor-
dio and errors was stronger within Rhyme but medium cor- mances. Participants with clearly outlying values in duration
related in Syllable and non-correlated in Sort. Since the users and errors were placed in group 2 (G2). As Mark Hall [14] sug-
gests, good features for machine learning should preferably
not be dependent on other features, which is the case for

Group         N        Duration        TTFR     Errors          RF         FR               AP         Miss rate       Speed
                                            110,4          8,1      1,7            9,5        9,1                               0,1        1,7
                    G1           9                                                                              -
                                            (58,6)        (5,1)    (2,5)          (5,7)      (8.6)                             (0.2)      (0.8)
 Labyrinth
                                            162,2         10,5     10,7           11,3        6,7                               0,8        1,3
                    G2           7                                                                              -
                                            (56,1)        (4,5)    (6,7)          (3,1)      (5,2)                             (0,4)      (0,3)
                                             60,4          9,9      0,1            0,7        0,0               5,1             0,0        0,9
 Syllable           G1           11
                                            (28,9)        (4,9)    (0,3)          (2,3)       (0)              (4,7)           (0.2)      (0.4)
                                             89,6         13,1      1,8            9,7        0,0               8,6             0,8        0,7
                    G2           5
                                            (32,1)        (5,0)    (0,4)          (6,2)       (0)              (5,2)           (0,2)      (0,2)
                                             55,2          6,7      0,3            2,0        0,0               2,5             0,2        1,6
 Rhyme              G1           12
                                            (25,3)        (2,9)    (0,5)          (4,0)       (0)              (3,4)           (0,2)      (0,6)
                                            126,4         13,5      3,0           18,0        0,0              12,8             1,5        0,9
                    G2           4
                                            (70,6)        (8,5)    (2,4)          (8,6)       (0)             (10,4)           (1,2)      (0,4)
                                             48,0          5,0      0,6           15,4        0,3               1,3             0,1        2,0
 Sort               G1           13
                                            (17,0)        (1,7)    (0,9)          (7,8)      (0,5)             (1,6)           (0,2)      (0,7)
                                             85,3          9,4      2,3           18,0        1,3               3,3             0,5        1,5
                    G2           3
                                            (47,3)        (7,8)    (0,6)          (5,1)      (0,6)             (5,8)           (0,1)      (0,5)
Table 5. The table is showing how the SVM categorized the dataset into two groups using the linear kernel. The mean values
and the standard deviations (in parenthesis) are presented.
                                                                                                     Linear             Rbf            Polynomial
response frequency and the longest pause within the drag-                   Mean accuracy             80%              74%                79%
and-drop-based exercises. Therefore, these features were not                N agreements               52               49                 45
used. The classification was done using Python and the SciPy                N disagreements            12               15                 19
                                                                            p-value                   0.629            0.010             0.0524
SVM-package. The linear, polynomial and the Rbf kernel were
                                                                            Statistic                  7.0              5.0                6.0
tested. Leave-one-out cross-validation was used due to the
small data set, meaning that 15 participants were used for                 Table 7. The mean accuracy, disagreements and p-value
training and one for classification. This was repeated for                 for all exercises and kernel.
every participant and exercise.
                                                                           Table 5 shows how the SVM classified the dataset into two
Evaluating the classification performance                                  groups. The mean errors and duration are slightly higher in
A McNemar test was used to test if the system significantly                group 2 and more audio plays can be seen in Rhyme and Sort.
disagrees between the predicated and the pre-selected partic-              Thus, in this minimized version, the system gives a small indi-
ipants for all exercises. This was accomplished by counting                cation to be working as intended. Though, the standard devi-
the number of true positives (TP), false positives (FP), true              ation indicates an overlapping spread between several varia-
negatives (TN), and false negatives (FN) in all exercises for              bles. According to the classification results in table 6 and 7,
each kernel. Accuracy and F1-scores indicate the efficiency of             the linear kernel was most effective in identifying the selected
the SVM. Accuracy is defined as the number of correct classi-              participants with a mean accuracy of 80 percent. The F1-
fications divided by the total number of classifications.                  scores signals that the Rbf and the polynomial kernel under-
(TP+TN/TP+TN+FP+FN) F1 score is a harmonic mean of pre-                    performed in identifying true positives. i.e., the participants in
cision (TP/TP+TN) and recall (TP/TP+FN) ranging from zero                  group 2. The mean accuracy for the Rbf was 74 percent, but
to one. A higher value indicates a higher classification perfor-           the kernel underperformed in classifying the correct targets
mance [27].                                                                as the F1 score was zero in three of the exercises. The polyno-
                                                                           mial kernel performed slightly better but also showed prob-
CLASSIFICATION RESULTS                                                     lems with classifying the true positives in Syllable. The p-val-
                    N*         Linear     Rbf        Polynomial            ues for the polynomial and the Rbf also indicate a general dis-
 Labyrinth          6           81%       69%           69%                agreement between the selections and the predictions. I.e.,
                               (0.77)    (0.54)        (0.28)              the true positives and false negatives were not the same.
 Syllable            4          69%       62%           75%                However, the linear kernel failed to reject the null hypothesis,
                               (0.28)      (0)           (0)               I.e., that the disagreements are not significant. Still, 20 percent
 Rhyme               4          88%       75%           87%
                                                                           of the participants were misclassified, which could be argued
                               (0.66)      (0)         (0.66)
 Sort                3          81%       81%           87%                as insufficient.
                                (0.4)      (0)          (0.5)
Table 6. Accuracy and F1-score for the exercises and the                   DISCUSSION
kernels *The number of participants selected in group 2                    THE VARIABLES APPLICABILITY TO THE GAME DESIGN
who had a higher number of errors and duration.                            This study aimed to see if a support vector machine could
                                                                           identify dyslectic children who performed poorly within a se-
                                                                           rious game. Several variables were implemented to get a

comprehensive picture of how the participants interacted machine learning to define clusters, which could support in
with the game and to find poor performance characteristics. distinguish groups with different behaviors. A second ap-
The variables appeared to be diverse and dependent on the proach to solve the misclassifications (as mentioned in the in-
exercise, making it complicated to define general rules for dif- terview) would be to decide thresholds for the variables. This
ficulty characteristics. E.g., Syllable requires the user to listen would not result in any misclassifications of true positives
to a sentence and thereby TTFR cannot be directly compared and allows neurophysiologists total control of the system.
to the other exercises. Syllable and Rhyme (which required a This approach could also be argued as less time-consuming as
minimum of one answer per level) did not result in longest no selections are required. However, a higher level of com-
pause or response frequency if the user's first answer was cor- plexity will be added if several variables are being used.
rect. The Labyrinth required several solutions, which resulted
in more responses per level which benefit the use of fast re- FUTURE WORK
sponses, response frequency and longest pause. Thus, in terms As this was a simplified case, the game should be tested on
of maximizing these variables' applicability, a game design re- children within the targeting group. Thus, to compare the
quiring more than one answer is recommended. measurements by testing for significant differences and vali-
date that the exercises are challenging. Conducting the tests
METHOD CRITIQUE in a controlled environment would benefit the reliability and
The linear kernel resulted in an accuracy of 80 percent, which the applicability of these variables to estimate poor perfor-
could be argued as insufficient and vague since a considerable mance. Furthermore, the variables that correspond to each
portion is misclassified. The selection was based on the defi- classification group should be statistically compared. As the
nition that outlying values in errors and time mainly charac- parameters can be assumed to be non-parametric, a Mann-
terize poor performance. This assumption was based on stud- Whitney U test would be a reasonable choice [25]. From the
ies reporting that these two variables appear to differentiate discussion followed that this approach of identifying strug-
people with dyslexia from non-dyslectics. Though, the aim gling is not covering the spectra of different kinds of problem-
was to broaden this definition in the context of Kunna's exer- solving approaches. However, this paper only used Support
cises using correlation analysis. Of course, such an approach Vector Machines, and other machine learning techniques like
is limited by not comparing and analyzing the target group. Random Forest, K-neighbors, or Neural Networks could be
However, the correlation analysis may not be sufficient to find tested to receive a greater accuracy. Also, these multiclassifi-
poor performance characteristics as the cause and effect are cation algorithms may also be used to divide the spectra of
not certain. Thus, two variables can be correlated but still be different problem-solving behaviors.
unrelated. A comparison with the target group and a con-
trolled observing test could complement the correlation anal- CONCLUSION
ysis to understand better how the users interact with these This paper aimed to see if a support vector machine could
games. identify dyslectic children who performed poorly within a se-
rious game. Eleven variables were used to measure the per-
INTERPRETING POOR PERFORMANCE formance of 16 Spanish-speaking children. Four measures not
The definition of poor and well performance is floating, and a found in similar studies were: response frequency, longest
line must be drawn somewhere, and the purpose of Support pause, fast responses, and audio plays. However, the game de-
Vectors Machines is to find this optimal line. The benefit and sign limited the applicability of these variables. Therefore, a
perhaps a disadvantage of this approach is that no strict game design requiring more than one answer that allows sev-
thresholds need to be decided for each variable. However, this eral independent variables is suggested. However, there was
binary classification causes a diverse mix of problem-solving not enough evidence to prove poor performance characteris-
behaviors separated into two groups. Of which one group tics between these variables. Though, the analysis was done
contains participants with outlying values. This causes a bi- on children not diagnosed with dyslexia, and the game should
ased selection and could result in a non-desirable outcome, therefore be tested and statistically compared with the target
which emphasizes well-founded selections. But as mentioned group. Even though the game could never be thoroughly eval-
in the interview, every child is different. E.g., one user may uated, the highest accuracy received was 80 percent using a
have many errors but a short completion time, another may linear kernel. However, from the discussion followed that the
have a long duration but a few errors, and a third may have Support Vector Machine may not be the most efficient choice
many errors and a long duration etc. Thus, the binary SVM re- for identifying poor performance as the binary SVM merges
quires a selection process that merges the spectra of various different problem-solving behaviors. Therefore, this paper
problem-solving behaviors. This is going against Mark Hall's suggests that future work should consider multiclassification
[14] definition of good features as various results reduces the algorithms.
correlations to the class, which could reduce the classification
performance. Another approach would be unsupervised

ACKNOWLEDGMENTS                                                                    [13] William S. Noble. 2006. What is a support vector machine?. Nature biotech-
                                                                                   nology 24.12 (2006): 1565-1567.
I want to thank Pontus Österberg for his support and the em-                       [14] Mark Andrew Hall. 1999. Correlation-based feature selection for machine
ployees at Prototyp for helping with technical issues. I would                     learning.
also like to thank the people at Neurotalentum and my super-                       [15] Michael Schwarz, Clémentine Maurice, Daniel Gruss, Stefan Mangard.
                                                                                   2017. Fantastic timers and where to find them: High-resolution microarchitec-
visor Patric Dalhqvist.                                                            tural attacks in javascript. International Conference on Financial Cryptography
                                                                                   and Data Security. Springer, Cham.
REFERENCES                                                                         [16] Anthony Jason and David Francis. 2005. Development of phonological
[1] Eva Germanò, Antonella Gagliano and Paolo Curatolo. 2010 Comorbidity of        awareness. Current directions in psychological Science 14.5 (2005): 255-259.
ADHD and dyslexia. Developmental neuropsychology 35, no. 5 (2010): 475-            DOI:https://doi.org/10.1111/j.0963-7214.2005.00376.x
493.                                                                               [17] Claudia Maehler and Kirsten Schuchardt. 2016. Working memory in chil-
[2] American Psychiatric Association. 2013. Diagnostic and statistical manual      dren with specific learning disorders and/or attention deficits. Learning and
of mental disorders (DSM-5®). American Psychiatric Pub.                            Individual Differences 49 (2016): 341-347.
[3] Ombretta Gaggi, Claudio Enrico Palazzi, Matteo Ciman, Giorgia Galiazzo,        [18] Andrea Facoetti, Pierluigi Paganoni, Massimo Turatto. Valentina Marzola
Sandro Franceschini, Milena Ruffino, Simone Gori, and Andrea Facoetti. 2017.       and Gian Gastone Mascetti. 2000. Visual-spatial attention in developmental dys-
Serious Games for Early Identification of Developmental Dyslexia. Comput. En-      lexia." Cortex 36.1 (2000): 109-123. DOI:https://doi.org/10.1016/S0010-
tertain.    15,     2,    Article    4    (Summer       2017),     24     pages.   9452(08)70840-2
DOI:https://doi.org/10.1145/2629558                                                [19] Lynette Bradley and Peter E. Bryant. 1983. Categorizing sounds and learn-
[4] Luz Rello, Miguel Ballesteros, Abdullah Ali, Miquel Serra, Daniela Alarcón     ing to read—a causal connection. Nature 301.5899 (1983): 419-421.
Sánchez, Jeffrey P. Bigham. 2016. Dytective: diagnosing risk of dyslexia with a    [20] Alexandra Poole, Farhana Zulkernine, and Catherine Aylward. 2017. Lexa:
game. In PervasiveHealth (pp. 89-96).                                              A tool for detecting dyslexia through auditory processing. IEEE Symposium Se-
[5] Maria Rauschenberger, Ricardo Baeza-Yates, and Luz Rello. 2020. Screening      ries on Computational Intelligence (SSCI). IEEE,
risk of dyslexia through a web-game using language-independent content and         [21] Christopher Sterling, Marion Farmer, Barbara Riddick, Steven Morgan,
machine learning. In Proceedings of the 17th International Web for All Confer-     Catherine Matthews.1998. Adult dyslexic writing. Dyslexia, 4(1), 1-15.
ence (W4A '20). Association for Computing Machinery, New York, NY, USA, Ar-        [22]. Heikki Lyytinen, Miia Ronimus, Anne Alanko, Anna-Maija Poikkeus & Ma-
ticle 13, 1–12. DOI:https://doi.org/10.1145/3371300.3383342                        ria Taanila. 2007. Early identification of dyslexia and the use of computer game-
[6] Voravika Wattanasoontorn, Imma Boada, Rubén García, Mateu Sbert. 2013.         based practice to support reading acquisition, Nordic Psychology, 59:2, 109-
Serious games for health. Entertainment Computing 4.4 231-247.                     126, DOI: 10.1027/1901-2276.59.2.109
[7] Maria Rauschenberger, Luz Rello, Ricardo Baeza-Yates, and Jeffrey P.           [23] Swan, Denise, and Usha Goswami. 1997. Phonological awareness deficits
Bigham. 2018. Towards Language Independent Detection of Dyslexia with a            in developmental dyslexia and the phonological representations hypothe-
Web-based Game. In Proceedings of the Internet of Accessible Things (W4A           sis." Journal of experimental child psychology 66.1 (1997): 18-41.
'18). Association for Computing Machinery, New York, NY, USA, Article 17, 1–       [24] Richard Boada, Erik G. Willcutt, and Bruce F. Pennington. 2012. Under-
10. DOI:https://doi.org/10.1145/3192714.3192816                                    standing the comorbidity between dyslexia and attention-deficit/hyperactivity
[8] Serrano, Francisca and Sylvia Defior. 2008. Dyslexia speed problems in a       disorder. Topics in Language Disorders 32.3 (2012): 264-284.
transparent orthography. Annals of dyslexia 58.1 (2008): 81.                       [25] Andy Field and Graham Hole. 2003. How to design and report experiments.
[9] Kast, Monika, et al. Computer-based learning of spelling skills in children    SAGE Publications, London, 2003.
with and without dyslexia. Annals of dyslexia 61.2 (2011): 177-200.                [26] Johannes C. Ziegler, Caroline Castel, Catherine Pech-Georgel, Florence
[10] Luz Rello, Enrique Romero, Maria Rauschenberger, Abdullah Ali, Kristin        George, F-Xavier Alario and Conrad Perry. 2008. Developmental dyslexia and
Williams, Jeffrey P. Bigham, and Nancy Cushen White. 2018. Screening Dyslexia      the dual route model of reading: Simulating individual differences and sub-
for English Using HCI Measures and Machine Learning. In Proceedings of the         types. Cognition 107.1 (2008): 151-178.
2018 International Conference on Digital Health (DH '18). Association for Com-     [27] Zachary C Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. 2014.
puting        Machinery,       New       York,       NY,       USA,       80–84.   Optimal thresholding of classifiers to maximize F1 measure. Joint European
DOI:https://doi.org/10.1145/3194658.3194675                                        Conference on Machine Learning and Knowledge Discovery in Databases.
[11] Iles Jo, Vincent Walsh, and Alex Richardson. 2000. Visual search perfor-      Springer, Berlin, Heidelberg.
mance in dyslexia. Dyslexia 6.3 (2000): 163-177.                                   [28] Scikit, 2021, Sklearn.svm.SVC. Retrieved from https://scikit-
                                                                                   learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
[12] Aleksandar Tenev, Silvana Markovska-Simoska, Ljupco Kocarev, Jordan
                                                                                   [29] Sails.js, 2021, Retrieved from: https://sailsjs.com/
Pop-Jordanov, Andreas MüllerGian and Candrian Tenev. 2014. Machine learn-
                                                                                   [30] Vue.js, 2021, Retrieved from: https://vuejs.org/
ing approach for classification of ADHD adults. International Journal of Psycho-
physiology 93.1. 162-166.