Longitudinal cardio-respiratory fitness prediction through free-living wearable sensors

Page created by Nelson Mccarthy

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Longitudinal cardio-respiratory fitness prediction through free-living wearable sensors

Longitudinal cardio-respiratory fitness prediction
                                        through free-living wearable sensors
                                        Dimitris Spathis1,*+ , Ignacio Perez-Pozuelo2+ , Tomas I. Gonzales2 , Soren Brage2 ,
                                        Nicholas Wareham2 , and Cecilia Mascolo1
                                        1 Department  of Computer Science and Technology, University of Cambridge, UK
                                        2 MRC   Epidemiology Unit, School of Clinical Medicine, University of Cambridge, UK
                                        * corresponding author: ds806@cam.ac.uk
                                        + these authors contributed equally to this work

                                        SUMMARY
arXiv:2205.03116v1 [cs.LG] 6 May 2022

                                        Background Cardiorespiratory fitness is an established predictor of metabolic disease and mortality. Fitness is directly
                                        measured as maximal oxygen consumption (VO2 max), or indirectly assessed using heart rate response to a standard exercise
                                        test. However, such testing is costly and burdensome, limiting its utility and scalability. Fitness can also be approximated using
                                        resting heart rate and self-reported exercise habits but with lower accuracy. Modern wearables capture dynamic heart rate data
                                        which, in combination with machine learning models that find latent patterns in high-dimensional sensor signals, could improve
                                        fitness prediction.
                                        Methods In this work, we analyze movement and heart rate signals from wearable sensors in free-living conditions from
                                        11,059 participants who also underwent a standard exercise test, along with a longitudinal repeat cohort of 2,675 participants.
                                        We design algorithms and models that convert raw sensor data into cardio-respiratory fitness estimates, and validate these
                                        estimates’ ability to capture fitness profiles in a longitudinal cohort over time while subjects engaged in real-world (non-exercise)
                                        behavior. Additionally, we validate our methods with a third external cohort of 181 participants who underwent maximal VO2 max
                                        testing, which is considered the gold standard measurement because it requires reaching one’s maximum heart rate and
                                        exhaustion level.
                                        Findings Our results show that the developed models yield high correlation (r = 0.82, 95CI 0.80-0.83), when compared to
                                        the ground truth in a holdout sample. These models outperform conventional non-exercise fitness models and traditional
                                        bio-markers using measurements of normal daily living without the need of a specific exercise test. Additionally, we show the
                                        adaptability and applicability of this approach for detecting fitness change over time in the longitudinal subsample who repeated
                                        measurements after 7 years.
                                        Interpretation These results demonstrate the value of wearables for the longitudinal assessment of fitness that today can be
                                        measured only with laboratory tests.
                                        Funding Partly funded by the University of Cambridge, Jesus College Cambridge, EPSRC, and GSK i-CASE scholarships.

                                        Introduction                                                        predicted maximal HR, and surpassing a peak respiratory
                                                                                                            exchange ratio threshold2 . The costs of VO2 measurement
                                        Cardiorespiratory fitness (CRF) is one of the strongest known       and risks of exhaustive exercise not only limit direct CRF as-
                                        predictors of cardiovascular disease (CVD) risk and is in-          sessment in clinical settings, but also restrict research of CRF
                                        versely associated with many other health outcomes 1 . CRF is       at the population level. Thus, our understanding of differences
                                        also a potentially stronger predictor of CVD outcomes when          in CRF within populations, across geographic regions, and
                                        compared to other risk factors like hypertension, type 2 dia-       over time is lacking.
                                        betes, high cholesterol, and smoking. Despite its prognostic           Non-exercise prediction models of VO2 max are an alter-
                                        value, routine CRF assessment remains uncommon in clinical          native to exercise testing in clinical settings. These models
                                        settings because maximal oxygen consumption (VO2 max),              are usually regression based and incorporate variables like
                                        the gold-standard measure of CRF, is challenging to directly        sex, age, body mass index (BMI), resting heart rate (RHR),
                                        measure. A computerised gas analysis system is needed to            and self-reported physical activity 3 . We have recently shown
                                        monitor ventilation and expired gas fractions during exhaus-        that RHR alone can be used to estimate VO2 max4 , however
                                        tive exercise on a treadmill or cycle ergometer. Additional         the validity of estimates from this approach are considerably
                                        equipment may be needed to monitor other biosignals, such           lower than those achieved with exercise testing5, 6 . Wearable
                                        as heart rate (HR). These equipment require trained research        devices such as activity trackers and smartwatches can moni-
                                        personnel to operate, and an attending physician is a requisite     tor not only RHR and physical activity but other biosignals in
                                        for exercise testing in some scenarios. Several criteria must       free-living conditions, potentially enabling more precise esti-
                                        also be achieved to verify that exhaustion has been reached,        mation of VO2 max without exercise testing Recent attempts
                                        including leveling-off of VO2 , achieving a percentage of age-      to use wearable devices to estimate VO2 max are difficult to

Research in Context
Evidence before this study
Large-scale assessment of physical activity and heart rate data could enable training robust models that anticipate and predict
fine-grained changes of fitness over time. Current fitness estimation models, however, have limitations. We searched the
PubMed database for publications up to April 27 2022, for studies about VO2max estimation through wearables, using the
search criteria "Vo2max AND wearable" and "Vo2max AND machine learning". We should note that the intersection of the
three terms "Vo2max AND machine learning AND wearable" yielded only one result. We systematically screened all 34
results (21, 12, 1 from each respective search) by reviewing abstracts, and identified five original research studies which were
relevant to the study aims. These studies were either small-scale assessments of VO2max or required data collected from non
free-living environments. We discuss these studies in further detail in the respective sections of this paper.
Added value of this study
In this study, deep learning models were developed for remote and scalable estimation of cardio-respiratory fitness through
wearable devices. Our study expands on previous studies by being the first large-scale study integrating a large cohort, a
longitudinal validation cohort, and a third gold standard cohort. The deep learning model takes statistical features extracted
from the raw sensor timeseries covering movement and heart rate attributes, and returns fine-grained predictions of VO2max,
either in the present or the future. We independently validated the models on several holdout test sets (different participants
from the set used when training the AI model). Our results show that using passive sensor data from everyday life can
improve fitness prediction significantly, opening the door to population-scale assessment of fitness, moving beyond traditional
predictors of individuals’ health such as weight or BMI.
Implications of all the available evidence
The performance of our deep learning-based sensor analysis model suggests that it can potentially be used for fitness
estimation with similar accuracy to a laboratory-measured test. In the future, wearables could include additional sensors
which could further improve fitness prediction, such as blood pressure, skin conductance, and oximetry. Future prospective
studies will help to identify the impact of personalized fitness predictions by suggesting interventions on individuals through
lifestyle, nutrition, and exercise recommendations.

externally evaluate, however, because their estimation meth- subset of the same population who were retested seven years
ods tend to be non-transparent 7 and lack scientific validation later. This has implications for the estimation of population
7, 8 . Although certain wearable devices show promise, they fitness levels and lifestyle trends, including the identification
tend to rely on detailed physical activity intensity measure- of sub-populations or areas in particular need of interven-
ments, GPS-based speed monitoring, and require users to tion. Such models can improve both population-health and
reach near-maximal HR values, which limits their use to fit- personalized medicine applications. For instance, the ability
ter individuals9 . Some studies attempt to estimate VO2 max of the patient to undergo certain treatments like surgery11 or
from data collected during free-living conditions, but these chemotherapy12 can be assessed through wearables before
are typically from small-scale cohorts and use contextual data the procedure takes place and therefore reduce postoperative
from treadmill activity, which restricts their application in complications.
population settings10 .

Here, we use data from the largest study of its kind, by over
Results
two orders of magnitude, and use purely free-living data to Baseline measurements were collected from 12,435 healthy
predict VO2 max, with no requirement for context-awareness. adults from the Fenland study in the United Kingdom13 , where
This work substantially advances previous non-exercise mod- all required data for the present analysis were available in
els for predicting CRF by introducing an adaptive representa- 11,059 participants (Fenland I, baseline timepoint referred to
tion learning approach to physiological signals derived from as "current" in our evaluation). A subset of 2,675 participants
wearable sensors in a large-scale population with free-living were assessed again after a median (interquantile range) of 7
condition data. We employ a deep neural network model that (5-8) years (Fenland II, referred to as "future" in our evalua-
utilizes densely-connected non-linear layers to learn personal- tions). Descriptive characteristics of the two analysis samples
ized fitness representations. We demonstrate that these models are presented in Table 1. We present the characteristics of the
yield better performance than traditional and state-of-the-art longitudinal cohort in both temporal snapshots in in Table 1
non-exercise models. We illustrate how these models can ("present" and "future"). Mean and standard deviations for
be used to predict the magnitude and direction of change in each characteristic are presented in this table. An overview of
CRF. Moreover, we show that they can adapt in time given the study design and the three experimental tasks is provided
behavioural changes by showcasing strong performance in a in Figure 1.

2/16

STUDY DESIGN

                                                                         Δ ~ 7 years
                                                      Fenland I                                           Fenland II
                                                     (N = 11,059)                                         (N = 2,675)

     BASELINE VISIT (DAY 1)                                                         FREE-LIVING PHASE (DAY 1-7)

                                                    Anthropometrics                           ~ 6 days
                 VO2max
                 Test
                                                    Questionnaires                 Movement and heart
                                                                                   data from wearables

     MODELING FITNESS                                                                            Fine-grained cardiorespiratory
                                                                                          1      ﬁtness prediction in the present
                   FEATURES
                                                                                                 Inferring ﬁtness direction and
                                                                                          2
                                       USERS

                                                                                                 magnitude 7 years in the future

                                                                                                 Validating adaptability and
                                                                                          3      change with new sensor data
                                                      DEEP NEURAL NETWORK

Figure 1. Study and experimental design. Fenland is a cohort study, including 11,059 participants with treadmill and
wearable sensor data at baseline (Fenland I, 2005-2015) and a longitudinal subsample of 2,675 participants who were retested
approximately 7 years later (Fenland II). During both phases participants underwent a variety of tests during the baseline clinic
visit comprising anthropometric measurements, questionnaires, and a submaximal VO2 max test. Following this baseline clinic
visit, participants were fitted with a combined heart rate and movement sensing device which they wore during free-living
conditions for approximately 6 days. In this work we develop deep learning models utilising wearable data and common
biomarkers as input to predict VO2 max and achieve strong performance, both in the present and when replicated at the second
timepoint.

Table 1. Characteristics for the study analytical sample: The Fenland I and II studies. Data is in mean (std). Values
with asterisk(*) indicate that this variable comes from Fenland II sensor data which is a smaller cohort (N=2071) due to data
filtering (see Figure 6–Panel 3). The values in FII (future) cohort correspond to the second assessment (7 years later). In Task 1,
the training set is FI (present) and the testing set is FII (present) so as to make sure that they come from similar distributions.

                                               Fenland present                      Fenland II f uture                      Fenland II present
                               Men (n= 5229)          Women (n= 5830)     Men (n=1303)           Women (n=1372)     Men (n=1303)     Women (n=1372)
                               mean        std        mean       std      mean      std          mean      std      mean    std      mean        std
 Demographics
     Age (years)               47.70       7.57       47.66      7.36     54.11     7.08         54.76     6.81     47.19   7.18     47.83       6.96
 Anthropometrics
     Height (m)                1.78        0.07       1.64       0.06     1.77      0.06         1.64      0.06     1.77    0.06     1.64        0.06
     Body mass (kg)            85.85       13.83      70.54      13.92    85.31     13.59        69.58     13.77    84.85   13.14    69.04       13.26
     BMI (kg/m2 )              27.16       3.97       26.17      4.97     27.03     4.01         25.85     4.94     27.00   4.03     25.84       4.98
 Physical activity
     MVPA (min/day)            35.87       22.35      34.40      22.59    34.92*    22.18*       35.35*    23.26*   34.41   22.23    32.81       21.45
     VPA (min/day)             3.27        8.57       3.31       15.67    3.57*     8.78*        3.30*     7.52*    3.38    9.30     3.86        27.80
 Resting Heart Rate
     RHR (bpm)                 61.48       8.68       64.46      8.28     59.63*    8.28*        62.21*    8.10*    61.06   8.44     63.81       8.20
 Cardiorespiratory fitness
     VO2 max (ml O2 /min/kg)   41.95       4.61       37.44      4.73     42.32     4.68         37.93     4.72     42.21   4.42     37.84       4.69

                                                                                                                                                         3/16

Fine-grained fitness prediction from wearable sen- fine-grained delta of VO2 max, we formulated this problem
sors as a classification task. A visual representation of this task
We first developed and externally validated several non- can be found in Figure 3a. By inspecting the distribution
exercise VO2 max estimation models as a regression task using of the difference (delta) of current-future VO2 max on the
features commonly measured by wearable devices (anthro- training set, we split it to 2 halves (50% quantiles) and set
pometry, resting heart rate (RHR), physical activity (PA); see these as prediction outcomes. The purpose of this task is
Table 2). Here our goal was to explore how conventional non- to assess the direction of individual change of fitness. We
exercise approaches to VO2 max estimation could be enhanced report an area under the curve (AUC) of 0.61 in predicting the
by features from free-living PA data. We split participant direction of change (N = 2675). We also focused on the tails
data into independent training and test sets. The training set of the change distribution which indicates participants who
(n=8384, participants with baseline data only) was used for had substantial and dramatic change in fitness over the period
model development. The test set (n=2675, participants with of time between Fenland I and Fenland II (≈ 7 years). In this
baseline and followup data) was used to externally validate case, the distribution was split into 80%/20% (substantial)
each model. Models using anthropometry or RHR alone had and 90%/10% (dramatic) quantiles. The results from these
poor external validity, but validity improved when combined experiments show that the models can distinguish between
in the same model. The best performance (R2 of 0.67) was substantial fitness change with an AUC of 0.72 (N = 1068)
attained using a deep neural network model combining wear- and between dramatic fitness change with an AUC of 0.74
able sensors, RHR, and anthropometric data (Figure 2). For (N = 535). All AUC curves can be found in Figure 3b.
reference, we compare these results to traditional non-model
equations, which rely on Body Mass, RHR, and Age. Using Enabling adaptive cardiorespiratory fitness infer-
a popular equation (as proposed in14, 15 ) shows poor validity ences
(R2 of -3.2 and Correlation of 0.389), a performance lower For the final task, we assessed whether the trained models
than using anthropometrics only in our setup (see Methods can pick up change using new sensor data from Fenland II,
for details). This motivates the use of machine learning which Considering that getting new wearable data is relatively easy
captures better covariate interactions. since these devices are becoming increasingly pervasive. The
Deep neural networks can learn feature representations that intuition behind this task is to evaluate the generalizability of
are suitable for clustering tasks, such as population strati- the models over time. We first matched the populations that
fication by implicit health status, but are difficult to reveal provided sensor data for both cohorts (N = 2, 042) and applied
using linear dimension-reduction techniques16 . We used t- the trained model from Task 1 in order to produce VO2 max
distributed stochastic neighbor embedding (tSNE), a nonlin- inferences. We then compared the predictions with the re-
ear dimension-reduction technique, to visualise learned fea- spective ground truth (current and future VO2 max). The true
ture representations from our model and their relationship and predictive distributions are shown in Figures 4c and 4d.
to participant VO2 max and HR levels (Figure 7). Clustering Through this procedure, we found that the model achieves an
and coloring by VO2 max and HR levels was shown to be r = 0.84 for VO2 max future prediction and an r = 0.82 for
inversely related and more apparent in the learned latent space VO2 max current prediction (validating our Task 1 results). In
compared to the original observation space. For example, other words, if we have access to wearable sensor data and
participants with higher VO2 max were clustered similarly to other information from the future time, we can reuse the al-
those with lower HR levels, and vice versa. ready trained model from Fenland I to accurately infer fitness
with minimal loss of accuracy over time, even though this is
new sensor data from a completely separate (future) week.
Predicting magnitude and direction of fitness
Last, we calculated the delta of the predictions and com-
change in the future
pared it to the actual delta of fitness over the years. This
The second group of tasks evaluated our model on the sub- task showed that the models tend to focus mostly on positive
set of participants who returned for Fenland II ≈ 7 years change and under-predict when participants’ fitness deterio-
later (referred to as future in our evaluations). For these ex- rates over the years (Figures 4a, 4b). The overall correlation
periments we carried out three evaluations. Following the between the delta of the predictions with the ground truth is
process described earlier, we re-trained a model to predict significant (r=0.57, p

Predicted
             0.08                                                     True                      60
Kernel Density

             0.06                                                                               50

                                                                                             True
             0.04                                                                               40
             0.02                                                                               30
             0.00                                                                               20
                    20       25      30   35     40       45      50        55                           30          40         50
                                            VO2max                                                            Predicted
                                          (a)
                                                                                                                    (b)

   Figure 2. Fine-grained fitness prediction. Comparing the predicted and true VO2max coming from the best performing
   comprehensive model (Sensors + RHR + Anthro.) trained with Fenland I. (a) Distribution of predicted and true VO2max. The
   plot combines a kernel density estimate and histogram. (b) Correlation of predicted and true VO2max (r = 0.82, p < 0.005,
   see Table 2). The gray line denotes a linear regression fit. Transparency has been applied to the datapoints to combat crowding.

   Table 2. Evaluation of predicting fine-grained VO2max with the Fenland I cohort. Comparison between traditional
   antrhopometrics, common biomarkers (RHR), and passively collected data over a week (wearable sensors). Best performance
   in bold. The units of VO2max are measured in mlO2 /min/kg. Results reported from the testing set.

                                                                 Evaluation Metrics [95% CI]                          N (train+val / test set)
                         Data modality
                                                        R2                    Corr                   RMSE
        Anthropometrics
         Age/Sex/Weight/BMI/Height              0.362 [0.332-0.391]    0.604 [0.579-0.627]    4.043 [3.924-4.172]
        Resting Heart Rate
         RHR (Sensor-derived)                   0.374 [0.344-0.403]    0.615 [0.589-0.639]    4.007 [3.891-4.117]
        Anthropometrics + RHR                                                                                                 11059
         Age/Sex/Weight/BMI/Height/RHR          0.616 [0.588-0.641]    0.785 [0.767-0.802]    3.138 [3.031-3.237]          (8384/2675)
        Wearable Sensors + RHR + Anthro.
         Acceleration/HR/HRV/MVPA               0.671 [0.649-0.692]    0.822 [0.808-0.835]    2.903 [2.801-3.003]
         Age/Sex/Weight/BMI/Height/RHR

                                                                                                                                                 5/16

1.0
        250                                 50% Quantile
                                                                                         0.8
        200
Frequency

                                                                                         0.6

                                                                               Sensitivity
        150           20% Quantile                             80% Quantile
                                                                                         0.4
        100
            50        10% Quantile                             90% Quantile              0.2                   AUC50/50 = 0.61 [CI 0.56-0.66]
                                                                                                               AUC20/80 = 0.72 [CI 0.64-0.78]
                                                                                                               AUC10/90 = 0.74 [CI 0.65-0.85]
             0                                                                           0.0                   chance
                 20   15       10       5       0          5        10        15               0.0   0.2     0.4     0.6       0.8        1.0
                             (VO2maxcurrent VO2maxfuture)                                                  1 - Specificity
                                (a)                                                                          (b)

  Figure 3. Evaluation in predicting the magnitude and direction of the VO2max change between the present and the
  future. (a) Distribution of the ∆ of VO2max in the present and the future. The shaded areas represent different binary bins that
  are used as outcomes, increasingly focusing on the extremes of this distribution. (b) ROC AUC performance in predicting the
  three ∆ outcomes as shown on the left hand side. Brackets represent 95% CIs.

 a deep learning framework for predicting CRF and changes                (Fenland II) VO2 max for those participants that came back
 in CRF over time. Our framework estimates VO2 max by                    approximately 7 years later. For this last task, the model pro-
 combining learned features from heart rate and accelerom-               duced outcomes that translated to a 0.57 correlation between
 eter free-living data extracted from wearable sensors with              the delta of predicted and delta of true VO2 max.
 anthropometric measures. To evaluate our framework’s per-                  The application of our work to other cohort and longitudinal
 formance, VO2 max estimates were compared with VO2 max                  studies is of particular importance as serial measurement of
 values derived from a submaximal exercise test5 . Free-living           cardiorespiratory fitness has significant prognostic value in
 and exercise test data were collected at a baseline investigation       clinical practice. Small increases in fitness are associated
 in 11,059 participants (Fenland I). A subset of those partici-          with reduced cardiovascular disease mortality risk and better
 pants (n=2,675) completed another exercise test at a follow-            clinical outcomes in patients with heart failure and type 2
 up investigation approximately seven years later (Fenland II).          diabetes20 . Nevertheless, routine measurement of fitness in
 This study design allowed us to address three questions: 1)             clinical practice is rare due to the costs and risks of exercise
 Do baseline estimates of VO2 max from the deep learning                 testing. Non-exercise based regression models can be used
 framework agree with VO2 max values measured from exer-                 to estimate changes in fitness in lieu of serial exercise testing.
 cise testing at baseline?, 2) Can the framework learn features          It is unclear, however, the extent to which changes in fitness
 from heart rate and accelerometer free-living data collected            detected with such models reflect true changes in exercise
 at baseline that predict VO2 max measured at follow-up?, and            capacity. Here, we relied on the relationship between CRF
 3) Can the framework be used to predict the magnitude of                and heart rate responses to different levels of physical activity
 change in VO2 max from baseline to follow-up?                           at submaximal, real-life conditions captured through wearable
    In the VO2 max estimation tasks, our model demonstrated              sensors. Using deep learning techniques, we have developed
 strong agreement with VO2 max measured from the submaxi-                a non-exercise based fitness estimation approach that can be
 mal exercise test at baseline (Pearson’s correlation coefficient        used not only to accurately infer current VO2 max, but also
 (PCC): 0.82) as well as for the longitudinal, follow-up visit           to do so in the settings of a future cohort, where the model
 (PCC: 0.72). We were also able to distinguish between sub-              did not require any retraining, just influx of new data. Further,
 stantial and dramatic changes in CRF (AUCs 0.72 and 0.74,               we show that the model can also be used to infer the changes
 respectively). Finally, we further evaluated the initial model          in CRF that occurred during the ≈ 7 year time span between
 on new input data by feeding Fenland II free-living data along          Fenland I and II.
 with updated heart rate and anthropometrics to the model,                  Our proposed deep learning approach outperforms tradi-
 showing that it is able to adapt and monitor change over time.          tional non-exercise models, which are the state-of-the art in
 We evaluated the inference capabilities of the model in the             the field and rely on simple variables inputted to a linear
 difference (delta) between the current (Fenland I) and future           model. Importantly, our model is able to take week-level in-

                                                                                                                                            6/16

formation from each participant and combine it with various In this paper, we developed deep learning models utilising
anthropometrics and bio-markers such as the RHR, providing wearable data and other bio-markers to predict the gold stan-
a truly personalized approach for CRF inference generation. dard of fitness (VO2 max) and achieved strong performance
The approach we present here outperforms traditional non- compared to other traditional approaches. Cardio-respiratory
exercise models, which are considered state-of-the-art meth- fitness is a well-established predictor of metabolic disease and
ods for longitudinal monitoring and highlights the potential of mortality and our premise is that modern wearables capture
wearable sensing technologies for digital health monitoring. non-standardised dynamic data which could improve fitness
An additional application of our work is the potential routine prediction. Our findings on a population of 11,059 partici-
estimation of VO2 max in clinical settings, given the strong pants showed that the combination of all modalities reached
association between estimated CRF levels and CVD health an r = 0.82, when compared to the ground truth in a holdout
outcomes21 . sample. Additionally, we show the adaptability and applicabil-
This study has several limitations worthy of recognition. ity of this approach for detecting fitness change over time in
First, the validity of the deep learning framework was assessed a longitudinal subsample (n = 2,675) who repeated measure-
by comparing estimated VO2 max values with VO2 max values ments after 7 years. Last, the latent representations that arise
derived from a submaximal exercise test. Ideally, one would from this model pave the way for fitness-aware monitoring
use VO2 max values directly measured during a maximal exer- and interventions at scale. It is often said that "If you cannot
cise test to establish the ground truth for cardiorespiratory fit- measure it, you cannot improve it". Cardio-fitness is such
ness comparisons. Maximal exercise tests, however, are prob- an important health marker, but until now we did not have
lematic when used in large population-based studies because the means to measure it at scale. Our findings could have
they may be unsafe for some participants and, consequently, significant implications for population health policies, finally
induce selection bias. The submaximal exercise test used in moving beyond weaker health proxies such as the BMI.
the Fenland Study was well tolerated by study participants
and demonstrated acceptable validity against direct VO2max Methods
measurements5 . Submaximal tests are also utilized to validate Study description
popular wearable devices such as the Apple Watch22 . We are The Fenland study is a population-based cohort study de-
therefore confident that VO2 max values estimated from the signed to investigate the independent and interacting effects
deep learning framework reflect true cardiorespiratory fitness of environmental, lifestyle and genetic influences on the de-
levels. velopment of obesity, type 2 diabetes and related metabolic
To investigate this limitation, we validate our models with disorders. Exclusion criteria included prevalent diabetes, preg-
181 participants from the UK Biobank Validation Study nancy or lactation, inability to walk unaided, psychosis or
(BBVS) who were recruited from the Fenland study. These terminal illness (life expectancy of ≤ 1 year at the time of
participants completed an independent maximal exercise test, recruitment) .
where VO2 max was directly measured. Taking into account The Fenland study has two distinct phases. Phase I, during
that the BBVS cohort is less fit (VO2 max=32.9±7) com- which baseline data was collected from 12,435 participants,
pared to the cohort the model was trained on (Fenland I, took place between 2005 and 2015. Phase II was launched
VO2 max=39.5±5), we observe that the model over-predicts in 2014 and involved repeating the measurements collected
with a mean prediction of VO2 max=39.9, RMSE=8.998. This during Phase I, alongside the collection of new measures. All
is expected because the range of VO2 max seen during training participants who had consented to being re-contacted after
did not include participants with VO2 max below 25. Even their Phase I assessment were invited to participate in Phase
when looking at women in isolation -who perform lower than II. At least 4 years must have elapsed between visits. As a
men in these tests-, they had a VO2 max=37.4±4.7 in Fenland result of this stipulation, recruitment to Phase II is ongoing.
I (see Table 1), which is still significantly higher than the aver- A flowchart of the analytical sample by each one of the study
age participant of BBVS. This is a common distribution/label tasks is provided in Figure 6.
shift issue where the model encounters an outcome which is After a baseline clinic visit, participants were asked to wear
out of its training data range and is still an open problem in a combined heart rate and movement chest sensor Actiheart,
statistical modelling23 . To partially mitigate this issue, we CamNtech, Cambridgeshire, UK) for 6 complete days. For
match BBVS’s fitness to have similar statistics to the training this study, data from 11,059 participants was included after
set of original model (Fenland I: VO2 max=39±5, matched excluding participants with insufficient or corrupt data or miss-
BBVS: VO2 max=39±4) and observe an RMSE=5.19 (see ing covariates as shown in Figure 6. A subset of 2,675 of the
Figure 8). This result is still not on par with our main results study participants returned for the second phase of the study,
in Fenland but shows the impact of including very low-fitness after a median (interquartile range) of 7 (5-8) years, and un-
participants in this validation study. To improve future CRF derwent a similar set of tests and protocols, including wearing
models, we believe that population-scale studies should focus the combined heart rate and movement sensing for 6 days.
on including low-fitness participants along with the general All participants provided written informed consent and the
population. study was approved by the University of Cambridge, NRES

7/16

Committee - East of England Cambridge Central committee. non-wear periods were excluded from the analyses. This al-
All experiments and data collected were done in accordance gorithm detected extended periods of non-physiological heart
with the declaration of Helsinki. rate concomitantly with extended (> 90 minutes) periods that
also registered no movement through the device’s accelerom-
Study procedure eter. We converted movement these intensities into standard
Participants wore the Actiheart wearable ECG which mea- metabolic equivalent units (METs), through the conversion 1
sured heart rate and movement recording at 60-second inter- MET = 71 J/min/kg ( 3.5 ml O2 · min−1 · kg−1 ). These con-
vals24 . The Actiheart device was attached to the chest at the versions where then used to determine intensity levels with
base of the sternum by two standard ECG electrodes. Par- ≤ 1.5METs classified as sedentary behaviour, activities be-
ticipants were told to wear the monitor continuously for 6 tween 3 and 6 METs were classified as moderate to vigorous
complete days and were advised that these were waterproof physical activity (MVPA) and those > 6METs were classified
and could be worn during showering, sleeping or exercis- as vigorous physical activity (VPA).
ing. During a lab visit, all participants performed a treadmill Since the season can have a big impact on physical activity
test that was used to establish their individual response to considering on how it affects workouts, sleeping patterns, and
a submaximal test, informing their VO2 max (maximum rate commuting patterns, we encoded the sensor timestamps using
of oxygen consumption measured during incremental exer- cyclical temporal features T f . Here we encoded the month of
cise)25 . RHR was measured with the participant in a supine the year as (x, y) coordinates on a circle:
position using the Actiheart device. HR was recorded for 15 2∗π ∗t 2∗π ∗t
minutes and RHR was calculated as the mean heart rate mea- T f1 = sin (1) T f2 = cos (2)
max(t) max(t)
sured during the last 3 minutes. Our RHR is a combination where t is the relevant temporal feature (month). The
of the Sleeping HR measured by the ECG over the free-living intuition behind this encoding is that the model will "see"
phase and the RHR as described above. that e.g. December (12th) and January (1st) are 1 month
apart (not 11). Considering that the month might change
Cardiorespiratory fitness assessment over the course of the week, we use the month of the first
VO2 max was predicted in study participants using a previously time-step only. Additionally, we extracted summary statistics
validated submaximal treadmill test 5 . Participants exercised from the following sensor time-series: raw acceleration, HR,
while treadmill grade and speed were progressively increased HRV, Aceleration-derived Euclidean Norm Minus One, and
across several stages of level walking, inclined walking, and Acceleration-derived Metabolic Equivalents of Task. Then,
level running. The test was terminated if one of the following for every time-series we extracted the following variables
criteria were met: 1) the participant wanted to stop, 2) the which cover a diverse set of attributes of their distributions:
participant reached 90% of age-predicted maximal heart rate mean, minimum, maximum, standard deviation, percentiles
(208-0.7*age) 15 , 3) the participants exercised at or above 80% (25%, 50%, 75%), and the slope of a linear regression fit. The
of age-predicted maximal heart rate for 2 minutes. Details rest of variables (anthropometrics and RHR) are used as a
about the fitness characteristics of the cohort and the validation single measurement.
of the submaximal test are provided elsewhere 26 . In total, we derived a comprehensive set of 68 features
To further validate the models trained on submaximal using the Python libraries Pandas and Numpy. A detailed
VO2 max, we employ the external cohort UK Biobank Val- view of the variables is provided in Table 4.
idation Study (BBVS)26 . We recruited 105 female (mean age:
54.3y±7.3) and 86 male (mean age: 55.0y±6.5) validation Deep learning models
study participants and VO2 max was directly measured during We developed deep neural network models that are able to
an independent maximal exercise test, which was completed capture non-linear relationships between the input data and
to exhaustion. Some maximal exercise test data were excluded the respective outcomes. Considering the high-sampling rate
because certain direct measurements were anomalous due to of the sensors (1 sample/minute) after aligning HR and Ac-
testing conditions (N=10). BBVS participants completed the celeration modalities, it is impossible to learn patterns with
same free-living protocol as in Fenland and we collected sim- such long dependencies (a week of sensor data includes more
ilar sensor and antrhopometrics data which were processed than 10,000 timesteps). Even the most well-tuned recurrent
with the same way as in Fenland (see next section). neural networks cannot cope with such sequences and given
the size of the training set (7,545 samples), the best option
Free-living wearable sensor data processing was to extract statistical features from the sensors and repre-
Participants were excluded from this analysis if they had less sent every participant-week as a row in a feature vector (see
than 72 hours of concurrent wear data (three full days of Fig. 1). This feature vector was fed to fully connected neural
recording) or insufficient individual calibration data (tread- network layers which were trained with backpropagation.
mill test-based data). All heart rate data collected during free- Data preparation. For Task 1 (see Figure 6), we matched
living conditions underwent pre-processing for noise filtering. the sensor data with the participants who had eligible lab tests.
Non-wear detection procedures were applied and any of those Then we split into disjoint train and test sets, making sure that

8/16

0.20            of predictions
                               (FIsensors FIIsensors)
Kernel Density

                                                                                               (VO2maxcurrent VO2maxfuture)
                 0.15          True                                                                                           10

                 0.10                                                                                                          0

                 0.05                                                                                                         10
                 0.00                                                                                                                10           0             10
                         15        10          5     0           5                    10      15                                             of predictions
                                         (VO2maxcurrent VO2maxfuture)                                                                    (FIsensors FIIsensors)
                                         (a)                                                                                                    (b)

                                                     Predicted                    0.08                                                                     Predicted
             0.08                                    True                                                                                                  True
Kernel Density

                                                                     Kernel Density

             0.06                                                                 0.06
             0.04                                                                 0.04
             0.02                                                                 0.02
             0.00                                                                 0.00
                    20        30         40         50                                   20                  30                          40           50          60
                                   VO2maxcurrent                                                                                   VO2maxfuture
                                   (c)                                                                                             (d)

   Figure 4. Assessing the robustness of the model to pick up change using new sensor data from Fenland II repeats. By
   matching the populations who provided sensor data for both cohorts (N=2,042) we passed them through our trained model
   from Task 1 to predict VO2max. (a-b) We then calculated the difference (∆) of the predictions juxtaposed with the true
   difference of fitness over the years. Distribution of ∆ of predicted and true VO2max. Correlation of ∆ of predicted and true
   VO2max (r = 0.57, p < 0.005). The gray line denotes a linear regression fit. Transparency has been applied to the datapoints
   to combat crowding. (c-d) Comparison of predicted and true VO2max using FI and FII covariates (sensors, RHR, anthro.),
   respectively. The distribution plots combine a kernel density estimate and histogram with bin size determined automatically
   with a reference rule.

                                                                                                                                                                       9/16

participants from Fenland I go to the train set, while those models; the model which was trained in Task 1 is used in
from Fenland II go to the test set (see Figure 5). This would inference mode (prediction).
allow to re-use the trained model from Task 1, with different
sensor data from Fenland II participants. Intuitively, we train Prediction equations
a model on the big population, and we evaluate it with two For reference, we compare our models results to traditional
snapshots of another longitudinal population over time (Task non-model equations, which rely on Body Mass, RHR, and
1 & 3). After splitting, we normalize the training data by Age. We incorporate the popular equation proposed by Uth
applying standard scaling (removing the mean and scaling and colleagues14 , which corresponds to VO2max = 15.0
to unit variance) and then denoise it by applying Principal (ml ∗ minˆ − 1) * Body Mass (kg) * (HRmax / HRrest),
Components Analysis (PCA), retaining the components that in combination with Tanaka’s equation15 where HRmax =
explain 99.99% of the variance. In practice, the original 68 208 − 0.7 ∗ age. Other approaches rely on measurements such
features are reduced to 48. We save the fitted PCA projection as the waist circumference, which however was not recorded
and scaler and we apply them individually to the test-set, to in our cohorts.
avoid information leakage across the sets. The same projec-
tion and scaler are applied to all downstream models (Task 2 Evaluation
and 3) to leverage the knowledge of the big cohort (Fenland To evaluate the performance of the deep learning models
I). which predict continuous values, we computed the root mean
q
Model architecture and training. The main neural network 1 N 2
squared error (RMSE) = |Ntest | ∑y∈Dtest ∑t=1 (yt − ŷt ) , the
(used in Task 1) receives a 2D vector of [users, features] and N (y −ŷ )2
∑t=1 t t
predicts a real value. For this work, we assume N users and coefficient of determination (R2 ) = 1 − N (y −ȳ )2 , and the
∑t=1 t t
F features of an input vector X = (x1 ,...,xN ) ∈ RN×F and a Pearson correlation coefficient. Here y and ŷ are the measured
target VO2 max y = (y1 ,...,yN ) ∈ RN . The network consists and predicted VO2 max and ȳ is the mean. For the binary
of two densely-connected feed forward layers with 128 units models, we used the Area under the Receiver Operator Char-
each. A dense layer works as follows: out put = activation acteristic (AUROC or AUC) which evaluates the probability
(input · kernel + bias), where activation is the element-wise of a randomly selected positive sample to be ranked higher
activation function (the exponential linear unit in our case) than a randomly selected negative sample.
, kernel is a learned weights matrix with a Glorot uniform
initialization, and bias is a learned bias vector. Each layer is Visualizing the latent space
followed by a batch normalization operation, which maintains The activations of the trained model allow us to understand
the mean output close to 0 and the output standard deviation the inner workings of the network and explore its latent space.
close to 1. Also, dropout of 0.3 probability is applied to every We first pass the test-set of Task 1 through the trained model
layer, which randomly sets input units to 0 and helps prevent and retrieve the activations of the penultimate layer27 . This is
overfitting. Last, the final layer is a single-unit dense layer and a 2D vector of [2675, 128] size, considering that the layer size
the network is trained with the Adam optimizer to minimize is 128 and the participants of the test-set are 2675. Intuitively,
the Mean Squared Error (MSE) loss, which is appropriate for every participant corresponds to a 128-dimensional point. In
continuous outcomes. We use a random 10% subset of the order to visualize this embedding, we apply tSNE28 , an algo-
train-set as a validation set. To combat overfitting, we train rithm for dimensionality reduction. For its optimization, we
for 300 epochs with a batch size of 32 and we perform early use a perplexity of 50, as it was suggested recently29 .
stopping when the validation loss stops improving after 15
epochs and the learning rate is reduced by 0.1 every 5 epochs. Statistical analyses
All hyperparameters (# layers, # units, dropout rates, batch We performed a number of sensitivity analyses to investi-
size, activations, and early stopping) were found after tuning gate potential sources of bias in our results. Full results of
on the validation set. these sensitivity analyses are shown in the main text and cor-
Model differences across tasks. Task 1 trains the main neu- responding Tables. In particular, we use bootstraping with
ral network of our study (see previous subsection). Task 2 replacement (1000 samples) to calculate 95% Confidence
re-trains an identical model to predict VO2 max in the future Intervals when we report the performance of the models in
(and the delta present-future). However, when we re-frame the hold-out sets. Wherever we report p-values, we use the
this problem as a classification task (see Figure 3), we use recently proposed strict threshold of p < 0.005.
significantly fewer participants when we focus on the tails
of the change distribution. Therefore, to combat overfitting,
we train a smaller network with one Dense layer of 128 units
Data availability
and a sigmoid output unit, which is appropriate for binary All Fenland data used in our analyses is available from the
problems. Instead of optimizing the MSE, we now minimize MRC Epidemiology Unit at the University of Cambridge
the binary cross-entropy. In all other cases –such as in Task upon reasonable request (http://www.mrc-epid.cam.
3 or when visualizing the latent space–, we do not train new ac.uk/research/studies/fenland/)

10/16

Code availability                                                   8. Passler, S., Bohrer, J., Blöchinger, L. & Senner, V. Valid-
                                                                       ity of wrist-worn activity trackers for estimating vo2max
The source code of this work will be available on Github upon
acceptance.                                                            and energy expenditure. Int. journal environmental re-
                                                                       search public health 16, 3037 (2019).
Acknowledgements                                                    9. Cooper, K. D. & Shafer, A. B. Validity and reliability
                                                                       of the polar a300’s fitness test feature to predict vo2max.
DS is supported by the Embiricos Trust Scholarship of Jesus            Int. journal exercise science 12, 393 (2019).
College Cambridge and the Engineering and Physical Sci-
ences Research Council through grant DTP (EP/N509620/1).           10. Altini, M., Casale, P., Penders, J. & Amft, O. Cardiores-
IP-P is supported by GlaxoSmithKline and Engineering and               piratory fitness estimation in free-living using wearable
Physical Sciences Research Council through an iCase fellow-            sensors. Artif. intelligence medicine 68, 37–46 (2016).
ship (17100053).                                                   11. Richardson, K., Levett, D., Jack, S. & Grocott, M. Fit
                                                                       for surgery? perspectives on preoperative exercise testing
Author contributions statement                                         and training. BJA: Br. J. Anaesth. 119, i34–i43 (2017).
DS, IP, SB, NW, CM designed the study. SB, NW collected            12. Leon-Ferre, R., Ruddy, K. J., Staff, N. P. & Loprinzi,
the data on behalf of the Fenland Study. DS, IP sourced, se-           C. L. Fit for chemo: Nerves may thank you. JNCI: J.
lected, and pre-processed the data. DS developed the models            Natl. Cancer Inst. 109 (2017).
and produced the figures. DS, IP performed the statistical         13. Lindsay, T. et al. Descriptive epidemiology of physical ac-
analysis. IP, DS wrote the first draft of the manuscript. All          tivity energy expenditure in uk adults (the fenland study).
authors (DS, IP, TG, SB, NW, CM) critically reviewed, con-             Int. J. Behav. Nutr. Phys. Activity 16, 1–13 (2019).
tributed to the preparation of the manuscript, and approved        14. Uth, N., Sørensen, H., Overgaard, K. & Pedersen, P. K.
the final version. All authors vouch for the data, analyses, and       Estimation of vo2max from the ratio between hrmax and
interpretations.                                                       hrrest–the heart rate ratio method. Eur. journal applied
                                                                       physiology 91, 111–115 (2004).
Competing interests
                                                                   15. Tanaka, H., Monahan, K. D. & Seals, D. R. Age-predicted
The authors declare that they have no competing financial              maximal heart rate revisited. J. american college cardiol-
interests.                                                             ogy 37, 153–156 (2001).
                                                                   16. Gaspar, H. A. & Breen, G. Probabilistic ancestry maps:
References                                                             a method to assess and visualize population substructures
 1. Mandsager, K. et al. Association of cardiorespiratory              in genetics. BMC bioinformatics 20, 1–11 (2019).
    fitness with long-term mortality among adults undergo-         17. Schmid, D. & Leitzmann, M. Cardiorespiratory fitness
    ing exercise treadmill testing. JAMA network open 1,               as predictor of cancer mortality: a systematic review and
    e183605–e183605 (2018).                                            meta-analysis. Annals oncology 26, 272–278 (2015).
 2. Swain, D. P., Brawner, C. A., of Sports Medicine,              18. Schuch, F. B. et al. Are lower levels of cardiorespiratory
    A. C. et al. ACSM’s resource manual for guidelines                 fitness associated with incident depression? a systematic
    for exercise testing and prescription (Wolters Kluwer              review of prospective cohort studies. Prev. Medicine 93,
    Health/Lippincott Williams & Wilkins, 2014).                       159–165 (2016).
 3. Cao, Z.-B. et al. Predicting v˙ o2max with an objectively      19. Laukkanen, J. A., Kurl, S., Salonen, R., Rauramaa, R. &
    measured physical activity in japanese women. Medicine             Salonen, J. T. The predictive value of cardiorespiratory
    & Sci. Sports & Exerc. 42, 179–186 (2010).                         fitness for cardiovascular events in men with various risk
 4. Gonzales, T. I. et al. Resting heart rate as a biomarker for       profiles: a prospective population-based cohort study.
    tracking change in cardiorespiratory fitness of uk adults:         Eur. heart journal 25, 1428–1437 (2004).
    the fenland study. medRxiv (2020).                             20. Jakicic, J. M. et al. Four-year change in cardiorespiratory
 5. Gonzales, T. I. et al. Estimating maximal oxygen con-              fitness and influence on glycemic control in adults with
    sumption from heart rate response to submaximal ramped             type 2 diabetes in a randomized trial: the look ahead trial.
    treadmill test. medRxiv (2020).                                    Diabetes care 36, 1297–1303 (2013).
 6. Nes, B. M. et al. Estimating v˙ o2peak from a nonexercise      21. Qui, S., Cai, X., Sun, Z., Wu, T. & Schumann, U. Is esti-
    prediction model: the hunt study, norway. Medicine &               mated cardiorespiratory fitness an effective predictor for
    Sci. Sports & Exerc. 43, 2024–2030 (2011).                         cardiovascular and all-cause mortality? a meta-analysis.
 7. Shcherbina, A. et al. Accuracy in wrist-worn, sensor-              Atherosclerosis 330, 22–28 (2021).
    based measurements of heart rate and energy expenditure        22. Apple. Using apple watch to estimate cardio fitness
    in a diverse cohort. J. personalized medicine 7, 3 (2017).         with vo2 max (2021). [Online; posted 2021, note =

                                                                                                                            11/16

https://www.apple.com/healthcare/docs/site/Using_
    Apple_Watch_to_Estimate_Cardio_Fitness_with_VO2_
    max.pdf].
23. Lipton, Z., Wang, Y.-X. & Smola, A. Detecting and
    correcting for label shift with black box predictors. In In-
    ternational conference on machine learning, 3122–3130
    (PMLR, 2018).
24. Brage, S., Brage, N., Franks, P. W., Ekelund, U. & Ware-
    ham, N. J. Reliability and validity of the combined heart
    rate and movement sensor actiheart. Eur. journal clinical
    nutrition 59, 561–570 (2005).
25. Brage, S. et al. Hierarchy of individual calibration lev-
    els for heart rate and accelerometry to measure physical
    activity. J. Appl. Physiol. 103, 682–692 (2007).
26. Gonzales, T. I. et al. Descriptive epidemiology of car-
    diorespiratory fitness in uk adults: The fenland study.
    medRxiv (2022).
27. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. & Lipson,
    H. Understanding neural networks through deep visual-
    ization. arXiv preprint arXiv:1506.06579 (2015).
28. Van der Maaten, L. & Hinton, G. Visualizing data using
    t-sne. J. machine learning research 9 (2008).
29. Wattenberg, M., Viégas, F. & Johnson, I. How to use
    t-sne effectively. Distill 1, e2 (2016).
30. Faurholt-Jepsen, M., Brage, S., Kessing, L. V. &
    Munkholm, K. State-related differences in heart rate
    variability in bipolar disorder. J. psychiatric research 84,
    169–173 (2017).
31. White, T., Westgate, K., Wareham, N. J. & Brage, S.
    Estimation of physical activity energy expenditure during
    free-living from wrist accelerometry in uk adults. PLoS
    One 11, e0167472 (2016).

Supplementary appendix

                                                                   12/16

400              Training set
                                           Test set
                  Frequency

                          300
                          200
                          100
                               0
                                   20    25          30       35       40           45       50           55
                                                               VO2max
Figure 5. Distribution of VO2max in the training and test sets in Fenland I cohort. Both sets display similar ranges of
values, making sure that inferences based on the test set are robust. This plot refers to Task’s 1 train and test sets.

Table 3. Evaluation of predicting fine-grained VO2max in the present and the future with the Fenland II repeats
cohort using covariates of Fenland I. Neural network results. (*the Delta outcome is in a different unit and hence a direct
comparison with raw VO2max results might not apply)

                                                          Evaluation Metrics [95% CI]                      N (train+val / test set)
               Outcomes
                                                R2                   Corr                  RMSE
 Wearable Sensors + RHR + Anthro.
  Current VO2max                        0.652 [0.606-0.695]   0.815 [0.783-0.846]    2.959 [2.742-3.201]           2675
                                                                                                                (2140/535)
   Future VO2max                        0.499 [0.431-0.55]    0.721 [0.67-0.759]     3.673 [3.421-3.916]

   Delta (Current - Future)*            0.081 [0.02-0.078]    0.233 [0.159-0.307]    3.175 [2.923-3.41]

                                                                                                                                 13/16

Table 4. Description of the features/variables used in our analysis as inputs to the models. The features with asterisks(*)
are time-series and therefore we have extracted the following statistical variables: mean, minimum, maximum, standard
deviation, percentiles (25%, 50%, 75%), and the slope of a linear regression fit. The final set of features is 68.

 Features/Variables                                    Description
 Sensors
  Acceleration*                                        Acceleration measured in mg

   Heart rate (HR)*                                    Mean HR resampled in 15sec intervals, measured in BPM

                                                       HRV calculated by differencing
   Heart Rate Variability (HRV)*                       the second-shortest and the second-longest inter-beat interval
                                                       (as seen in30 ), measured in ms

   Acceleration-derived
                                                       ENMO-like variable (Acceleration/0.0060321) + 0.057) (as seen in31 )
   Euclidean Norm Minus One (ENMO)*

  Acceleration-derived
  Metabolic Equivalents of Task (METs)*
                            Sedentary*                 If Accelerometer = 1, take daily count and average
                            Vigorous*                  If Accelerometer >= 4.15, take daily count and average
 Anthropometrics
  Age                                                  Age, measured in years

   Sex                                                 Sex is binary (female/male)

   Weight                                              Weight, measured in kilograms

   Height                                              Height, measured in meters.centimeters

  Body Mass Index (BMI)                                BMI is calculated by Weight/(Heightˆ2), measured in kg/mˆ2
 Resting Heart Rate
                                                       RHR is calculated by averaging the 4th, 5th, and 6th minute
   Wearable-derived RHR                                of the baseline visit and adding to that the Sleeping Heart Rate
                                                       that has been inferred by the wearable device.4
 Seasonality
                                                       The month number is used along with a coordinate encoding that
   Month of year
                                                       allows the models to make sense of their cyclical sequence.

                                                                                                                          14/16

Figure 6. Flowchart of the analytical sample and the training/testing splits across the three tasks. The first task trains a
model to predict fitness using the large cohort (Fenland I). The second task is using the smaller cohort of repeats in Fenland I
(called Fenland II) and trains further models to predict fitness now and in the future (and their delta). The third task evaluates
the original model trained in Task 1 by feeding new sensor data (Fenland II sensors and anthropometrics) to assess the
adaptability of the model to pick up change. (*Training set is 90% of the 80% remaining dataset after splitting to testing set.
Validation set is 10% of the training set)

                                                                                                                            15/16

Original data                                                                            Learned latent space
                                                                  55                                                                                            55
 tSNE Dimension 2

                                                                                                 tSNE Dimension 2
                                                                  50                                                                                            50

                                                                     VO2maxcurrent

                                                                                                                                                                   VO2maxcurrent
                                                                  45                                                                                            45
                                                                  40                                                                                            40
                                                                  35                                                                                            35
                                                                  30                                                                                            30
                                                                  25                                                                                            25
                                 tSNE Dimension 1                                                                            tSNE Dimension 1
                                            (a)                                                                                         (b)

                                   Original data                                                                            Learned latent space
                                                                  100                                                                                           100

                                                                         weekHeart Rate (bpm)

                                                                                                                                                                      weekHeart Rate (bpm)
tSNE Dimension 2

                                                                                                tSNE Dimension 2
                                                                  90                                                                                            90
                                                                  80                                                                                            80
                                                                  70                                                                                            70
                                                                  60                                                                                            60
                                                                  50                                                                                            50
                                 tSNE Dimension 1                                                                            tSNE Dimension 1
                                            (c)                                                                                         (d)

   Figure 7. t-distributed stochastic neighbor embedding (tSNE) projection of the original feature vector (Fenland I
   testing set, Sensors + RHR + Anthro.) compared to the model’s latent space after training. (a-b) The original data
   presents some clusters but the outcome is not clearly linearly separable. The model activations on the penultimate layer of the
   neural network capture the continuum of low-high VO2 max both locally and globally. (c-d) A similar assessment to VO2 max is
   presented by coloring with the mean HR of the week of each participant. In the learned space, participants with low HR (high
   fitness) are placed in the same clusters as in VO2 max, unlike the original space. In all plots a 50% transparency has been
   applied to combat crowding and the colorbar is centered on the median value to illustrate extreme cases. The VO2 max label is
   used only for color-coding purposes (the projection is label-agnostic). Every participant is a dot.

                                                              Predicted                                                                                   Predicted
                    0.100                                     True                                                  0.100                                 True
     Kernel Density

                                                                                                     Kernel Density

                    0.075                                                                                           0.075
                    0.050                                                                                           0.050
                    0.025                                                                                           0.025
                    0.000                                                                                           0.000
                            10      20      30      40   50         60                                                        30   35    40     45   50    55        60
                                              VO2max                                                                                      VO2max
                                            (a)                                                                                         (b)

   Figure 8. External validation of Fenland I model with maximal (peak exercise) test data using the BBVS cohort. (a)
   Distribution of the predicted vs the true VO2max (RMSE=8.998) using all participants (N=181). (b) Distribution of the
   predicted vs the true VO2max (RMSE=5.19) by matching BBVS to have similar VO2max (mean±std) to the training set of
   Fenland I, using a subset of participants (N=82). Please see the main text for interpretation of these results.

                                                                                                                                                                16/16

You can also read