Longitudinal cardio-respiratory fitness prediction through free-living wearable sensors
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Longitudinal cardio-respiratory fitness prediction through free-living wearable sensors Dimitris Spathis1,*+ , Ignacio Perez-Pozuelo2+ , Tomas I. Gonzales2 , Soren Brage2 , Nicholas Wareham2 , and Cecilia Mascolo1 1 Department of Computer Science and Technology, University of Cambridge, UK 2 MRC Epidemiology Unit, School of Clinical Medicine, University of Cambridge, UK * corresponding author: ds806@cam.ac.uk + these authors contributed equally to this work SUMMARY arXiv:2205.03116v1 [cs.LG] 6 May 2022 Background Cardiorespiratory fitness is an established predictor of metabolic disease and mortality. Fitness is directly measured as maximal oxygen consumption (VO2 max), or indirectly assessed using heart rate response to a standard exercise test. However, such testing is costly and burdensome, limiting its utility and scalability. Fitness can also be approximated using resting heart rate and self-reported exercise habits but with lower accuracy. Modern wearables capture dynamic heart rate data which, in combination with machine learning models that find latent patterns in high-dimensional sensor signals, could improve fitness prediction. Methods In this work, we analyze movement and heart rate signals from wearable sensors in free-living conditions from 11,059 participants who also underwent a standard exercise test, along with a longitudinal repeat cohort of 2,675 participants. We design algorithms and models that convert raw sensor data into cardio-respiratory fitness estimates, and validate these estimates’ ability to capture fitness profiles in a longitudinal cohort over time while subjects engaged in real-world (non-exercise) behavior. Additionally, we validate our methods with a third external cohort of 181 participants who underwent maximal VO2 max testing, which is considered the gold standard measurement because it requires reaching one’s maximum heart rate and exhaustion level. Findings Our results show that the developed models yield high correlation (r = 0.82, 95CI 0.80-0.83), when compared to the ground truth in a holdout sample. These models outperform conventional non-exercise fitness models and traditional bio-markers using measurements of normal daily living without the need of a specific exercise test. Additionally, we show the adaptability and applicability of this approach for detecting fitness change over time in the longitudinal subsample who repeated measurements after 7 years. Interpretation These results demonstrate the value of wearables for the longitudinal assessment of fitness that today can be measured only with laboratory tests. Funding Partly funded by the University of Cambridge, Jesus College Cambridge, EPSRC, and GSK i-CASE scholarships. Introduction predicted maximal HR, and surpassing a peak respiratory exchange ratio threshold2 . The costs of VO2 measurement Cardiorespiratory fitness (CRF) is one of the strongest known and risks of exhaustive exercise not only limit direct CRF as- predictors of cardiovascular disease (CVD) risk and is in- sessment in clinical settings, but also restrict research of CRF versely associated with many other health outcomes 1 . CRF is at the population level. Thus, our understanding of differences also a potentially stronger predictor of CVD outcomes when in CRF within populations, across geographic regions, and compared to other risk factors like hypertension, type 2 dia- over time is lacking. betes, high cholesterol, and smoking. Despite its prognostic Non-exercise prediction models of VO2 max are an alter- value, routine CRF assessment remains uncommon in clinical native to exercise testing in clinical settings. These models settings because maximal oxygen consumption (VO2 max), are usually regression based and incorporate variables like the gold-standard measure of CRF, is challenging to directly sex, age, body mass index (BMI), resting heart rate (RHR), measure. A computerised gas analysis system is needed to and self-reported physical activity 3 . We have recently shown monitor ventilation and expired gas fractions during exhaus- that RHR alone can be used to estimate VO2 max4 , however tive exercise on a treadmill or cycle ergometer. Additional the validity of estimates from this approach are considerably equipment may be needed to monitor other biosignals, such lower than those achieved with exercise testing5, 6 . Wearable as heart rate (HR). These equipment require trained research devices such as activity trackers and smartwatches can moni- personnel to operate, and an attending physician is a requisite tor not only RHR and physical activity but other biosignals in for exercise testing in some scenarios. Several criteria must free-living conditions, potentially enabling more precise esti- also be achieved to verify that exhaustion has been reached, mation of VO2 max without exercise testing Recent attempts including leveling-off of VO2 , achieving a percentage of age- to use wearable devices to estimate VO2 max are difficult to
Research in Context Evidence before this study Large-scale assessment of physical activity and heart rate data could enable training robust models that anticipate and predict fine-grained changes of fitness over time. Current fitness estimation models, however, have limitations. We searched the PubMed database for publications up to April 27 2022, for studies about VO2max estimation through wearables, using the search criteria "Vo2max AND wearable" and "Vo2max AND machine learning". We should note that the intersection of the three terms "Vo2max AND machine learning AND wearable" yielded only one result. We systematically screened all 34 results (21, 12, 1 from each respective search) by reviewing abstracts, and identified five original research studies which were relevant to the study aims. These studies were either small-scale assessments of VO2max or required data collected from non free-living environments. We discuss these studies in further detail in the respective sections of this paper. Added value of this study In this study, deep learning models were developed for remote and scalable estimation of cardio-respiratory fitness through wearable devices. Our study expands on previous studies by being the first large-scale study integrating a large cohort, a longitudinal validation cohort, and a third gold standard cohort. The deep learning model takes statistical features extracted from the raw sensor timeseries covering movement and heart rate attributes, and returns fine-grained predictions of VO2max, either in the present or the future. We independently validated the models on several holdout test sets (different participants from the set used when training the AI model). Our results show that using passive sensor data from everyday life can improve fitness prediction significantly, opening the door to population-scale assessment of fitness, moving beyond traditional predictors of individuals’ health such as weight or BMI. Implications of all the available evidence The performance of our deep learning-based sensor analysis model suggests that it can potentially be used for fitness estimation with similar accuracy to a laboratory-measured test. In the future, wearables could include additional sensors which could further improve fitness prediction, such as blood pressure, skin conductance, and oximetry. Future prospective studies will help to identify the impact of personalized fitness predictions by suggesting interventions on individuals through lifestyle, nutrition, and exercise recommendations. externally evaluate, however, because their estimation meth- subset of the same population who were retested seven years ods tend to be non-transparent 7 and lack scientific validation later. This has implications for the estimation of population 7, 8 . Although certain wearable devices show promise, they fitness levels and lifestyle trends, including the identification tend to rely on detailed physical activity intensity measure- of sub-populations or areas in particular need of interven- ments, GPS-based speed monitoring, and require users to tion. Such models can improve both population-health and reach near-maximal HR values, which limits their use to fit- personalized medicine applications. For instance, the ability ter individuals9 . Some studies attempt to estimate VO2 max of the patient to undergo certain treatments like surgery11 or from data collected during free-living conditions, but these chemotherapy12 can be assessed through wearables before are typically from small-scale cohorts and use contextual data the procedure takes place and therefore reduce postoperative from treadmill activity, which restricts their application in complications. population settings10 . Here, we use data from the largest study of its kind, by over Results two orders of magnitude, and use purely free-living data to Baseline measurements were collected from 12,435 healthy predict VO2 max, with no requirement for context-awareness. adults from the Fenland study in the United Kingdom13 , where This work substantially advances previous non-exercise mod- all required data for the present analysis were available in els for predicting CRF by introducing an adaptive representa- 11,059 participants (Fenland I, baseline timepoint referred to tion learning approach to physiological signals derived from as "current" in our evaluation). A subset of 2,675 participants wearable sensors in a large-scale population with free-living were assessed again after a median (interquantile range) of 7 condition data. We employ a deep neural network model that (5-8) years (Fenland II, referred to as "future" in our evalua- utilizes densely-connected non-linear layers to learn personal- tions). Descriptive characteristics of the two analysis samples ized fitness representations. We demonstrate that these models are presented in Table 1. We present the characteristics of the yield better performance than traditional and state-of-the-art longitudinal cohort in both temporal snapshots in in Table 1 non-exercise models. We illustrate how these models can ("present" and "future"). Mean and standard deviations for be used to predict the magnitude and direction of change in each characteristic are presented in this table. An overview of CRF. Moreover, we show that they can adapt in time given the study design and the three experimental tasks is provided behavioural changes by showcasing strong performance in a in Figure 1. 2/16
STUDY DESIGN Δ ~ 7 years Fenland I Fenland II (N = 11,059) (N = 2,675) BASELINE VISIT (DAY 1) FREE-LIVING PHASE (DAY 1-7) Anthropometrics ~ 6 days VO2max Test Questionnaires Movement and heart data from wearables MODELING FITNESS Fine-grained cardiorespiratory 1 fitness prediction in the present FEATURES Inferring fitness direction and 2 USERS magnitude 7 years in the future Validating adaptability and 3 change with new sensor data DEEP NEURAL NETWORK Figure 1. Study and experimental design. Fenland is a cohort study, including 11,059 participants with treadmill and wearable sensor data at baseline (Fenland I, 2005-2015) and a longitudinal subsample of 2,675 participants who were retested approximately 7 years later (Fenland II). During both phases participants underwent a variety of tests during the baseline clinic visit comprising anthropometric measurements, questionnaires, and a submaximal VO2 max test. Following this baseline clinic visit, participants were fitted with a combined heart rate and movement sensing device which they wore during free-living conditions for approximately 6 days. In this work we develop deep learning models utilising wearable data and common biomarkers as input to predict VO2 max and achieve strong performance, both in the present and when replicated at the second timepoint. Table 1. Characteristics for the study analytical sample: The Fenland I and II studies. Data is in mean (std). Values with asterisk(*) indicate that this variable comes from Fenland II sensor data which is a smaller cohort (N=2071) due to data filtering (see Figure 6–Panel 3). The values in FII (future) cohort correspond to the second assessment (7 years later). In Task 1, the training set is FI (present) and the testing set is FII (present) so as to make sure that they come from similar distributions. Fenland present Fenland II f uture Fenland II present Men (n= 5229) Women (n= 5830) Men (n=1303) Women (n=1372) Men (n=1303) Women (n=1372) mean std mean std mean std mean std mean std mean std Demographics Age (years) 47.70 7.57 47.66 7.36 54.11 7.08 54.76 6.81 47.19 7.18 47.83 6.96 Anthropometrics Height (m) 1.78 0.07 1.64 0.06 1.77 0.06 1.64 0.06 1.77 0.06 1.64 0.06 Body mass (kg) 85.85 13.83 70.54 13.92 85.31 13.59 69.58 13.77 84.85 13.14 69.04 13.26 BMI (kg/m2 ) 27.16 3.97 26.17 4.97 27.03 4.01 25.85 4.94 27.00 4.03 25.84 4.98 Physical activity MVPA (min/day) 35.87 22.35 34.40 22.59 34.92* 22.18* 35.35* 23.26* 34.41 22.23 32.81 21.45 VPA (min/day) 3.27 8.57 3.31 15.67 3.57* 8.78* 3.30* 7.52* 3.38 9.30 3.86 27.80 Resting Heart Rate RHR (bpm) 61.48 8.68 64.46 8.28 59.63* 8.28* 62.21* 8.10* 61.06 8.44 63.81 8.20 Cardiorespiratory fitness VO2 max (ml O2 /min/kg) 41.95 4.61 37.44 4.73 42.32 4.68 37.93 4.72 42.21 4.42 37.84 4.69 3/16
Fine-grained fitness prediction from wearable sen- fine-grained delta of VO2 max, we formulated this problem sors as a classification task. A visual representation of this task We first developed and externally validated several non- can be found in Figure 3a. By inspecting the distribution exercise VO2 max estimation models as a regression task using of the difference (delta) of current-future VO2 max on the features commonly measured by wearable devices (anthro- training set, we split it to 2 halves (50% quantiles) and set pometry, resting heart rate (RHR), physical activity (PA); see these as prediction outcomes. The purpose of this task is Table 2). Here our goal was to explore how conventional non- to assess the direction of individual change of fitness. We exercise approaches to VO2 max estimation could be enhanced report an area under the curve (AUC) of 0.61 in predicting the by features from free-living PA data. We split participant direction of change (N = 2675). We also focused on the tails data into independent training and test sets. The training set of the change distribution which indicates participants who (n=8384, participants with baseline data only) was used for had substantial and dramatic change in fitness over the period model development. The test set (n=2675, participants with of time between Fenland I and Fenland II (≈ 7 years). In this baseline and followup data) was used to externally validate case, the distribution was split into 80%/20% (substantial) each model. Models using anthropometry or RHR alone had and 90%/10% (dramatic) quantiles. The results from these poor external validity, but validity improved when combined experiments show that the models can distinguish between in the same model. The best performance (R2 of 0.67) was substantial fitness change with an AUC of 0.72 (N = 1068) attained using a deep neural network model combining wear- and between dramatic fitness change with an AUC of 0.74 able sensors, RHR, and anthropometric data (Figure 2). For (N = 535). All AUC curves can be found in Figure 3b. reference, we compare these results to traditional non-model equations, which rely on Body Mass, RHR, and Age. Using Enabling adaptive cardiorespiratory fitness infer- a popular equation (as proposed in14, 15 ) shows poor validity ences (R2 of -3.2 and Correlation of 0.389), a performance lower For the final task, we assessed whether the trained models than using anthropometrics only in our setup (see Methods can pick up change using new sensor data from Fenland II, for details). This motivates the use of machine learning which Considering that getting new wearable data is relatively easy captures better covariate interactions. since these devices are becoming increasingly pervasive. The Deep neural networks can learn feature representations that intuition behind this task is to evaluate the generalizability of are suitable for clustering tasks, such as population strati- the models over time. We first matched the populations that fication by implicit health status, but are difficult to reveal provided sensor data for both cohorts (N = 2, 042) and applied using linear dimension-reduction techniques16 . We used t- the trained model from Task 1 in order to produce VO2 max distributed stochastic neighbor embedding (tSNE), a nonlin- inferences. We then compared the predictions with the re- ear dimension-reduction technique, to visualise learned fea- spective ground truth (current and future VO2 max). The true ture representations from our model and their relationship and predictive distributions are shown in Figures 4c and 4d. to participant VO2 max and HR levels (Figure 7). Clustering Through this procedure, we found that the model achieves an and coloring by VO2 max and HR levels was shown to be r = 0.84 for VO2 max future prediction and an r = 0.82 for inversely related and more apparent in the learned latent space VO2 max current prediction (validating our Task 1 results). In compared to the original observation space. For example, other words, if we have access to wearable sensor data and participants with higher VO2 max were clustered similarly to other information from the future time, we can reuse the al- those with lower HR levels, and vice versa. ready trained model from Fenland I to accurately infer fitness with minimal loss of accuracy over time, even though this is new sensor data from a completely separate (future) week. Predicting magnitude and direction of fitness Last, we calculated the delta of the predictions and com- change in the future pared it to the actual delta of fitness over the years. This The second group of tasks evaluated our model on the sub- task showed that the models tend to focus mostly on positive set of participants who returned for Fenland II ≈ 7 years change and under-predict when participants’ fitness deterio- later (referred to as future in our evaluations). For these ex- rates over the years (Figures 4a, 4b). The overall correlation periments we carried out three evaluations. Following the between the delta of the predictions with the ground truth is process described earlier, we re-trained a model to predict significant (r=0.57, p
Predicted 0.08 True 60 Kernel Density 0.06 50 True 0.04 40 0.02 30 0.00 20 20 25 30 35 40 45 50 55 30 40 50 VO2max Predicted (a) (b) Figure 2. Fine-grained fitness prediction. Comparing the predicted and true VO2max coming from the best performing comprehensive model (Sensors + RHR + Anthro.) trained with Fenland I. (a) Distribution of predicted and true VO2max. The plot combines a kernel density estimate and histogram. (b) Correlation of predicted and true VO2max (r = 0.82, p < 0.005, see Table 2). The gray line denotes a linear regression fit. Transparency has been applied to the datapoints to combat crowding. Table 2. Evaluation of predicting fine-grained VO2max with the Fenland I cohort. Comparison between traditional antrhopometrics, common biomarkers (RHR), and passively collected data over a week (wearable sensors). Best performance in bold. The units of VO2max are measured in mlO2 /min/kg. Results reported from the testing set. Evaluation Metrics [95% CI] N (train+val / test set) Data modality R2 Corr RMSE Anthropometrics Age/Sex/Weight/BMI/Height 0.362 [0.332-0.391] 0.604 [0.579-0.627] 4.043 [3.924-4.172] Resting Heart Rate RHR (Sensor-derived) 0.374 [0.344-0.403] 0.615 [0.589-0.639] 4.007 [3.891-4.117] Anthropometrics + RHR 11059 Age/Sex/Weight/BMI/Height/RHR 0.616 [0.588-0.641] 0.785 [0.767-0.802] 3.138 [3.031-3.237] (8384/2675) Wearable Sensors + RHR + Anthro. Acceleration/HR/HRV/MVPA 0.671 [0.649-0.692] 0.822 [0.808-0.835] 2.903 [2.801-3.003] Age/Sex/Weight/BMI/Height/RHR 5/16
1.0 250 50% Quantile 0.8 200 Frequency 0.6 Sensitivity 150 20% Quantile 80% Quantile 0.4 100 50 10% Quantile 90% Quantile 0.2 AUC50/50 = 0.61 [CI 0.56-0.66] AUC20/80 = 0.72 [CI 0.64-0.78] AUC10/90 = 0.74 [CI 0.65-0.85] 0 0.0 chance 20 15 10 5 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 (VO2maxcurrent VO2maxfuture) 1 - Specificity (a) (b) Figure 3. Evaluation in predicting the magnitude and direction of the VO2max change between the present and the future. (a) Distribution of the ∆ of VO2max in the present and the future. The shaded areas represent different binary bins that are used as outcomes, increasingly focusing on the extremes of this distribution. (b) ROC AUC performance in predicting the three ∆ outcomes as shown on the left hand side. Brackets represent 95% CIs. a deep learning framework for predicting CRF and changes (Fenland II) VO2 max for those participants that came back in CRF over time. Our framework estimates VO2 max by approximately 7 years later. For this last task, the model pro- combining learned features from heart rate and accelerom- duced outcomes that translated to a 0.57 correlation between eter free-living data extracted from wearable sensors with the delta of predicted and delta of true VO2 max. anthropometric measures. To evaluate our framework’s per- The application of our work to other cohort and longitudinal formance, VO2 max estimates were compared with VO2 max studies is of particular importance as serial measurement of values derived from a submaximal exercise test5 . Free-living cardiorespiratory fitness has significant prognostic value in and exercise test data were collected at a baseline investigation clinical practice. Small increases in fitness are associated in 11,059 participants (Fenland I). A subset of those partici- with reduced cardiovascular disease mortality risk and better pants (n=2,675) completed another exercise test at a follow- clinical outcomes in patients with heart failure and type 2 up investigation approximately seven years later (Fenland II). diabetes20 . Nevertheless, routine measurement of fitness in This study design allowed us to address three questions: 1) clinical practice is rare due to the costs and risks of exercise Do baseline estimates of VO2 max from the deep learning testing. Non-exercise based regression models can be used framework agree with VO2 max values measured from exer- to estimate changes in fitness in lieu of serial exercise testing. cise testing at baseline?, 2) Can the framework learn features It is unclear, however, the extent to which changes in fitness from heart rate and accelerometer free-living data collected detected with such models reflect true changes in exercise at baseline that predict VO2 max measured at follow-up?, and capacity. Here, we relied on the relationship between CRF 3) Can the framework be used to predict the magnitude of and heart rate responses to different levels of physical activity change in VO2 max from baseline to follow-up? at submaximal, real-life conditions captured through wearable In the VO2 max estimation tasks, our model demonstrated sensors. Using deep learning techniques, we have developed strong agreement with VO2 max measured from the submaxi- a non-exercise based fitness estimation approach that can be mal exercise test at baseline (Pearson’s correlation coefficient used not only to accurately infer current VO2 max, but also (PCC): 0.82) as well as for the longitudinal, follow-up visit to do so in the settings of a future cohort, where the model (PCC: 0.72). We were also able to distinguish between sub- did not require any retraining, just influx of new data. Further, stantial and dramatic changes in CRF (AUCs 0.72 and 0.74, we show that the model can also be used to infer the changes respectively). Finally, we further evaluated the initial model in CRF that occurred during the ≈ 7 year time span between on new input data by feeding Fenland II free-living data along Fenland I and II. with updated heart rate and anthropometrics to the model, Our proposed deep learning approach outperforms tradi- showing that it is able to adapt and monitor change over time. tional non-exercise models, which are the state-of-the art in We evaluated the inference capabilities of the model in the the field and rely on simple variables inputted to a linear difference (delta) between the current (Fenland I) and future model. Importantly, our model is able to take week-level in- 6/16
formation from each participant and combine it with various In this paper, we developed deep learning models utilising anthropometrics and bio-markers such as the RHR, providing wearable data and other bio-markers to predict the gold stan- a truly personalized approach for CRF inference generation. dard of fitness (VO2 max) and achieved strong performance The approach we present here outperforms traditional non- compared to other traditional approaches. Cardio-respiratory exercise models, which are considered state-of-the-art meth- fitness is a well-established predictor of metabolic disease and ods for longitudinal monitoring and highlights the potential of mortality and our premise is that modern wearables capture wearable sensing technologies for digital health monitoring. non-standardised dynamic data which could improve fitness An additional application of our work is the potential routine prediction. Our findings on a population of 11,059 partici- estimation of VO2 max in clinical settings, given the strong pants showed that the combination of all modalities reached association between estimated CRF levels and CVD health an r = 0.82, when compared to the ground truth in a holdout outcomes21 . sample. Additionally, we show the adaptability and applicabil- This study has several limitations worthy of recognition. ity of this approach for detecting fitness change over time in First, the validity of the deep learning framework was assessed a longitudinal subsample (n = 2,675) who repeated measure- by comparing estimated VO2 max values with VO2 max values ments after 7 years. Last, the latent representations that arise derived from a submaximal exercise test. Ideally, one would from this model pave the way for fitness-aware monitoring use VO2 max values directly measured during a maximal exer- and interventions at scale. It is often said that "If you cannot cise test to establish the ground truth for cardiorespiratory fit- measure it, you cannot improve it". Cardio-fitness is such ness comparisons. Maximal exercise tests, however, are prob- an important health marker, but until now we did not have lematic when used in large population-based studies because the means to measure it at scale. Our findings could have they may be unsafe for some participants and, consequently, significant implications for population health policies, finally induce selection bias. The submaximal exercise test used in moving beyond weaker health proxies such as the BMI. the Fenland Study was well tolerated by study participants and demonstrated acceptable validity against direct VO2max Methods measurements5 . Submaximal tests are also utilized to validate Study description popular wearable devices such as the Apple Watch22 . We are The Fenland study is a population-based cohort study de- therefore confident that VO2 max values estimated from the signed to investigate the independent and interacting effects deep learning framework reflect true cardiorespiratory fitness of environmental, lifestyle and genetic influences on the de- levels. velopment of obesity, type 2 diabetes and related metabolic To investigate this limitation, we validate our models with disorders. Exclusion criteria included prevalent diabetes, preg- 181 participants from the UK Biobank Validation Study nancy or lactation, inability to walk unaided, psychosis or (BBVS) who were recruited from the Fenland study. These terminal illness (life expectancy of ≤ 1 year at the time of participants completed an independent maximal exercise test, recruitment) . where VO2 max was directly measured. Taking into account The Fenland study has two distinct phases. Phase I, during that the BBVS cohort is less fit (VO2 max=32.9±7) com- which baseline data was collected from 12,435 participants, pared to the cohort the model was trained on (Fenland I, took place between 2005 and 2015. Phase II was launched VO2 max=39.5±5), we observe that the model over-predicts in 2014 and involved repeating the measurements collected with a mean prediction of VO2 max=39.9, RMSE=8.998. This during Phase I, alongside the collection of new measures. All is expected because the range of VO2 max seen during training participants who had consented to being re-contacted after did not include participants with VO2 max below 25. Even their Phase I assessment were invited to participate in Phase when looking at women in isolation -who perform lower than II. At least 4 years must have elapsed between visits. As a men in these tests-, they had a VO2 max=37.4±4.7 in Fenland result of this stipulation, recruitment to Phase II is ongoing. I (see Table 1), which is still significantly higher than the aver- A flowchart of the analytical sample by each one of the study age participant of BBVS. This is a common distribution/label tasks is provided in Figure 6. shift issue where the model encounters an outcome which is After a baseline clinic visit, participants were asked to wear out of its training data range and is still an open problem in a combined heart rate and movement chest sensor Actiheart, statistical modelling23 . To partially mitigate this issue, we CamNtech, Cambridgeshire, UK) for 6 complete days. For match BBVS’s fitness to have similar statistics to the training this study, data from 11,059 participants was included after set of original model (Fenland I: VO2 max=39±5, matched excluding participants with insufficient or corrupt data or miss- BBVS: VO2 max=39±4) and observe an RMSE=5.19 (see ing covariates as shown in Figure 6. A subset of 2,675 of the Figure 8). This result is still not on par with our main results study participants returned for the second phase of the study, in Fenland but shows the impact of including very low-fitness after a median (interquartile range) of 7 (5-8) years, and un- participants in this validation study. To improve future CRF derwent a similar set of tests and protocols, including wearing models, we believe that population-scale studies should focus the combined heart rate and movement sensing for 6 days. on including low-fitness participants along with the general All participants provided written informed consent and the population. study was approved by the University of Cambridge, NRES 7/16
Committee - East of England Cambridge Central committee. non-wear periods were excluded from the analyses. This al- All experiments and data collected were done in accordance gorithm detected extended periods of non-physiological heart with the declaration of Helsinki. rate concomitantly with extended (> 90 minutes) periods that also registered no movement through the device’s accelerom- Study procedure eter. We converted movement these intensities into standard Participants wore the Actiheart wearable ECG which mea- metabolic equivalent units (METs), through the conversion 1 sured heart rate and movement recording at 60-second inter- MET = 71 J/min/kg ( 3.5 ml O2 · min−1 · kg−1 ). These con- vals24 . The Actiheart device was attached to the chest at the versions where then used to determine intensity levels with base of the sternum by two standard ECG electrodes. Par- ≤ 1.5METs classified as sedentary behaviour, activities be- ticipants were told to wear the monitor continuously for 6 tween 3 and 6 METs were classified as moderate to vigorous complete days and were advised that these were waterproof physical activity (MVPA) and those > 6METs were classified and could be worn during showering, sleeping or exercis- as vigorous physical activity (VPA). ing. During a lab visit, all participants performed a treadmill Since the season can have a big impact on physical activity test that was used to establish their individual response to considering on how it affects workouts, sleeping patterns, and a submaximal test, informing their VO2 max (maximum rate commuting patterns, we encoded the sensor timestamps using of oxygen consumption measured during incremental exer- cyclical temporal features T f . Here we encoded the month of cise)25 . RHR was measured with the participant in a supine the year as (x, y) coordinates on a circle: position using the Actiheart device. HR was recorded for 15 2∗π ∗t 2∗π ∗t minutes and RHR was calculated as the mean heart rate mea- T f1 = sin (1) T f2 = cos (2) max(t) max(t) sured during the last 3 minutes. Our RHR is a combination where t is the relevant temporal feature (month). The of the Sleeping HR measured by the ECG over the free-living intuition behind this encoding is that the model will "see" phase and the RHR as described above. that e.g. December (12th) and January (1st) are 1 month apart (not 11). Considering that the month might change Cardiorespiratory fitness assessment over the course of the week, we use the month of the first VO2 max was predicted in study participants using a previously time-step only. Additionally, we extracted summary statistics validated submaximal treadmill test 5 . Participants exercised from the following sensor time-series: raw acceleration, HR, while treadmill grade and speed were progressively increased HRV, Aceleration-derived Euclidean Norm Minus One, and across several stages of level walking, inclined walking, and Acceleration-derived Metabolic Equivalents of Task. Then, level running. The test was terminated if one of the following for every time-series we extracted the following variables criteria were met: 1) the participant wanted to stop, 2) the which cover a diverse set of attributes of their distributions: participant reached 90% of age-predicted maximal heart rate mean, minimum, maximum, standard deviation, percentiles (208-0.7*age) 15 , 3) the participants exercised at or above 80% (25%, 50%, 75%), and the slope of a linear regression fit. The of age-predicted maximal heart rate for 2 minutes. Details rest of variables (anthropometrics and RHR) are used as a about the fitness characteristics of the cohort and the validation single measurement. of the submaximal test are provided elsewhere 26 . In total, we derived a comprehensive set of 68 features To further validate the models trained on submaximal using the Python libraries Pandas and Numpy. A detailed VO2 max, we employ the external cohort UK Biobank Val- view of the variables is provided in Table 4. idation Study (BBVS)26 . We recruited 105 female (mean age: 54.3y±7.3) and 86 male (mean age: 55.0y±6.5) validation Deep learning models study participants and VO2 max was directly measured during We developed deep neural network models that are able to an independent maximal exercise test, which was completed capture non-linear relationships between the input data and to exhaustion. Some maximal exercise test data were excluded the respective outcomes. Considering the high-sampling rate because certain direct measurements were anomalous due to of the sensors (1 sample/minute) after aligning HR and Ac- testing conditions (N=10). BBVS participants completed the celeration modalities, it is impossible to learn patterns with same free-living protocol as in Fenland and we collected sim- such long dependencies (a week of sensor data includes more ilar sensor and antrhopometrics data which were processed than 10,000 timesteps). Even the most well-tuned recurrent with the same way as in Fenland (see next section). neural networks cannot cope with such sequences and given the size of the training set (7,545 samples), the best option Free-living wearable sensor data processing was to extract statistical features from the sensors and repre- Participants were excluded from this analysis if they had less sent every participant-week as a row in a feature vector (see than 72 hours of concurrent wear data (three full days of Fig. 1). This feature vector was fed to fully connected neural recording) or insufficient individual calibration data (tread- network layers which were trained with backpropagation. mill test-based data). All heart rate data collected during free- Data preparation. For Task 1 (see Figure 6), we matched living conditions underwent pre-processing for noise filtering. the sensor data with the participants who had eligible lab tests. Non-wear detection procedures were applied and any of those Then we split into disjoint train and test sets, making sure that 8/16
0.20 of predictions (FIsensors FIIsensors) Kernel Density (VO2maxcurrent VO2maxfuture) 0.15 True 10 0.10 0 0.05 10 0.00 10 0 10 15 10 5 0 5 10 15 of predictions (VO2maxcurrent VO2maxfuture) (FIsensors FIIsensors) (a) (b) Predicted 0.08 Predicted 0.08 True True Kernel Density Kernel Density 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00 20 30 40 50 20 30 40 50 60 VO2maxcurrent VO2maxfuture (c) (d) Figure 4. Assessing the robustness of the model to pick up change using new sensor data from Fenland II repeats. By matching the populations who provided sensor data for both cohorts (N=2,042) we passed them through our trained model from Task 1 to predict VO2max. (a-b) We then calculated the difference (∆) of the predictions juxtaposed with the true difference of fitness over the years. Distribution of ∆ of predicted and true VO2max. Correlation of ∆ of predicted and true VO2max (r = 0.57, p < 0.005). The gray line denotes a linear regression fit. Transparency has been applied to the datapoints to combat crowding. (c-d) Comparison of predicted and true VO2max using FI and FII covariates (sensors, RHR, anthro.), respectively. The distribution plots combine a kernel density estimate and histogram with bin size determined automatically with a reference rule. 9/16
participants from Fenland I go to the train set, while those models; the model which was trained in Task 1 is used in from Fenland II go to the test set (see Figure 5). This would inference mode (prediction). allow to re-use the trained model from Task 1, with different sensor data from Fenland II participants. Intuitively, we train Prediction equations a model on the big population, and we evaluate it with two For reference, we compare our models results to traditional snapshots of another longitudinal population over time (Task non-model equations, which rely on Body Mass, RHR, and 1 & 3). After splitting, we normalize the training data by Age. We incorporate the popular equation proposed by Uth applying standard scaling (removing the mean and scaling and colleagues14 , which corresponds to VO2max = 15.0 to unit variance) and then denoise it by applying Principal (ml ∗ minˆ − 1) * Body Mass (kg) * (HRmax / HRrest), Components Analysis (PCA), retaining the components that in combination with Tanaka’s equation15 where HRmax = explain 99.99% of the variance. In practice, the original 68 208 − 0.7 ∗ age. Other approaches rely on measurements such features are reduced to 48. We save the fitted PCA projection as the waist circumference, which however was not recorded and scaler and we apply them individually to the test-set, to in our cohorts. avoid information leakage across the sets. The same projec- tion and scaler are applied to all downstream models (Task 2 Evaluation and 3) to leverage the knowledge of the big cohort (Fenland To evaluate the performance of the deep learning models I). which predict continuous values, we computed the root mean q Model architecture and training. The main neural network 1 N 2 squared error (RMSE) = |Ntest | ∑y∈Dtest ∑t=1 (yt − ŷt ) , the (used in Task 1) receives a 2D vector of [users, features] and N (y −ŷ )2 ∑t=1 t t predicts a real value. For this work, we assume N users and coefficient of determination (R2 ) = 1 − N (y −ȳ )2 , and the ∑t=1 t t F features of an input vector X = (x1 ,...,xN ) ∈ RN×F and a Pearson correlation coefficient. Here y and ŷ are the measured target VO2 max y = (y1 ,...,yN ) ∈ RN . The network consists and predicted VO2 max and ȳ is the mean. For the binary of two densely-connected feed forward layers with 128 units models, we used the Area under the Receiver Operator Char- each. A dense layer works as follows: out put = activation acteristic (AUROC or AUC) which evaluates the probability (input · kernel + bias), where activation is the element-wise of a randomly selected positive sample to be ranked higher activation function (the exponential linear unit in our case) than a randomly selected negative sample. , kernel is a learned weights matrix with a Glorot uniform initialization, and bias is a learned bias vector. Each layer is Visualizing the latent space followed by a batch normalization operation, which maintains The activations of the trained model allow us to understand the mean output close to 0 and the output standard deviation the inner workings of the network and explore its latent space. close to 1. Also, dropout of 0.3 probability is applied to every We first pass the test-set of Task 1 through the trained model layer, which randomly sets input units to 0 and helps prevent and retrieve the activations of the penultimate layer27 . This is overfitting. Last, the final layer is a single-unit dense layer and a 2D vector of [2675, 128] size, considering that the layer size the network is trained with the Adam optimizer to minimize is 128 and the participants of the test-set are 2675. Intuitively, the Mean Squared Error (MSE) loss, which is appropriate for every participant corresponds to a 128-dimensional point. In continuous outcomes. We use a random 10% subset of the order to visualize this embedding, we apply tSNE28 , an algo- train-set as a validation set. To combat overfitting, we train rithm for dimensionality reduction. For its optimization, we for 300 epochs with a batch size of 32 and we perform early use a perplexity of 50, as it was suggested recently29 . stopping when the validation loss stops improving after 15 epochs and the learning rate is reduced by 0.1 every 5 epochs. Statistical analyses All hyperparameters (# layers, # units, dropout rates, batch We performed a number of sensitivity analyses to investi- size, activations, and early stopping) were found after tuning gate potential sources of bias in our results. Full results of on the validation set. these sensitivity analyses are shown in the main text and cor- Model differences across tasks. Task 1 trains the main neu- responding Tables. In particular, we use bootstraping with ral network of our study (see previous subsection). Task 2 replacement (1000 samples) to calculate 95% Confidence re-trains an identical model to predict VO2 max in the future Intervals when we report the performance of the models in (and the delta present-future). However, when we re-frame the hold-out sets. Wherever we report p-values, we use the this problem as a classification task (see Figure 3), we use recently proposed strict threshold of p < 0.005. significantly fewer participants when we focus on the tails of the change distribution. Therefore, to combat overfitting, we train a smaller network with one Dense layer of 128 units Data availability and a sigmoid output unit, which is appropriate for binary All Fenland data used in our analyses is available from the problems. Instead of optimizing the MSE, we now minimize MRC Epidemiology Unit at the University of Cambridge the binary cross-entropy. In all other cases –such as in Task upon reasonable request (http://www.mrc-epid.cam. 3 or when visualizing the latent space–, we do not train new ac.uk/research/studies/fenland/) 10/16
Code availability 8. Passler, S., Bohrer, J., Blöchinger, L. & Senner, V. Valid- ity of wrist-worn activity trackers for estimating vo2max The source code of this work will be available on Github upon acceptance. and energy expenditure. Int. journal environmental re- search public health 16, 3037 (2019). Acknowledgements 9. Cooper, K. D. & Shafer, A. B. Validity and reliability of the polar a300’s fitness test feature to predict vo2max. DS is supported by the Embiricos Trust Scholarship of Jesus Int. journal exercise science 12, 393 (2019). College Cambridge and the Engineering and Physical Sci- ences Research Council through grant DTP (EP/N509620/1). 10. Altini, M., Casale, P., Penders, J. & Amft, O. Cardiores- IP-P is supported by GlaxoSmithKline and Engineering and piratory fitness estimation in free-living using wearable Physical Sciences Research Council through an iCase fellow- sensors. Artif. intelligence medicine 68, 37–46 (2016). ship (17100053). 11. Richardson, K., Levett, D., Jack, S. & Grocott, M. Fit for surgery? perspectives on preoperative exercise testing Author contributions statement and training. BJA: Br. J. Anaesth. 119, i34–i43 (2017). DS, IP, SB, NW, CM designed the study. SB, NW collected 12. Leon-Ferre, R., Ruddy, K. J., Staff, N. P. & Loprinzi, the data on behalf of the Fenland Study. DS, IP sourced, se- C. L. Fit for chemo: Nerves may thank you. JNCI: J. lected, and pre-processed the data. DS developed the models Natl. Cancer Inst. 109 (2017). and produced the figures. DS, IP performed the statistical 13. Lindsay, T. et al. Descriptive epidemiology of physical ac- analysis. IP, DS wrote the first draft of the manuscript. All tivity energy expenditure in uk adults (the fenland study). authors (DS, IP, TG, SB, NW, CM) critically reviewed, con- Int. J. Behav. Nutr. Phys. Activity 16, 1–13 (2019). tributed to the preparation of the manuscript, and approved 14. Uth, N., Sørensen, H., Overgaard, K. & Pedersen, P. K. the final version. All authors vouch for the data, analyses, and Estimation of vo2max from the ratio between hrmax and interpretations. hrrest–the heart rate ratio method. Eur. journal applied physiology 91, 111–115 (2004). Competing interests 15. Tanaka, H., Monahan, K. D. & Seals, D. R. Age-predicted The authors declare that they have no competing financial maximal heart rate revisited. J. american college cardiol- interests. ogy 37, 153–156 (2001). 16. Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: References a method to assess and visualize population substructures 1. Mandsager, K. et al. Association of cardiorespiratory in genetics. BMC bioinformatics 20, 1–11 (2019). fitness with long-term mortality among adults undergo- 17. Schmid, D. & Leitzmann, M. Cardiorespiratory fitness ing exercise treadmill testing. JAMA network open 1, as predictor of cancer mortality: a systematic review and e183605–e183605 (2018). meta-analysis. Annals oncology 26, 272–278 (2015). 2. Swain, D. P., Brawner, C. A., of Sports Medicine, 18. Schuch, F. B. et al. Are lower levels of cardiorespiratory A. C. et al. ACSM’s resource manual for guidelines fitness associated with incident depression? a systematic for exercise testing and prescription (Wolters Kluwer review of prospective cohort studies. Prev. Medicine 93, Health/Lippincott Williams & Wilkins, 2014). 159–165 (2016). 3. Cao, Z.-B. et al. Predicting v˙ o2max with an objectively 19. Laukkanen, J. A., Kurl, S., Salonen, R., Rauramaa, R. & measured physical activity in japanese women. Medicine Salonen, J. T. The predictive value of cardiorespiratory & Sci. Sports & Exerc. 42, 179–186 (2010). fitness for cardiovascular events in men with various risk 4. Gonzales, T. I. et al. Resting heart rate as a biomarker for profiles: a prospective population-based cohort study. tracking change in cardiorespiratory fitness of uk adults: Eur. heart journal 25, 1428–1437 (2004). the fenland study. medRxiv (2020). 20. Jakicic, J. M. et al. Four-year change in cardiorespiratory 5. Gonzales, T. I. et al. Estimating maximal oxygen con- fitness and influence on glycemic control in adults with sumption from heart rate response to submaximal ramped type 2 diabetes in a randomized trial: the look ahead trial. treadmill test. medRxiv (2020). Diabetes care 36, 1297–1303 (2013). 6. Nes, B. M. et al. Estimating v˙ o2peak from a nonexercise 21. Qui, S., Cai, X., Sun, Z., Wu, T. & Schumann, U. Is esti- prediction model: the hunt study, norway. Medicine & mated cardiorespiratory fitness an effective predictor for Sci. Sports & Exerc. 43, 2024–2030 (2011). cardiovascular and all-cause mortality? a meta-analysis. 7. Shcherbina, A. et al. Accuracy in wrist-worn, sensor- Atherosclerosis 330, 22–28 (2021). based measurements of heart rate and energy expenditure 22. Apple. Using apple watch to estimate cardio fitness in a diverse cohort. J. personalized medicine 7, 3 (2017). with vo2 max (2021). [Online; posted 2021, note = 11/16
https://www.apple.com/healthcare/docs/site/Using_ Apple_Watch_to_Estimate_Cardio_Fitness_with_VO2_ max.pdf]. 23. Lipton, Z., Wang, Y.-X. & Smola, A. Detecting and correcting for label shift with black box predictors. In In- ternational conference on machine learning, 3122–3130 (PMLR, 2018). 24. Brage, S., Brage, N., Franks, P. W., Ekelund, U. & Ware- ham, N. J. Reliability and validity of the combined heart rate and movement sensor actiheart. Eur. journal clinical nutrition 59, 561–570 (2005). 25. Brage, S. et al. Hierarchy of individual calibration lev- els for heart rate and accelerometry to measure physical activity. J. Appl. Physiol. 103, 682–692 (2007). 26. Gonzales, T. I. et al. Descriptive epidemiology of car- diorespiratory fitness in uk adults: The fenland study. medRxiv (2022). 27. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. & Lipson, H. Understanding neural networks through deep visual- ization. arXiv preprint arXiv:1506.06579 (2015). 28. Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. J. machine learning research 9 (2008). 29. Wattenberg, M., Viégas, F. & Johnson, I. How to use t-sne effectively. Distill 1, e2 (2016). 30. Faurholt-Jepsen, M., Brage, S., Kessing, L. V. & Munkholm, K. State-related differences in heart rate variability in bipolar disorder. J. psychiatric research 84, 169–173 (2017). 31. White, T., Westgate, K., Wareham, N. J. & Brage, S. Estimation of physical activity energy expenditure during free-living from wrist accelerometry in uk adults. PLoS One 11, e0167472 (2016). Supplementary appendix 12/16
400 Training set Test set Frequency 300 200 100 0 20 25 30 35 40 45 50 55 VO2max Figure 5. Distribution of VO2max in the training and test sets in Fenland I cohort. Both sets display similar ranges of values, making sure that inferences based on the test set are robust. This plot refers to Task’s 1 train and test sets. Table 3. Evaluation of predicting fine-grained VO2max in the present and the future with the Fenland II repeats cohort using covariates of Fenland I. Neural network results. (*the Delta outcome is in a different unit and hence a direct comparison with raw VO2max results might not apply) Evaluation Metrics [95% CI] N (train+val / test set) Outcomes R2 Corr RMSE Wearable Sensors + RHR + Anthro. Current VO2max 0.652 [0.606-0.695] 0.815 [0.783-0.846] 2.959 [2.742-3.201] 2675 (2140/535) Future VO2max 0.499 [0.431-0.55] 0.721 [0.67-0.759] 3.673 [3.421-3.916] Delta (Current - Future)* 0.081 [0.02-0.078] 0.233 [0.159-0.307] 3.175 [2.923-3.41] 13/16
Table 4. Description of the features/variables used in our analysis as inputs to the models. The features with asterisks(*) are time-series and therefore we have extracted the following statistical variables: mean, minimum, maximum, standard deviation, percentiles (25%, 50%, 75%), and the slope of a linear regression fit. The final set of features is 68. Features/Variables Description Sensors Acceleration* Acceleration measured in mg Heart rate (HR)* Mean HR resampled in 15sec intervals, measured in BPM HRV calculated by differencing Heart Rate Variability (HRV)* the second-shortest and the second-longest inter-beat interval (as seen in30 ), measured in ms Acceleration-derived ENMO-like variable (Acceleration/0.0060321) + 0.057) (as seen in31 ) Euclidean Norm Minus One (ENMO)* Acceleration-derived Metabolic Equivalents of Task (METs)* Sedentary* If Accelerometer = 1, take daily count and average Vigorous* If Accelerometer >= 4.15, take daily count and average Anthropometrics Age Age, measured in years Sex Sex is binary (female/male) Weight Weight, measured in kilograms Height Height, measured in meters.centimeters Body Mass Index (BMI) BMI is calculated by Weight/(Heightˆ2), measured in kg/mˆ2 Resting Heart Rate RHR is calculated by averaging the 4th, 5th, and 6th minute Wearable-derived RHR of the baseline visit and adding to that the Sleeping Heart Rate that has been inferred by the wearable device.4 Seasonality The month number is used along with a coordinate encoding that Month of year allows the models to make sense of their cyclical sequence. 14/16
Figure 6. Flowchart of the analytical sample and the training/testing splits across the three tasks. The first task trains a model to predict fitness using the large cohort (Fenland I). The second task is using the smaller cohort of repeats in Fenland I (called Fenland II) and trains further models to predict fitness now and in the future (and their delta). The third task evaluates the original model trained in Task 1 by feeding new sensor data (Fenland II sensors and anthropometrics) to assess the adaptability of the model to pick up change. (*Training set is 90% of the 80% remaining dataset after splitting to testing set. Validation set is 10% of the training set) 15/16
Original data Learned latent space 55 55 tSNE Dimension 2 tSNE Dimension 2 50 50 VO2maxcurrent VO2maxcurrent 45 45 40 40 35 35 30 30 25 25 tSNE Dimension 1 tSNE Dimension 1 (a) (b) Original data Learned latent space 100 100 weekHeart Rate (bpm) weekHeart Rate (bpm) tSNE Dimension 2 tSNE Dimension 2 90 90 80 80 70 70 60 60 50 50 tSNE Dimension 1 tSNE Dimension 1 (c) (d) Figure 7. t-distributed stochastic neighbor embedding (tSNE) projection of the original feature vector (Fenland I testing set, Sensors + RHR + Anthro.) compared to the model’s latent space after training. (a-b) The original data presents some clusters but the outcome is not clearly linearly separable. The model activations on the penultimate layer of the neural network capture the continuum of low-high VO2 max both locally and globally. (c-d) A similar assessment to VO2 max is presented by coloring with the mean HR of the week of each participant. In the learned space, participants with low HR (high fitness) are placed in the same clusters as in VO2 max, unlike the original space. In all plots a 50% transparency has been applied to combat crowding and the colorbar is centered on the median value to illustrate extreme cases. The VO2 max label is used only for color-coding purposes (the projection is label-agnostic). Every participant is a dot. Predicted Predicted 0.100 True 0.100 True Kernel Density Kernel Density 0.075 0.075 0.050 0.050 0.025 0.025 0.000 0.000 10 20 30 40 50 60 30 35 40 45 50 55 60 VO2max VO2max (a) (b) Figure 8. External validation of Fenland I model with maximal (peak exercise) test data using the BBVS cohort. (a) Distribution of the predicted vs the true VO2max (RMSE=8.998) using all participants (N=181). (b) Distribution of the predicted vs the true VO2max (RMSE=5.19) by matching BBVS to have similar VO2max (mean±std) to the training set of Fenland I, using a subset of participants (N=82). Please see the main text for interpretation of these results. 16/16
You can also read