An introduction to biomedical applications of big data, machine learning, and artificial intelligence - RYAN PETERSON, ASSISTANT PROFESSOR ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An introduction to biomedical applications of big data, machine learning, and artificial intelligence RYAN PETERSO N , ASSISTANT PRO F ESSO R, B IO STATISTI C S & IN FO RMATI C S C C TSI B IG DATA SEMINAR SERIES NOV EMB ER 9, 2022 1
Outline Introduction to big data ◦ What are they? ◦ Why are they important? ◦ Important considerations – context, goals, bias Machine learning and artificial intelligence A survey of big data contexts Big data ethics 2
Big data – what are they? The term “big data” typically refers to data sets so large and/or complex that they are difficult to process using on- hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Big data sets often include a variety of data types. Sometimes data are structured, but often they are unstructured or semi-structured. Source: Wikipedia 4
Big data – The 4 (or 5) V’s IBM: The 4 V’s of big data vary from context to context: volume, veracity, variety, and velocity The fifth “V” is “Value”. Big data + “work to yield insights” = Value Such “work” may include analysis, machine learning, and/or artificial intelligence, which we’ll touch on soon Source 5
Important considerations Big data are not an end in themselves. Context matters for how to frame necessary work! For many reasons, most big data cannot be processed on a desktop. Big data are often utilized to solve causal questions, but may provide limited causal insights Identifying sources of variation and bias are key; traditional methods used for big data tend to fall victim to bias. ◦ Landon/Roosevelt election in 1936, Lancet poll ◦ Google Flu Trends 7
Value propositions for big data In industry settings, value is most often measured in dollars. What is the value proposition for big data in research (not-for-profit) settings? What is the value proposition in biomedical/healthcare settings? Personalized medicine! 8
High-sample or high-dimensional? “High sample size” big data refer to “tall” data sets with an extremely large number of samples (n). ◦ If data collection is cheap relative to value proposition, massive p data sets are common (e.g. advertising, insurance) ◦ For, randomized controlled clinical trials, the gold-standard causal ID X1 … Xp study, we do not see massive data sets. S1 High-dimensional data refer to “wide” data sets with an extremely large number of features (p). n … ◦ If measuring many features of a single independent observation is cheap, we will see high-dimensional data. Sn ◦ ’Omics data are typically high-dimensional, rarely have a n>>p Additional computational and analytical complexity arises when big data have both high-n and high-dimensional components (wearables, Google trends, images) 9
High-sample or high-dimensional? HIGH SAMPLE SIZE (N>P) HIGH DIMENSIONS (P>N) Produces “stable” insights (e.g. precise Tends to produce “unstable” insights estimates of effects of interest) Issues relating to multiple Issues relating to bias of primary concern comparisons/multiplicity of primary concern Possibility for diminishing returns relative to Causal effects can be difficult to disentangle value proposition from correlated features ◦ “Rashomon Effect” Black-box algorithms tend to perform well What about when you have both a very high sample size and very high dimensions? • Not always clear; depends on goals/context. • An interdisciplinary approach is best; important to include clinician, informatician, data scientist, (bio)statistician, and more. Same goes for complex data structures. 10
Streaming data Another type of “big data” that doesn’t follow neatly into these categories is “Streaming” data. Streaming data are continuously generated by different sources, and refer to small amounts of data which arrive at a very fast rate. Streaming data can be conceptualized as “data streams” which lead to a “data lake”. Examples: Sensors in transportation vehicles, stock market data, website analytics, twitter data Processing and analysis performed on the data lake are referred to as “batch processing”, those performed on the data streams are “stream processing” More information available at AWS Related to time-series data analysis 11
Machine learning, artificial intelligence 12
Machine learning and artificial intelligence Screenshot from openAI playground: 13
Machine learning and artificial intelligence Hotpot.ai image generator: “machine learning and artificial intelligence” “Machine learning” 14
Machine learning Also called “Data mining”, represents “3rd phase” of knowledge discovery for big-data. Supervised learning: using information to predict an outcome Unsupervised learning: using information to identify subgroups Recommendation systems Anomaly detections: identification of observations that deviate from the norm 15
Supervised learning Goal: Given a training data set with X’s and Y’s, build a model to predict Y given X. Then, take a new X that was not in the data set and predict Y for that new observation Many public challenges exist or have been completed for this purpose (e.g. Kaggle, Netflix) Simple supervised learning techniques can be performed that are transparent and interpretable: ◦ Regression! (lasso for high-dimensional case) ◦ Decision trees “Fancier” ML tools: random forests, ensembles/Super learners, K-nearest neighbors, support vector/kernel machines, [deep] neural networks ◦ Predictions and models are “explainable,” but not necessarily interpretable ◦ Commonly referred to as “black box” 16
Unsupervised Machine Learning No outcome of interest Dimension reduction (going from a high number of variables to a low number of variables) ◦ Without the consideration of the prediction of an outcome Clustering data points together (top-right) Examples: ◦ Eliminating high VIF variables is a form of dimension reduction ◦ Weight loss trajectory clustering (bottom plot on right) 17
Supervised ML “algorithms” CLASSIFICATION ALGORITHMS REGRESSION ALGORITHMS Outcome is categorical (possibly binary) Outcome is continuous Logistic regression is a classification algorithm Confusingly, logistic regression is not a “regression” algorithm in the ML world since A coin flip can be a classification algorithm the outcome must be continuous. Any set of rules or equations which output a Examples include: class prediction or probability is a classification ◦ Linear regression algorithm. ◦ Partial least squares Performance measures: misclassification rate, ◦ Regression trees area under the ROC curve (AUC), sensitivity, ◦ many others specificity, more. Performance measures: Root-mean-squared error, R-squared 18
Interpretability in Machine Learning SIMPLE/INTERPRETABLE “BLACK BOX” MODELS Tend to have parameters which can be If they have parameters, they cannot be well- interpreted in sentences/words understood (think: high order interactions) The process by which predictions are made is The process by which predictions are made mathematically tractable and understandable. need not be mathematically tractable (by humans). Examples: ◦ the older a child, the larger our prediction for Input -> ?? -> predictions. The question marks their FEV. (linear regression) are “left blank” by the researchers, who only ◦ Patients in 3rd class (or crew) on the titanic had care about the predictions themselves. the worst odds of survival (logistic regression) 19
Transparent vs black box learning AN INTERPRETABLE STATISTICAL MODEL MAY A BLACK-BOX ML MODEL MAY BE THE BETTER BE THE BETTER CHOICE IF… CHOICE IF… Uncertainty is inherent in the outcome The outcome doesn’t have a strong component of randomness The signal-to-noise ratio is low The signal-to-noise ratio is large One wants to isolate effects of a small number of variables Overall prediction is the goal, without being able to Uncertainty in an overall prediction or the effect of a succinctly describe the impact of any one variable (e.g., predictor is sought treatment) Additivity is the dominant way that predictors affect the One is not very interested in estimating uncertainty in outcome (interactions are relatively rare and can be pre- forecasts or in effects of selected predictors specified) Non-additivity is expected to be strong and can’t be The sample size isn’t huge isolated to a few pre-specified variables One wants to isolate the effects of “special” variables such The sample size is huge as treatment or a risk factor One does not care that the model is a “black box” One wants the entire model to be interpretable and ethically tractable 20 Source: F. Harrell Regression Blog
Machine learning – additional resources Highly recommend the following textbook if you want to learn more: ◦ Applied Predictive Modeling by Max Kuhn and Kjell Johnson Others: ◦ An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani ◦ The Elements of Statistical Learning by Freidman, Tibshirani, and Hastie Other good readings: ◦ Leo Breiman. "Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author).” Statistical Science, 16(3) 199-231 August 2001. https://doi.org/10.1214/ss/1009213726 ◦ https://www.fharrell.com/post/stat-ml/ 21
A survey of big data contexts 22
Big data from “wearables” Wearables are sensors worn on the body (e.g. on the wrist) Measure accelerometry at the sub-second level, generating massive amounts of data Dimension reduction, ML, and AI used to aggregate this data into useful (meaningful) quantities, such as activity counts, sedentary behavior, steps, gaits, sleep quality, and more. Scientific Questions ◦ Physical activity and health ◦ Example: 6-minute walk distance. More informative from clinic, or from wearable? ◦ Epidemiology of aging ◦ Circadian rhythms ◦ Sleep duration, quality Seth Creasy, Julia Wrobel, and Andrew LeRoux will be doing a CCTSI big data seminar for wearables data this Spring. 23
Big data and mHealth mHealth is a similar but slightly more general context than wearables data, encompassing the use of all mobile telecommunication and multimedia technologies in health care delivery. Such technologies often allow for granular (e.g. household-level) surveillance related to disease transmission, outbreak detection, and identifying household or genetic risk factors. Commonly available variables: location/movement, time (frequency), health readings (e.g. from wearables, but also things like user-recorded images, sound, and symptoms), distance/proximity to other devices Scientific questions of interest (examples): ◦ Healthcare worker + patient movement in a healthcare setting ◦ Covid-exposure detection (alerts) ◦ Diabetic foot surveillance (at-home detection of foot ulcers) (Anthony et al. 2020) 24
Big data and mHealth – Kinsa smart thermometer example Kinsa developed a thermometer (& smartphone app) which they made commercially available across the US ◦ Temperatures stored on device and uploaded to cloud anonymously ◦ Users have profiles with basic health information, and can report other symptoms (cough, sore throat, etc) We wanted to see if these measures helped “nowcast” flu incidence in the US, compared to the CDC’s influenza-like illness measure which is commonly treated as the gold standard. Highest correlation Lowest correlation Miller, Peterson et al. (2019) 25
Big data and mHealth – Kinsa smart thermometer example During the flu season, squared forecast errors decreased by 43% (95% CI: 35-50) The baseline model missed the peak of influenza season by an average of 2.5 weeks, whereas models incorporating Kinsa data missed the peak by about 1.5 weeks. There were only 3 states where model performance did not improve with the thermometer readings: Montana, Oklahoma, and Utah. In 12 states, the optimal model’s SE declined by more than 60% when thermometer data incorporated. Miller, Peterson et al. (2019) 26
Research insights from Wearables, mHealth The Datenspende study; Radin et al 2020 ◦ German study acquired anonymous wearable data at a large scale (over 500K participants) ◦ Researchers used wearable data to calculate the regional probability of COVID-19 outbreaks incorporating individual data on pulse, physical activity (PA), and sleep. The Apple Heart Study; Perez et al. 2019: ◦ 400K+ participants, Apple smartwatch ◦ Showed that wearable devices may detect atrial fibrillation (AF) ◦ Generated electrocardiograms with wearable ECG patch Other examples: ◦ electrolyte monitoring for cystic fibrosis (Choi et al. 2018) or training management (Relf et al. 2019) ◦ measurement of continuous noninvasive blood glucose (Caduff et al. 2018) ◦ Wearable smart inhalers and activity trackers for asthma monitoring can predict uncontrolled asthma (van der Kamp et al. 2020) Huhn et al 2022 scoping review 27
Big data – administrative data Lots of data are collected routinely by organizations: financial data, claims, billing, in addition to EHR data. Such “administrative” data are not typically collected for the purposes of research, but still can be utilized to answer interesting questions related to research. Allows for extremely large sample sizes (high-n), typically easy and/or cheap to acquire Tend to be well-documented, facilitating data management (still hard given massiveness of data). Important to consider many possible sources of bias! Examples: ◦ Medicare/Medicaid data ◦ Premier data ◦ Census data ◦ National Health and Nutrition Examination Survey (NHANES) ◦ Healthcare Cost and Utilization Project (HCUP) data contains rich information on national and state-level healthcare utilization, access, quality, charges, and outcomes 28
Big administrative data – SSI seasonality Surgical site infections (SSIs) after total knee (TK) and total hip arthroplasty (THA) are devastating to patients and costly to healthcare systems. Are SSIs seasonal? Let’s find out with HCUP’s National Readmission Database. >760K procedures over 2 years Adjusted SSI incidence 24% higher in June vs December Nadir Peak, Odds Estimated Estimated Ratio (95% CI) Nadir Month Peak Month TKA 1.305 (1.201-1.418) December June THA 1.19 (1.091-1.298) January July Pooled 1.237 (1.164-1.314) December June Anthony, Peterson et al. 2018 29
Big administrative data – cellulitis HCUP’s National Inpatient Sample is bigger than the NRD (more years available), and is meant to be nationally representative. We used data from 1998-2013 to show that hospitalizations for cellulitis, a skin infection, have increased dramatically in incidence and associated costs. We also showed that incidence was seasonal, which had previously been unknown. Methods: extract counts of cases by month, then model as a time series ◦ Efficient for > 5 million records Peterson et al, 2017 30
Big data context: Electronic Health Records The EHR consists of clinic notes, medications, labs, demographics, registries, vitals, billing codes, outpatient encounters, and more. Databases live as either structured (e.g. ICD-codes), semi-structured (e.g. medications/dosage), or unstructured text (clinical notes). Our campus has an EMR repository and many resources available for working with this data: ◦ Health Data Compass: www.healthdatacompass.org ◦ Colorado Center for Personalized Medicine Biobank: www.cobiobank.org ◦ CIDA: coloradosph.cuanschutz.edu/research-and-practice/centers-programs/cida ◦ D2V: https://medschool.cuanschutz.edu/accords/about/d2v Laura Wiley will be presenting on this topic next month at a big data seminar! 31
EHR data example – Simon et al. 2021 “Harmonized” EHR data can be used to build machine learning models that can be directly used during clinical care. N ≈ 35K patients at risk for drug-induced QT prolongation using the UCHealth EHR, with p ≈ 6500 features derived from the harmonized EHR data ◦ Included medications, procedure codes, diagnosis codes, labs, and demographic data Analyses utilized 96 CPUs and 620 GB of RAM (Google Cloud platform). Deep neural networks were found to best predict the outcome, achieving an out-of-sample AUC of 0.71 (FPR = 28%, FNR = 30%). Source: Cleveland Clinic 32
Additional big data contexts in research Big data in the COVID-19 pandemic response ◦ See Tell Bennett’s big data seminar from last year (https://youtu.be/5-PbrKyCZRE) ◦ See also Debashis Ghosh’s big data seminar in 2020 (https://youtu.be/QSpjhpiiwsk), where he also describes biases and ethics of big data Omics ◦ Stay tuned for Katerina Kechris will discuss this context in March 2023 in a big data seminar! Imaging ◦ Fuyong Xing and Antonio Porras will discuss this context in April 2023 for a big data seminar Evidence synthesis ◦ Tianjing Li and Lisa Bero gave a big data seminar in 2020. Here is a link to their slides. Wearables, and EHR data will also be covered in more depth this year. 33
Big data and registries: new frontiers? Increasingly, patient registries and large cohort studies are being augmented with data from other big data contexts. Examples: ◦ SEER-Medicare: a cancer registry (35% of all cancer patients) linked to Medicare data ◦ All of Us: a disease agnostic cohort study gathering data from wearables (heart rate, physical activity, sleep, and more) as well as social determinants of health, family health history, health care access, etc. ◦ PROGRESS: combines wearable data with electronic health records, biosamples, and patient generated data to identify what influences how your body responds to food. ◦ NCDB: 70% newly diagnosed cancer patients linked with >34 million historical records ◦ The UK Biobank study (>80K subjects with granular wearable data, genetic data, and outcomes) 34
Personalized medicine Personalized medicine is a type of healthcare in which treatments of patients are tailored to individuals based on the person’s genetic makeup (biology), lifestyle, and medical history. What big data contexts are involved with these individual factors? ◦ Many of them! EHR, wearables, ‘omics, mHealth, imaging, more? How can these data sources be used and streamlined for personalized medicine? Data harmonization and integration are critical to this endeavor. With a vast amount of data available on patients, ML/AI tools can be utilized to predict how a person will respond to different treatments. Who is ultimately accountable for personalized medical treatment decisions informed by ML/AI? 35
Ethical Considerations MORE QUESTIONS THAN ANSWERS 36
Hotpot.ai: “privacy” Ethics for Big Data – Privacy Is privacy of individuals maintained with big data? The trade-off between privacy and variety of data ◦ Is it possible to de-identify high-dimensional data? Without losing valuable information? ◦ Is it possible to de-identify location data? Who can access the stored big data? Further reading: ◦ https://www.theperspective.com/debates/businessandtechnology/the-perspective-on-big-data/ ◦ https://hbr.org/2012/08/dont-build-a-database-of-ruin ◦ https://journalofbigdata.springeropen.com/articles/10.1186/s40537-016-0059-y ◦ Risk mitigation for synthetic data sets: https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2017/3_risk_mitigation.pdf 37
Ethics for big data – algorithmic bias According to this article, Proctorio (a popular service which has a contract with the University of Colorado) watched its business grow 900% after campuses began closing due to the COVID-19 pandemic. These programs use supervised ML in a variety of ways to proctor exams by various means, including: ◦ Using facial recognition to match the person taking the exam on Zoom to the ID of the student enrolled in the class, ◦ Creating flags for review if cheating is suspected based on eye-tracking ◦ Following web-traffic to monitor sites utilized during the exam, ◦ Etc… After learning how these ML methods are trained, what to you think about the idea of institutionally implementing this “auto-proctor” system? 38
Ethics for big data – additional concerns Selection and Sampling Bias ◦ Researchers may very precisely answer the wrong question ◦ Biased data sets will have limited generalizability, even with massive sample size ◦ Example: 2013 Boston pothole detection ◦ Used smartphone app to detect potholes using GPS and accelerometers ◦ People in lower-income groups less likely to have smartphone ◦ Which potholes will be patched? Transparency: How can we build trust in (black-box) ML/AI systems when we don’t understand how they work? Who are the stakeholders of big data? Who is accountable for ML/AI systems (ultimately)? 39
Summary A large value proposition for big data in biomedical settings is in the prospect of personalized medicine. Health and its determinants are extremely multifaceted. They require an extremely high dimensional set of variables to measure accurately New technologies are producing massive amounts of data from large, diverse populations. New insights are occurring at a rapid rate from imaging, wearables, ‘omics, administrative data, EHR, mHealth, and more big data contexts Research increasingly requires interdisciplinary teams with both data scientists, clinicians, etc. ML/AI systems can be beneficial, but transparency, bias, and trust are important considerations in their application, since those ultimately accountable for decisions made in personalized medicine are patients themselves. 40
You can also read