The Role of Time, Weather and Google Trends in Understanding and Predicting Web Survey Response
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Survey Research Methods (2021) © 2021Author(s) Vol. 15, No. 1, pp. 1-25 doi:10.18148/srm/2021.v15i1.7633 European Survey Research Association CC BY-NC 4.0 The Role of Time, Weather and Google Trends in Understanding and Predicting Web Survey Response Qixiang Fang Joep Burger Utrecht University Statistics Netherlands Ralph Meijers Kees van Berkel Statistics Netherlands Statistics Netherlands In the literature about web survey methodology, significant efforts have been made to under- stand the role of time-invariant factors (e.g. gender, education and marital status) in (non- )response mechanisms. Time-invariant factors alone, however, cannot account for most varia- tions in (non-)responses, especially fluctuations of response rates over time. This observation inspires us to investigate the counterpart of time-invariant factors, namely time-varying factors and the potential role they play in web survey (non-)response. Specifically, we study the ef- fects of time, weather and societal trends (derived from Google Trends data) on the daily (non- )response patterns of the 2016 and 2017 Dutch Health Surveys. Using discrete-time survival analysis, we find, among others, that weekends, holidays, pleasant weather, disease outbreaks and terrorism salience are associated with fewer responses. Furthermore, we show that using these variables alone achieves satisfactory prediction accuracy of both daily and cumulative response rates when the trained model is applied to future unseen data. This approach has the further benefit of requiring only non-personal contextual information and thus involving no privacy issues. We discuss the implications of the study for survey research and data collection. Keywords: online survey; response rates; weather; Google Trends; survival analysis 1 Introduction less biased. Nevertheless, low response rates of the web mode in mixed-design surveys are still undesirable because Web surveys have become increasingly popular over the of the associated increase in total expenses and planning ef- past decades. This growing popularity of web surveys, in fort. Therefore, in many ways, low response rates are a chal- comparison with their traditional counterparts (e.g. tele- lenge for web surveys, necessitating research to understand phone and face-to-face interviews), can be attributed to the likely mechanism underlying unit (non-)response deci- unique advantages of web surveys such as shorter transmis- sions in web surveys (Fang & Wen, 2012). sion time, lower costs and more flexible designs (Bethle- The past decades has seen a plethora of research on influ- hem, 2009). However, web surveys also suffer from vari- encing factors of web survey (non-)response. Most of these ous issues, the most prominent of which are lower response factors belong to the following three categories: respondent- rates, which leads to compromised quality of the resulting related (e.g. age, income, education), region-related (e.g. de- survey data (i.e. sometimes more bias and always less pre- gree of urbanisation, population density) and design-related cision) (e.g. Fan & Yan, 2010; Fowler, Cosenza, Cripps, (e.g. survey length, contact mode). For a quick (non- Edgman-Levitan, & Cleary, 2019; Manfreda, Berzelak, Ve- exhaustive) overview of existing findings in the web survey hovar, Bosnjak, & Haas, 2008; Taylor & Scott, 2019). A literature, see Table A1 in Appendix A. These discoveries potential solution are mixed-mode designs, which follow up have greatly improved our understanding of the underlying on web surveys with other modes like telephone and face- mechanisms of response decisions in web surveys. However, to-face interviews (Wolf, Joye, Smith, & Fu, 2016). In this they do not fully account for the variation in observed re- way, the total response rate becomes higher, the collected sponse behaviour. For instance, Erdman and Bates (2017) responses more representative and the resulting estimates used the best twenty-five out of over three hundreds of such predictors in their study to predict block-level response rates and yet the resulting model only explains about 56% of the Contact information: Qixiang Fang, Department of Methodol- variation in the response rates. It remains, therefore, neces- ogy and Statistics, Faculty of Social and Behavioural Sciences, sary to investigate additional influencing factors. Sjoerd Groenmangebouw, Padualaan 14, 3584 CH Utrecht, The One prominent characteristic of the three categories of Netherlands (E-Mail: q.fang@uu.nl) factors described above is that they are time-invariant, mean- 1
2 QIXIANG FANG, JOEP BURGER, RALPH MEIJERS AND KEES VAN BERKEL ing that they tend to stay constant during a survey’s data col- ues that change by design. For survey research, which has lection period. Even when they do vary in values, the change a typical fieldwork period ranging from a few days to sev- is unlikely substantial. The time-invariance of these well- eral months, it is likely that some underlying factors of (non- studied factors (partially) explains why the use of them alone )response behaviour vary substantially in their values during is insufficient to substantially explain variations in (non- the fieldwork period. Examples are: personal availability, in- )response patterns, simply because they cannot properly ac- dividual emotional status, day of a week, weather, holidays, count for temporal fluctuations of survey response rates over and societal sentiments about a topic. Of course, respondent- time like different days of the week (e.g. Faught, Whitten, and region-related factors can also vary (e.g. gender, age, & Green, 2004; Sauermann & Roach, 2013), months of the household composition, geographical features of the city of year (e.g. Losch et al., 2002; Svensson, Svensson, Hansen, residence), but normally to a much lesser extent (especially & Lagerros, 2012) and different years (Sheehan, 2006). This during a typical survey project period) and unlikely apply to leads us to look into the counterpart of time-invariant factors, the majority of the sample units. Therefore, we argue that namely time-varying factors such as day of a week, weather, time-varying factors can help to understand and predict sur- presence of public holidays, disease outbreaks and salience vey responses, especially with regards to fluctuations in re- of societal issues like privacy concerns, to name a few. sponse rates over time which time-invariant factors by defi- The lack of research on the roles of such time-varying fac- nition cannot properly account for. tors not only limits our understanding of the underlying pro- In this paper, we focus on time-varying factors that are cess of response decisions in web surveys, but also hinders also contextual factors. We define contextual factors as vari- survey design effort aiming at increasing response rates. For ables that usually cannot be influenced by study participants instance, people may be unlikely to respond to surveys in because they are determined by stochastic processes largely holiday periods because they are not at home and/or do not or totally external to them. Typical contextual factors as- want to spend time on surveys. Therefore, survey response sess potentially changing characteristics of the physical or rates might be higher or lower depending on time-related fac- social environment in which study participants live. Some tors. This is also relevant to surveys that accept responses of the aforementioned examples of time-varying factors like over a longer period of time. In such surveys, daily response weather, time and societal sentiments fit this definition. The rates usually peak on the first day(s) and quickly subdue (Mi- term, like others, is defined and used differently across dis- nato, 2015). This means that the number of responses during ciplines. In survival analysis research, it is often called “an- the first day(s) significantly influence the final response rates cillary factors” (Kalbfleisch & Prentice, 2002). In this paper, of the surveys, suggesting that effects of time-varying factors we prefer to use the more intuitive term “contextual factors”. (if there is any) during the first few days might be crucial for The focus on only contextual factors has two practical a proper understanding of the final response rates. purposes. The data for contextual factors are usually non- Therefore, in this paper, we investigate whether and how personal, meaning that they do not involve privacy issues. various time-varying factors (including time, weather and so- This is important because recent increases in general pri- cietal trends) influence the daily (non-)response patterns of vacy concerns among the public and the introduction of more the web mode of the 2016 and 2017 Dutch Health Surveys, stringent privacy regulations such as the General Data Pro- and whether the use of these factors are can be useful in pre- tection Regulation in the EU have led to greater difficulties dicting survey response rates. in obtaining, accessing and using personal data. Our attempt We structure the remainder of the article as follows. We to model web survey response with only non-personal pre- begin with a detailed account of time-varying (contextual) dictors, if proven successful, can be a useful alternative for factors and our research proposal, followed by the research the survey research community. Furthermore, because con- aims. Then, we describe the data, research methods and re- textual factors are (largely) free from the influence of study sults, respectively. We conclude with a discussion on the subjects, there is no or little concern for reverse causality and theoretical and practical implications of the findings and rec- hence more internal validity for the study. ommendations for future research. To sum up, in this paper, we focus on predicting and un- derstanding web survey responses from time-varying contex- 2 Time-Varying Contextual Factors tual factors. 2.1 Definitions 2.2 Literature Review In contrast to time-invariant factors, time-varying factors A significant amount of research in fields like cross- are variables whose values may differ over time (J. D. Singer cultural psychology, epidemiology and family sociology has & Willett, 2003). They record an observation’s potentially investigated the effects of time-varying contextual factors on differing status on each associated measurement occasion. individual outcomes. In contrast, much fewer studies in the Some have values that change naturally; others have val- field of survey research have attempted so and even fewer
UNDERSTANDING AND PREDICTING WEB SURVEY RESPONSE 3 focus on web surveys. Therefore, for a more thorough un- sions in surveys. However, some contradictory findings sug- derstanding of the likely effects of time-varying contextual gest a need for further research, for instance, on the influence factors, we conduct a literature review on this topic including of day of a week. In addition, it is also likely that the results not only web survey studies but also non-web survey ones. of the existing studies are confined to local applications and The studies on time-varying contextual factors that we that findings based on non-web surveys may not apply to web identified can be categorised into two types. The first type fo- surveys. New time-varying contextual factors should also be cuses on the effect of time, such as year, season, month, days researched. We propose the following time-varying contex- of a week. For instance, Sheehan (2006) analysed 31 web tual factors for our study. surveys and concluded that the year in which a survey was Time. The first category of time-varying contextual fac- published was the most important predictor of response rates. tors is related to time, including day of a week and public Losch et al. (2002) found that completing a survey interview holidays. The former is chosen because according to the during summer in Iowa (US) required more contact attempts literature, its effect on survey response rates is still unclear, than in other seasons. Similarly, Göritz (2014) documented seemingly varying across surveys and thereby needing fur- for a German online panel that the panel members were more ther research. The latter factor is chosen because we hypoth- likely to start and finish studies in winter than during any esise that during holidays, people may travel around, spend other season. Contrasting these two findings, Svensson et al. more time with family or want to rest and are consequently (2012) in a Swedish longitudinal online survey found that the less likely to participate in survey research. highest response rate was in September. Faught et al. (2004) Weather. The effect of weather on survey participation noted in their experimental study on US manufacturers that also requires more research. In this study, we include differ- survey response rates were the highest when the email in- ent types of daily weather measures (e.g. maximum, min- vitation was sent on either Wednesday morning or Tuesday imum and average) of temperature, sunshine, precipitation, Afternoon. Contrary to this, Sauermann and Roach (2013) wind, cloud, visibility, humidity and air pressure. in their experiments, which was conducted among US re- Societal Trends. In addition to time and weather, fac- searchers, did not find the timing of the e-mail invitation (in tors which relate to real-time societal trends such as dis- terms of the day of a week) to result in significantly different ease outbreaks, privacy concerns, public outdoor engage- response rates in a web survey. However, they did find that ment (e.g. in festivals and on the road), terrorism salience people were less likely to respond in the weekend and would may also play an influencing role in affecting survey partic- postpone the response until the next week. ipation decisions. These we term “societal trends” in this The second type of studies concerns the influence of paper. We explain each of these societal trends factors next. weather on survey participation. Potoski, Urbatsch, and Disease Outbreaks. It is common knowledge that be- Yu (2015) analysed eight surveys from 2001 to 2007 and ing sick (physically or psychologically) can alter individual showed that on unusually cold and warm days, wealthier behaviour. For instance, individuals who are sick may stay people are more likely to participate in surveys than the less at home more and reduce outdoor or professional activities. wealthy. Cunningham (1979) found more pleasant weather They may lack the cognitive resources to engage in cogni- (e.g. more sunshine, higher temperature, lower humidity) tively demanding activities (like survey participation). They to significantly improve a person’s willingness to assist an may develop more negative emotions, which in turn may interviewer. influence their pro-social behaviour. In particular, medical The effect of weather on survey participation, however, conditions such as the common cold, the flu, hay fever and likely goes beyond these two findings. Simonsohn (2010) depression are likely to affect a large number of individuals, showed that on cloudier days people are more likely to en- especially during certain times of a year (e.g. cold, flu and gage in academic activities, which share some common char- depression in the winter; hay fever in the spring). Therefore, acteristics with survey participation (e.g. high cognitive load we hypothesise that disease outbreaks can have physical, be- and low immediate returns). Therefore, it is likely that peo- havioural and psychological consequences that in turn im- ple on cloudy days may become more inclined to participate pact survey participation to a varying degree depending on in surveys. Keller et al. (2005) showed that higher air pres- the type, severity and prevalence of the illness. We did not sure has a positive influence on mood. In turn, positive mood find any previous research on this topic, leading us to believe states may lead to increased helping behaviour (e.g. fulfill- that our hypothesis is novel and worth studying. ing a survey request) (Weyant, 1978). Therefore, air pressure Privacy Concerns. Research has shown that privacy may also impact survey response decisions. concerns can deter individuals from survey participation. For instance, two studies on the 1990 and 2000 U.S. census find 2.3 Research Proposal that an increase in concern about privacy and confidential- ity issues is consistently associated with a decrease in the The studies above do confirm, to some degree, the in- probability of census participation, especially among cer- fluence of time-varying contextual factors on response deci- tain ethnic groups (e.g. E. Singer, Mathiowetz, & Couper,
4 QIXIANG FANG, JOEP BURGER, RALPH MEIJERS AND KEES VAN BERKEL 1993; E. Singer, Van Hoewyk, & Neugebauer, 2003). Using use, travel behaviour and interpersonal relationships suggest paradata, two other studies report that greater privacy con- that survey participation (and consequently, survey response cerns are linked to higher unit or item non-response (Bates, rates) can also be indirectly affected. Given all the potential Dahlhamer, & Singer, 2008; Dahlhamer, Simile, & Taylor, consequences of higher terrorism salience (which likely take 2008). Given these findings, we hypothesise that the level greater priorities in one’s life than survey participation), we of general societal concerns about data privacy issues likely tentatively hypothesise a negative effect of terrorism salience predicts survey response rates. Specifically, the higher the on survey response rates. Furthermore, note that the Nether- levels of the privacy concerns, the lower the response rates. lands (where the current study is based) and its nearby coun- Public Outdoor Engagement. Considering the nature of tries (such as Germany and France) have suffered from ter- some types of surveys (e.g. mailed surveys, digital surveys rorist threats, attacks or related events and issued terrorism that require a desktop and are not smartphone-friendly), it warnings during the past years (e.g., see United States De- is conceivable that people are unlikely (if not impossible) to partment of State, 2017, 2018). This fact makes our inclusion participate in surveys when they are engaged in outdoor ac- of terrorism salience as a potential factor of survey response tivities (e.g. public events, holidays, travel), even if they have all the more relevant. received the survey requests. This is evidenced by the previ- Summary. Some of the existing studies investigated ef- ous finding that completing a survey interview during sum- fects of time-varying factors on a monthly or yearly scale. mer required more contact attempts than in other seasons be- These findings are certainly interesting and informative; nev- cause people are more likely travelling (Losch et al., 2002). ertheless, studying the effects of time-varying factors on a Therefore, we hypothesise that the level of outdoor engage- finer scale (e.g. weekly, daily) might provide survey re- ments in during certain time periods may also affect survey searchers with even more helpful insights. In this project, response rates during that time. This can be especially true we focus on the effects of daily time-varying factors on daily for the surveys that we study in this research, where survey survey response. We do not consider the effects of months invitations are mailed to the sample units’ home addresses or seasons here, partly because we expect the daily time- and the surveys are not suitable for smartphones (i.e. they varying variables we use to be able to capture any monthly require the use of laptops or tablets). and seasonal trends and partly because there is not sufficient variation in our data to allow for reliable estimation of the Terrorism Salience. A large body of literature has relevant month and season effects that would generalise well shown that terrorist events have impactful individual and to future unseen data. societal consequences. For instance, higher levels of ter- To sum up, we propose the following daily time-varying rorism salience or fears are linked to more negative emo- factors for investigation in the current study: day of a week, tions for non-religious people (Fischer, Greitemeyer, Kas- public holidays, weather (i.e. temperature, sunshine, pre- tenmüller, Jonas, & Frey, 2006), worse mental health (Fis- cipitation, wind, cloud, visibility, humidity and air pressure) cher & Ai, 2008), more media consumption (Boyle et al., and societal trends (i.e. disease outbreaks, data privacy con- 2004; Lachlan, Spence, & Seeger, 2009), increased con- cerns, public outdoor engagement and terrorism salience). tact with family and friends (Goodwin, Willson, & Stanley, 2005), (irrational) travel behaviour (e.g. Baumert, de Obesso, & Valbuena, 2019; Gigerenzer, 2006), temporarily lower so- 3 Research Aims cial trust (Geys & Qari, 2017) but more institutional trust (Dinesen & Jäger, 2013; S. J. Sinclair & LoCicero, 2010), The current study is mainly concerned with whether and less occupational networking activities (Kastenmüller et al., how daily time-varying contextual factors such as day of a 2011), cancellation of sport events and a higher number of week, holidays, weather and societal trends influence web no shows in sport events (Frevel & Schreyer, 2020). Further- survey response behaviour on a daily basis. To approach this more, mortality salience (which can be induced by reports question, we use discrete-time survival analysis to model the of deaths in terrorist events) has been shown to increase pro- effects of predictors on the daily conditional odds of a per- social attitudes and behaviour (Jonas, Schimel, Greenberg, son responding to a web survey (given that he/she has not & Pyszczynski, 2002). These studies were conducted across responded yet). In this way, we obtain insight into how a spe- various countries, related to different terrorist events and with cific factor influences the daily response decision of a person. study participants either directly or indirectly impacted by Furthermore, believing that a model (or a predictor) is terrorist events. These consistent findings lead us to reason generally more useful when it not only explains the current that terrorism salience or fears can have a substantial impact data but also generalises to future observations, we evaluate on most members of a society, including potential survey the trained models and the related predictors with regards respondents. In the specific context of survey research, the to their predictive performances on an independent data set. findings that terrorism salience or fears can induce emotional This approach helps to answer, for instance, whether temper- and behavioural changes in, for example, health status, media ature is a better predictor than day of a week.
UNDERSTANDING AND PREDICTING WEB SURVEY RESPONSE 5 Table 1 Survey Phase ● Invitation Reminder 1 Reminder 2 Comparison of the 2016 and 2017 Dutch Health Survey (Web Mode) 0.4 2016 2017 Expected Delivery Fri. (Jan.-Jun.) Thur. Day of Invitation Sat. (Jul.-Dec.) Cumulative Response Rate 0.3 Expected Delivery Fri. (Jan.-Jun.) Sat. Day of Reminder Sat. (Jul.-Dec.) Sample Size 15007 16972 0.2 Response Rate 34.8% 34.2% ● ● ● ● ●● ● ● ● ● ● of the 2016 Dutch Health Survey. The starting point of a 0.1 ● ● ● ● new curve indicates receiving a corresponding invitation let- ● ● ● ter. Each web-mode DCP ends at the end of the curve. We ● ● ● can see that response rates grow the fastest in the first few days after an invitation or reminder letter and flatten quickly Feb 01 Mar 01 Apr 01 Date later, suggesting that the first few days of data collection are Figure 1. Cumulative Response Rates of Three Web DCPs crucial for ensuring a high response rate. in the 2016 Dutch Health Survey The Dutch Health Survey has a relatively consistent sur- vey design over the years (up to and including 2017), thereby making comparison and integration of data from different years valid and simple. Table 1 summarises information 4 Data about expected delivery days of the letters, sample sizes and 4.1 The Dutch Health Surveys response rates of the 2016 and 2017 surveys (web-mode). Note that in the first half of 2016, the letters were scheduled In this study, we analyse the response decision of individ- to arrive on Friday, while in the latter half on Saturday. In uals who were invited to participate in either the 2016 or the contrast, the invitation and reminder letters in 2017 were ex- 2017 Dutch Health Surveys. The Dutch Health Survey is a pected to arrive on Thursday and Saturday, respectively. This yearly survey administered by Statistics Netherlands. It aims variation in the expected arrival days of the letters (i.e. varia- to provide an overview of the developments in health, medi- tion in the designs of the surveys) can increase the robustness cal contacts, lifestyle and preventive behaviour of the Dutch of the study results, especially with regards to the effects of population. The sampling frame comprises of persons of all the day of a week predictor. ages residing in private households. It utilises a mixed-mode While we acknowledge that there are different categories design, consisting of an initial web mode and follow-up tele- of non-response behaviour such as non-contact and refusal phone or face-to-face interviews in case of non-response in (e.g. Lynn and Clarke, 2002) and that there is research value the web mode. Only the design and the data of the web mode in differentiating the sub-types of non-response behaviour, are relevant to this study. we treat all non-response sub-categories as one single “non- A yearly sample is divided into 12 cohorts, each corre- response” category in this study. There are two reasons. sponding to a data collection period (DCP). Each DCP starts First, the focus of our study is on response and non-response. with a web-mode survey, which lasts about a month. A web- In this sense, we are not interested in the sub-types of non- mode DCP begins with the by-post delivery of an invitation response behaviour. Second, the number of non-contact and letter that requests the sample unit to respond to the survey refusals in our data is too low (non-contact rates < 0.4% and online using a desktop, laptop or tablet (but not a smart- refusal rates < 2.2%) for the use of discrete-time survival phone). In case of no response from the individual after models in this study, because the denominator of hazard rates about one week, up to two mailed reminder letters follow (at becomes so low that the models would have trouble with re- an interval of one week). The invitation and reminder letters liable estimation. contain a web link to the survey and a unique personalised password required for login to the survey. Each web-mode 4.2 Weather Data DCP ends roughly one month following the invitation letters. Figure 1 illustrates the data collection process with the cu- The Royal Netherlands Meteorological Institute (KNMI) mulative response rates of the first three web-mode DCPs records daily weather information about temperature, sun-
6 QIXIANG FANG, JOEP BURGER, RALPH MEIJERS AND KEES VAN BERKEL shine, precipitation, wind, cloud, visibility, humidity and air we used to measure each trend. For disease outbreaks, we pressure across 47 stations in the Netherlands. We retrieved used the relevant commonly used Dutch terms concerning the 2016 and 2017 daily weather information (i.e. 20 vari- diseases like “flu”, “cold” and “depression”. For privacy ables in total) from the KNMI website (KNMI, 2019). The concerns, we used terms such as “data leaks” and “hacking” exact variables and the associated measures are summarised as proxies. For outdoor engagement, we used terms indicat- in Table A2 in Appendix A. ing whether people are in a traffic jam or participating in fes- We averaged the obtained weather records across all sta- tivals. Lastly, for terrorism salience, we used the term “ter- tions, instead of assigning every sample case to the nearest rorist”. Note that we hypothesised these search terms prior weather station, for two reasons. First, considering the small to any analysis, rather than cherry-picked from a long list size and geographical homogeneity of the Netherlands and sorted by correspondence with response rates, to reduce the only small variations of the weather data across the weather problem of spurious correlations. stations, we did not see a strong benefit of assigning the clos- Like any other data, GT data also need to be checked est weather station records to the sample cases over simply with regards to their validity and reliability before use. A assigning the average scores, especially when we also fac- potential issue of validity in GT concerns the fact that the tor in the additional effort and potential matching mistakes search volume of a specific search term does not necessarily associated with the matching approach. Second, averaging measure the intended phenomenon. A key reason is that a the weather measures across stations means that all sample search term can bear multiple meanings. For instance, the cases, on a given day, are associated with the exact same Dutch term “AVG” can be short for both “Algemene veror- scores for any weather variable. This has the advantage that dening gegevensbescherming (General Data Protection Reg- we can collapse the data set from the “Person-Period” data ulation)” and “AVG Technologies” (a security software com- format into the “Period-Level” data format (see Section 5.2 pany). According to Google Trends, the second meaning was for more information), which has the advantage of down- more often used than the first meaning in both 2016 and 2017 sizing the data matrix and thus reducing model computa- in the Netherlands. Therefore, GT indices may capture noise tion time (while obtaining the exact same model estimates). rather than the intended trends of interest. To mitigate this Therefore, given both considerations, we decided to average validity issue, we followed the advice by Zhu, Wu, Wang, the weather records across stations. and Qin (2012). The authors noted that GT offers a func- tion to check the most correlated queries (i.e. search terms) 4.3 Societal/Google Trends and topics for any specific term you enter. The validity of the The four societal trends of interest are disease outbreaks, terms can thus be manually assessed by checking whether the privacy concerns, public outdoor engagement and terrorism most correlated queries and topics correspond to the intended salience. To our knowledge, there are currently no publicly meaning of the search term. For instance, GT shows that the available administrative or survey data on any of these four most correlated topic with the Dutch term “files” (meaning societal trends on a daily basis in the Netherlands. Measuring “traffic jam”) was “traffic congestion” in both 2016 and 2017, these trends, therefore, requires innovative solutions. Our so- therefore indicating the relatively high construct validity of lution of choice is to use Google Trends (GT) data to capture the term “files”. Table A4 in Appendix A summarises all the signs of these societal trends. correlated queries and topics of the used GT search terms in GT offers periodical summaries of user search data from this study. 2004 onwards for many regions and for any possible search In addition, the resulting index scores from GT can vary term. These summaries, available as indices, represent the substantially across different requested periods and inquiry number of Google searches that include a given search term attempts. This speaks of measurement reliability issues, in a specified period (e.g. day, week or month). The data are which likely stems from two reasons. First, Google Trends scaled for the requested period between 0 and 100, with 0 uses a simple random sample from the total search volume indicating no search at all and 100 the highest search volume to calculate the index scores. Therefore, the resulting index in that period. These indices, which represent the popular- scores are subject to high variability if the (unknown) sam- ity of specific search terms, may offer relevant insights into ple size is small. Second, Google does not publish details various human activities in (almost) real time. Indeed, GT about the underlying algorithms that calculate the scores, nor indices have been used for various purposes, such as real- does Google make publicly available any changes in its al- time surveillance of disease outbreaks (Carneiro & Mylon- gorithms. Therefore, the obtained index scores may change akis, 2009), economic indicators (Choi & Varian, 2012) and over time because of updates in the algorithms, even when salience of immigration and terrorism (Mellon, 2014). the exact same search strategy is used. To overcome this is- These successful applications of GT indices suggest the sue, we used repeated sampling to enhance the measurement possibility of using GT to capture our four societal trends reliability of GT data. By taking as many samples as possible of interest. Table A2 in Appendix A lists the search terms per date in the period of interest and averaging all the scores
UNDERSTANDING AND PREDICTING WEB SURVEY RESPONSE 7 for each date over all repeated samples, one can obtain much contextual predictors; and, lastly, the “interaction model” more precise estimates. However, in this procedure a compli- where we allow the effects of the predictors to vary with time. cation arises due to the fact that a different sample can only Appendix B provides descriptive statistics about all the be taken when a different period is requested. For instance, variables used in the training and the test data sets, separately. to obtain two different repeated samples for the date “2017- 01-05”, one needs to request two different periods such as 5.1 Discrete-Time Survival Analysis “2017-01-04 to 2017-01-05” and “2017-01-05 to 2017-01- 06”, both covering the date of interest (“2017-01-05”). How- The research questions and the features of the current data ever, as GT scales the scores within the specified period to be require a modelling framework capable of handling the fol- between 0 and 100, when two different periods are requested, lowing issues: first, the method should model the transition the reference values in the two groups can be different, lead- from non-response to response; second, it incorporates both ing to differently scaled scores and invalid comparisons of time-varying and time-fixed predictors; third, it takes care scores between the two different periods. We overcame this of the right censoring issue in the data, with right censoring issue by maximising the number of overlapped dates between meaning that for some individuals the time when a transition every two consecutive different periods and then calibrating (i.e. response) takes place is not observed during the survey’s the latter period to the previous one. In this way, the com- web mode. parability between different samples is maximised. See Ap- These specific issues call for survival analysis. Survival pendix E for a detailed description of the algorithm we cre- analysis is a body of methods commonly used to analyse ated and used to calibrate GT data. time-to-event data (J. D. Singer & Willett, 2008). The fo- cus is on the modelling of transitions and the time it takes for 4.4 Overview of Variables a specific event to occur. The current research interests lie Appendix B summarises all the variables used in the study. in the modelling of the transition from non-response to re- Unless specified as categorical or dummy-coded, the vari- sponse over a period of time, thereby making survival analy- ables are treated as continuous. sis the right analysis tool. Many survival analysis techniques (e.g. Cox regression) assume continuous measurement of 5 Methods time. However, in practice, data are often collected in dis- In this section we detail the analytic approaches used in crete intervals, for instance, days, weeks and months. In the study. First, we introduce discrete-time survival analysis this case, a sub-type of survival analysis is needed, namely, and under this analytical framework, demonstrate how we discrete-time survival analysis. Given that our data are mea- used logistic regression to model the effects of time-varying sured in daily intervals, it is only appropriate to use discrete- factors on survey response in the web mode of the 2016 and time survival analysis. 2017 Dutch Health Surveys. We explain how we applied There are further advantages to using discrete-time analy- (adaptive) Lasso regularisation with logistic regression, with sis, in comparison to its continuous-time counterpart (Tutz & the goal to enhance model interpretability and predictive per- Schmid, 2016). For example, discrete-time analysis has no formance. problem with ties (i.e. multiple events occurring at the same Following the modelling approaches mentioned above, we time point) and it can be embedded into the generalised linear trained three models with the time-varying contextual pre- model framework, as is shown next. dictors based on the training data set, which consists of the complete 2016 Dutch Health Survey data and the first half of 5.2 The General Modelling Approach the 2017 data (i.e. 18 web-mode DCPs in total). The data is split in this way because it is important to have a good The fundamental quantity used to assess the risk of event trade-off between enough data variation in the training data occurrence in a discrete-time period is hazard. Denoted by and sufficient independence of the test data from the training his , discrete-time hazard is the conditional probability that data. individual i will experience the target event in time period s, Then, we applied the trained models to the test data set, given that he or she did not experience it prior to time pe- which is the remaining 2017 data (i.e. the last six web- riod s. This translates into, in the context of this paper, the mode DCPs), evaluated and compared their predictive per- probability of person i responding to the survey during day formances. These three models are: the “baseline model”, s given the individual did not respond earlier. The value of which includes only the baseline predictors (the number of discrete-time hazard in time period s can be estimated as the “days” since the previous invitation or reminder letter and ratio of the number of individuals who experience the target “survey phase”) that are necessary for the specification of the event (i.e. answering the survey) in time s to the number of intercept term in a discrete-time survival model (see Section individuals at risk of the event in time s. 5.3); the “full model”, which includes all the time-varying A general representation of the hazard function that con-
8 QIXIANG FANG, JOEP BURGER, RALPH MEIJERS AND KEES VAN BERKEL nects the hazard his to a linear predictor η is Table 2 Example of Person-Period Data η = g(his ) = γ0s + xis γ (1) Person Event Time Covariate i yis s xis where g(.) is a link function. It links the hazard and the linear predictor η = γ0s + xis γ, which contains the effects of predic- 1 0 1 x1,1 tors for individual i in time period s. The intercept γ0s is 1 1 2 x1,2 assumed to vary over time whereas the parameter γ is fixed. 2 0 1 x2,1 Since hazards are probabilities restricted to the interval [0, 2 0 2 x2,2 .. .. .. .. 1], a natural, popular candidate for the response function g(.) . . . . is, among others, the logit link. The corresponding hazard 2 0 20 x2,20 function becomes 3 0 1 x3,1 .. .. .. .. his = exp(η)/(1 + exp(η)) (2) . . . . Under this logistic model, the exponential term of a pa- rameter estimate quantifies the difference in the value of the Note that when a data set contains only time-varying pre- conditional odds (instead of hazards) per unit difference in dictors and these predictors only vary with time but not with the predictor. The total negative log-likelihood of the model, individuals, we can collapse the data set from the “Person- assuming random (right) censoring, is given by Period” format into the “Period-Level” format, where each ti n X X row represents a given time point, the scores of the time- −l∝ yis log(his ) + (1 − yis ) log(1 − his ) (3) varying predictors associated with that time point, the num- i=1 s=1 ber of individuals experiencing the target event and the num- ber of individuals at risk of the event at that time. This ap- where yis = 1 if the target event occurs for individual i dur- proach allows us to significantly downsize the data matrix ing time period s, and yis = 0 otherwise; n refers to the total (from 775,890 rows to only 808 rows, in this study), and thus number of individuals; ti the observed censored time for in- reduce model computation time (from tens of hours to only dividual i. This negative log-likelihood is equivalent to that minutes, in this study), while obtaining the exact same model of a binary response model. This analogue allows us to use estimates as we would with the original Person-Period data software designed for binary response models (e.g. binary lo- format. The only difference is that, instead of using a binary gistic regression) for model estimation, with only one mod- logistic regression, we need to use a binomial logistic regres- ification, namely that the number of binary observations in sion which models count and proportion outcomes. Our data the discrete survival model depends on the observed censor- qualifies for such a transformation and thus we adopt this ing and lifetimes. Thus, the number of binary observations transformation strategy. is ni=1 ts=1 P Pi . This requires the so-called Person-Period data format, where there is a separate row for each individual i 5.3 Model Specification for each period s (“day” in our case) when the person is ob- served. In each row a variable indicates whether an event oc- An important consideration concerns the specification of curs. The event occurs in the last observed period unless the the intercept γ0s shown in Equation 1. γ0s can be interpreted observation has been censored. Table 2 shows an exemplar as a baseline hazard, which is present for any given set of Person-Period data set. covariates. The specification of γ0s is very flexible, vary- One may wonder whether the analysis of the multiple ing from giving all discrete time points their own parame- records in a Person-Period data set yields appropriate param- ters to specifying a single linear term. In the 2016 and 2017 eter estimates, standard errors and goodness-of-fit statistics Dutch Health Surveys, an individual becomes “at risk” (i.e. when the multiple records for each person in the data set do of responding to the survey) when he/she receives an invi- not appear to be independent from each other. This, fortu- tation letter. Thus, each time point s can be conceptualised nately, is not an issue in discrete time survival analysis be- as a linear combination of the number of days since the ex- cause the hazard function describes the conditional probabil- pected delivery of the previous invitation or reminder letter ity of event occurrence, where the conditioning depends on (i.e. days, a continuous variable), and the specific survey the individual surviving until each specific time period s and phase this time point is in (Survey Phase, a categorical vari- his or her values for the substantive predictors in each time able with levels “Invitation”, “Reminder 1” and “Reminder period (J. D. Singer & Willett, 2008). Therefore, records 2”). in the Person-Period data need to only assume conditional Note that we measure the “days” variable as the number independence. of days since the last letter (invitation or reminder) instead
UNDERSTANDING AND PREDICTING WEB SURVEY RESPONSE 9 of the number of days since the invitation letter, because this by introducing a small bias to the model, Lasso significantly removes dependency between the “days” and “survey phase” reduces model variance and thereby improves a model’s out- variables. For instance, “Day 2” together with “survey phase: of-sample predictive performance. Invitation” refers to the second day since the expected arrival The value of λ needs to be carefully selected, because up day of the invitation letter, while “Day 5” in combination until a certain point, the increase in λ is beneficial as it only with “survey phase: Reminder 1” indicates that this is day reduces the variance (and hence avoids overfitting), without 5 since receiving the first reminder letter. Together, these losing any important properties in the data. After a certain two variables specify the baseline hazard rates for all sample threshold, however, the model starts losing important prop- cases on a given day. erties, giving rise to bias in the model and thus underfitting. With “Invitation” treated as the reference level of Survey To find the optimal λ, we followed the advice of Hastie, Phase, the specification of γ0s becomes Tibshirani, and Friedman (2009), which involves the use of γ0s = γ00 + γ01 Days + γ02 Reminder1 + γ03 Reminder2 (4) k-fold cross-validation (CV) . k-fold CV entails randomly di- viding the entire set of observations into k groups (folds) of where Reminder1 = 1 if time period s is in the “Reminder 1” approximately equal size. The first fold is treated as a valida- phase and Reminder1 = 0 otherwise; likewise, Reminder2 = tion set, and the model is fit on the remaining k-1 folds. The 1 if s is in the “Reminder 2” phase and Reminder2 = 0 other- error measure (e.g. root mean squared error) is then com- wise. Days remains untransformed, because common trans- puted on the observations in the held-out fold. This proce- formation of this variable (e.g. log, clog-log, square, cube, dure is repeated k times: each time, a different fold of obser- square-root) does not lead to better model fit. vations is treated as a validation set. This process results in k estimates of the validation error. Averaging all of these esti- 5.4 Lasso Regularisation mates gives the k-fold CV error estimate. A typical choice of Logistic regression, however, has one shortcoming. It can- k is 5 or 10, which gives accurate estimates of the validation not handle the relatively large number of highly correlated error while keeping the computation feasible. In this study, predictors in the current data. For instance, there are in the we used 10-fold CV. current data 20 weather variables and 10 GT variables, many Next, we chose a range of λ values and computed the 10- of whom are highly correlated with each other (e.g. “average fold CV error (i.e. deviance) for each value of λ. Then, we temperature”, “maximum temperature”, “disease outbreaks: selected the λ value for which the CV error is the lowest. cold”, “disease outbreaks: influenza”). Therefore, the inclu- Finally, the model was refitted using all of the available ob- sion of all the predictors would result in a lack of model par- servations and the selected λ value. simony. Both the determination of relevant predictors and the There are two further considerations regarding the use interpretation of parameter estimates become much more dif- of Lasso regularisation. First, the original Lasso algo- ficult. Furthermore, having many predictors may result in an rithm has the disadvantage that its selection of variables can overfit model, because some of the predictors may be captur- be inconsistent. To solve this problem, Zou (2006) pro- ing noises rather than actual signals. Lastly, multicollinearity posed the adaptive Lasso, whose penalty term has the form λ pj=1 w j |γ j |, where w j are weights. He showed that for P can lead to inflated parameter variances and model depen- dency on the relationship among the highly correlated pre- appropriately chosen data-dependent weights, the adaptive dictors. lasso provides consistent variable selection. Following the One solution is a popular machine learning technique author’s advice, we used Ridge regularisation estimates as called Lasso regularisation. Initially proposed by Tibshi- weights. Note that Ridge regularisation, similar to Lasso, rani (1996), this technique is capable of performing vari- shrinks parameter estimates. However, unlike Lasso, Ridge able selection (while achieving good prediction) and is also regularisation does not reduce any estimate to exactly zero compatible with the generalised linear modelling framework and therefore does not perform variable selection. Second, in discrete-time survival analysis (Tutz & Schmid, 2016). when using Lasso, one usually assigns a less-than-full-rank Generally speaking, Lasso regularisation works by adding a dummy-coding procedure to a categorical variable, such that penalty term λ (λ ≥ 0) to the negative log-likelihood function all levels of the variable enter the model as separate variables. −l (Equation 3), which has the effect of shrinking parameter This allows Lasso to select what it considers to be appropri- estimates towards zero. By doing so, Lasso retains only a ate reference levels (i.e. the ones with a zero coefficient). small number of important variables (i.e. the ones that have Nevertheless, sometimes Lasso retains all levels of a cate- non-zero parameter estimates) in a model and thus results in gorical variable in the model. Without a reference category, a more parsimonious and interpretable model. Because this the interpretation of the parameter estimates of categorical variable selection procedure is automatic, we can also conve- variables becomes impossible. To avoid this problem, we niently avoid the use of traditional p-values and confidence pre-assigned a reference category to all the categorical vari- intervals to judge the relevance of a variable. In addition, ables (see Table A2 in Appendix A) before they entered the
10 QIXIANG FANG, JOEP BURGER, RALPH MEIJERS AND KEES VAN BERKEL model. able importance scores. We used RMSE as the error measure in calculating variable importance. 5.5 Model Evaluation As the focus of the study is on the influence of predic- 5.7 Software tors on daily response hazards, it is necessary to evaluate the We conducted all the analyses in R (version 3.5.0) and predictive performance of the models with regards to their R studio (version 1.1.383). We used the package “glmnet” prediction of the hazard rates when the models are applied to (Friedman, Hastie, & Tibshirani, 2010) for the implementa- the test data set. For this purpose, we used root mean squared tion of adaptive Lasso logistic regression. error (RMSE) as the evaluation criterion, which quantifies the distance between the observed and the predicted daily hazards. 6 Results Using RMSE, we compared the predictive performance of 6.1 Model Estimates and Interpretation what we call the “full model” (which includes all the time- varying predictors) to that of the “baseline model” (which Figure 2 shows the exponentiated standardised parameter includes only the baseline intercept predictors: “days” and estimates of the predictors that are retained by the full model. “survey phase”). With the full model where we enter all of That is to say, these predictors are considered by the model the predictors into the model without any interaction effect, to have non-zero coefficients and are thus important. The we assume that the effects of the predictors do not vary with size of the estimates quantifies how much one standard devi- time. This modelling approach has the advantage that the ation change in the predictors impacts the conditional odds model is more parsimonious and easier to interpret. How- of survey response on a given day, under the condition that ever, in reality, the effect of a predictor may depend on time. the person has not responded earlier. To account for this possibility, we also built an “interac- As the figure suggests, both of the baseline predictors that tion model”, where we include interaction terms between define the model intercept are strong predictors of survey the baseline predictors (“days“ and “survey phase“) and the response. Specifically, the number of days since the previ- time-varying contextual predictors and thereby allow the ef- ous invitation or reminder letter seems to have the largest fects of the model predictors to vary over time. Note that effect on survey response. An exponentiated standardised we do not interpret this model in terms of parameter esti- coefficient of about 0.28 suggests that, assuming everything mates, because the resulting model contains non-zero inter- else stays constant, for one standard deviation increase in the action terms whose corresponding main effects are shrunk to number of days (about 5.34 days), the odds of a person re- zero. Because of this, the model’s parameter estimates be- sponding to the survey at that given time point are reduced by come difficult to interpret. However, we can compare still about 72%. Survey phase also turns out to be an important the predictive performance of this interaction model with the predictor: the first reminder letter increases the conditional other two models. odds of response by about 22% than in the invitation phase, Furthermore, we plotted the predicted cumulative re- while the second reminder letter lowers the conditional odds sponse rates of the three models against the observed ones. by about 17%, assuming everything else stays constant. To avoid repetitions, in the rest of the paper we interpret model 5.6 Variable Importance estimates without repeating the assumption that everything In addition to knowing whether the model on the whole else holds constant. predicts well, it is also helpful to know whether a specific Turning to the time-varying contextual predictors of in- predictor predicts well (i.e. so-called “variable importance”). terest: Day of a week appears as a very relevant predictor. Specifically, one can evaluate the importance of a variable In comparison to Monday, all non-Mondays lower the condi- by calculating the increase in the model’s prediction error tional odds of responses. That is to say, Monday has the most after permuting the variable (Molnar, 2018). Permuting the positive effect on conditional response odds, compared to variable breaks the relationship between the variable and the all other days. Saturday shows the strongest negative effect. true outcome. Thus, a variable is “important” if shuffling its With an estimate of approximately 0.66, the conditional odds values increases the model’s prediction error, because in this of a survey response on Saturday is about 34% less likely case the model relies on this variable for better prediction. A than on Monday. The effect of Sunday on response odds is variable is “unimportant” if permuting its values leaves the also negative, with an exponentiated estimate of 0.81. There- model error unchanged or worse. fore, we can safely conclude that weekends have a negative The permutation algorithm for assessing variable impor- influence on survey response, while Monday has a positive tance we used is based on the work of Fisher, Rudin, and one. Dominici (2018). For an accurate estimate, we used 20 per- Holiday also appears to have a negative effect on response. mutations for each variable and averaged the resulting vari- With an exponentiated coefficient of 0.82, holidays reduce
UNDERSTANDING AND PREDICTING WEB SURVEY RESPONSE 11 Days ● 0.28 Survey Phase: Reminder 1 ● 1.22 Survey Phase: Reminder 2 ● 0.826 Day of a Week: Tuesday ● 0.911 Day of a Week: Wednesday ● 0.763 Day of a Week: Thursday ● 0.837 Day of a Week: Friday ● 0.828 Day of a Week: Saturday ● 0.661 Day of a Week: Sunday ● 0.813 Holiday ● 0.819 Variable Temperature (max.) ● 0.953 Sunshine Duration ● 0.979 Precipitation Volume ● 1.006 Precipitation Volume (max. hr.) ● 1.01 Air Pressure (avg.) ● 0.998 Visibility (max.) ● 1.045 Cloudiness (avg.) ● 1.015 Disease Outbreaks: Depression ● 0.961 Disease Outbreaks: Cold ● 0.964 Public Outdoor Engagement: Traffic Jam ● 0.974 Terrorist Attacks ● 0.979 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 Exponentiated Standardised Coefficient Estimate Figure 2. Exponentiated Standardised Estimates of Predictors in the Full Model the conditional response odds by about 18% compared to 6.2 Model Performance non-holidays. The RMSE scores of the three models (“baseline model”, “full model” and “interaction model”) are 0.005528, The weather variables show smaller effects on survey re- 0.005274 and 0.004738, respectively. This suggests that the sponse than the previous variables. Nevertheless, those with inclusion of the time-varying contextual predictors in the full non-zero coefficients show a clear pattern. When the weather model increases the baseline model’s predictive performance is nicer (e.g. higher temperature, longer sunshine duration, improves by 4.6%. In addition, allowing time-varying effects less rain, higher air pressure, and less cloudy), conditional in the interaction model further reduces prediction error by response odds are also lower. Specifically, for one SD change about 10%, compared with the full model. in these weather variables, conditional response odds can As some may argue that weather variables and/or GT vari- change by a maximum of about 5%. The only exception to ables largely capture monthly or seasonal trends and there- this observed rule is the variable maximum visibility, which fore can be substituted by indicator variables representing shows a clear positive effect on response. month or season, we conducted additional analyses to test this argument, which shows that replacing the weather and Similar to the weather variables, the GT variables tend to GT variables with either month or season indicators lead to have, if not zero, small effects. Among the variables intended poorer performances in RMSE than the models that include to measure signs of disease outbreaks, “depression” and the weather and GT variables. Specifically, using month as a “cold” show negative effects on survey response, while the replacement variable results in an RMSE score of 0.006553, other indicators of disease outbreaks (“flu”, “hay fever” and while using season leads to a score of 0.005830. Both are “influenza”) are not retained by the model. The two terms much higher than any of the previous three models we tested. concerning data privacy concerns, namely “data leak” and Figure 3 presents the predicted cumulative response rates “hacking”, have also been left out by the model. Between the of the three tested models across all survey phases in the test two variables related to public outdoor engagement, “traf- data set, against the observed cumulative response rates (in- fic jam” negatively predicts survey response, while “festival“ dicated by unconnected asterisks). has a zero coefficient. Finally, “terrorist” also has a small In the invitation survey phase, the interaction model negative influence on survey response. Note that the interpre- achieves the best prediction of cumulative response rates tation of the GT variables in terms of the sizes of their effects among the three models. Especially during the later stage of is difficult and can be misleading, because the variables are the invitation phase, the interaction model predicts cumula- measured on somewhat arbitrary scales. tive response rates almost perfectly. Both the full model and
You can also read