Temporal Poisson Square Root Graphical Models - Proceedings of ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Temporal Poisson Square Root Graphical Models Sinong Geng* 1 Zhaobin Kuang* 1 Peggy Peissig 2 David Page 1 Abstract lions of web users are usually mapped into various topics We propose temporal Poisson square root graphi- (e.g. travel, education, weather), and search engine providers cal models (TPSQRs), a generalization of Poisson are interested in the interplay among these search topics for square root graphical models (PSQRs) specifically a better understanding of user preferences (Gunawardana designed for modeling longitudinal event data. By et al., 2011). In health analytics, electronic health records estimating the temporal relationships for all possi- (EHRs) contain clinical encounter events from millions of ble pairs of event types, TPSQRs can offer a holis- patients collected over decades, including drug prescriptions, tic perspective about whether the occurrences of biomarkers, and condition diagnoses, among others. Unrav- any given event type could excite or inhibit any eling the relationships between different drugs and different other type. A TPSQR is learned by estimating conditions is vital to answering some of the most pressing a collection of interrelated PSQRs that share the medical and scientific questions such as drug-drug interac- same template parameterization. These PSQRs tion detection (Tatonetti et al., 2012), comorbidity identifica- are estimated jointly in a pseudo-likelihood fash- tion, adverse drug reaction (ADR) discovery (Simpson et al., ion, where Poisson pseudo-likelihood is used to 2013; Bao et al., 2017; Kuang et al., 2017b), computational approximate the original more computationally- drug repositioning (Kuang et al., 2016a;b), and precision intensive pseudo-likelihood problem stemming medicine (Liu et al., 2013; 2014a). from PSQRs. Theoretically, we demonstrate All these analytics challenges beg the statistical modeling that under mild assumptions, the Poisson pseudo- question: can we offer a comprehensive perspective about likelihood approximation is sparsistent for recov- the relationships between the occurrences of all possible ering the underlying PSQR. Empirically, we learn pairs of event types in longitudinal event data? In this paper, TPSQRs from Marshfield Clinic electronic health we propose a solution via temporal Poisson square root records (EHRs) with millions of drug prescrip- graphical models (TPSQRs), a generalization of Poisson tion and condition diagnosis events, for adverse square root graphical models (PSQRs, Inouye et al. 2016) drug reaction (ADR) detection. Experimental re- made in order to represent multivariate distributions among sults demonstrate that the learned TPSQRs can count variables evolving temporally in LED. recover ADR signals from the EHR effectively and efficiently. The reason why conventional undirected graphical models (UGMs) are not readily applicable to LED is the lack of mechanisms to address the temporality and irregularity in 1. Introduction the data. Conventional UGMs (Liu and Page, 2013; Liu et al., 2014b; Yang et al., 2015a; Liu et al., 2016; Kuang Longitudinal event data (LED) and the analytics challenges et al., 2017a; Geng et al., 2018) focus on estimating the co- therein are ubiquitous now. In business analytics, purchas- occurrence relationships among various variables rather than ing events of different items from millions of customers their temporal relationships, that is, how the occurrence of are collected, and retailers are interested in how a distinct one type of event may affect the future occurrence of another market action or the sales of one particular type of item type. Furthermore, existing temporal variants of UGMs could boost or hinder the sales of another type (Han et al., (Kolar et al., 2010; Yang et al., 2015b) usually assume that 2011). In search analytics, web search keywords from bil- data are regularly sampled, and observations for all variables are available at each time point. Neither assumption is true, Sinong Geng and Zhaobin Kuang contribute equally. Their names are listed in alphabetical order. 1 The University of Wis- due to the irregularity of LED. consin, Madison 2 Marshfield Clinic Research Institute. Correspon- In contrast to these existing UGM models, a TPSQR models dence to: Sinong Geng . temporal relationships, by data aggregation; a TPSQR ex- Proceedings of the 35 th International Conference on Machine tracts a sequence of time-stamped summary count statistics Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 of distinct event types that preserves the relative temporal by the author(s).
Temporal Poisson Square Root Graphical Models order in the raw data for each subject. A PSQR is then used ables evolving temporally in LED. TPSQR can accom- to model the joint distribution among these summary count modate both positive and negative dependencies among statistics for each subject. Different PSQRs for different covariates, and can be learned efficiently via the pseudo- subjects are assumed to share the same template parame- likelihood problem for PSQR. terization and hence can be learned jointly by estimating the template in a pseudo-likelihood fashion. To address the • In terms of advancing the state-of-the-art of PSQR estima- challenge in temporal irregularity, we compute the exact tion, we propose Poisson pseudo-likelihood approxima- time difference between each pair of time-stamped sum- tion in lieu of the original more computationally-intensive mary statistics, and decide whether a difference falls into conditional distribution induced by the joint distribution a particular predefined time interval, hence transforming of a PSQR. We show that under mild assumptions, the the irregular time differences into regular timespans. We Poisson pseudo-likelihood approximation procedure is then incorporate the effects of various timespans into the sparsistent (Ravikumar et al., 2007) with respect to the template parameterization as well as PSQR constructions underlying PSQR. Our theoretical results not only justify from the template. the use of the more efficient Poisson pseudo-likelihood over the original conditional distribution for better esti- By addressing temporality and irregularity of LED in this mation efficiency of PSQR but also establish a formal fashion, TPSQR is also different from many point process correspondence between the more intuitive but less strin- models (Gunawardana et al., 2011; Weiss et al., 2012; Weiss gent local Poisson graphical models (Allen and Liu, 2013) and Page, 2013; Du et al., 2016), which usually strive to and the more rigorous but less convenient PSQRs. pinpoint the exact occurrence times of events, and offer gen- erative mechanisms to event trajectories. TPSQR, on the • We apply TPSQR to Marshfield Clinic EHRs to deter- other hand, adopts a coarse resolution approach to temporal mine the relationships between the occurrences of various modeling via the aforementioned data aggregation and time drugs and the occurrences of various conditions, and of- interval construction. As a result, TPSQR focuses on esti- fer more accurate estimations for adverse drug reaction mating stable relationships among occurrences of different (ADR) discovery, a challenging task in health analytics event types, and does not model the precise event occurrence due to the (thankfully) rare and weak ADR signals en- timing. This behavior is especially meaningful in applica- coded in the data, whose success is crucial to improving tion settings such as ADR discovery, where the importance healthcare both financially and clinically (Sultana et al., of identifying the occurrence of an adverse condition caused 2013). by the prescription of a drug usually outweighs knowing about the exact time point of the occurrence of the ADR, due 2. Background to the high variance of the onset time of ADRs (Schuemie We show how to deal with the challenges in temporality et al., 2016). and irregularity mentioned in Section 1 via the use of data Since TPSQR is a generalization of PSQR, many desirable aggregation and an influence function for LED. We then properties of PSQR are inherited by TPSQR. For example, define the template parameterization that is central to the TPSQR, like PSQR, is capable of modeling both positive modeling of TPSQRs. and negative dependencies between covariates. Such flexi- bility cannot usually be taken for granted when modeling a 2.1. Longitudinal Event Data multivariate distribution over count data due to the potential dispersion of the partition function of a graphical model Longitudinal event data are time-stamped events of finitely (Yang et al., 2015a). TPSQR can be learned by solving many types collected across various subjects over time. Fig- the pseudo-likelihood problem for PSQR. For efficiency ure 1 visualizes the LED for two subjects. As shown in and scalability, we use Poisson pseudo-likelihood to ap- Figure 1, the occurrences of different event types are rep- proximately solve the original pseudo-likelihood problem resented as arrows in different colors. No two events for induced by a PSQR, and we show that the Poisson pseudo- one subject occur at the exact same time. We are interested likelihood approximation can recover the structure of the in modeling the relationships among the occurrences of underlying PSQR under mild assumptions. Finally, we different types of events via TPSQR. demonstrate the utility of TPSQRs using Marshfield Clinic EHRs with millions of drug prescription and condition di- 2.2. Data Aggregation agnosis events for the task of adverse drug reaction (ADR) To enable PSQRs to cope with the temporality in LED, detection. Our contributions are three-fold: TPSQRs start from extracting relative-temporal-order- preserved summary count statistics from the raw LED via • TPSQR is a generalization of PSQR made in order to data aggregation, to cope with the high volume and frequent represent the multivariate distributions among count vari- consecutive replications of events of the same type that are
Temporal Poisson Square Root Graphical Models Subject 1 Time/Day Subject 2 Time/Day Event Type Type 1 Type 2 Type 3 Figure 1: Visualization of longitudinal event data from two subjects. Curly brackets denote the timespans during which events of only one type occur. xij ’s represent the number of subsequent occurrences after the first occurrence. oij ’s are the types of events in various timespans. commonly observed in LED. Take Subject 1 in Figure 1 as processing. In detail, let L + 1 user-specified time-threshold an illustrative example; we divide the raw data of Subject values be given, where 0 = τ0 < τ1 < τ2 < · · · < τL . φ(τ ) 1 into four timespans by the dashed lines. Each of the four is a L × 1 one hot vector whose lth component is defined timespans contains only events of the same type. We use as: ( three statistics to summarize each timespan: the time stamp 1, τl−1 ≤ τ < τl of the first occurrence of the event in each timespan: t11 = 1, [φ(τ )]l := , (1) 0, otherwise t12 = 121, t13 = 231, and t14 = 361; the event type in each timespan: o11 = 1, o12 = 2, o13 = 3, and o14 = 1; where l ∈ {1, 2, · · · , L}. In our case, we let τ := tij 0 − tij and the counts of subsequent occurrences in each timespan: to construct φ(τ ) according to (1). Widely used influence x11 = 1, x12 = 1, x13 = 2, and x14 = 0. Note that the functions in signal processing include the dyadic wavelet reason x14 = 0 is that there is only one occurrence of event function and the Haar wavelet function (Mallat, 2008); both type 1 in total during timespan 4 of subject 1. Therefore, are piecewise constant and hence share similar representa- the number of subsequent occurrence after the first and only tion to (1). occurrence is 0. Let there be N independent subjects and p types of events 2.4. Template Parameterization in a given LED X. We denote by ni the number of Template parameterization provides the capability of TP- timespans during which only one type of event occurs SQRs to represent the effects of all possible (ordered) to subject i, where i ∈ {1, 2, · · · , N }. The j th times- pairs of event types on all time scales. Specifically, let pan of the ith subject can be represented by the vector 2 an ordered pair (k, k 0 ) ∈ {1, 2, · · · , p} be given. Let > sij := tij oij xij , where j ∈ {1, 2, · · · , ni }, and 0 = τ0 < τ1 < τ2 < · · · < τK also be given. For the “:=” represents “defined as.” tij ∈ [0, +∞) is the time ease of presentation, we assume that k 6= k 0 , which can stamp at which the first event occurs during the timespan sij . be easily generalized to k = k 0 . Considering a particular Furthermore, t11 < t12 < · · · < t1ni . oij ∈ {1, 2, · · · , p} patient, we are interested in knowing the effect of an occur- represents the event type in sij . Furthermore, oij 6= oi(j+1) , rence of a type k event towards a subsequent occurrence of ∀i ∈ {1, 2, · · · , N } and ∀j < ni . xij ∈ N is the number of a type k 0 event, when the time between the two occurrences subsequent occurrences of events of the same type in sij . falls in the lth time window specified via (1). Enumerating all L time windows, we have: 2.3. Influence Function > wkk0 := wkk0 1 wkk0 2 ··· wkk0 L . (2) Let sij and sij 0 be given, where j < j 0 ≤ ni . To handle the irregularity of the data, we map the time difference Note that since (k, k 0 ) is ordered, wk0 k is different from tij 0 − tij to a one-hot vector that represents the activation of wkk0 . We further define W as a (p − 1)p × L matrix that a time interval using an influence function φ(·), a common > stacks up all wkk 0 ’s. In this way, W includes all possible mechanism widely used in point process models and signal pairwise temporally bidirectional relationships among the
Temporal Poisson Square Root Graphical Models p variables on different time scales, offering holistic repre- template for constructing Θ(i) ’s, and hence provide a “tem- sentation power. To represent the intrinsic prevalence effect plate parameterization.” Since there are N subjects in total of the occurrences of events of various types, we further in the dataset, and each Θ(i) offers a personalized PSQR for > one subject, TPSQR is capable of learning a collection of define ω := ω1 ω2 · · · ωp . We call ω and W the template parameterization, from which we will generate the interrelated PSQRs due to the use of the template parame- parameters of various PSQRs as shown in Section 3. terization. Recall the well-rounded representation power of a template shown in Section 2.4; learning the template pa- rameterization via TPSQR can hence offer a comprehensive 3. Modeling perspective about the relationships for all possible tempo- Let sij ’s be given where j ∈ {1, 2, · · · , ni }; we demonstrate rally ordered pairs of event types. the use of the influence function and template parameteriza- Furthermore, since TPSQR is a generalization of PSQR, tion to construct a PSQR for subject i. it inherits many desirable properties enjoyed by PSQR. A > > most prominent property is its capability of accommodating Let ti := ti1 ti2 · · · tini , oi := oi1 oi2 · · · oini , > both positive and negative dependencies between variables. and xi := xi1 xi2 · · · xini . Given ti and oi , a Such flexibility in general cannot be taken for granted when TPSQR aims at modeling the joint distribution of counts xi modeling multivariate count data. For example, a Poisson using a PSQR. Specifically, under the template parameteriza- graphical model (Yang et al., 2015a) can only represent neg- tion ω and W, we first define a symmetric parameterization ative dependencies due to the diffusion of its log-partition Θ(i) using ti and oi . The component of Θ(i) at the j th row function when positive dependencies are involved. Yet for th and the j 0 column is: example one drug (e.g., the blood thinner Warfarin) can have a positive influence on some conditions (e.g., bleeding) ωoij , j = j0 and a negative influence on others (e.g., stroke). We refer (i) θjj 0 := [Θ(i) ]jj 0 := wo>ij oij0 φ(|tij 0 − tij |), j < j 0 . (3) interested readers to Allen and Liu 2013; Yang et al. 2013; (i) [Θ ]j 0 j , j > j0 Inouye et al. 2015; Yang et al. 2015a; Inouye et al. 2016 for more details of PSQRs and other related Poisson graphical We then can use Θ(i) to parameterize a PSQR that gives a models. joint distribution over xi as: 4. Estimation i −1 X " n nX ni i (i) √ (i) √ X P xi ; Θ(i) := exp θjj xij + θjj 0 xij xij 0 In this section, we present the pseudo-likelihood estimation j=1 j=1 j 0 >j problem for TPSQR. We then point out that solving this problem can be inefficient, which leads to the proposed ni # X (i) − log(xij ! ) − Ani Θ . Poisson pseudo-likelihood approximation to the original j=1 pseudo-likelihood problem. (4) 4.1. Pseudo-Likelihood for TPSQR In (4), Ani (Θ(i) ) is a normalization constant called the log-partition function that ensures the legitimacy of the We now present our estimation approach for TPSQR based probability distribution in question: on pseudo-likelihood. We start from considering the pseudo- likelihood for a given ith subject. By (4), the log probability of xij conditioned on xi,−j , which is an (ni − 1) × 1 vector " n i (i) √ X X Ani Θ(i) := log exp θjj xj constructed by removing the j th component from xi , is x∈Nni j=1 given as: (5) i −1 X nX ni ni # (i) √ X + θjj 0 xj xj 0 − log(xj ! ) . (i) j=1 j 0 >j j=1 logP xij | xi,−j ; θj = − log(xij ! )+ (6) Note that in (5) we emphasize the dependency of the parti- (i)> √ √ (i) (i) θjj + θj,−j xi,−j xij − Ãni θj , tion function upon the dimensionof x using the subscript ni , and x := x1 x1 · · · xni . (i) To model the joint distribution of xi , TPSQR directly uses where θj is the j th column of Θ(i) and hence Θ(i) , which is extracted from ω and W via (3) depend- ing on the individual and temporal irregularity of the data h i> (i) (i) (i) (i) (i) (i) characterized by ti and oi . Therefore, ω and W serve as a θj := θ1j · · · θj−1,j θjj θj+1,j · · · θni ,j
Temporal Poisson Square Root Graphical Models h i> (i) := θ1j (i) · · · θj−1,j (i) θjj (i) (i) θj,j+1 · · · θj,ni . where λ ≥ 0 is the regularization parameter, and the penalty (p−1)p L (7) X X kWk1,1 := |[W]ij | In (7), by the symmetry of Θ(i) , we rearrange the index after i=1 j=1 (i) θjj to ensure that the row index is no larger than the column is used to encourage sparsity over the template parame- index so that the parameterization is consistent with that terization W that determines the interactions between the in (4). We will adhere to this convention in the subsequent occurrences of two distinct event types. As mentioned at the (i) presentation. Furthermore, θj,−j is an (ni − 1) × 1 vector end of Section 4.1, TPSQR learning is equivalent to learn- (i) constructed from θj by excluding its j th component, and ing a PSQR over the template parameterization. Therefore, √ the sparsity penalty induced here is helpful to recover the xi,−j is constructed by taking the square root of each component of xi,−j . Finally, structure of the underlying graphical model. Ãni θj (i) := The major advantage of approximating the original pseudo- likelihood problem with Poisson pseudo-likelihood is the X (i) (i)> √ √ gain in computational efficiency. Based on the construc- log exp θjj + θj,−j xi,−j xij − log(xij ! ) , tion in Geng et al. 2017, (11) can be formulated as an x∈Rni (8) l1 -regularized Poisson regression problem, which can be solved much more efficiently via many sophisticated algo- which is a quantity that involves summing up infinitely many rithms and their implementations (Friedman et al., 2010; terms, and in general cannot be further simplified, leading Tibshirani et al., 2012) compared to solving the original to potential intractability in computing (8). problem that involves the potentially challenging computa- tion for (8). Furthermore, in the subsequent section, we will PN the conditional distribution in (6) and letting M := With show that even though the Poisson pseudo-likelihood is an i=1 ni , the pseudo-likelihood problem for TPSQR is given as: approximation procedure to the pseudo-likelihood of PSQR, under mild assumptions Poisson pseudo-likelihood is still N ni 1 XX (i) capable of recovering the structure of the underlying PSQR. max log P xij | xi,−j ; θj . (9) ω,W M i=1 j=1 4.3. Sparsistency Guarantee (9) is the maximization over all the conditional distributions of all the count variables for all N personalized PSQRs For the ease of presentation, in this section we will reuse generated by the template. Therefore, it can be viewed as much of the notation that appears previously. The redefini- a pseudo-likelihood estimation problem directly for ω and tions introduced in this section only apply to the contents in W. However, solving the pseudo-likelihood problem in this section and the related proofs in the Appendix. Recall (9) involves the predicament of computing the potentially at the end of Section 4.1, the pseudo-likelihood problem of intractable (8), which motivates us to use Poisson pseudo- TPSQR can be viewed as learning a PSQR parameterized likelihood as an approximation to (9). by the template. Therefore, without loss of generality, we will consider a PSQR over p count variables X = x ∈ Np 4.2. Poisson Pseudo-Likelihood parameterized by a p × p symmetric matrix Θ∗ , where > (i) X := X1 X2 · · · Xp is the multivariate random Using the parameter vector θj , we define the conditional variable, and x is an assignment to X. We use k·k∞ to distribution of xij given by xi,−j via the Poisson distribu- represent the infinity norm of a vector or a matrix. Let tion as: X := {x1 , x2 , · · · , xn } be a dataset with n independent and (i) identically distributed (i.i.d.) samples generated from the P̂ xij | xi,−j ; θj ∝ h i PSQR. Then the joint probability distribution over x is: (i) (i)> (i) (i)> exp θjj + θ−j xi,−j xij − exp θjj + θ−j xi,−j . " p p−1 Xp X √ ∗ √ X ∗ ∗ (10) P(x; Θ ):= exp θjj xij + θjj 0 xij xij 0 j=1 j=1 j 0 >j Notice the similarity between (6) and (10). We can de- p # fine the sparse Poisson pseudo-likelihood problem simi- X ∗ − log(xij ! ) − A (Θ ) , lar to the original pseudo-likelihood problem by replacing (i) (i) j=1 log P xij | xi,−j ; θj with log P̂ xij | xi,−j ; θj : where A (Θ∗ ) is the log-partition function, and the corre- N ni 1 XX sponding Poisson pseudo-likelihood problem is: (i) max log P̂ xij | xi,−j ; θj −λkWk1,1 , (11) ω,W M i=1 j=1 Θ̂ := arg min F (Θ) + λkΘk1,off , (12) Θ
Temporal Poisson Square Root Graphical Models I := (j, j 0 ) | θjj ∗ 0 0 where 0 = 0, j 6= j , j, j ∈ {1, 2, · · · , p} . n p 1 XXh > F (Θ) := −(θjj + θj,−j xi,−j )xij Let H := ∇2 F (Θ∗ ). Then for some 0 < α < 1 and C5 > n i=1 j=1 (13) 0, we have HIS H−1 −1 SS ∞ ≤ 1 − α and HSS ∞ ≤ C5 , where we use the index sets as subscripts to represent the i > + exp(θjj + θj,−j xi,−j ) , corresponding components of a vector or a matrix. and kΘk1,off represents imposing l1 penalty over all but the diagonal components of Θ. The final assumption characterizes the second-order Taylor expansion of F (Θ∗ ) at a certain direction ∆. Sparsistency (Ravikumar et al., 2007) addresses whether Θ̂ can recover the structure of the underlying Θ∗ with high Assumption 4. Let R(∆) be the second-order Taylor ex- probability using n i.i.d. samples. In what follows, we will pansion remainder of ∇F (Θ) around Θ = Θ∗ at direction show that Θ̂ is indeed sparsistent under mild assumptions. ∆ := Θ−Θ∗ (i.e. ∇F (Θ) = ∇F (Θ∗ )+∇2 F (Θ∗ )(Θ− Θ∗ ) + R(∆)), where k∆k∞ ≤ r := 4C5 λ ≤ C51C6 with We use E[·] to denote the expectation of a random variable ∆I = 0, and for some C6 > 0. Then kR(∆)k∞ ≤ under P(x; Θ∗ ). The first assumption is about the bound- 2 C6 k∆k∞ . edness of E[X], and the boundedness of the partial second order derivatives of a quantity related to the log-partition With these mild assumptions, the sparsistency result is stated A (Θ∗ ). This assumption is standard in the analysis of in Theorem 1. pseudo-likelihood methods (Yang et al., 2015a). Assumption 1. kE[X]k∞ ≤ C1 for some C1 > 0. Let Theorem 1. Suppose that Assumption 1 - 4 are all satisfied. Then, with probability of at least 1 − (exp (C1 + C2 /2) + 8) p−2 + p−1/C2 , Θ̂ shares the " p X X √ B(Θ, b) := log exp θjj xj + b> x same structure with Θ∗ , if for some constant C7 > 0, x∈Np j=1 p−1 p p # r X X √ X 8 2 log p + θjj 0 xj xj 0 − log(xj ! ) . λ≥ C3 (3 log p + log n) + (3 log p + log n) α n j=1 j 0 >j j=1 ! s r 2 log p log5 p For some C2 > 0, and ∀k ∈ [0, 1], + 8C4 C1 + α, λ ≤ C7 , n n ∂ 2 B(Θ, 0 + kej ) 2 ∀j ∈ {1, 2, · · · , p} , ≤ C2 , r ≤ kΘ∗S k∞ , and n ≥ 64C7 C52 C6 /α log5 p. ∂ 2 bj where ej is the one-hot vector with the j th component as 1. We defer the proof of Theorem 1 to the Appendix. Note that log5 p in Theorem 1 represents a higher sample complexity The following assumption characterizes the boundedness of compared to similar results in the analysis of Ising models the conditional distributions given by the PSQR under Θ∗ (Ravikumar et al., 2010). Such a higher sample complexity and by the Poisson approximation using the same Θ∗ . intuitively makes sense since the multivariate count vari- Assumption 2. Let λ∗ij := exp θjj ∗ ∗> + θj,−j xi,−j be the ables that we deal with are unbounded and usually heavy- mean parameter of a Poisson distribution. Then ∀i ∈ tailed, and we are also considering the Poisson pseudo- {1, 2 · · · , n} and ∀j ∈ {1, 2, · · · , p}, for some C3 > 0 likelihood approximation to the original pseudo-likelihood and C4 > 0, we have that E [Xj | xi,−j ] ≤ C3 and problem induced by PSQRs. The fact that Poisson pseudo- λ∗ij − E [Xj | xi,−j ] ≤ C4 . likelihood is a sparsistent procedure for learning PSQRs not only provides an efficient approach to learn PSQRs with The third assumption is the mutual incoherence condition strong theoretical guarantees, but also establishes a formal vital to the sparsistency of sparse statistical learning with correspondence between local Poisson graphical models l1 -regularization. Also, with a slight abuse of notation, in (LPGMs, Allen and Liu 2013) and PSQRs. This is because the remaining of Section 4.3 as well as in the corresponding Poisson pseudo-likelihood is also a sparsistent procedure proofs, we should view Θ as a vector generated by stacking for LPGMs. Compared to PSQRs, LPGMs are more in- up θjj 0 ’s, where j ≤ j 0 , whenever it is clear from context. tuitive yet less stringent theoretically due to the lack of a Assumption 3. Let Θ∗ be given. Define the index sets joint distribution defined by the model. Fortunately, with the guarantees in Theorem 1, we are able to provide some A := (j, j 0 ) | θjj ∗ 0 0 0 6= 0, j 6= j , j, j ∈ {1, 2, · · · , p} , reassurance for the use of LPGMs in terms of structure D := {(j, j) | j ∈ {1, 2, · · · , p}} , S := A ∪ D, recovery.
Temporal Poisson Square Root Graphical Models Methods 0.9 jects (e.g. patients in poorer health might tend to be more MSCCS likely to have a heart attack compared to a healthy person, 90 TPSQR 0.8 ● ● ● ● ● which might confound the effects of various drugs when Counts 0.7 AUC 60 identifying drugs that could cause heart attacks as an ADR). ● ● ● 0.6 ● ● ● ● ● Therefore, when using TPSQR, we will also introduce fixed ● 30 ● 0.5 effects to equip TPSQRs with the capability of addressing subject heterogeneity. Specifically, we consider learning a 0 0.4 ● 0.4 0.5 0.6 0.7 0.8 0.9 ● MSCCS TPSQR variant of (11): AUC Methods N ni 1 XX (i) (a) Overlapping Histograms (b) Boxplot max αioij +log P̂ xij | xi,−j ; θj −λkWk1,1 , α,ω,W M i=1 j=1 Figure 2: Overall performance of TPSQR and MSCCS measured by AUC among 300 different experimental con- where α is the fixed effect parameter vector constructed by figurations for each of the two methods. αioij ’s that depicts the belief that different patients could have different baseline risks of experiencing different types of events. 5. Adverse Drug Reaction Discovery To demonstrate the capability of TPSQRs to capture tem- 6. Experiments poral relationships between different pairs of event types In what follows, we will compare the performances of TP- in LED, we use ADR discovery from EHR as an example. SQR, MSCCS, and Hawkes process (Bao et al., 2017) in the ADR discovery is the task of finding unexpected and nega- OMOP task. The experiments are conducted using Marsh- tive incidents caused by drug prescriptions. In EHR, time- field Clinic EHRs with millions of drug prescription and stamped drug prescriptions as well as condition diagnoses condition diagnosis events from 200,000 patients. are collected from millions of patients. These prescriptions of different drugs and diagnoses of different conditions can 6.1. Experimental Configuration hence be viewed as various event types in LED. Therefore, using TPSQR, we can model whether the occurrences of a Minimum Duration: clinical encounter sequences from particular drug k could elevate the possibility of the future different patients might span across different time lengths. occurrences of a condition k 0 on different time scales by Some have decades of observations in their records while estimating wkk0 defined in (2). If an elevation is observed, other might have records only last a few days. We therefore we can consider the drug k as a potential candidate to cause consider minimum duration of the clinic encounter sequence condition k 0 as an adverse drug reaction. as a threshold to determine whether we admit a patient to the study or not. In our experiments, we consider two minimum Postmarketing ADR surveillance from EHR is a multi- duration thresholds: 0.5 year and 1 year. decade research and practice effort that is of utmost im- portance to the pharmacovigilance community (Bate et al., Maximum Time Difference: for TPSQR, in (1), τL de- 2018), with substantial financial and clinical implication for termines the maximum time difference between the occur- health care delivery (Sultana et al., 2013). Various ADR dis- rences of two events within which the former event might covery methods have been proposed over the years (Harpaz have nonzero influence on the latter event. We call τL the et al., 2012), and a benchmark task is created by the Obser- maximum time difference to characterize how distant in vational Medical Outcome Partnership (OMOP, Simpson the past we would like to take previous occurrences into 2011) to evaluate the ADR signal detection performance of consideration when modeling future occurrences. In our these methods. The OMOP task is to identify the ADRs experiments, we consider three maximum time differences: in 50 drug-condition pairs, coming from a selective combi- 0.5 year, 1 year, and 1.5 years. L = 3 and the correspond- nation of ten different drugs and nine different conditions. ing influence functions are chosen according to Bao et al. Among the 50 pairs, 9 of them are confirmed ADRs, while 2017. In MSCCS, a configuration named risk window serves the remaining 41 of them are negative controls. a similar purpose to the maximum difference in TPSQR. We choose three risk windows according to Kuang et al. 2017b A most successful ADR discovery method using EHR is so as to ensure that the both TPSQR and MSCCS have sim- the multiple self-controlled case series (MSCCS, Simpson ilar capability in considering the event history on various et al. 2013), which has been deployed in real-world ADR time scales. discovery related projects (Hripcsak et al., 2015). A rea- son for the success of MSCCS is its introduction of fixed Regularization Parameter: we use l1 -regularization for effects to address the heterogeneity among different sub- TPSQR since it encourages sparsity, and the sparsity pat-
Temporal Poisson Square Root Graphical Models all the components of wkk0 . For MSCCS, AUC is computed % Better TPSQRs 80 according to Kuang et al. 2017b. Figure 2 presents the his- togram of these two sets of 300 AUCs. The contrast in the Max Time Diff 70 0.5 Year performances between TPSQR and MSCCS is obvious. The 1 Year 1.5 Years distribution of TPSQR shifts substantially towards higher 60 AUC values compared to the distribution of MSCCS. There- fore, the overall performance of TPSQR is superior to that 50 of MSCCS in the OMOP task under various experimental 0.5 Year 1 Year configurations in question. As a matter of fact, the top per- Minimum Duration forming TPSQR model reaches an AUC of 0.91, as opposed Figure 3: Percentage of better TPSQR models under various to 0.77 for MSCCS. Furthermore, the majority of TPSQRs minimum duration and maximum time difference designs have higher AUCs even compared to the MSCCS model compared to the best MSCCS model that has the best AUC. We also contrast the performance of TPSQR with the Hawkes process method in Bao et al. 2017, whose best AUC is 0.84 under the same experiment configurations. 0.8 Max Time Diff 6.3. Sensitivity Analysis and Model Selection AUC 0.7 0.5 Year 1 Year To see how sensitive the performance of TPSQR is for differ- 1.5 Years 0.6 ent choices of experimental configurations, we compute the percentage of TPSQRs with a given minimum duration and 0.5 a given maximum time difference design that are better than 0.5 Year 1 Year the best MSCCS model (with an AUC of 0.77). The results Minimum Duration are summarized in Figure 3. As can be seen, the percentage Figure 4: AUC of TPSQR models selected by AIC for given of better TPSQRs is consistently above 80% under various minimum duration and maximum time difference designs scenarios, suggesting the robustness of TPSQRs to various experimental configurations. Given a fixed minimum du- ration and a fixed maximum time difference, we conduct model selection for TPQSRs by the Akaike information cri- terns learned correspond to the structures of the graphi- terion (AIC) over the regularization parameters. The AUC cal models. We use L2 -regularization for MSCCS since of the selected models are summarized in Figure 4. Note it yields outstanding empirical performance in previous that under various fixed minimum duration and maximum studies (Simpson et al., 2013; Kuang et al., 2017b). 50 time difference designs, AIC is capable of selecting models regularization parameters are chosen for both TPSQR and with high AUCs. In fact, all the models selected by AIC MSCCS. have higher AUCs than the best performer of MSCCS. This To sum up, there are 2 × 3 × 50 = 300 experimental con- phenomenon demonstrates that the performance of TPSQR figurations respectively for TPSQR and MSCCS. is consistent and robust with respect to the various choices of experimental configurations. 6.2. Overall Performance For each of the 300 experimental configurations for TPSQR 7. Conclusion and MSCCS, we perform the OMOP task using our EHR We propose TPSQRs, a generalization of PSQRs for the tem- data. Both TPSQR and MSCCS can be implemented by poral relationships between different event types in LED. the R package glmnet (Friedman et al., 2010; Tibshirani We propose the use of Poisson pseudo-likelihood approxi- et al., 2012). We then use Area Under the Curve (AUC) mation to solve the pseudo-likelihood problem arising from for the receiver operating characteristic curve to evaluate PSQRs. The approximation procedure is extremely efficient how well TPSQR and MSCCS can distinguish actual ADRs to solve, and is sparsistent in recovering the structure of the from negative controls under this particular experimental underlying PSQR. The utility of TPSQR is demonstrated configuration. The result is 300 AUCs corresponding to using Marshfield Clinic EHRs for adverse drug reaction the total number of experimental configurations for each of discovery. the two methods. For TPSQR, since the effect of drug k on condition k 0 is estimated over different time scales via Acknowledgement: The authors would like to gratefully wkk0 , the score corresponding to this drug-condition pair acknowledge the NIH BD2K Initiative grant U54 AI117924 used to calculate the AUC is computed by the average over and the NIGMS grant 2RO1 GM097618.
Temporal Poisson Square Root Graphical Models References Rijnbeek, et al. Observational health data sciences and informatics (ohdsi): opportunities for observational re- Genevera I Allen and Zhandong Liu. A local poisson graph- searchers. Studies in Health Technology and Informatics, ical model for inferring networks from sequencing data. 216:574, 2015. IEEE Transactions on Nanobioscience, 12(3):189–198, 2013. David Inouye, Pradeep Ravikumar, and Inderjit Dhillon. Square root graphical models: Multivariate generaliza- Yujia Bao, Zhaobin Kuang, Peggy Peissig, David Page, and tions of univariate exponential families that permit posi- Rebecca Willett. Hawkes process modeling of adverse tive dependencies. In International Conference on Ma- drug reactions with longitudinal observational data. In chine Learning, pages 2445–2453, 2016. Machine Learning for Healthcare Conference, pages 177– 190, 2017. David I Inouye, Pradeep K Ravikumar, and Inderjit S Dhillon. Fixed-length poisson MRF: Adding dependen- Andrew Bate, Robert F Reynolds, and Patrick Caubel. The cies to the multinomial. In Advances in Neural Informa- hope, hype and reality of big data for pharmacovigilance, tion Processing Systems, pages 3213–3221, 2015. 2018. Mladen Kolar, Le Song, Amr Ahmed, and Eric P Xing. Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Estimating time-varying networks. The Annals of Applied Manuel Gomez-Rodriguez, and Le Song. Recurrent Statistics, pages 94–123, 2010. marked temporal point processes: Embedding event his- tory to vector. In Proceedings of the 22nd ACM SIGKDD Zhaobin Kuang, James Thomson, Michael Caldwell, Peggy International Conference on Knowledge Discovery and Peissig, Ron Stewart, and David Page. Baseline reg- Data Mining, pages 1555–1564. ACM, 2016. ularization for computational drug repositioning with longitudinal observational data. In Proceedings of the Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Reg- Twenty-Fifth International Joint Conference on Artificial ularization paths for generalized linear models via coor- Intelligence, pages 2521–2528. AAAI Press, 2016a. dinate descent. Journal of Statistical Software, 33(1):1, 2010. Zhaobin Kuang, James Thomson, Michael Caldwell, Peggy Sinong Geng, Zhaobin Kuang, and David Page. An Peissig, Ron Stewart, and David Page. Computational efficient pseudo-likelihood method for sparse binary drug repositioning using continuous self-controlled case pairwise Markov network estimation. arXiv preprint series. In Proceedings of the 22nd ACM SIGKDD Inter- arXiv:1702.08320, 2017. national Conference on Knowledge Discovery and Data Mining, pages 491–500. ACM, 2016b. Sinong Geng, Zhaobin Kuang, Jie Liu, Stephen Wright, and David Page. Stochastic learning for sparse discrete Zhaobin Kuang, Sinong Geng, and David Page. A screening Markov random fields with controlled gradient approxi- rule for l1-regularized ising model estimation. In Ad- mation error. In Proceedings of the Thirty-Fourth Con- vances in Neural Information Processing Systems, pages ference on Uncertainty in Artificial Intelligence ( 2018 ), 720–731, 2017a. 2018. Zhaobin Kuang, Peggy Peissig, Vitor Santos Costa, Richard Asela Gunawardana, Christopher Meek, and Puyang Xu. Maclin, and David Page. Pharmacovigilance via baseline A model for temporal dependencies in event streams. regularization with large-scale longitudinal observational In Advances in Neural Information Processing Systems, data. In Proceedings of the 23rd ACM SIGKDD Inter- pages 1962–1970, 2011. national Conference on Knowledge Discovery and Data Mining, pages 1537–1546. ACM, 2017b. Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011. Jie Liu and David Page. Bayesian estimation of latently- grouped parameters in undirected graphical models. In Rave Harpaz, William DuMouchel, Nigam H Shah, David Advances in Neural Information Processing Systems, Madigan, Patrick Ryan, and Carol Friedman. Novel data- pages 1232–1240, 2013. mining methodologies for adverse drug event discovery and analysis. Clinical Pharmacology & Therapeutics, 91 Jie Liu, David Page, Houssam Nassif, Jude Shavlik, Peggy (6):1010–1021, 2012. Peissig, Catherine McCarty, Adedayo A Onitilo, and Eliz- abeth Burnside. Genetic variants improve breast cancer George Hripcsak, Jon D Duke, Nigam H Shah, Chris- risk prediction on mammograms. In AMIA Annual Sym- tian G Reich, Vojtech Huser, Martijn J Schuemie, Marc A posium Proceedings, volume 2013, page 876. American Suchard, Rae Woong Park, Ian Chi Kei Wong, Peter R Medical Informatics Association, 2013.
Temporal Poisson Square Root Graphical Models Jie Liu, David Page, Peggy Peissig, Catherine McCarty, Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Adedayo A Onitilo, Amy Trentham-Dietz, and Elizabeth Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshi- Burnside. New genetic variants improve personalized rani. Strong rules for discarding predictors in lasso-type breast cancer diagnosis. AMIA Summits on Translational problems. Journal of the Royal Statistical Society: Series Science Proceedings, 2014:83, 2014a. B (Statistical Methodology), 74(2):245–266, 2012. Jie Liu, Chunming Zhang, Elizabeth Burnside, and David Martin J Wainwright. Sharp thresholds for high- Page. Learning heterogeneous hidden Markov random dimensional and noisy sparsity recovery using l1- fields. In Artificial Intelligence and Statistics, pages 576– constrained quadratic programming (lasso). IEEE Trans- 584, 2014b. actions on Information Theory, 55(5):2183–2202, 2009. Jie Liu, Chunming Zhang, David Page, et al. Multiple test- Jeremy Weiss, Sriraam Natarajan, and David Page. Mul- ing under dependence via graphical models. The Annals tiplicative forests for continuous-time processes. In Ad- of Applied Statistics, 10(3):1699–1724, 2016. vances in Neural Information Processing Systems, pages 458–466, 2012. Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic Press, 2008. Jeremy C Weiss and David Page. Forest-based point pro- cess for event prediction from electronic health records. James M Ortega and Werner C Rheinboldt. Iterative solution In Joint European Conference on Machine Learning of nonlinear equations in several variables. SIAM, 2000. and Knowledge Discovery in Databases, pages 547–562. Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Springer, 2013. Wasserman. Spam: Sparse additive models. In Pro- Eunho Yang and Pradeep Ravikumar. On the use of vari- ceedings of the 20th International Conference on Neural ational inference for learning discrete graphical model. Information Processing Systems, pages 1201–1208. Cur- In International Conference on Machine Learning, pages ran Associates Inc., 2007. 1009–1016, 2011. Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, Eunho Yang, Pradeep K Ravikumar, Genevera I Allen, and et al. High-dimensional ising model selection using 1- Zhandong Liu. On poisson graphical models. In Advances regularized logistic regression. The Annals of Statistics, in Neural Information Processing Systems, pages 1718– 38(3):1287–1319, 2010. 1726, 2013. Martijn J Schuemie, Gianluca Trifirò, Preciosa M Coloma, Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Patrick B Ryan, and David Madigan. Detecting adverse Zhandong Liu. Graphical models via univariate exponen- drug reactions following long-term exposure in longi- tial family distributions. Journal of Machine Learning tudinal observational data: The exposure-adjusted self- Research, 16(1):3813–3847, 2015a. controlled case series. Statistical Methods in Medical Research, 25(6):2577–2592, 2016. Sen Yang, Zhaosong Lu, Xiaotong Shen, Peter Wonka, and Jieping Ye. Fused multiple graphical lasso. SIAM Journal Shawn E Simpson. Self-controlled methods for postmarket- on Optimization, 25(2):916–943, 2015b. ing drug safety surveillance in large-scale longitudinal data. Columbia University, 2011. Shawn E Simpson, David Madigan, Ivan Zorych, Martijn J Schuemie, Patrick B Ryan, and Marc A Suchard. Mul- tiple self-controlled case series for large-scale longitudi- nal observational databases. Biometrics, 69(4):893–902, 2013. Janet Sultana, Paola Cutroneo, and Gianluca Trifirò. Clinical and economic burden of adverse drug reactions. Journal of Pharmacology & Pharmacotherapeutics, 4(Suppl1): S73, 2013. Nicholas P Tatonetti, P Ye Patrick, Roxana Daneshjou, and Russ B Altman. Data-driven prediction of drug effects and interactions. Science Translational Medicine, 4(125): 125ra31–125ra31, 2012.
You can also read