Temporal Poisson Square Root Graphical Models - Proceedings of ...

Page created by Jennifer Avila

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Temporal Poisson Square Root Graphical Models

                             Sinong Geng* 1 Zhaobin Kuang* 1 Peggy Peissig 2 David Page 1

                          Abstract                                   lions of web users are usually mapped into various topics
     We propose temporal Poisson square root graphi-                 (e.g. travel, education, weather), and search engine providers
     cal models (TPSQRs), a generalization of Poisson                are interested in the interplay among these search topics for
     square root graphical models (PSQRs) specifically               a better understanding of user preferences (Gunawardana
     designed for modeling longitudinal event data. By               et al., 2011). In health analytics, electronic health records
     estimating the temporal relationships for all possi-            (EHRs) contain clinical encounter events from millions of
     ble pairs of event types, TPSQRs can offer a holis-             patients collected over decades, including drug prescriptions,
     tic perspective about whether the occurrences of                biomarkers, and condition diagnoses, among others. Unrav-
     any given event type could excite or inhibit any                eling the relationships between different drugs and different
     other type. A TPSQR is learned by estimating                    conditions is vital to answering some of the most pressing
     a collection of interrelated PSQRs that share the               medical and scientific questions such as drug-drug interac-
     same template parameterization. These PSQRs                     tion detection (Tatonetti et al., 2012), comorbidity identifica-
     are estimated jointly in a pseudo-likelihood fash-              tion, adverse drug reaction (ADR) discovery (Simpson et al.,
     ion, where Poisson pseudo-likelihood is used to                 2013; Bao et al., 2017; Kuang et al., 2017b), computational
     approximate the original more computationally-                  drug repositioning (Kuang et al., 2016a;b), and precision
     intensive pseudo-likelihood problem stemming                    medicine (Liu et al., 2013; 2014a).
     from PSQRs. Theoretically, we demonstrate                       All these analytics challenges beg the statistical modeling
     that under mild assumptions, the Poisson pseudo-                question: can we offer a comprehensive perspective about
     likelihood approximation is sparsistent for recov-              the relationships between the occurrences of all possible
     ering the underlying PSQR. Empirically, we learn                pairs of event types in longitudinal event data? In this paper,
     TPSQRs from Marshfield Clinic electronic health                 we propose a solution via temporal Poisson square root
     records (EHRs) with millions of drug prescrip-                  graphical models (TPSQRs), a generalization of Poisson
     tion and condition diagnosis events, for adverse                square root graphical models (PSQRs, Inouye et al. 2016)
     drug reaction (ADR) detection. Experimental re-                 made in order to represent multivariate distributions among
     sults demonstrate that the learned TPSQRs can                   count variables evolving temporally in LED.
     recover ADR signals from the EHR effectively
     and efficiently.                                                The reason why conventional undirected graphical models
                                                                     (UGMs) are not readily applicable to LED is the lack of
                                                                     mechanisms to address the temporality and irregularity in
1. Introduction                                                      the data. Conventional UGMs (Liu and Page, 2013; Liu
                                                                     et al., 2014b; Yang et al., 2015a; Liu et al., 2016; Kuang
Longitudinal event data (LED) and the analytics challenges           et al., 2017a; Geng et al., 2018) focus on estimating the co-
therein are ubiquitous now. In business analytics, purchas-          occurrence relationships among various variables rather than
ing events of different items from millions of customers             their temporal relationships, that is, how the occurrence of
are collected, and retailers are interested in how a distinct        one type of event may affect the future occurrence of another
market action or the sales of one particular type of item            type. Furthermore, existing temporal variants of UGMs
could boost or hinder the sales of another type (Han et al.,         (Kolar et al., 2010; Yang et al., 2015b) usually assume that
2011). In search analytics, web search keywords from bil-            data are regularly sampled, and observations for all variables
                                                                     are available at each time point. Neither assumption is true,
   Sinong Geng and Zhaobin Kuang contribute equally. Their
names are listed in alphabetical order. 1 The University of Wis-     due to the irregularity of LED.
consin, Madison 2 Marshfield Clinic Research Institute. Correspon-   In contrast to these existing UGM models, a TPSQR models
dence to: Sinong Geng .
                                                                     temporal relationships, by data aggregation; a TPSQR ex-
Proceedings of the 35 th International Conference on Machine         tracts a sequence of time-stamped summary count statistics
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018           of distinct event types that preserves the relative temporal
by the author(s).

Temporal Poisson Square Root Graphical Models

order in the raw data for each subject. A PSQR is then used ables evolving temporally in LED. TPSQR can accom-
to model the joint distribution among these summary count modate both positive and negative dependencies among
statistics for each subject. Different PSQRs for different covariates, and can be learned efficiently via the pseudo-
subjects are assumed to share the same template parame- likelihood problem for PSQR.
terization and hence can be learned jointly by estimating
the template in a pseudo-likelihood fashion. To address the • In terms of advancing the state-of-the-art of PSQR estima-
challenge in temporal irregularity, we compute the exact tion, we propose Poisson pseudo-likelihood approxima-
time difference between each pair of time-stamped sum- tion in lieu of the original more computationally-intensive
mary statistics, and decide whether a difference falls into conditional distribution induced by the joint distribution
a particular predefined time interval, hence transforming of a PSQR. We show that under mild assumptions, the
the irregular time differences into regular timespans. We Poisson pseudo-likelihood approximation procedure is
then incorporate the effects of various timespans into the sparsistent (Ravikumar et al., 2007) with respect to the
template parameterization as well as PSQR constructions underlying PSQR. Our theoretical results not only justify
from the template. the use of the more efficient Poisson pseudo-likelihood
over the original conditional distribution for better esti-
By addressing temporality and irregularity of LED in this mation efficiency of PSQR but also establish a formal
fashion, TPSQR is also different from many point process correspondence between the more intuitive but less strin-
models (Gunawardana et al., 2011; Weiss et al., 2012; Weiss gent local Poisson graphical models (Allen and Liu, 2013)
and Page, 2013; Du et al., 2016), which usually strive to and the more rigorous but less convenient PSQRs.
pinpoint the exact occurrence times of events, and offer gen-
erative mechanisms to event trajectories. TPSQR, on the • We apply TPSQR to Marshfield Clinic EHRs to deter-
other hand, adopts a coarse resolution approach to temporal mine the relationships between the occurrences of various
modeling via the aforementioned data aggregation and time drugs and the occurrences of various conditions, and of-
interval construction. As a result, TPSQR focuses on esti- fer more accurate estimations for adverse drug reaction
mating stable relationships among occurrences of different (ADR) discovery, a challenging task in health analytics
event types, and does not model the precise event occurrence due to the (thankfully) rare and weak ADR signals en-
timing. This behavior is especially meaningful in applica- coded in the data, whose success is crucial to improving
tion settings such as ADR discovery, where the importance healthcare both financially and clinically (Sultana et al.,
of identifying the occurrence of an adverse condition caused 2013).
by the prescription of a drug usually outweighs knowing
about the exact time point of the occurrence of the ADR, due 2. Background
to the high variance of the onset time of ADRs (Schuemie
We show how to deal with the challenges in temporality
et al., 2016).
and irregularity mentioned in Section 1 via the use of data
Since TPSQR is a generalization of PSQR, many desirable aggregation and an influence function for LED. We then
properties of PSQR are inherited by TPSQR. For example, define the template parameterization that is central to the
TPSQR, like PSQR, is capable of modeling both positive modeling of TPSQRs.
and negative dependencies between covariates. Such flexi-
bility cannot usually be taken for granted when modeling a 2.1. Longitudinal Event Data
multivariate distribution over count data due to the potential
dispersion of the partition function of a graphical model Longitudinal event data are time-stamped events of finitely
(Yang et al., 2015a). TPSQR can be learned by solving many types collected across various subjects over time. Fig-
the pseudo-likelihood problem for PSQR. For efficiency ure 1 visualizes the LED for two subjects. As shown in
and scalability, we use Poisson pseudo-likelihood to ap- Figure 1, the occurrences of different event types are rep-
proximately solve the original pseudo-likelihood problem resented as arrows in different colors. No two events for
induced by a PSQR, and we show that the Poisson pseudo- one subject occur at the exact same time. We are interested
likelihood approximation can recover the structure of the in modeling the relationships among the occurrences of
underlying PSQR under mild assumptions. Finally, we different types of events via TPSQR.
demonstrate the utility of TPSQRs using Marshfield Clinic
EHRs with millions of drug prescription and condition di- 2.2. Data Aggregation
agnosis events for the task of adverse drug reaction (ADR) To enable PSQRs to cope with the temporality in LED,
detection. Our contributions are three-fold: TPSQRs start from extracting relative-temporal-order-
preserved summary count statistics from the raw LED via
• TPSQR is a generalization of PSQR made in order to data aggregation, to cope with the high volume and frequent
represent the multivariate distributions among count vari- consecutive replications of events of the same type that are

Temporal Poisson Square Root Graphical Models

                                                                                                                  Subject 1

                                                                                                                  Time/Day

                                                                                                                  Subject 2

                                                                                                                  Time/Day

                                  Event Type           Type 1        Type 2         Type 3

Figure 1: Visualization of longitudinal event data from two subjects. Curly brackets denote the timespans during which
events of only one type occur. xij ’s represent the number of subsequent occurrences after the first occurrence. oij ’s are the
types of events in various timespans.

commonly observed in LED. Take Subject 1 in Figure 1 as             processing. In detail, let L + 1 user-specified time-threshold
an illustrative example; we divide the raw data of Subject          values be given, where 0 = τ0 < τ1 < τ2 < · · · < τL . φ(τ )
1 into four timespans by the dashed lines. Each of the four         is a L × 1 one hot vector whose lth component is defined
timespans contains only events of the same type. We use             as:                        (
three statistics to summarize each timespan: the time stamp                                     1, τl−1 ≤ τ < τl
of the first occurrence of the event in each timespan: t11 = 1,                  [φ(τ )]l :=                         ,         (1)
                                                                                                0, otherwise
t12 = 121, t13 = 231, and t14 = 361; the event type in
each timespan: o11 = 1, o12 = 2, o13 = 3, and o14 = 1;              where l ∈ {1, 2, · · · , L}. In our case, we let τ := tij 0 − tij
and the counts of subsequent occurrences in each timespan:          to construct φ(τ ) according to (1). Widely used influence
x11 = 1, x12 = 1, x13 = 2, and x14 = 0. Note that the               functions in signal processing include the dyadic wavelet
reason x14 = 0 is that there is only one occurrence of event        function and the Haar wavelet function (Mallat, 2008); both
type 1 in total during timespan 4 of subject 1. Therefore,          are piecewise constant and hence share similar representa-
the number of subsequent occurrence after the first and only        tion to (1).
occurrence is 0.
Let there be N independent subjects and p types of events           2.4. Template Parameterization
in a given LED X. We denote by ni the number of                     Template parameterization provides the capability of TP-
timespans during which only one type of event occurs                SQRs to represent the effects of all possible (ordered)
to subject i, where i ∈ {1, 2, · · · , N }. The j th times-         pairs of event types on all time scales. Specifically, let
pan of the ith subject can be represented by the vector                                                          2
                                                                    an ordered pair (k, k 0 ) ∈ {1, 2, · · · , p} be given. Let
                         >
sij := tij oij xij , where j ∈ {1, 2, · · · , ni }, and             0 = τ0 < τ1 < τ2 < · · · < τK also be given. For the
“:=” represents “defined as.” tij ∈ [0, +∞) is the time             ease of presentation, we assume that k 6= k 0 , which can
stamp at which the first event occurs during the timespan sij .     be easily generalized to k = k 0 . Considering a particular
Furthermore, t11 < t12 < · · · < t1ni . oij ∈ {1, 2, · · · , p}     patient, we are interested in knowing the effect of an occur-
represents the event type in sij . Furthermore, oij 6= oi(j+1) ,    rence of a type k event towards a subsequent occurrence of
∀i ∈ {1, 2, · · · , N } and ∀j < ni . xij ∈ N is the number of      a type k 0 event, when the time between the two occurrences
subsequent occurrences of events of the same type in sij .          falls in the lth time window specified via (1). Enumerating
                                                                    all L time windows, we have:
2.3. Influence Function                                                                                                 >
                                                                              wkk0 := wkk0 1     wkk0 2   ···   wkk0 L        .   (2)
Let sij and sij 0 be given, where j < j 0 ≤ ni . To handle
the irregularity of the data, we map the time difference            Note that since (k, k 0 ) is ordered, wk0 k is different from
tij 0 − tij to a one-hot vector that represents the activation of   wkk0 . We further define W as a (p − 1)p × L matrix that
a time interval using an influence function φ(·), a common                         >
                                                                    stacks up all wkk 0 ’s. In this way, W includes all possible
mechanism widely used in point process models and signal            pairwise temporally bidirectional relationships among the

Temporal Poisson Square Root Graphical Models

p variables on different time scales, offering holistic repre-                    template for constructing Θ(i) ’s, and hence provide a “tem-
sentation power. To represent the intrinsic prevalence effect                     plate parameterization.” Since there are N subjects in total
of the occurrences of events of various types, we further                         in the dataset, and each Θ(i) offers a personalized PSQR for
                                 >
                                                                                  one subject, TPSQR is capable of learning a collection of
             
define ω := ω1 ω2 · · · ωp . We call ω and W the
template parameterization, from which we will generate the                        interrelated PSQRs due to the use of the template parame-
parameters of various PSQRs as shown in Section 3.                                terization. Recall the well-rounded representation power of
                                                                                  a template shown in Section 2.4; learning the template pa-
                                                                                  rameterization via TPSQR can hence offer a comprehensive
3. Modeling                                                                       perspective about the relationships for all possible tempo-
Let sij ’s be given where j ∈ {1, 2, · · · , ni }; we demonstrate                 rally ordered pairs of event types.
the use of the influence function and template parameteriza-                      Furthermore, since TPSQR is a generalization of PSQR,
tion to construct a PSQR for subject i.                                           it inherits many desirable properties enjoyed by PSQR. A
                              >                                >              most prominent property is its capability of accommodating
Let ti := ti1 ti2 · · · tini , oi := oi1 oi2 · · · oini ,
                                       >                                        both positive and negative dependencies between variables.
and xi := xi1 xi2 · · · xini . Given ti and oi , a                                Such flexibility in general cannot be taken for granted when
TPSQR aims at modeling the joint distribution of counts xi                        modeling multivariate count data. For example, a Poisson
using a PSQR. Specifically, under the template parameteriza-                      graphical model (Yang et al., 2015a) can only represent neg-
tion ω and W, we first define a symmetric parameterization                        ative dependencies due to the diffusion of its log-partition
Θ(i) using ti and oi . The component of Θ(i) at the j th row                      function when positive dependencies are involved. Yet for
            th
and the j 0 column is:                                                            example one drug (e.g., the blood thinner Warfarin) can
                                                                                 have a positive influence on some conditions (e.g., bleeding)
                         ωoij ,
                                                       j = j0                    and a negative influence on others (e.g., stroke). We refer
   (i)
  θjj 0 := [Θ(i) ]jj 0 := wo>ij oij0 φ(|tij 0 − tij |), j < j 0 . (3)             interested readers to Allen and Liu 2013; Yang et al. 2013;
                         
                          (i)
                           [Θ ]j 0 j ,                  j > j0                    Inouye et al. 2015; Yang et al. 2015a; Inouye et al. 2016 for
                                                                                  more details of PSQRs and other related Poisson graphical
We then can use Θ(i) to parameterize a PSQR that gives a                          models.
joint distribution over xi as:
                                                                                  4. Estimation
                                      i −1 X
                    " n              nX    ni
                        i
                            (i) √              (i) √
                    X
P xi ; Θ(i) := exp         θjj xij +          θjj 0 xij xij 0                     In this section, we present the pseudo-likelihood estimation
                           j=1                     j=1 j 0 >j                     problem for TPSQR. We then point out that solving this
                                                                                  problem can be inefficient, which leads to the proposed
                    ni
                                                                #
                    X                                      
                                                      (i)
                −         log(xij ! ) − Ani Θ                       .             Poisson pseudo-likelihood approximation to the original
                    j=1                                                           pseudo-likelihood problem.
                                                                            (4)
                                                                                  4.1. Pseudo-Likelihood for TPSQR
In (4), Ani (Θ(i) ) is a normalization constant called the
log-partition function that ensures the legitimacy of the                         We now present our estimation approach for TPSQR based
probability distribution in question:                                             on pseudo-likelihood. We start from considering the pseudo-
                                                                                  likelihood for a given ith subject. By (4), the log probability
                                                                                  of xij conditioned on xi,−j , which is an (ni − 1) × 1 vector
                                   " n
                                       i
                                          (i) √
                         X          X
      Ani Θ(i) := log          exp       θjj xj                                   constructed by removing the j th component from xi , is
                                 x∈Nni         j=1                                given as:
                                                                            (5)
                 i −1 X
                nX    ni                           ni
                                                                        #
                                (i) √
                                                   X
            +                  θjj 0 xj xj 0   −         log(xj ! ) .                        
                                                                                                             (i)
                                                                                                                 
                j=1   j 0 >j                       j=1                                   logP xij | xi,−j ; θj     = − log(xij ! )+
                                                                                                                                              (6)
Note that in (5) we emphasize the dependency of the parti-                                          (i)> √          √                 
                                                                                             (i)                                   (i)
                                                                                            θjj + θj,−j xi,−j         xij − Ãni θj ,
tion function upon the dimensionof x using the subscript
ni , and x := x1 x1 · · · xni .
                                                                                          (i)
To model the joint distribution of xi , TPSQR directly uses                       where θj is the j th column of Θ(i) and hence
Θ(i) , which is extracted from ω and W via (3) depend-
ing on the individual and temporal irregularity of the data                              h                                                     i>
                                                                                     (i)   (i)        (i)          (i)    (i)           (i)
characterized by ti and oi . Therefore, ω and W serve as a                          θj := θ1j  · · · θj−1,j      θjj     θj+1,j · · · θni ,j

Temporal Poisson Square Root Graphical Models
         h                                               i>
           (i)
       := θ1j         (i)
               · · · θj−1,j
                                 (i)
                               θjj
                                        (i)          (i)
                                       θj,j+1 · · · θj,ni .       where λ ≥ 0 is the regularization parameter, and the penalty
                                                                                                   (p−1)p L
                                                           (7)                                      X X
                                                                                 kWk1,1 :=                      |[W]ij |
In (7), by the symmetry of Θ(i) , we rearrange the index after                                      i=1   j=1
 (i)
θjj to ensure that the row index is no larger than the column     is used to encourage sparsity over the template parame-
index so that the parameterization is consistent with that        terization W that determines the interactions between the
in (4). We will adhere to this convention in the subsequent       occurrences of two distinct event types. As mentioned at the
                              (i)
presentation. Furthermore, θj,−j is an (ni − 1) × 1 vector        end of Section 4.1, TPSQR learning is equivalent to learn-
                    (i)
constructed from θj by excluding its j th component, and          ing a PSQR over the template parameterization. Therefore,
√                                                                 the sparsity penalty induced here is helpful to recover the
  xi,−j is constructed by taking the square root of each
component of xi,−j . Finally,                                     structure of the underlying graphical model.
           
Ãni θj
        (i)
              :=                                                  The major advantage of approximating the original pseudo-
                                                                  likelihood problem with Poisson pseudo-likelihood is the
      X  (i)          (i)> √
                                   
                                     √
                                                        
                                                                  gain in computational efficiency. Based on the construc-
  log      exp θjj + θj,−j xi,−j       xij − log(xij ! ) ,
                                                                  tion in Geng et al. 2017, (11) can be formulated as an
     x∈Rni
                                                           (8)    l1 -regularized Poisson regression problem, which can be
                                                                  solved much more efficiently via many sophisticated algo-
which is a quantity that involves summing up infinitely many      rithms and their implementations (Friedman et al., 2010;
terms, and in general cannot be further simplified, leading       Tibshirani et al., 2012) compared to solving the original
to potential intractability in computing (8).                     problem that involves the potentially challenging computa-
                                                                  tion for (8). Furthermore, in the subsequent section, we will
PN the conditional distribution in (6) and letting M :=
With
                                                                  show that even though the Poisson pseudo-likelihood is an
   i=1 ni , the pseudo-likelihood problem for TPSQR is
given as:                                                         approximation procedure to the pseudo-likelihood of PSQR,
                                                                  under mild assumptions Poisson pseudo-likelihood is still
                N ni
             1 XX           
                                            (i)
                                                                 capable of recovering the structure of the underlying PSQR.
         max           log P xij | xi,−j ; θj .            (9)
         ω,W M
               i=1 j=1
                                                                  4.3. Sparsistency Guarantee
(9) is the maximization over all the conditional distributions
of all the count variables for all N personalized PSQRs           For the ease of presentation, in this section we will reuse
generated by the template. Therefore, it can be viewed as         much of the notation that appears previously. The redefini-
a pseudo-likelihood estimation problem directly for ω and         tions introduced in this section only apply to the contents in
W. However, solving the pseudo-likelihood problem in              this section and the related proofs in the Appendix. Recall
(9) involves the predicament of computing the potentially         at the end of Section 4.1, the pseudo-likelihood problem of
intractable (8), which motivates us to use Poisson pseudo-        TPSQR can be viewed as learning a PSQR parameterized
likelihood as an approximation to (9).                            by the template. Therefore, without loss of generality, we
                                                                  will consider a PSQR over p count variables X = x ∈ Np
4.2. Poisson Pseudo-Likelihood                                    parameterized by a p × p symmetric matrix Θ∗ , where
                                                                                                  >
                               (i)                                X := X1 X2 · · · Xp is the multivariate random
Using the parameter vector θj , we define the conditional         variable, and x is an assignment to X. We use k·k∞ to
distribution of xij given by xi,−j via the Poisson distribu-      represent the infinity norm of a vector or a matrix. Let
tion as:                                                          X := {x1 , x2 , · · · , xn } be a dataset with n independent and
                        
                     (i)                                          identically distributed (i.i.d.) samples generated from the
  P̂ xij | xi,−j ; θj      ∝
        h                                          i          PSQR. Then the joint probability distribution over x is:
           (i)     (i)>                 (i)   (i)>
    exp θjj + θ−j xi,−j xij − exp θjj + θ−j xi,−j .                                     " p                 p−1 Xp
                                                                                           X √
                                                                                                                      ∗ √
                                                                                                            X
                                                                            ∗                   ∗
                                                          (10)      P(x; Θ ):= exp             θjj   xij +           θjj 0 xij xij 0
                                                                                         j=1                j=1 j 0 >j
Notice the similarity between (6) and (10). We can de-
                                                                                         p
                                                                                                                         #
fine the sparse Poisson pseudo-likelihood problem simi-                                  X
                                                                                                                     ∗
                                                                                     −         log(xij ! ) − A (Θ ) ,
lar to the original pseudo-likelihood   problem by replacing
                       (i)
                           
                                                        (i)
                                                                                        j=1
log P xij | xi,−j ; θj       with log P̂ xij | xi,−j ; θj     :
                                                                  where A (Θ∗ ) is the log-partition function, and the corre-
          N ni
       1 XX                                                     sponding Poisson pseudo-likelihood problem is:
                                       (i)
 max             log P̂ xij | xi,−j ; θj −λkWk1,1 , (11)
 ω,W   M i=1 j=1                                                               Θ̂ := arg min F (Θ) + λkΘk1,off ,              (12)
                                                                                               Θ

Temporal Poisson Square Root Graphical Models

                                                                               I := (j, j 0 ) | θjj
                                                                                                 ∗               0      0
                                                                                   
where                                                                                               0 = 0, j 6= j , j, j ∈ {1, 2, · · · , p} .
                   n      p
            1 XXh              >
   F (Θ) :=           −(θjj + θj,−j xi,−j )xij                               Let H := ∇2 F (Θ∗ ). Then for some 0 < α < 1 and C5 >
            n i=1 j=1
                                                                      (13)   0, we have HIS H−1                           −1
                                                                                                SS ∞ ≤ 1 − α and HSS ∞ ≤ C5 ,
                                                                             where we use the index sets as subscripts to represent the
                                                                i
                                                 >
                               + exp(θjj +      θj,−j xi,−j )    ,
                                                                             corresponding components of a vector or a matrix.
and kΘk1,off represents imposing l1 penalty over all but the
diagonal components of Θ.                                                    The final assumption characterizes the second-order Taylor
                                                                             expansion of F (Θ∗ ) at a certain direction ∆.
Sparsistency (Ravikumar et al., 2007) addresses whether Θ̂
can recover the structure of the underlying Θ∗ with high                     Assumption 4. Let R(∆) be the second-order Taylor ex-
probability using n i.i.d. samples. In what follows, we will                 pansion remainder of ∇F (Θ) around Θ = Θ∗ at direction
show that Θ̂ is indeed sparsistent under mild assumptions.                   ∆ := Θ−Θ∗ (i.e. ∇F (Θ) = ∇F (Θ∗ )+∇2 F (Θ∗ )(Θ−
                                                                             Θ∗ ) + R(∆)), where k∆k∞ ≤ r := 4C5 λ ≤ C51C6 with
We use E[·] to denote the expectation of a random variable                   ∆I = 0, and for some C6 > 0. Then kR(∆)k∞ ≤
under P(x; Θ∗ ). The first assumption is about the bound-                            2
                                                                             C6 k∆k∞ .
edness of E[X], and the boundedness of the partial second
order derivatives of a quantity related to the log-partition                 With these mild assumptions, the sparsistency result is stated
A (Θ∗ ). This assumption is standard in the analysis of                      in Theorem 1.
pseudo-likelihood methods (Yang et al., 2015a).
Assumption 1. kE[X]k∞ ≤ C1 for some C1 > 0. Let                              Theorem 1. Suppose that Assumption 1 - 4 are all
                                                                             satisfied.   Then, with probability of at least 1 −
                                                                              (exp (C1 + C2 /2) + 8) p−2 + p−1/C2 , Θ̂ shares the
                      " p
                 X     X √
  B(Θ, b) := log   exp    θjj xj + b> x                                      same structure with Θ∗ , if for some constant C7 > 0,
                       x∈Np          j=1
                       p−1    p                       p
                                                                     #                                                         r
                       X      X         √             X                           8                                        2
                                                                                                                               log p
                   +                θjj 0 xj xj 0 −         log(xj ! ) .     λ≥     C3 (3 log p + log n) + (3 log p + log n)
                                                                                 α                                                n
                       j=1 j 0 >j                     j=1                                               !              s
                                                                                             r
                                                                                                2 log p                  log5 p
For some C2 > 0, and ∀k ∈ [0, 1],                                              + 8C4 C1 +                 α, λ ≤ C7             ,
                                                                                                   n                       n
                                     ∂ 2 B(Θ, 0 + kej )                                                             2
      ∀j ∈ {1, 2, · · · , p} ,                          ≤ C2 ,               r ≤ kΘ∗S k∞ , and n ≥ 64C7 C52 C6 /α log5 p.
                                            ∂ 2 bj

where ej is the one-hot vector with the j th component as 1.                 We defer the proof of Theorem 1 to the Appendix. Note that
                                                                             log5 p in Theorem 1 represents a higher sample complexity
The following assumption characterizes the boundedness of
                                                                             compared to similar results in the analysis of Ising models
the conditional distributions given by the PSQR under Θ∗
                                                                             (Ravikumar et al., 2010). Such a higher sample complexity
and by the Poisson approximation using the same Θ∗ .
                                                                             intuitively makes sense since the multivariate count vari-
Assumption 2. Let λ∗ij := exp θjj      ∗       ∗>
                                                       
                                          + θj,−j xi,−j be the               ables that we deal with are unbounded and usually heavy-
mean parameter of a Poisson distribution. Then ∀i ∈                          tailed, and we are also considering the Poisson pseudo-
{1, 2 · · · , n} and ∀j ∈ {1, 2, · · · , p}, for some C3 > 0                 likelihood approximation to the original pseudo-likelihood
and C4 > 0, we have that E [Xj | xi,−j ] ≤ C3 and                            problem induced by PSQRs. The fact that Poisson pseudo-
 λ∗ij − E [Xj | xi,−j ] ≤ C4 .                                               likelihood is a sparsistent procedure for learning PSQRs not
                                                                             only provides an efficient approach to learn PSQRs with
The third assumption is the mutual incoherence condition
                                                                             strong theoretical guarantees, but also establishes a formal
vital to the sparsistency of sparse statistical learning with
                                                                             correspondence between local Poisson graphical models
l1 -regularization. Also, with a slight abuse of notation, in
                                                                             (LPGMs, Allen and Liu 2013) and PSQRs. This is because
the remaining of Section 4.3 as well as in the corresponding
                                                                             Poisson pseudo-likelihood is also a sparsistent procedure
proofs, we should view Θ as a vector generated by stacking
                                                                             for LPGMs. Compared to PSQRs, LPGMs are more in-
up θjj 0 ’s, where j ≤ j 0 , whenever it is clear from context.
                                                                             tuitive yet less stringent theoretically due to the lack of a
Assumption 3. Let Θ∗ be given. Define the index sets                         joint distribution defined by the model. Fortunately, with
                                                                             the guarantees in Theorem 1, we are able to provide some
 A := (j, j 0 ) | θjj
                   ∗                0      0
     
                      0 6= 0, j 6= j , j, j ∈ {1, 2, · · · , p} ,
                                                                             reassurance for the use of LPGMs in terms of structure
    D := {(j, j) | j ∈ {1, 2, · · · , p}} ,           S := A ∪ D,            recovery.

Temporal Poisson Square Root Graphical Models

                            Methods                        0.9                 jects (e.g. patients in poorer health might tend to be more
                                MSCCS                                          likely to have a heart attack compared to a healthy person,
      90                        TPSQR                      0.8
                                                                   ●
                                                                   ●
                                                                   ●
                                                                   ●
                                                                   ●
                                                                               which might confound the effects of various drugs when
 Counts

                                                           0.7

                                                        AUC
      60                                                                       identifying drugs that could cause heart attacks as an ADR).
                                                                         ●
                                                                         ●
                                                                         ●
                                                           0.6           ●
                                                                         ●
                                                                         ●
                                                                         ●
                                                                         ●     Therefore, when using TPSQR, we will also introduce fixed
                                                                         ●

      30
                                                                         ●

                                                           0.5
                                                                               effects to equip TPSQRs with the capability of addressing
                                                                               subject heterogeneity. Specifically, we consider learning a
          0                                                0.4     ●

              0.4   0.5   0.6   0.7     0.8   0.9
                                                                   ●

                                                                 MSCCS TPSQR
                                                                               variant of (11):
                           AUC                                   Methods                N ni
                                                                                     1 XX                  
                                                                                                                           (i)
                                                                                                                              
              (a) Overlapping Histograms                      (b) Boxplot       max           αioij +log P̂ xij | xi,−j ; θj −λkWk1,1 ,
                                                                               α,ω,W M
                                                                                       i=1 j=1
Figure 2: Overall performance of TPSQR and MSCCS
measured by AUC among 300 different experimental con-                          where α is the fixed effect parameter vector constructed by
figurations for each of the two methods.                                       αioij ’s that depicts the belief that different patients could
                                                                               have different baseline risks of experiencing different types
                                                                               of events.
5. Adverse Drug Reaction Discovery
To demonstrate the capability of TPSQRs to capture tem-                        6. Experiments
poral relationships between different pairs of event types                     In what follows, we will compare the performances of TP-
in LED, we use ADR discovery from EHR as an example.                           SQR, MSCCS, and Hawkes process (Bao et al., 2017) in the
ADR discovery is the task of finding unexpected and nega-                      OMOP task. The experiments are conducted using Marsh-
tive incidents caused by drug prescriptions. In EHR, time-                     field Clinic EHRs with millions of drug prescription and
stamped drug prescriptions as well as condition diagnoses                      condition diagnosis events from 200,000 patients.
are collected from millions of patients. These prescriptions
of different drugs and diagnoses of different conditions can
                                                                               6.1. Experimental Configuration
hence be viewed as various event types in LED. Therefore,
using TPSQR, we can model whether the occurrences of a                         Minimum Duration: clinical encounter sequences from
particular drug k could elevate the possibility of the future                  different patients might span across different time lengths.
occurrences of a condition k 0 on different time scales by                     Some have decades of observations in their records while
estimating wkk0 defined in (2). If an elevation is observed,                   other might have records only last a few days. We therefore
we can consider the drug k as a potential candidate to cause                   consider minimum duration of the clinic encounter sequence
condition k 0 as an adverse drug reaction.                                     as a threshold to determine whether we admit a patient to the
                                                                               study or not. In our experiments, we consider two minimum
Postmarketing ADR surveillance from EHR is a multi-
                                                                               duration thresholds: 0.5 year and 1 year.
decade research and practice effort that is of utmost im-
portance to the pharmacovigilance community (Bate et al.,                      Maximum Time Difference: for TPSQR, in (1), τL de-
2018), with substantial financial and clinical implication for                 termines the maximum time difference between the occur-
health care delivery (Sultana et al., 2013). Various ADR dis-                  rences of two events within which the former event might
covery methods have been proposed over the years (Harpaz                       have nonzero influence on the latter event. We call τL the
et al., 2012), and a benchmark task is created by the Obser-                   maximum time difference to characterize how distant in
vational Medical Outcome Partnership (OMOP, Simpson                            the past we would like to take previous occurrences into
2011) to evaluate the ADR signal detection performance of                      consideration when modeling future occurrences. In our
these methods. The OMOP task is to identify the ADRs                           experiments, we consider three maximum time differences:
in 50 drug-condition pairs, coming from a selective combi-                     0.5 year, 1 year, and 1.5 years. L = 3 and the correspond-
nation of ten different drugs and nine different conditions.                   ing influence functions are chosen according to Bao et al.
Among the 50 pairs, 9 of them are confirmed ADRs, while                        2017. In MSCCS, a configuration named risk window serves
the remaining 41 of them are negative controls.                                a similar purpose to the maximum difference in TPSQR. We
                                                                               choose three risk windows according to Kuang et al. 2017b
A most successful ADR discovery method using EHR is
                                                                               so as to ensure that the both TPSQR and MSCCS have sim-
the multiple self-controlled case series (MSCCS, Simpson
                                                                               ilar capability in considering the event history on various
et al. 2013), which has been deployed in real-world ADR
                                                                               time scales.
discovery related projects (Hripcsak et al., 2015). A rea-
son for the success of MSCCS is its introduction of fixed                      Regularization Parameter: we use l1 -regularization for
effects to address the heterogeneity among different sub-                      TPSQR since it encourages sparsity, and the sparsity pat-

Temporal Poisson Square Root Graphical Models

                                                                     all the components of wkk0 . For MSCCS, AUC is computed
 % Better TPSQRs

               80
                                                                     according to Kuang et al. 2017b. Figure 2 presents the his-
                                                                     togram of these two sets of 300 AUCs. The contrast in the
                                                    Max Time Diff
               70                                     0.5 Year       performances between TPSQR and MSCCS is obvious. The
                                                      1 Year
                                                      1.5 Years
                                                                     distribution of TPSQR shifts substantially towards higher
               60                                                    AUC values compared to the distribution of MSCCS. There-
                                                                     fore, the overall performance of TPSQR is superior to that
               50                                                    of MSCCS in the OMOP task under various experimental
                     0.5 Year       1 Year                           configurations in question. As a matter of fact, the top per-
                       Minimum Duration
                                                                     forming TPSQR model reaches an AUC of 0.91, as opposed
Figure 3: Percentage of better TPSQR models under various            to 0.77 for MSCCS. Furthermore, the majority of TPSQRs
minimum duration and maximum time difference designs                 have higher AUCs even compared to the MSCCS model
compared to the best MSCCS model                                     that has the best AUC. We also contrast the performance
                                                                     of TPSQR with the Hawkes process method in Bao et al.
                                                                     2017, whose best AUC is 0.84 under the same experiment
                                                                     configurations.
               0.8

                                                    Max Time Diff    6.3. Sensitivity Analysis and Model Selection
 AUC

               0.7                                    0.5 Year
                                                      1 Year         To see how sensitive the performance of TPSQR is for differ-
                                                      1.5 Years
               0.6
                                                                     ent choices of experimental configurations, we compute the
                                                                     percentage of TPSQRs with a given minimum duration and
               0.5                                                   a given maximum time difference design that are better than
                     0.5 Year       1 Year                           the best MSCCS model (with an AUC of 0.77). The results
                       Minimum Duration
                                                                     are summarized in Figure 3. As can be seen, the percentage
Figure 4: AUC of TPSQR models selected by AIC for given              of better TPSQRs is consistently above 80% under various
minimum duration and maximum time difference designs                 scenarios, suggesting the robustness of TPSQRs to various
                                                                     experimental configurations. Given a fixed minimum du-
                                                                     ration and a fixed maximum time difference, we conduct
                                                                     model selection for TPQSRs by the Akaike information cri-
terns learned correspond to the structures of the graphi-
                                                                     terion (AIC) over the regularization parameters. The AUC
cal models. We use L2 -regularization for MSCCS since
                                                                     of the selected models are summarized in Figure 4. Note
it yields outstanding empirical performance in previous
                                                                     that under various fixed minimum duration and maximum
studies (Simpson et al., 2013; Kuang et al., 2017b). 50
                                                                     time difference designs, AIC is capable of selecting models
regularization parameters are chosen for both TPSQR and
                                                                     with high AUCs. In fact, all the models selected by AIC
MSCCS.
                                                                     have higher AUCs than the best performer of MSCCS. This
To sum up, there are 2 × 3 × 50 = 300 experimental con-              phenomenon demonstrates that the performance of TPSQR
figurations respectively for TPSQR and MSCCS.                        is consistent and robust with respect to the various choices
                                                                     of experimental configurations.
6.2. Overall Performance
For each of the 300 experimental configurations for TPSQR            7. Conclusion
and MSCCS, we perform the OMOP task using our EHR                    We propose TPSQRs, a generalization of PSQRs for the tem-
data. Both TPSQR and MSCCS can be implemented by                     poral relationships between different event types in LED.
the R package glmnet (Friedman et al., 2010; Tibshirani              We propose the use of Poisson pseudo-likelihood approxi-
et al., 2012). We then use Area Under the Curve (AUC)                mation to solve the pseudo-likelihood problem arising from
for the receiver operating characteristic curve to evaluate          PSQRs. The approximation procedure is extremely efficient
how well TPSQR and MSCCS can distinguish actual ADRs                 to solve, and is sparsistent in recovering the structure of the
from negative controls under this particular experimental            underlying PSQR. The utility of TPSQR is demonstrated
configuration. The result is 300 AUCs corresponding to               using Marshfield Clinic EHRs for adverse drug reaction
the total number of experimental configurations for each of          discovery.
the two methods. For TPSQR, since the effect of drug k
on condition k 0 is estimated over different time scales via         Acknowledgement: The authors would like to gratefully
wkk0 , the score corresponding to this drug-condition pair           acknowledge the NIH BD2K Initiative grant U54 AI117924
used to calculate the AUC is computed by the average over            and the NIGMS grant 2RO1 GM097618.

Temporal Poisson Square Root Graphical Models

References Rijnbeek, et al. Observational health data sciences and
informatics (ohdsi): opportunities for observational re-
Genevera I Allen and Zhandong Liu. A local poisson graph-
searchers. Studies in Health Technology and Informatics,
ical model for inferring networks from sequencing data.
216:574, 2015.
IEEE Transactions on Nanobioscience, 12(3):189–198,
2013. David Inouye, Pradeep Ravikumar, and Inderjit Dhillon.
Square root graphical models: Multivariate generaliza-
Yujia Bao, Zhaobin Kuang, Peggy Peissig, David Page, and
tions of univariate exponential families that permit posi-
Rebecca Willett. Hawkes process modeling of adverse
tive dependencies. In International Conference on Ma-
drug reactions with longitudinal observational data. In
chine Learning, pages 2445–2453, 2016.
Machine Learning for Healthcare Conference, pages 177–
190, 2017. David I Inouye, Pradeep K Ravikumar, and Inderjit S
Dhillon. Fixed-length poisson MRF: Adding dependen-
Andrew Bate, Robert F Reynolds, and Patrick Caubel. The
cies to the multinomial. In Advances in Neural Informa-
hope, hype and reality of big data for pharmacovigilance,
tion Processing Systems, pages 3213–3221, 2015.
2018.
Mladen Kolar, Le Song, Amr Ahmed, and Eric P Xing.
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay,
Estimating time-varying networks. The Annals of Applied
Manuel Gomez-Rodriguez, and Le Song. Recurrent
Statistics, pages 94–123, 2010.
marked temporal point processes: Embedding event his-
tory to vector. In Proceedings of the 22nd ACM SIGKDD Zhaobin Kuang, James Thomson, Michael Caldwell, Peggy
International Conference on Knowledge Discovery and Peissig, Ron Stewart, and David Page. Baseline reg-
Data Mining, pages 1555–1564. ACM, 2016. ularization for computational drug repositioning with
longitudinal observational data. In Proceedings of the
Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Reg-
Twenty-Fifth International Joint Conference on Artificial
ularization paths for generalized linear models via coor-
Intelligence, pages 2521–2528. AAAI Press, 2016a.
dinate descent. Journal of Statistical Software, 33(1):1,
2010. Zhaobin Kuang, James Thomson, Michael Caldwell, Peggy
Sinong Geng, Zhaobin Kuang, and David Page. An Peissig, Ron Stewart, and David Page. Computational
efficient pseudo-likelihood method for sparse binary drug repositioning using continuous self-controlled case
pairwise Markov network estimation. arXiv preprint series. In Proceedings of the 22nd ACM SIGKDD Inter-
arXiv:1702.08320, 2017. national Conference on Knowledge Discovery and Data
Mining, pages 491–500. ACM, 2016b.
Sinong Geng, Zhaobin Kuang, Jie Liu, Stephen Wright,
and David Page. Stochastic learning for sparse discrete Zhaobin Kuang, Sinong Geng, and David Page. A screening
Markov random fields with controlled gradient approxi- rule for l1-regularized ising model estimation. In Ad-
mation error. In Proceedings of the Thirty-Fourth Con- vances in Neural Information Processing Systems, pages
ference on Uncertainty in Artificial Intelligence ( 2018 ), 720–731, 2017a.
2018. Zhaobin Kuang, Peggy Peissig, Vitor Santos Costa, Richard
Asela Gunawardana, Christopher Meek, and Puyang Xu. Maclin, and David Page. Pharmacovigilance via baseline
A model for temporal dependencies in event streams. regularization with large-scale longitudinal observational
In Advances in Neural Information Processing Systems, data. In Proceedings of the 23rd ACM SIGKDD Inter-
pages 1962–1970, 2011. national Conference on Knowledge Discovery and Data
Mining, pages 1537–1546. ACM, 2017b.
Jiawei Han, Jian Pei, and Micheline Kamber. Data mining:
concepts and techniques. Elsevier, 2011. Jie Liu and David Page. Bayesian estimation of latently-
grouped parameters in undirected graphical models. In
Rave Harpaz, William DuMouchel, Nigam H Shah, David Advances in Neural Information Processing Systems,
Madigan, Patrick Ryan, and Carol Friedman. Novel data- pages 1232–1240, 2013.
mining methodologies for adverse drug event discovery
and analysis. Clinical Pharmacology & Therapeutics, 91 Jie Liu, David Page, Houssam Nassif, Jude Shavlik, Peggy
(6):1010–1021, 2012. Peissig, Catherine McCarty, Adedayo A Onitilo, and Eliz-
abeth Burnside. Genetic variants improve breast cancer
George Hripcsak, Jon D Duke, Nigam H Shah, Chris- risk prediction on mammograms. In AMIA Annual Sym-
tian G Reich, Vojtech Huser, Martijn J Schuemie, Marc A posium Proceedings, volume 2013, page 876. American
Suchard, Rae Woong Park, Ian Chi Kei Wong, Peter R Medical Informatics Association, 2013.

Temporal Poisson Square Root Graphical Models

Jie Liu, David Page, Peggy Peissig, Catherine McCarty, Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor
Adedayo A Onitilo, Amy Trentham-Dietz, and Elizabeth Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshi-
Burnside. New genetic variants improve personalized rani. Strong rules for discarding predictors in lasso-type
breast cancer diagnosis. AMIA Summits on Translational problems. Journal of the Royal Statistical Society: Series
Science Proceedings, 2014:83, 2014a. B (Statistical Methodology), 74(2):245–266, 2012.

Jie Liu, Chunming Zhang, Elizabeth Burnside, and David Martin J Wainwright. Sharp thresholds for high-
Page. Learning heterogeneous hidden Markov random dimensional and noisy sparsity recovery using l1-
fields. In Artificial Intelligence and Statistics, pages 576– constrained quadratic programming (lasso). IEEE Trans-
584, 2014b. actions on Information Theory, 55(5):2183–2202, 2009.

Jie Liu, Chunming Zhang, David Page, et al. Multiple test- Jeremy Weiss, Sriraam Natarajan, and David Page. Mul-
ing under dependence via graphical models. The Annals tiplicative forests for continuous-time processes. In Ad-
of Applied Statistics, 10(3):1699–1724, 2016. vances in Neural Information Processing Systems, pages
458–466, 2012.
Stephane Mallat. A wavelet tour of signal processing: the
sparse way. Academic Press, 2008. Jeremy C Weiss and David Page. Forest-based point pro-
cess for event prediction from electronic health records.
James M Ortega and Werner C Rheinboldt. Iterative solution In Joint European Conference on Machine Learning
of nonlinear equations in several variables. SIAM, 2000. and Knowledge Discovery in Databases, pages 547–562.
Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Springer, 2013.
Wasserman. Spam: Sparse additive models. In Pro- Eunho Yang and Pradeep Ravikumar. On the use of vari-
ceedings of the 20th International Conference on Neural ational inference for learning discrete graphical model.
Information Processing Systems, pages 1201–1208. Cur- In International Conference on Machine Learning, pages
ran Associates Inc., 2007. 1009–1016, 2011.
Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, Eunho Yang, Pradeep K Ravikumar, Genevera I Allen, and
et al. High-dimensional ising model selection using 1- Zhandong Liu. On poisson graphical models. In Advances
regularized logistic regression. The Annals of Statistics, in Neural Information Processing Systems, pages 1718–
38(3):1287–1319, 2010. 1726, 2013.
Martijn J Schuemie, Gianluca Trifirò, Preciosa M Coloma, Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and
Patrick B Ryan, and David Madigan. Detecting adverse Zhandong Liu. Graphical models via univariate exponen-
drug reactions following long-term exposure in longi- tial family distributions. Journal of Machine Learning
tudinal observational data: The exposure-adjusted self- Research, 16(1):3813–3847, 2015a.
controlled case series. Statistical Methods in Medical
Research, 25(6):2577–2592, 2016. Sen Yang, Zhaosong Lu, Xiaotong Shen, Peter Wonka, and
Jieping Ye. Fused multiple graphical lasso. SIAM Journal
Shawn E Simpson. Self-controlled methods for postmarket- on Optimization, 25(2):916–943, 2015b.
ing drug safety surveillance in large-scale longitudinal
data. Columbia University, 2011.

Shawn E Simpson, David Madigan, Ivan Zorych, Martijn J
Schuemie, Patrick B Ryan, and Marc A Suchard. Mul-
tiple self-controlled case series for large-scale longitudi-
nal observational databases. Biometrics, 69(4):893–902,
2013.

Janet Sultana, Paola Cutroneo, and Gianluca Trifirò. Clinical
and economic burden of adverse drug reactions. Journal
of Pharmacology & Pharmacotherapeutics, 4(Suppl1):
S73, 2013.

Nicholas P Tatonetti, P Ye Patrick, Roxana Daneshjou, and
Russ B Altman. Data-driven prediction of drug effects
and interactions. Science Translational Medicine, 4(125):
125ra31–125ra31, 2012.

You can also read