Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records Max Friedrich1 , Arne Köhn2 , Gregor Wiedemann1 , Chris Biemann1 1 Language Technology Group, Universität Hamburg {2mfriedr,gwiedemann,biemann}@informatik.uni-hamburg.de 2 Department of Language Science and Technology, Saarland University koehn@coli.uni-saarland.de Abstract numbers; the American Health Insurance Porta- bility Accountability Act1 (HIPAA, 1996) defines De-identification is the task of detecting pro- 18 categories of PHI. De-identification is the task tected health information (PHI) in medical of finding and labeling PHI in medical text as a text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for re- step toward sanitization. As the information to search. Automatic de-identification classifiers be removed is very sensitive, sanitization always can significantly speed up the sanitization pro- requires final human verification. Automatic de- cess. However, obtaining a large and diverse identification labeling can however significantly dataset to train such a classifier that works well speed up the process, as shown for other annota- across many types of medical text poses a chal- tion tasks in e.g. Yimam (2015). lenge as privacy laws prohibit the sharing of raw medical records. We introduce a method Trying to create an automatic classifier for de- to create privacy-preserving shareable repre- identification leads to a “chicken and egg prob- sentations of medical text (i.e. they contain lem” (Uzuner et al., 2007): without a compre- no PHI) that does not require expensive man- hensive training set, an automatic de-identification ual pseudonymization. These representations classifier cannot be developed, but without ac- can be shared between organizations to create cess to automatic de-identification, it is difficult to unified datasets for training de-identification share large corpora of medical text in a privacy- models. Our representation allows training a preserving way for research (including for train- simple LSTM-CRF de-identification model to an F1 score of 97.4%, which is comparable ing the classifier itself). The standard method of to a strong baseline that exposes private infor- data protection compliant sharing of training data mation in its representation. A robust, widely for a de-identification classifier requires humans to available de-identification classifier based on pseudonymize protected information with substi- our representation could potentially enable tutes in a document-coherent way. This includes studies for which de-identification would oth- replacing e.g. every person or place name with erwise be too costly. a different name, offsetting dates by a random amount while retaining date intervals, and replac- 1 Introduction ing misspellings with similar misspellings of the Electronic health records (EHRs) are are valuable pseudonym (Uzuner et al., 2007). resource that could potentially be used in large- In 2019, a pseudonymized dataset for de- scale medical research (Botsis et al., 2010; Birk- identification from a single source, the i2b2 2014 head et al., 2015; Cowie et al., 2017). In addition dataset, is publicly available (Stubbs and Uzuner, to structured medical data, EHRs contain free-text 2015). However, de-identification classifiers patient notes that are a rich source of information trained on this dataset do not generalize well to (Jensen et al., 2012). However, due to privacy data from other sources (Stubbs et al., 2017). and data protection laws, medical records can only To obtain a universal de-identification classifier, be shared and used for research if they are sani- many medical institutions would have to pool their tized to not include information potentially identi- data. But, preparing this data for sharing us- fying patients. The PHI that may not be shared in- ing the document-coherent pseudonymization ap- cludes potentially identifying information such as 1 names, geographic identifiers, dates, and account https://legislink.org/us/pl-104-191
[Henry]Patient was admitted to [River Clinic]Hosp . . . f n atio iz Pseudonymized patient notes µ nym James was admitted to [James]Patient was admitted udo Pse St. Thomas. . . PHI labeling / to [St. Thomas]Hosp . . . de-identification 3 No Raw patient notes PHI-labeled patient notes tra n-rev nsf ers orm ibl []Patient atio e n [ ]Hosp . . . Private vector representation of patient notes µ Figure 1: Sharing training data for de-identification. PHI annotations are marked with [brackets]. Upper alter- native: traditional process using manual pseudonymization. Lower alternative: our approach of sharing private vector representations. The people icon represents tasks done by humans; the gears icon represents tasks done by machines; the lock icon represents privacy-preserving artifacts. Manual pseudonymization is marked with a dollar icon to emphasize its high costs. proach requires large human effort (Dernoncourt 2007; Meystre et al., 2010) to deep learning meth- et al., 2017). ods (Stubbs et al., 2017; Dernoncourt et al., 2017; To address this problem, we introduce an ad- Liu et al., 2017). versarially learned representation of medical text that allows privacy-preserving sharing of training Three i2b2 shared tasks on de-identification data for a de-identification classifier by transform- were run in 2006 (Uzuner et al., 2007), 2014 ing text non-reversibly into a vector space and (Stubbs et al., 2015), and 2016 (Stubbs et al., only sharing this representation. Our approach 2017). The organizers performed manual pseu- still requires humans to annotate PHI (as this is donymization on clinical records from a single the training data for the actual de-identification source to create the datasets for each of the tasks. task) but the pseudonymization step (replacing An F1 score of 95% has been suggested as a target PHI with coherent substitutes) is replaced by the for reasonable de-identification systems (Stubbs automatic transformation to the vector representa- et al., 2015). tion instead. A classifier then trained on our rep- Dernoncourt et al. (2017) first applied a resentation cannot contain any protected data, as it long short-term memory (LSTM) (Hochreiter and is never trained on raw text (as long as the repre- Schmidhuber, 1997) model with a CRF output sentation does not allow for the reconstruction of component to de-identification. Transfer learn- sensitive information). The traditional approach to ing from a larger dataset slightly improves perfor- sharing training data is conceptually compared to mance on the i2b2 2014 dataset (Lee et al., 2018). our approach in Fig. 1. Liu et al. (2017) achieve state-of-the-art perfor- 2 Related Work mance in de-identification by combining a deep learning ensemble with a rule component. Our work builds upon two lines of research: firstly de-identification, as the system has to provide Up to and including the 2014 shared task, the good de-identification performance, and secondly organizers emphasized that it is unclear if a sys- adversarial representation learning, to remove all tem trained on the provided datasets will general- identifying information from the representations ize to medical records from other sources (Uzuner to be distributed. et al., 2007; Stubbs et al., 2015). The 2016 shared task featured a sight-unseen track in which de- 2.1 Automatic De-Identification identification systems were evaluated on records Analogously to many natural language process- from a new data source. The best system achieved ing tasks, the state of the art in de-identification an F1 score of 79%, suggesting that systems at changed in recent years from rule-based systems the time were not able to deliver sufficient per- and shallow machine learning approaches like formance on completely new data (Stubbs et al., conditional random fields (CRFs) (Uzuner et al., 2017).
2.2 Adversarial Representation Learning Hyperparameter Value Fair representations (Zemel et al., 2013; Hamm, Pre-trained embeddings FastText, GloVe 2015) aim to encode features of raw data that al- Casing feature Yes lows it to be used in e.g. machine learning algo- Batch size 32 rithms while obfuscating membership in a pro- Number of LSTM layers 2 tected group or other sensitive attributes. The LSTM units per layer/dir. 128 domain-adversarial neural network (DANN) ar- Input embedding dropout 0.1 chitecture (Ganin et al., 2016) is a deep learning Variational dropout 0.25 implementation of a three-party game between a Dropout after LSTM 0.5 representer, a classifier, and an adversary compo- Optimizer Nadam nent. The classifier and the adversary are deep Gradient norm clipping 1.0 learning models with shared initial layers. A gra- dient reversal layer is used to worsen the represen- Table 1: Hyperparameter configuration of our de- tation for the adversary during back-propagation: identification model. when training the adversary, the adversary-specific part of the network is optimized for the adversarial task but the shared part is updated against the gra- We compare three different approaches: a non- dient to make the shared representation less suit- private de-identification classifier and two privacy- able for the adversary. enabled extensions, automatic pseudonymization Although initially conceived for use in domain (Section 4) and adversarially learned representa- adaptation, DANNs and similar adversarial deep tions (Section 5). learning models have recently been used to obfus- Our non-private system as well as the privacy- cate demographic attributes from text (Elazar and enabled extensions are based on a bidirectional Goldberg, 2018; Li et al., 2018) and subject iden- LSTM-CRF architecture that has been proven to tity (Feutry et al., 2018) from images. Elazar and work well in sequence tagging (Huang et al., 2015; Goldberg (2018) warn that when a representation Lample et al., 2016) and de-identification (Der- is learned using gradient reversal methods, contin- noncourt et al., 2017; Liu et al., 2017). We only ued adversary training on the frozen representa- use pre-trained FastText (Bojanowski et al., 2017) tion may allow adversaries to break representation or GloVe (Pennington et al., 2014) word embed- privacy. To test whether the unwanted informa- dings, not explicit character embeddings, as we tion is not extractable from the generated informa- suspect that these may allow easy re-identification tion anymore, adversary training needs to continue of private information if used in shared represen- on the frozen representation after finishing train- tations. In place of learned character features, ing the system. Only if after continued adversary we provide the casing feature from Reimers and training the information cannot be recovered, we Gurevych (2017) as an additional input. The fea- have evidence that it really is not contained in the ture maps words to a one-hot representation of representation anymore. their casing (numeric, mainly numeric, all lower, all upper, initial upper, contains digit, or other). 3 Dataset and De-Identification Model Table 1 shows our raw de-identification model’s hyperparameter configuration that was determined We evaluate our approaches using the i2b2 2014 through a random hyperparameter search. dataset (Stubbs and Uzuner, 2015), which was re- leased as part of the 2014 i2b2/UTHealth shared 4 Automatic Pseudonymization task track 1 and is the largest publicly available dataset for de-identification today. It contains To provide a baseline to compare our primary ap- 1304 free-text documents with PHI annotations. proach against, we introduce a naı̈ve word-level The i2b2 dataset uses the 18 categories of PHI de- automatic pseudonymization approach that ex- fined by HIPAA as a starting point for its own set ploits the fact that state-of-the-art de-identification of PHI categories. In addition to the HIPAA set models (Liu et al., 2017; Dernoncourt et al., 2017) of categories, it includes (sub-)categories such as as well as our non-private de-identification model doctor names, professions, states, countries, and work on the sentence level and do not rely on doc- ages under 90. ument coherency. Before training, we shuffle the
De-identification output Adversary output sentation and improves on the pseudonymization ··· from Section 4. After training the representation on an initial publicly available dataset, e.g. the i2b2 2014 data, a central model provider shares De-Identification Model Adversary Model the frozen representation model with participating medical institutions. They transform their PHI- labeled raw data into the pseudonymized repre- Represent. ··· sentation, which is then pooled into a new pub- lic dataset for de-identification. Periodically, the pipeline consisting of the representation model Representation Model and a trained de-identification model can be pub- lished to be used by medical institutions on their unlabeled data. Emb. ··· Since both the representation model and the re- sulting representations are shared in this scenario, Tokens James was admitted · · · our representation procedure is required to prevent two attacks: Figure 2: Simplified visualization of the adversarial A1. Learning an inverse representation model that model architecture. Sequences of squares denote real- transforms representations back to original valued vectors, dotted arrows represent possible addi- sentences containing PHI. tional real or fake inputs to the adversary. The cas- ing feature that is provided as a second input to the de- A2. Building a lookup table of inputs and their ex- identification model is omitted for legibility. act representations that can be used in known plaintext attacks. training sentences and replace all PHI tokens with 5.1 Architecture a random choice of a fixed number N of their clos- est neighbors in an embedding space (including Our approach uses a model that is composed of the token itself), as determined by cosine distance three components: a representation model, the de- in a pre-computed embedding matrix. identification model from Section 3, and an adver- Using this approach, the sentence sary. An overview of the architecture is shown in Fig. 2. [James] was admitted to [St. Thomas] The representation model maps a sequence of may be replaced by word embeddings to an intermediate vector rep- resentation sequence. The de-identification model [Henry] was admitted to [Croix Scott]. receives this representation sequence as an input While the resulting sentences do not necessarily instead of the original embedding sequence. It re- make sense to a reader (e.g. “Croix Scott” is not tains the casing feature as an auxiliary input. The a realistic hospital name), its embedding represen- adversary has two inputs, the representation se- tation is similar to the original. We train our de- quence and an additional embedding or represen- identification model on the transformed data and tation sequence, and a single output unit. test it on the raw data. The number of neighbors N controls the privacy properties of the approach: 5.2 Representation N = 1 means no pseudonymization; setting N to To protect against A1, our representation must be the number of rows in a precomputed embedding invariant to small input changes, like a single PHI matrix delivers perfect anonymization but the re- token being replaced with a neighbor in the em- sulting data may be worthless for training a de- bedding space. Again, the number of neighbors N identification model. controls the privacy level of the representation. To protect against A2, we add a random element 5 Adversarial Representation to the representation that makes repeated trans- We introduce a new data sharing approach that is formations of one sentence indistinguishable from based on an adversarially learned private repre- representations of similar input sentences.
We use a bidirectional LSTM model to imple- D A D A D A D A ment the representation. It applies Gaussian noise N with zero mean and trainable standard devia- R R R R tions to the input embeddings E and the output 1. 2. 3. a) 3. b) sequence. The model learns a standard deviation Figure 3: Visualization of Feutry et al.’s three-part for each of the input and output dimensions. training procedure. The adversarial model layout fol- lows Fig. 2: the representation model is at the bottom, the left branch is the de-identification model and the R = Nout + LSTM(E + Nin ) (1) right branch is the adversary. In each step, the thick components are trained while the thin components are In a preliminary experiment, we confirmed frozen. that adding noise with a single, fixed standard deviation is not a viable approach for privacy- 5.4 Training preserving representations. To change the cosine similarity neighborhoods of embeddings at all, we We evaluate two training procedures: DANN need to add high amounts of noise (more than training (Ganin et al., 2016) and the three-part pro- double of the respective embedding matrix’s stan- cedure from Feutry et al. (2018). dard deviation), which in turn results in unreal- In DANN training, the three components are istic embeddings that do not allow training a de- trained conjointly, optimizing the sum of losses. identification model of sufficient quality. Training the de-identification model modifies the In contrast to the automatic pseudonymization representation model weights to generate a more approach from Section 4 that only perturbs PHI meaningful representation for de-identification. tokens, the representation models in this approach The adversary gradient is reversed with a gradi- processes all tokens to represent them in a new ent reversal layer between the adversary and the embedding space. We evaluate the representation representation model in the backward pass, caus- sizes d ∈ {50, 100, 300}. ing the representation to become less meaningful for the adversary. 5.3 Adversaries The training procedure by Feutry et al. (2018) We use two adversaries that are trained on tasks is shown in Fig. 3. It is composed of three phases: that directly follow from A1 and A2: P1. The de-identification and representation T1. Given a representation sequence and an em- models are pre-trained together, optimizing bedding sequence, decide if they were ob- the de-identification loss ldeid . tained from the same sentence. P2. The representation model is frozen and the T2. Given two representation sequences (and adversary is pre-trained, optimizing the ad- their cosine similarities), decide if they were versarial loss ladv . obtained from the same sentence. P3. In alternation, for one epoch each: We generate the representation sequences for the (a) The representation is frozen and both second adversary from a copy of the representa- de-identification model and adversary tion model with shared weights. We generate real are trained, optimizing their respective and fake pairs for adversarial training using the au- losses ldeid and ladv . tomatic pseudonymization approach presented in (b) The de-identification model and adver- Section 4, limiting the number of replaced PHI to- sary are frozen and the representation is kens to one per sentence. trained, optimizing the combined loss The adversaries are implemented as bidirec- tional LSTM models with single output units. We lrepr = ldeid + λ|ladv − lrandom | (2) confirmed that this type of model is able to learn the adversarial tasks on random data and raw word embeddings in preliminary experiments. To use the two adversaries in our architecture, we aver- In each of the first two phases, the respective val- age their outputs. idation loss is monitored to decide at which point
the training should move on to the next phase. Model F1 (%) The alternating steps in the third phase each last Our non-private FastText 97.67 one training epoch; the early stopping time for the Our non-private GloVe 97.24 third phase is determined using only the combined Our non-private GloVe + casing 97.62 validation loss from Phase P3b. Gradient reversal is achieved by optimizing the Dernoncourt et al. (LSTM-CRF) 97.85 combined representation loss while the adversary Liu et al. (ensemble + rules) 98.27 weights are frozen. The combined loss is moti- Our autom. pseudon. FastText 96.75 vated by the fact that the adversary performance Our autom. pseudon. GloVe 96.42 should be the same as a random guessing model, which is a lower bound for anonymization (Feutry Our adv. repr. FastText 97.40 et al., 2018). The term |ladv −lrandom | approaches 0 Our adv. repr. GloVe 96.89 when the adversary performance approaches ran- Table 2: Binary HIPAA F1 scores of our non-private dom guessing2 . λ is a weighting factor for the two (top) and private (bottom) de-identification approaches losses; we select λ = 1. on the i2b2 2014 test set in comparison to non-private the state of the art. Our private approaches use N = 6 Experiments 100 neighbors as a privacy criterion. To evaluate our approaches, we perform experi- ments using the i2b2 2014 dataset. 7 Results Table 2 shows de-identification performance re- Preprocessing: We apply aggressive tokeniza- sults for the non-private de-identification classifier tion similarly to Liu et al. (2017), including split- (upper part, in comparison to the state of the art) as ting at all punctuation marks and mid-word e.g. if well as the two privacy-enabled extensions (lower a number is followed by a word (“25yo” is split part). The results are average values out of five into “25”, “yo”) in order to minimize the amount experiment runs. of GloVe out-of-vocabulary tokens. We extend spaCy’s3 sentence splitting heuristics with addi- 7.1 Non-private De-Identification Model tional rules for splitting at multiple blank lines as When trained on the raw i2b2 2014 data, our mod- well as bulleted and numbered list items. els achieve F1 scores that are comparable to Der- Deep Learning Models: We use the Keras noncourt et al.’s results. The casing feature im- framework4 (Chollet et al., 2015) with the Tensor- proves GloVe by 0.4 percentage points. Flow backend (Abadi et al., 2015) to implement 7.2 Automatic Pseudonymization our deep learning models. For both FastText and GloVe, moving training PHI Evaluation: In order to compare our results to tokens to random tokens from up to their N = 200 the state of the art, we use the token-based bi- closest neighbors does not significantly reduce de- nary HIPAA F1 score as our main metric for de- identification performance (see Fig. 4). F1 scores identification performance. Dernoncourt et al. for both models drop to around 95% when se- (2017) deem it the most important metric: decid- lecting from N = 500 neighbors and to around ing if an entity is PHI or not is generally more im- 90% when using N = 1 000 neighbors. With portant than assigning the correct category of PHI, N = 100, the FastText model achieves an F1 score and only HIPAA categories of PHI are required to of 96.75% and the GloVe model achieves an F1 be removed by American law. Non-PHI tokens are score of 96.42%. not incorporated in the F1 score. We perform the evaluation with the official i2b2 evaluation script5 . 7.3 Adversarial Representation We do not achieve satisfactory results with the 2 In the case of binary classification: Lrandom = − log 21 . conjoint DANN training procedure: in all cases, 3 https://spacy.io 4 our models learn representations that are not suf- https://keras.io 5 https://github.com/kotfic/i2b2_ ficiently resistant to the adversary. When training evaluation_scripts the adversary on the frozen representation for an
De-identification performance scores 0.27 points F1 score lower than our best 1.000 Binary HIPAA F1 score non-private model; we consider this relative in- 0.975 crease of errors of less than 10% as tolerable. 0.950 Raw Text De-Identification: We find that the 0.925 choice of GloVe or FastText embeddings does FastText 0.900 GloVe not meaningfully influence de-identification per- formance. FastText’s approach to embedding 100 101 102 103 unknown words (word embeddings are the sum Number of neighbors N of their subword embeddings) should intuitively Figure 4: F1 scores of our models when trained on prove useful on datasets with misspellings and un- automatically pseudonymized data where PHI tokens grammatical text. However, when using the ad- are moved to one of different numbers of neighbors N . ditional casing feature, FastText beats GloVe only The gray dashed line marks the 95% target F1 score. by 0.05 percentage points on the i2b2 test set. In this task, the casing feature makes up for GloVe’s additional 20 epochs, it is able to distinguish real inability to embed unknown words. from fake input pairs on a test set with accura- Liu et al. (2017) use a deep learning ensemble cies above 80%. This confirms the difficulties of in combination with hand-crafted rules to achieve DANN training as described by Elazar and Gold- state-of-the-art results for de-identification. Our berg (2018) (see Section 2.2). model’s scores are similar to the previous state In contrast, with the three-part training proce- of the art, a bidirectional LSTM-CRF model with dure, we are able to learn a representation that al- character features (Dernoncourt et al., 2017). lows training a de-identification model while pre- Automatically Pseudonymized Data: Our venting an adversary from learning the adversar- naı̈ve automatic word-level pseudonymization ial tasks, even with continued training on a frozen approach allows training reasonable de-iden- representation. tification models when selecting from up to Figure 5 (left) shows our de-identification re- N = 500 neighbors. There is almost no decrease sults when using adversarially learned representa- in F1 score for up to N = 20 neighbors for both tions. A higher number of neighbors N means a the FastText and GloVe model. stronger invariance requirement for the representa- tion. For values of N up to 1 000, our FastText and Adversarially Learned Representation: Our GloVe models are able to learn representations that adversarially trained vector representation allows allow training de-identification models that reach training reasonable de-identification models (F1 or exceed the target F1 score of 95%. However, scores above 95%) when using up to N = 1 000 training becomes unstable for N > 500: at this neighbors as an invariance requirement. The ad- point, the adversary is able to break the represen- versarial representation results beat the automatic tation privacy when trained for an additional 50 pseudonymization results because the representa- epochs (Fig. 5 right). tion model can act as a task-specific feature ex- Our choice of representation size d ∈ tractor. Additionally, the representations are more {50, 100, 300} does not influence de-identification general as they are invariant to word changes. or adversary performance, so we select d = 50 for 9 Privacy Properties further evaluation. For d = 50 and N = 100, the FastText model reaches an F1 score of 97.4% and In this section, we discuss our models with respect the GloVe model reaches an F1 score of 96.89%. to their privacy-preserving properties. 8 De-Identification Performance Embeddings: When looking up embedding space neighbors for words, it is notable that many In the following, we discuss the results of our FastText neighbors include the original word or models with regard to our goal of sharing sensi- parts of it as a subword. For tokens that occur tive training data for automatic de-identification. as PHI in the i2b2 training set, on average 7.37 Overall, privacy-preserving representations come of their N = 100 closest neighbors in the Fast- at a cost, as our best privacy-preserving model Text embedding matrix contain the original token
De-identification performance Adversary accuracy 1.00 1.00 Binary HIPAA F1 score FastText GloVe Accuracy 0.98 0.75 0.96 0.50 1 2 3 10 10 10 101 102 103 Number of neighbors N Number of neighbors N Figure 5: Left: de-identification F1 scores of our models using an adversarially trained representation with different numbers of neighbors N for the representation invariance requirement. Right: mean adversary accuracy when trained on the frozen representation for an additional 50 epochs. The figure shows average results out of five experiment runs. as a subword. When looking up neighbors using replaced with any of its N neighbors in an em- GloVe embeddings, the value is 0.44. This may bedding space. When freezing the representation indicate that FastText requires stronger perturba- model from an experiment run using up to N = tion (i.e. higher N ) than GloVe to sufficiently ob- 500 neighbors and training the adversary for an ad- fuscate protected information. ditional 50 epochs, it still does not achieve higher- than-chance accuracies on the training data. Due Automatically Pseudonymized Data: The to the additive noise, the adversary does not over- word-level pseudonymization does not guarantee fit on its training set but rather fails to identify any a minimum perturbation for every word, e.g. in a structure in the data. set of pseudonymized sentences using N = 100 In the case of N = 1 000 neighbors, the repre- FastText neighbors, we found the phrase sentation never becomes stable in the alternating [Florida Hospital], training phase. The adversary is always able to break the representation privacy. which was replaced with [Miami-Florida Hosp]. 10 Conclusions & Future Work Additionally, the approach may allow an adver- We introduced a new approach to sharing train- sary to piece together documents from the shuf- ing data for de-identification that requires lower fled sentences. If multiple sentences contain sim- human effort than the existing approach of ilar pseudonymized identifiers, they will likely document-coherent pseudonymization. Our ap- come from the same original document, undoing proach is based on adversarial learning, which the privacy gain from shuffling training sentences yields representations that can be distributed since across documents. It may be possible to infer the they do not contain private health information. original information using the overlapping neigh- The setup is motivated by the need of de- bor spaces. To counter this, we can re-introduce identification of medical text before sharing; our document-level pseudonymization, i.e. moving all approach provides a lower-cost alternative than occurrences of a PHI token to the same neighbor. manual pseudonymization and gives rise to the However, we would then also need to detect mis- pooling of de-identification datasets from hetero- spelled names as well as other hints to the actual geneous sources in order to train more robust clas- tokens and transform them similarly to the orig- sifiers. Our implementation and experimental data inal, which would add back much of the com- are publicly available6 . plexity of manual pseudonymization that we try As precursors to our adversarial representation to avoid. approach, we developed a deep learning model Adversarially Learned Representation: Our for de-identification that does not rely on ex- adversarial representation empirically satisfies a plicit character features as well as an automatic strong privacy criterion: representations are in- 6 https://github.com/maxfriedrich/ variant to any protected information token being deid-training-data
word-level pseudonymization approach. A model ther possible extension is a dynamic noise level trained on our automatically pseudonymized data in the representation model that depends on the with N = 100 neighbors loses around one per- LSTM output instead of being a trained weight. centage point in F1 score when compared to the This might allow using lower amounts of noise for non-private system, scoring 96.75% on the i2b2 certain inputs while still being robust to the adver- 2014 test set. sary. Further, we presented an adversarial learning When more training data from multiple sources based private representation of medical text that become available in the future, it will be possible is invariant to any PHI word being replaced with to evaluate our adversarially learned representa- any of its embedding space neighbors and con- tion against unseen data. tains a random element. The representation allows training a de-identification model while being ro- Acknowledgments bust to adversaries trying to re-identify protected This work was partially supported by BWFG information or building a lookup table of repre- Hamburg within the “Forum 4.0” project as part sentations. We extended existing adversarial rep- of the ahoi.digital funding line. resentation learning approaches by using two ad- De-identified clinical records used in this re- versaries that discriminate real from fake sequence search were provided by the i2b2 National pairs with an additional sequence input. Center for Biomedical Computing funded by The representation acts as a task-specific fea- U54LM008748 and were originally prepared for ture extractor. For an invariance criterion of up to the Shared Tasks for Challenges in NLP for Clin- N = 500 neighbors, training is stable and adver- ical Data organized by Özlem Uzuner, i2b2 and saries cannot beat the random guessing accuracy SUNY. of 50%. Using the adversarially learned represen- tation, de-identification models reach an F1 score of 97.4%, which is close to the non-private system References (97.67%). In contrast, the automatic pseudonymi- zation approach only reaches an F1 score of 95.0% Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor- at N = 500. rado, Andy Davis, Jeffrey Dean, Matthieu Devin, Our adversarial representation approach en- Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, ables cost-effective private sharing of training data Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal for sequence labeling. Pooling of training data for Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, de-identification from multiple institutions would Derek Murray, Chris Olah, Mike Schuster, Jonathon lead to much more robust classifiers. Eventually, Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- improved de-identification classifiers could help war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude- enable large-scale medical studies that eventually van, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi- improve public health. aoqiang Zheng. 2015. TensorFlow: Large-scale ma- chine learning on heterogeneous systems. Accessed Future Work: The automatic pseudonymiza- May 31, 2019. tion approach could serve as a data augmenta- tion scheme to be used as a regularizer for de- Guthrie S. Birkhead, Michael Klompas, and Nirav R. identification models. Training a model on a com- Shah. 2015. Uses of electronic health records for public health surveillance to advance public health. bination of raw and pseudonymized data may re- Annual Review of Public Health, 36(1):345–359. sult in better test scores on the i2b2 test set, possi- PMID: 25581157. bly improving the state of the art. Private character embeddings that are learned Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with from a perturbed source could be an interesting ex- subword information. Transactions of the Associa- tension to our models. tion for Computational Linguistics, 5:135–146. In adversarial learning with the three-part train- ing procedure, it might be possible to tune the λ Taxiarchis Botsis, Gunnar Hartvigsen, Fei Chen, and Chunhua Weng. 2010. Secondary use of EHR: parameter and define a better stopping condition Data quality issues and informatics opportunities. that avoids the unstable characteristics with high AMIA Summits on Translational Science Proceed- values for N in the invariance criterion. A fur- ings, 2010:1–5.
François Chollet et al. 2015. Keras. Accessed May 31, Ji Young Lee, Franck Dernoncourt, and Peter 2019. Szolovits. 2018. Transfer Learning for Named- Entity Recognition with Neural Networks. In Martin R. Cowie, Juuso I. Blomster, Lesley H. Cur- Proceedings of the Eleventh International Confer- tis, Sylvie Duclaux, Ian Ford, Fleur Fritz, Saman- ence on Language Resources and Evaluation, pages tha Goldman, Salim Janmohamed, Jörg Kreuzer, 4470–4473, Miyazaki, Japan. European Language Mark Leenay, Alexander Michel, Seleen Ong, Jill P. Resources Association. Pell, Mary Ross Southworth, Wendy Gattis Stough, Martin Thoenes, Faiez Zannad, and Andrew Za- Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018. lewski. 2017. Electronic health records to facilitate Towards robust and privacy-preserving text repre- clinical research. Clinical Research in Cardiology, sentations. In Proceedings of the 56th Annual Meet- 106(1):1–9. ing of the Association for Computational Linguistics (Volume 2: Short Papers), pages 25–30, Melbourne, Franck Dernoncourt, Ji Young Lee, Özlem Uzuner, Australia. Association for Computational Linguis- and Peter Szolovits. 2017. De-identification of pa- tics. tient notes with recurrent neural networks. Journal of the American Medical Informatics Association, Zengjian Liu, Buzhou Tang, Xiaolong Wang, and 24(3):596–606. Qingcai Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional Yanai Elazar and Yoav Goldberg. 2018. Adversarial random field. Journal of Biomedical Informatics, removal of demographic attributes from text data. 75(S):S34–S42. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages Stephane M. Meystre, F. Jeffrey Friedlin, Brett R. 11–21, Brussels, Belgium. Association for Compu- South, Shuying Shen, and Matthew H. Samore. tational Linguistics. 2010. Automatic de-identification of textual docu- ments in the electronic health record: a review of Clément Feutry, Pablo Piantanida, Yoshua Bengio, and recent research. BMC Medical Research Methodol- Pierre Duhamel. 2018. Learning anonymized repre- ogy, 10(1):70. sentations with adversarial neural networks. arXiv preprint arXiv:1802.09386. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, representation. In Proceedings of the 2014 Con- Pascal Germain, Hugo Larochelle, François Lavi- ference on Empirical Methods in Natural Language olette, Mario Marchand, and Victor Lempitsky. Processing, pages 1532–1543, Doha, Qatar. Associ- 2016. Domain-adversarial training of neural net- ation for Computational Linguistics. works. Journal of Machine Learning Research, 17(1):2096–2030. Nils Reimers and Iryna Gurevych. 2017. Op- timal hyperparameters for deep LSTM-networks Jihun Hamm. 2015. Preserving privacy of continuous for sequence labeling tasks. arXiv preprint high-dimensional data with minimax filters. In Pro- arXiv:1707.06799. ceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Amber Stubbs, Michele Filannino, and Özlem Uzuner. Proceedings of Machine Learning Research, pages 2017. De-identification of psychiatric intake 324–332, San Diego, CA, USA. PMLR. records: Overview of 2016 CEGS N-GRID shared tasks track 1. Journal of Biomedical Informatics, Sepp Hochreiter and Jürgen Schmidhuber. 1997. 75(S):S4–S18. Long short-term memory. Neural Computation, 9(8):1735–1780. Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems for the de-identification Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- of longitudinal clinical narratives: Overview of 2014 rectional LSTM-CRF models for sequence tagging. i2b2/UTHealth shared task track 1. Journal of arXiv preprint arXiv:1508.01991. Biomedical Informatics, 58:11–19. Peter B. Jensen, Lars J. Jensen, and Søren Brunak. Amber Stubbs and Özlem Uzuner. 2015. Annotating 2012. Mining electronic health records: towards longitudinal clinical narratives for de-identification: better research applications and clinical care. Na- The 2014 i2b2/UTHealth corpus. Journal of ture Reviews Genetics, 13:395–405. Biomedical Informatics, 58:20–29. Guillaume Lample, Miguel Ballesteros, Sandeep Sub- The United States Congress. 1996. Health insurance ramanian, Kazuya Kawakami, and Chris Dyer. 2016. portability and accountability act of 1996. Accessed Neural architectures for named entity recognition. May 31, 2019. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- Özlem Uzuner, Yuan Luo, and Peter Szolovits. 2007. tional Linguistics: Human Language Technologies, Evaluating the state-of-the-art in automatic de- pages 260–270, San Diego, CA, USA. Association identification. Journal of the American Medical In- for Computational Linguistics. formatics Association, 14(5):550–563.
Seid Muhie Yimam. 2015. Narrowing the loop: Inte- gration of resources and linguistic dataset develop- ment with interactive machine learning. In Proceed- ings of the 2015 Conference of the North American Chapter of the Association for Computational Lin- guistics: Student Research Workshop, pages 88–95, Denver, CO, USA. Association for Computational Linguistics. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representa- tions. In Proceedings of the 30th International Con- ference on Machine Learning, volume 28 of Pro- ceedings of Machine Learning Research, pages 325– 333, Atlanta, GA, USA. PMLR.
You can also read