Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Page 1 of 23 Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable Compensation Dennis Grishin,1,2,3 Kamal Obbad,1 Preston Estep,1,3 Kevin Quinn,1 Sarah Wait Zaranek,3 Alexander Wait Zaranek,1,3 Ward Vandewege,1,3 Tom Clegg,3 Nico César,3 Mirza Cifric,1,3 George Church1,2,3 Authors 1 Nebula Genomics, Inc., San Francisco, USA; 2Department of Genetics, Harvard Medical School, Boston, USA; 3 Veritas Genetics, Inc, Danvers, USA. Corresponding Author Dennis Grishin, Nebula Genomics Inc., 73 Sumner Street, #401, San Francisco, CA 94103, USA; dgrishin@g. harvard.edu Keywords: Arvados, Blockchain, Data Privacy, Data Sharing, DNA Sequencing, Genomics, Homomorphic Encryption, Nebula Category: Use Cases/Pilots/Methodologies In the years since the first human genome and technical challenges impede the growth of was sequenced at a cost of over $3 billion, genomic data and hinder data sharing. technological advancements have driven the price below $1,000, making personal In this article, we propose that these challenges genome sequencing affordable to many can be addressed by combining decentralized people. Personal genome sequencing has the system design, privacy-preserving technologies, potential to enable better disease prevention, and an equitable compensation model in a more accurate diagnoses, and personalized platform that vests control over data with therapies. Furthermore, sharing genomic data individual owners; ensures transparency and with researchers promises identification of the privacy; facilitates regulatory compliance; causes of many diseases and the development minimizes expensive data transfers; and shifts of new therapies. However, sequencing costs, the sequencing costs from consumers, patients, data privacy concerns, regulatory restrictions, and biobanks to researchers in industry and Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 2 of 23 Figure 1—Human genome sequencing cost, 2001–2017. academia. We exemplify this by describing the using DNA hybridization microarrays. These implementation of Nebula, a distributed genomic tests are referred to as genotyping and they data generation, sharing, and analysis platform. assess the presence or absence of genetic variants associated with certain traits. For The Human Genome Project has sequenced a cost less than $100, genotyping typically and assembled the first human reference reads out only ~0.02% of the human genome, genome at a cost of over $3 billion.1 Since then, at predefined positions, often missing development of next-generation sequencing health-relevant genetic variants that must be technology has resulted in exponentially reported. In addition, variant identification decreasing sequencing cost (Figure 1).2 Today, at a small number of positions does not the sequencing of a whole human genome allow discovery of novel variants, including costs less than $1,000. This price is projected those that cause disease; the majority of to drop to $100 in the next few years.3 The these variants are distributed throughout exponentially decreasing DNA sequencing the genome and remain undiscovered.4 This costs have made personal genome sequencing limits the usefulness of genotyping data to affordable to patients as well as healthy researchers. individuals. OPPORTUNITIES Personal genome sequencing is becoming As genomic sequencing becomes more affordable, more common as prices decline, but most it opens up opportunities for individuals as well as genetic tests to date have been performed researchers in academia and industry. Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 3 of 23 Personal genome sequencing can support personal genomics company 23andMe received data-driven decision-making for health-related $60 million from Genentech16 and $300 million issues. Studies estimated that ~2% of people from GlaxoSmithKline17 for access to genotyping carry genetic variants that cause or predispose data collected from its customers. Other them to a wide variety of diseases at various biopharma companies have launched their own levels of severity, the majority of which can be sequencing projects. AstraZeneca announced it preventable or treatable.5 In addition, every parent would sequence 2 million human genomes,18 and carries, on average, approximately five genetic Regeneron is leading a $100 million consortium variants that might cause diseases in offspring to sequence approximately 500,000 samples if the other parent carries the same variant.6 The collected by the UK Biobank.19 presence of certain genetic variants also has been associated with adverse effects for ~7% of CHALLENGES Food and Drug Administration-approved drugs.5 Multiple obstacles hinder the realization of Personal genome sequencing can also help opportunities offered by personal genomics. healthy individuals make better lifestyle choices. Many people are deterred by the costs of For example, genetic variants have been shown personal genome sequencing, as well as concerns to cause sensitivities to certain nutrients7–9 and over genomic data privacy. Research is hampered to increase risks of sports-related injuries.10–12 In by the resultant scarcity of genomic data and is the future, advancement in understanding human further compounded by difficulties with respect genetics will make personal genome sequencing to data access. more insightful, while correcting pathological genetic variants will become possible as more In 2018, the number of genotyped people and more gene therapies enter clinical trials.13 surpassed 10 million and is expected to grow to more than 100 million by 2021.20 This growth Researchers study genomic data sets to is driven by a combination of factors, notably identify genetic variants that cause diseases. consumer interest in ancestry analysis coupled This enables the research and development with a decrease in genotyping costs below of therapies targeting disease-associated $100.21 In contrast, consumer interest in whole genes with increasing specificity. Genomics- genome sequencing has grown slowly due to guided therapeutic discovery has been applied a significantly higher cost. A recent survey successfully to many types of cancers, rare revealed that only ~3% of people are willing to genetic diseases, and, increasingly, common pay >$1,000 for whole genome sequencing.22 complex diseases.14 Furthermore, genomics- For the majority of consumers, whose primary guided patient cohort recruiting can reduce interest in the area can best be described as the failure rate of clinical trials by enriching nonmedical “infotainment,” the benefits of for likely responders and reducing reducing sequencing over genotyping do not justify the adverse reactions. This approach to clinical trials significantly higher cost. promises to reduce surging drug development costs and lead to more drugs reaching the market At the same time, the surge in popularity of and benefiting patients.15 genetic testing, forensic utilization of genetic databases,23 and the purchase of genetic data These opportunities are recognized by the by biopharma companies24 have increased biopharma industry. For example, the leading consumer and media attention to genetic Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 4 of 23 data privacy. Studies show that privacy concerns genomic data sharing, privacy-preserving are legitimate, as data sharing policies of computing has been utilized to protect genomic many personal genomics companies do not data privacy, and different compensation models fulfill transparency guidelines with regard have been explored to incentivize genomic data to the confidentiality or sharing of customer sharing. genetic data.25 These developments are likely to exacerbate reported privacy concerns over Genomic Data Sharing genetic data26,27 and deter personal genomic The GA4GH Beacon Project33 and i2b2 sequencing. SHRINE34 are two of the most advanced systems for biomedical data sharing. Both are For researchers, low adoption of personal networks that enable participating institutions genome sequencing has resulted in low to connect their genomic (and clinical) databases availability of genomic data. According to and process queries about the presence of estimates, only ~500 thousand human genomes genetic variants and traits, including medical had been sequenced by 2017.3 This is detrimental conditions. This federated model minimizes for research because very large genomic expensive data transfers and enables institutions data sets are necessary to find links between to retain control of their data. This addresses genetic variants and traits, such as disease privacy, regulatory, and technical challenges predispositions. Finding such links is difficult that are associated with centralized storage and because most traits are the product of complex transfers of genomic data. interactions of many genetic variants, while the effects of individual genetic variants are, on However, there are limitations. First, average, very small.28 Low diversity of genomic functionality is currently limited to simple data sets further compounds the search for links queries. Orchestrated, distributed computations between genetics and disease.29 required for data processing and analysis are currently not supported. Second, participation The scarcity of genomic data is exacerbated by is limited to academic research institutions and difficulty in data access due to fragmentation of hospitals. There are no patient- or consumer- genomic data across proprietary data silos.30 Data focused portals that would enable individuals sharing is further hindered by the large size of to easily contribute their personal genomic genomic data, which impedes data transfer over data. Third, decentralized governance and networks.31 In addition to logistic and technical compensation mechanisms have not been challenges, data access is often complicated by implemented. restrictive government regulations that hinder data sharing.32 Low availability of genomic Genomic Data Protection data combined with data silos also results in Distributed genomic data storage and computing high prices, making it unaffordable to many can help protect genomic data privacy. However, researchers. data owners cannot always maintain in-house servers and therefore they often must outsource PREVIOUS WORK data storage and computing to third parties, Solutions to the challenges outlined above have such as cloud service providers. To protect been proposed previously. Federated data storage the privacy of genomic data that are shared systems have been implemented to facilitate with untrusted third parties, encryption-based Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 5 of 23 privacy-preserving techniques have been adopted to the challenges described in the previous for genomics. These techniques enable third sections. This model requires consumers to pay parties to execute computations and return results for genetic testing and result interpretation, without having access to plaintext genomic data. while personal genomics companies often take ownership of the generated genomic data and Privacy-preserving techniques have been applied sell it to biopharma companies (Figure 2). This previously to distributed medical and genomic model requires consumers to carry the costs databases. For example, MedCo integrates and relinquish ownership and control of their with the i2b2 SHRINE framework and uses a genomic data, which discourages genetic testing. homomorphic data encryption scheme to enable In addition, this model promotes genomic data outsourcing of genomic data storage and query fragmentation across private data silos, which execution to untrusted third parties.35 Another hampers data access and increases data prices. example is the Secure Multi Party Query Language framework that implements similar We propose to combine and extend previous functionality and privacy guarantees using secure work on genomic data sharing networks, privacy- multiparty computations.36 Data can also be preserving technologies, and compensation protected using trusted hardware. An example models to create a new model for personal is the PRINCESS framework that executes genomics that may overcome these challenges computations on genomic data inside protected (Figure 3). memory regions of Intel microprocessors.37 First, the functionality of genomic data sharing Compensation Models networks must be extended beyond simple Over the past few years, personal genomics queries. This requires a network that can be companies have explored different models to integrated with a full-fledged bioinformatics compensate individuals for contributing their platform that supports genomic data processing personal genomic data to research studies. In and analysis. Implementing this functionality 2016, Genos offered to help its customers sell would bundle fragmented genomic data and their genomic data to researchers.38 A similar make it available for analysis on a single model that uses a cryptocurrency instead of fiat network, thereby facilitating data access for money was adapted by EncrypGen in 2017.39 researchers. Most recently, LunaDNA announced that it would compensate genomic data contributors Second, the data sharing network must expand with company stock.40 These models are similar beyond research institutions and must be in that individuals who want to participate must accessible to individuals who want to share their already own their personal genomic data, or personal genomic data. However, the resulting choose to purchase genetic testing because of the network decentralization will necessitate a more prospect they will be rewarded later for sharing democratic governance model. This potentially the data. can be achieved by integrating blockchain technology, which holds the promise of enabling PERSONAL GENOMICS 2.0 decentralized, self-governing networks. The traditional model for genomic data generation and sharing that has been adopted by Third, the privacy of genomic data must be most personal genomics companies contributes protected. Data access control on the blockchain Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 6 of 23 Figure 2. The traditional model for genomic data generation and sharing. can ensure transparent consent management, also may result in a reduction in genomic data while privacy-preserving technologies can help prices and thus empower more researchers to protect shared genomic data. Together with access large genomic data sets. the distributed computing model that “brings algorithms to the data,” these technologies can DESIGN CONSIDERATIONS enable network participants to retain ownership To implement a system as outlined in and control of their genomic data, thereby the previous section, one must integrate reducing privacy concerns and incentivizing data a bioinformatics platform that supports sharing. distributed data storage and computing with a suitable blockchain framework, as well as with Fourth, genome sequencing and data sharing techniques for privacy-preserving computing. also must be incentivized by implementing Here, we review and evaluate existing options. subsidy and compensation mechanisms. The decentralized data sharing model can facilitate Bioinformatics Platforms this, as it enables researchers to connect directly Bioinformatics platforms have been developed to with individuals with traits of interest, subsidize facilitate organization of genomic data; to enable their genome sequencing costs, and compensate parallelized, high-performance computing with them for data sharing. Elimination of middlemen support for complex dependencies; and to allow Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 7 of 23 Figure 3—Alternative model for personal genomics that may overcome challenges. a modular pipeline design that is flexible and containerization of computing environments, ensures reproducible results.41 Table 1 shows a and utilization of standardized application comparison of popular bioinformatics platforms. programming interfaces (APIs). The development of bioinformatics platforms Based on these considerations, Arvados and has been driven by exponentially growing DNAstack appear as suitable choices for the genomic data and marked by adaption of proposed genomic data sharing platform. Both multiple computing trends. Storage and platforms have an API-focused architecture and processing of genomic data has moved from data sharing functionality. DNAstack integrates local servers to remote clouds. This has enabled with the GA4GH Beacon Network, while scalable data storage and computing and Arvados supports platform-agnostic, federated facilitated access sharing to genomic data sets. cloud environments and has an open-source To scale beyond single clouds, efforts are being codebase. made to create federated cloud environments that could enable distributed data storage Blockchain Frameworks and computing.48,49 Furthermore, the growth Blockchain technology has three use cases of genomic data and development of new in the proposed system. First, the need to bioinformatics tools that must be integrated provide transparent consent management can into workflows are driving the development of be addressed by the ability of blockchains standardized workflow description languages, to store data access permissions on an Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 8 of 23 Table 1. Comparison of bioinformatics platforms Criteria Arvados42,43 DNAstack44 Seven Bridges45 DNAnexus46 Galaxy47 Hardware Federated Google Cloud with Clouds Clouds Local servers clouds and Beacon Network servers integration Pipeline design API-based; API-based; web Web GUI Web GUI Web GUI web GUI GUI Containers Yes Yes Yes Yes Yes Workflow CWL WDL CWL Custom Custom language Open source Yes No No No Yes Platform launch 2013 2014 2012 2010 2005 year API: application programming interface; CWL: Common Workflow Language; GUI: graphical (rather than textual) user interface; WDL: Workflow Description Language. immutable public ledger. Second, blockchains Based on these requirements, permissioned can enable implementation of decentralized blockchains frameworks such as Exonum and systems governed by network participants. Hyperledger Fabric appear most suitable (Table 2). Third, an immutable ledger can facilitate Hyperledger Fabric has been more widely adopted, verification of the integrity of decentrally but Exonum offers transparency and security that stored data. is comparable to public blockchains. Based on these use cases, one can create a First, Exonum-based blockchains offer public set of requirements that a suitable blockchain read access but restrict write access to selected framework must fulfill. First, consent validator nodes. By making read access to the management requires that the identity of blockchain public, transaction audit does not researchers who request to access data are rely on trusted parties. Exonum transactions known to data owners. To this end, network are verified in real time by all nodes. Thus, access must be limited to data buyers whose all network participants are able to audit the identity has been verified. Therefore, consent blockchain state collectively. management requires a blockchain that supports permissioned access. Second, Exonum supports anchoring of transaction logs in the Bitcoin blockchain. Second, a large, decentralized data marketplace Hashes of the Exonum blockchain state are requires smart contract functionality and high periodically written to the Bitcoin blockchain, so transaction throughput. Private blockchains can even if all permissioned Exonum nodes collude, achieve higher transaction throughputs than the transaction history cannot be falsified unless public blockchains because the ability to write the attacker succeeds in compromising the transactions to the blockchain is limited to a Bitcoin blockchain as well. group of permissioned validator nodes. However, this makes private blockchains more centralized Third, Exonum uses a byzantine fault-tolerant and less dependable. (BFT) consensus algorithm that protects against Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 9 of 23 Table 2. Comparison of blockchain frameworks Criteria Exonum50 Hyperledger Fabric51 Ethereum52 Read access Public Private Public Write access Private Private Public Consensus Byzantine fault-tolerant Fault-tolerant (FT) Proof of work (PoW) (BFT) Transactions per ~3,000 ~3,000 ~15 second (TPS) Smart contracts Yes (Rust, Java) Yes (Go, Java) Yes (Solidity) Light clients Yes No Yes Public blockchain Yes No NA anchoring Open source Yes Yes Yes NA: not applicable. malicious behavior of permissioned nodes. amounts during the computation. It is possible, In contrast, Hyperledger and other private however, to improve the performance of fully blockchains rely on less computationally homomorphic encryption and secure multiparty intensive fault-tolerant (FT) consensus computations significantly if they are optimized algorithms that protect against node breakdown for specific use cases. Practical performance but not malicious behavior. Exonum offers both levels have been demonstrated for queries on BFT consensus and high transaction throughput genomic data54,55 and genome-wide association because it is written in Rust, one of the fastest studies (GWAS).56 programming languages. Furthermore, Rust offers memory safety which eliminates many Alternative technologies have drawbacks of vulnerabilities that are commonly exploited by their own. Intel Software Guard Extensions hackers. technology is a hardware-assisted approach that protects data privacy by executing Privacy-Preserving Technologies computations inside private memory regions. It Table 3 shows a comparison of privacy- offers good performance but has been affected preserving technologies that all have been by vulnerabilities that can compromise data applied to secure genomic data.53 Fully privacy.57 Differential privacy methods protect homomorphic encryption and secure multiparty data privacy by introducing randomness. computations enable computations on encrypted However, obfuscation of computation results can data that generate encrypted results. These complicate interpretation of studies.53 encrypted results, when decrypted, correspond to the results of the same computation on plaintext NEBULA data. However, fully homomorphic encryption In this section, concepts and design is very slow and typically suffers from very considerations outlined in the previous sections large ciphertext expansion. The limitation of are illustrated by describing the technical secure multiparty computation protocols is implementation of Nebula—a decentralized that they require transfers of very large data genomic data generation, sharing, and analysis Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 10 of 23 Table 3. Comparison of privacy-preserving technologies Fully Secure Intel Differential Homomorphic Multiparty Software Guard Privacy Criteria Encryption Computations Extensions Principle Computations Distributed Computations inside Introduction of (additions AND computations on private memory randomness to multiplications) ciphertexts regions data/results of on ciphertexts computations Computation Very slow Slow Fast Fast time Memory usage Very high High Low Very low Communication High Very high Low Low cost Specific None None Vulnerabilities have Noise makes limitations been discovered; interpretation requires Intel CPUs of results more difficult CPU: central processing unit. platform. Nebula integrates the Arvados42,43 intention is to preserve data privacy by enabling bioinformatics platform (github.com/curoverse/ investigators to query the whole database arvados) with the Exonum50 blockchain and discover their data of interest, without framework (github.com/exonum) and a fully compromising the privacy of the queried data. homomorphic data encryption scheme (Figure 4). In the future, it should be possible to extend the application of privacy-preserving technologies to Arvados has two core services: Keep and GWAS and other computations. Crunch. Keep is a distributed content- addressable storage system that enables scalable The Nebula blockchain is an Exonum-based storage of genomic big data, high throughput blockchain through which the Nebula network data access, and efficient data management. will be governed, consent will be documented, Crunch is a workflow management engine and the data will be secured. Exonum-based that enables flexible creation and parallelized blockchains have three types of nodes: auditors, execution of data analysis pipelines and light clients, and validators. Auditors are generation of reproducible results. Arvados full nodes that maintain a copy of the entire implements a distributed data storage and blockchain content and can generate transactions. computing model that minimizes required data Light clients also can generate transactions, but transfers. This helps address big data challenges, they replicate only information that is relevant regulatory restrictions, and data privacy risks. to them instead of the whole blockchain content. Validators are permissioned nodes that verify Utilization of a homomorphic data encryption transactions received from auditors and light scheme enables implementation of privacy- clients and write new blocks to the blockchain. preserving queries on genomic data. The While the current implementation of Nebula Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 11 of 23 Figure 4—Overview of the Nebula platform. uses the Exonum framework, other permissioned payments to their wallets by operating light blockchains, in particular Hyperledger Fabric, clients on the Nebula blockchain. can be used as well. • Network maintainers are organizations that operate validator nodes on the Nebula The Nebula network has four types of blockchain. Validator nodes will collectively participants: data owners, network maintainers, control data access by managing encrypted data buyers, and storage and compute providers. key shares, verifying transactions, and keeping track of data stored in Keep and computations • Data owners can be private individuals or executed by Crunch. institutions. They will store encrypted genomic data in public or private clouds that are part • Data buyers are researchers who wish to of the Keep storage system. They will be able obtain access to genomic data. They will be to control access to their data and receive operating auditor nodes to keep a local copy of Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 12 of 23 the metadata, which they will use to locate data To this end, the Nebula platform enables a stored in Keep, verify data integrity, and keep data buyer to create a smart contract that track of access permissions. Data buyers will specifies the blockchain addresses of data be able to query homomorphically encrypted owners previously identified in a query and data, utilize smart contracts to acquire data send cryptocurrency tokens to that smart access permissions from data owners, and use contract. The data owners are notified that Crunch to run analysis pipelines. a buyer has offered to pay their sequencing costs. If a data owner accepts the offer by • Storage and compute providers are data executing the smart contract, the deposited owners that operate private clouds, or third tokens are sent to a sequencing provider. parties that offer storage and computing Next, the data owner receives a saliva services (e.g., Google, Amazon, and collection kit and submits a saliva sample Microsoft). They will form a federated cloud to the sequencing facility. The sample environment that hosts the Keep storage is sequenced, and the genomic data are system and Crunch-managed containers deposited on a Keep server specified by within which computations are executed. the data owner. Data hashes, along with The development of Nebula is ongoing. Some blockchain addresses of all data owners and parts of the platform, in particular, Arvados, buyers, are written to the blockchain. The have been fully implemented over the past data buyer who paid the sequencing costs is few years and are already being deployed by permitted to access and analyze the data. The various organizations. Other parts of Nebula, in data owner receives interpretations of his particular, the homomorphic encryption schemes, genomic data and is able to share data access are a relatively recent addition and are not yet with additional data buyers. fully integrated. A report on the progress of our work was published in a white paper.58 Here we describe the implementation of Nebula in greater Phenotypic data detail but also revise some previously made Information about medical conditions and design choices. other traits is referred to as phenotypic data. These data are generated primarily through Data Generation survey questions. The platform utilizes a phenotyping toolkit that maps plain-language Genomic data survey responses to clinical descriptions Personal genome sequencing cost is a significant in Human Phenotype Ontology (HPO)59 factor in preventing more widespread consumer format. Survey data can be verified using two adoption. Therefore, a key consideration in the approaches. First, comparing the incidence of design of the Nebula platform was to provide a medical conditions in the general population to mechanism to shift sequencing costs from data the incidence observed in the platform’s data owners (e.g., consumers and biobanks) to data set will enable identification of survey results buyers (e.g., pharma and biotech companies). that deviate from the expected results. Second, This is being implemented by enabling data survey data can be verified by referencing buyers to query the Nebula database, identify Electronic Health Records (EHRs) imported data sets of interest, and pay the sequencing costs through the Fast Healthcare Interoperability to generate and access genomic data (Figure 5). Resources (FHIR) API. Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 13 of 23 Figure 5—Genome sequencing subsidy payment on the blockchain. Data Encryption Encryption Standard (AES) encrypted. The AES Privacy of genomic and phenotypic data are keys are encrypted with validator public keys and protected through client-side encryption by bundled with the encrypted data. data owners and encryption key management by blockchain validator nodes. To enable data Data Storage buyers to discover data prior to purchasing data access, the platform implements a lattice-based Data fully homomorphic encryption scheme. To this Genomic data are stored in Keep, a distributed end, blockchain validator nodes generate public– content-addressable storage system that retrieves private key pairs and construct a single collective files based on their content. Addresses of files public key (Figure 4). Data owners encrypt their are generated through cryptographic digest of survey responses and genetic variant lists with their content. Keep combines content-addressing the collective public key and upload them to with the distributed storage architecture of the a Keep server. The homomorphic encryption Google File System.60 Keep splits encrypted files scheme protects data privacy by enabling data into 64-megabyte blocks and stores them in an buyers to execute Structured Query Language underlying object store or file system (Figure 6). (SQL)-like queries on the homomorphically The content addresses of the blocks are stored on encrypted data. Files that contain raw sequencing the blockchain and are used to find data locations data and are not used for queries are Advanced and check data integrity. Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 14 of 23 Figure 6—Data blocks are stored in Keep. Block hashes are stored on the blockchain. Keep is designed for storing genomic and other Storj can potentially be supported if computing types of biomedical big data. First, its content- on stored data becomes possible. Data owners addressing offers high-speed storage and retrieval can register new, personal cloud instances or by eliminating an indexing service, a potential store their encrypted data in shared clouds. bottleneck and point of failure, and enabling Based on phenotypic information, data sets that direct connections between the storage and are likely to be analyzed together are stored in compute subsystems. Second, content-addressing physical proximity, which minimizes slow and works well for data written to disk once and read expensive data transfers. many times, a characteristic of genomic data, as it does not change over time but is accessed As sequencing data are processed, different frequently. Third, fixed-size data blocks allow file formats are generated and stored in Keep. scalable distributed storage of big data, and Typically, Keep stores FASTQ files that contain content-addressing enables easy file verification, raw sequencing data (~200 gigabytes/genome), which is particularly important for distributed Binary Alignment Map files that store aligned databases. sequencing reads (~100 gigabytes/genome), and Variant Call Format files that store genetic Keep is designed to be a distributed, hybrid variants (~200 megabytes/genome). Additionally, storage system. Data owners can choose to Nebula uses the Compact Genome Format (CGF) store their data in clouds such as Amazon Web to generate compact genomic data summaries. Services (AWS), Google Cloud Platform (GCP), Genomes in the CGF format are represented by and Microsoft Azure, or on private bare-metal pointers referencing sequences in a tile library servers. Decentralized file storage solutions such (Figure 7). CGF offers a consistent, standardized as InterPlanetary File System (IPFS), Sia, and representation of genomic data that makes Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 15 of 23 Figure 7—Simplified representation of a tile library and a Compact Genome Format (CGF) file. The rectangles represent tile variants at different positions and the dotted line illustrates the tile composition of specific genome. different types of sequencing and genotyping transactions, add new blocks to the blockchain, data interoperable. The CGF representation and update the key-value store. Storage of is also very space efficient (~30 megabytes/ metadata on an immutable ledger helps secure genome), which facilitates file transfers, and the integrity of the decentralized Nebula enables fast queries and efficient analysis.60 database. To this end, multiple column families are implemented: Tabular phenotypic data generated through surveys and imports of EHRs are stored in • Data ownership is registered by assigning each physical proximity with associated genomic data. block content address the blockchain address of In contrast to static genomic data, phenotypic the data owner who added the block to Keep. information is much more dynamic and smaller • Data locations are described by assigning each than genomic data. This makes utilization of the block content address the Uniform Resource Google File System and content addressability Locator (URL) of a Keep server. unsuitable. Therefore, phenotypic data files are • Data integrity is verified by re-hashing data stored as Not only SQL documents. blocks and comparing their hashes with content addresses that are stored on the blockchain. • Data buyer identities, including names and Metadata institutional affiliations, are verified, linked To organize data stored in Keep, Nebula stores to blockchain addresses, and stored on the metadata on the blockchain in a key-value store. blockchain. When new data are added to Keep or existing • Access permissions to the Nebula platform and data are modified, blockchain transactions data stored in Keep also are managed on the are generated. Validator nodes verify these blockchain. Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 16 of 23 Data Discovery criteria, as well as their blockchain addresses. Utilization of fully homomorphic encryption is This enables data buyers to connect with data intended to address the privacy barrier to data owners to pay sequencing costs or to purchase sharing. It enables data owners to make their data access to existing genomic data (Figure 8). available for discovery without privacy risks, while at the same time allowing data buyers Data Analysis to explore the database before purchasing data The Arvados container and pipeline management access to perform analyses. engine, Crunch, executes computations on data stored in Keep. Crunch implements a distributed To this end, data buyers will construct a SQL-like computing model whereby workflows, and not query and encrypt it with the collective public the genomic data, are moved between cloud key that has been constructed by validator nodes instances whenever possible. Highly distributed and used to encrypt phenotypic information genomic data processing is possible because and genetic variant lists. The encrypted query is many intensive bioinformatics computations, executed on homomorphically encrypted data such as alignment and variant calling, are and an encrypted result is generated. The query performed on single genomes and are easily result is re-encrypted by the validator nodes parallelizable. To this end, Crunch executes under a public key provided by the data buyer tasks inside Docker containers that are created and shared with data buyer who can now decrypt physically close to the data locations in Keep it with its private key. A query can return the and distributes computations between many number of data owners that matched the specified processing units. Figure 8—Secure data discovery through queries on homomorphically encrypted data. Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 17 of 23 Crunch ensures result reproducibility through Security standardization of computing workflows Homomorphic encryption can enable privacy- using Common Workflow Language (CWL),61 preserving queries for data discovery. However, which enables connection of open-source most computations that are necessary for and proprietary bioinformatics software into typical genomic data analysis workflows do not workflow pipelines that are flexible, portable, achieve practical runtimes when executed on and scalable. Crunch can access CWL pipelines homomorphically encrypted data. Therefore, stored in public or private Git repositories such other security mechanisms must be utilized. as GitHub. CWL can be used to implement end-to- Platform access control end bioinformatics analysis pipelines. To protect data owners and their data, data Typically, CWL pipelines include common, buyers are required to go through a partially computationally intensive “secondary analysis” decentralized, three-step permission process. tasks, such as alignment of sequencing reads to a The first step is data buyer authentication. reference genome and variant calling. However, Here, a blockchain validator node will verify a “tertiary analysis” tasks, which often involve data buyer’s personal and institutional identity. computing on genomic data sets rather than Blockchain addresses of verified data buyers will single genomes and are less standardized, also be added to the blockchain metadata store. Data can be incorporated into CWL pipelines. Typical buyers will then be able to connect to Nebula examples are statistical tests that are used in REST API servers and use Crunch to execute GWAS to identify correlations between genetic pipelines on data stored in Keep. Data buyer variants and phenotypes. For such tertiary authentication will enable data owners to verify analysis tasks, Nebula uses Lightning,62 a system data buyer identity before agreeing to share for high-performance, in-memory computations data access. Furthermore, immutable storage of on genomic data in the CGF. Lightning integrates data buyer identities on the blockchain enables into CWL pipelines and enables fast queries and identification of data buyers who have violated execution of complex machine learning tasks on consent agreements or have bypassed pipeline large genomic data sets. execution control. CWL pipelines can also be used to analyze and interpret personal genomic (and phenotypic) Pipeline execution control data. First, users can build their own custom To protect data privacy, the platform design pipelines to analyze their personal data and incorporates the ability to define approved also share pipelines among each other using bioinformatics tools and CWL workflows. public Git repositories. Second, developers The intent is to prevent data buyers from can build and monetize genomic apps. To this downloading genomic data or executing any end, CWL pipelines can be stored in private computations that attempt to extract a large repositories, and access by Crunch may require amount of information about individual data a smart contract-mediated token payment to the owners. This approach was chosen because it pipeline developer. The approach of bringing has the ability to provide a sufficient level of apps to the data facilitates protection of personal data privacy protection without significantly information. restricting data buyers. Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 18 of 23 Data access control re-encrypt the data under the data buyer’s public The first task in every CWL pipeline is to get key. Finally, the data buyer’s access permission access to the input data (Figure 9). Here, a is registered on the blockchain and tokens are data buyer executes a smart contract on the sent from the smart contract to the data owner’s blockchain. The inputs are the data buyer’s wallet. Crunch can now load decrypted data into blockchain address and the content addresses of a Docker container and begin pipeline execution. all data blocks of the input files. The data buyer also deposits tokens inside the smart contract Governance and defines a token payout for data access. When The Nebula blockchain can be used to enable a data owner’s light client synchronizes with Nebula network participants to collectively the blockchain, the data owner is notified of the govern the network, in particular, to help data access request. The data owner can decide maintain data protection. To this end, for about data sharing based on offered payment and example, Token-Curated Registries (TCRs)63 identity of the data buyer. The data owner grants can be used to conduct elections that determine data access by executing the smart contract. validator nodes or whitelist data analysis The blockchain validator nodes then verify the pipelines. TCRs are lists that are curated integrity of the requested data stored in Keep by decentrally by token holders. Importantly, comparing data hashes with the content addresses economic incentives drive the token holders to stored on the blockchain and collectively curate the list’s contents judiciously. In brief, Figure 9—Data access control and data purchases. Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 19 of 23 network participants can cast votes whereby the produced the data. If the source of the genomic weights of votes scale with their token holdings. data is unknown, or the sequencing facility Since token holders are invested in the network, does not cooperate, genomic data cannot be they are incentivized to maintain its proper validated. A possible solution to this problem operation that ensures data protection. can be a model that compensates personal genomics companies and other genomic data DISCUSSION producers for validating data authenticity. Data The obstacles that hinder personal genome collected from different sources also must be sequencing and genomic data sharing have a made interoperable. It is particularly challenging significant impact on the progress of research to curate health records and other types of into disease prevention, drug development, phenotypic data. However, standards such as and other crucial aspects of human health. We FHIR are being developed very actively and described one approach to overcoming these have already enabled applications that can collect obstacles, using a combination of multiple EHRs across different health systems.65 technologies. A number of challenges remain to be addressed regarding data privacy, data The idea of a personal data marketplace is very validation, data curation, and data economics. new and has not yet been implemented at scale. A personal data marketplace is likely to differ Data privacy protection requires decentralization from traditional marketplaces in important ways. of data generation and further development of For example, data supply can be regarded as privacy-preserving technologies. Today, genomic being unlimited because an individual can share data generation is limited to laboratories that data access with an unlimited number of data own expensive sequencing machines operated buyers. Personal data marketplaces also would by experienced technicians. Centralized genomic be asymmetric, since individuals are likely to be data generation leads to data privacy risks that unaware of the value of their personal data and may be averted if sequencing is decentralized. are thus at risk of not being compensated fairly. We anticipate that this will become possible soon The novelty of personal genomic data further as new technologies are being developed that compounds these challenges and makes market would enable compact, affordable, and easy-to- dynamics difficult to predict. We anticipate operate sequencing machines.64 Data privacy that future research into economics of data protection is also impaired by current limitations marketplaces will help answer these and other of privacy-preserving technologies that do not open questions. allow complex computations on large data sets and require extensive optimization for every Contributions: Dennis Grishin, Kamal Obbad, use case, which hinders effective data analysis. and Kevin Quinn wrote the article. Dennis However, practicality of privacy-preserving Grishin, Kamal Obbad, and George Church technologies has been steadily increased over developed the ideas described in the article. the past few years, and we anticipate continuing Dennis Grishin, Kamal Obbad, and Kevin progress in the future. Quinn are leading the development of the Nebula platform. Alexander Wait Zaranek, Ward Data validation and curation are another area of Vandewege, Tom Clegg, Nico César, Preston challenge. Validation of genomic data requires Estep, and Mirza Cifric contributed to the assistance of the sequencing facilities that have development of Arvados. Alexander Wait Zaranek Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 20 of 23 and Sarah Wait Zaranek developed the Compact 7. Yang A, Palmer AA, de Wit H. Genetics Genome Format and Lightning. All authors edited of caffeine consumption and responses and/or reviewed the final article. to caffeine. Psychopharmacology. 2010 Aug;211(3):245–57. Acknowledgments 8. Mattar R, de Campos Mazo DF, Carrilho The authors would like to thank Armon Rahim FJ. Lactose intolerance: Diagnosis, genetic, and clinical factors. Clin Exp Gastroenterol. and Nathaniel Tucker for their help in reviewing 2012 Jul 5;5:113–21. this article. 9. Freeman HJ. Risk factors in familial forms of celiac disease. World J Gastroenterol. Conflict of Interest 2010 Apr 21;16(15):1828–31. All authors are employees, advisors, or 10. O’Connell K, Knight H, Ficek K, et al. collaborators of Nebula Genomics. Interactions between collagen gene variants and risk of anterior cruciate ligament Funding Statement rupture. EJSS. 2015;15(4):341–50. Development of the Nebula platform is funded 11. Tiziano FD, Palmieri V, Genuardi M, by Nebula Foundation. Zeppilli P. The role of genetic testing in the identification of young athletes with inherited primitive cardiac disorders at risk REFERENCES of exercise sudden death. Front Cardiovasc 1. International Human Genome Sequencing Med. 2016 Aug 26;3:28. Consortium. Finishing the euchromatic 12. Bennett ER, Reuter-Rice K, Laskowitz sequence of the human genome. Nature. DT. Genetic Influences in Traumatic Brain 2004 Oct 21;431(7011):931–45. Injury: Chapter XI. In: Laskowitz D, Grant 2. Wetterstrand KA. DNA Sequencing Costs: G, editors. Translational Research in Data from the NHGRI Genome Sequencing Traumatic Brain Injury. Boca Raton, FL: Program (GSP) [Internet]. [cited 2018 Jan CRC Press/Taylor and Francis Group; 2015. 11]. Available from: https://www.genome. 13. Ginn SL, Amaya AK, Alexander gov/sequencingcostsdata/ IE, Edelstein M, Abedi MR. Gene 3. Illumina Promises to Sequence Human therapy clinical trials worldwide to Genome For $100—But Not Quite Yet. 2017: An update. J Gene Med. 2018 2017. [cited 2018 Oct 10]. Available May;20(5):e3015. from: https://www.forbes.com/sites/ 14. Cardon LR, Harris T. Precision medicine, matthewherper/2017/01/09/illumina- genomics and drug discovery. Hum Mol promises-to-sequence-human-genome-for- Genet. 2016 Oct 1;25(R2):R166–72. 100-but-not-quite-yet/#672a5d72386d 15. Rojahn SY. Genomics could blow up the 4. Maurano MT, Humbert R, Rynes E, et al. clinical trial. MIT Technology Review Systematic localization of common disease- [Internet]. 2013 Nov 12 [cited 2018 associated variation in regulatory DNA. Aug 25]; Available from: https://www. Science. 2012 Sep 7;337(6099):1190–5. technologyreview.com/s/521496/genomics- 5. Lindor NM, Thibodeau SN, Burke could-blow-up-the-clinical-trial/ W. Whole-genome sequencing in 16. Herper M. Surprise! With $60 Million healthy people. Mayo Clin Proc. 2017 Genentech Deal, 23andMe Has A Business Jan;92(1):159–72. Plan [Internet]. Forbes. 2015 [cited 2017 6. Berg JS, Adams M, Nassar N, et al. Oct 1]. Available from: https://www.forbes. An informatics approach to analyzing com/sites/matthewherper/2015/01/06/ the incidentalome. Genet Med. 2013 surprise-with-60-million-genentech-deal- Jan;15(1):36–44. 23andme-has-a-business-plan/ Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 21 of 23 17. Bloomberg. GlaxoSmithKline Is Acquiring 26. Bloss CS, Ornowski L, Silver E, et al. a $300 Million Stake in 23andMe [Internet]. Consumer perceptions of direct- Fortune. [cited 2018 Aug 25]. Available to-consumer personalized genomic from: http://fortune.com/2018/07/25/ risk assessments. Genet Med. 2010 glaxosmithkline-23andme-gsk/ Sep;12(9):556–66. 18. Ledford H. AstraZeneca launches project to 27. Sanderson SC, Brothers KB, Mercaldo ND, sequence 2 million genomes. Nature. 2016 Clayton EW, Antommaria AHM, Aufox SA, Apr 28;532(7600):427. et al. Public attitudes toward consent and 19. Herper M. Drug company consortium to data sharing in Biobank Research: A large sequence the genes of 500,000 Britons over multi-site experimental survey in the US. next two years. Forbes Magazine [Internet]. Am J Hum Genet. 2017 Mar 2;100(3): 2018 Jan 8 [cited 2018 May 27]; Available 414–27. from: https://www.forbes.com/sites/ 28. Visscher PM, Wray NR, Zhang Q, et al. matthewherper/2018/01/08/drug-company- 10 years of GWAS discovery: Biology, consortium-to-sequence-the-genes-of- function, and translation. Am J Hum Genet. 500000-britons-over-next-two-years/ 2017 Jul 6;101(1):5–22. 20. Khan R, Mittelman D. Consumer genomics 29. Popejoy AB, Fullerton SM. Genomics will change your life, whether you get is failing on diversity. Nature. 2016 Oct tested or not. Genome Biol. 2018 Aug 13;538(7624):161–4. 20;19(1):120. 30. Lawler M, Maughan T. From Rosalind 21. Allyse MA, Robinson DH, Ferber MJ, Franklin to Barack Obama: Data sharing Sharp RR. Direct-to-Consumer Testing 2.0: challenges and solutions in genomics and Emerging models of direct-to-consumer personalised medicine. New Bioeth. 2017 genetic testing. Mayo Clin Proc. 2018 Apr;23(1):64–73. Jan;93(1):113–20. 31. Feltus FA, Breen JR 3rd, Deng J, et al. 22. Marshall DA, Gonzalez JM, Johnson FR, The widening gulf between genomics data et al. What are people willing to pay for generation and consumption: A practical whole-genome sequencing information, and guide to big data transfer technology. who decides what they receive? Genet Med. Bioinform Biol Insights. 2015 Sep 2016 Dec;18(12):1295–302. 23;9(Suppl 1):9–19. 23. Kolata G, Murphy H. The golden state 32. Majumder MA, Cook-Deegan R, McGuire killer is tracked through a thicket of DNA, AL. Beyond our borders? Public resistance and experts shudder. The New York Times to global genomic data sharing. PLoS Biol. [Internet]. 2018 Apr 27 [cited 2018 Aug 2016 Nov;14(11):e2000206. 21]; Available from: https://www.nytimes. 33. Global Alliance for Genomics and Health. com/2018/04/27/health/dna-privacy-golden- GENOMICS. A federated ecosystem for state-killer-genealogy.html sharing genomic, clinical data. Science. 24. Ducharme J. A major drug company now 2016 Jun 10;352(6291):1278–80. has access to 23andMe’s genetic data. 34. Weber GM, Murphy SN, McMurry AJ, et Should you be concerned? Time [Internet]. al. The Shared Health Research Information 2018 Jul 26 [cited 2018 Aug 21]; Available Network (SHRINE): A prototype from: http://time.com/5349896/23andme- federated query tool for clinical data glaxo-smith-kline/ repositories. J Am Med Inform Assoc. 2009 25. Laestadius LI, Rich JR, Auer PL. All Sep;16(5):624–30. your data (effectively) belong to us: Data 35. Raisaro JL, Troncoso-Pastoriza J, practices among direct-to-consumer Misbach M, et al. MedCo: Enabling genetic testing firms. Genet Med. 2017 secure and privacy-preserving exploration May;19(5):513–20. of distributed clinical and genomic Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
Page 22 of 23 data. IEEE/ACM Trans Comput Biol 46. DNAnexus Documentation [Internet]. Bioinform [Internet]. 2018 Jul 13; [cited 2018 Oct 10]. Available from: wiki. Available from: http://dx.doi.org/10.1109/ dnanexus.com TCBB.2018.2854776 47. Goecks J, Nekrutenko A, Taylor J, Galaxy 36. Bater J, Elliott G, Eggen C, Goel S, Kho Team. Galaxy: A comprehensive approach A, Rogers J. SMCQL: Secure querying for supporting accessible, reproducible, for federated databases. Proceed VLDB and transparent computational research in Endowment. 2017 Feb;10(6):673–84. the life sciences. Genome Biol. 2010 Aug 37. Chen F, Wang S, Jiang X, et al. PRINCESS: 25;11(8):R86. Privacy-protecting rare disease International 48. Chaterji S, Koo J, Li N, Meyer F, Grama Network Collaboration via encryption A, Bagchi S. Federation in genomics through software guard extensions. pipelines: Techniques and challenges. Brief Bioinformatics. 2017 Mar 15;33(6):871–8. Bioinform [Internet]. 2017 Aug 29; [cited 38. Molteni M, Allain R, Chen S, Thompson 2018 Oct 10]. Available from: http://dx.doi. A, Simon M, Gonzalez R. Genos will org/10.1093/bib/bbx102 sequence your genes—And help you sell 49. Workflow Execution Service (WES) API them to science. Wired [Internet]. 2016 [Internet]. Github; [cited 2018 Oct 11]. Dec 15 [cited 2018 Oct 6]; Available from: Available from: https://github.com/ga4gh/ https://www.wired.com/2016/12/genos-will- workflow-execution-service-schemas sequence-genes-help-sell-science/ 50. Exonum Documentation [Internet]. [cited 39. Lin P. Blockchain: The missing link between 2018 Oct 10]. Available from: exonum. genomics and privacy? Forbes [Internet]. com/doc 2017 May 8 [cited 2018 Oct 6]; Available 51. Androulaki E, Barger A, Bortnikov V, et al. from: https://www.forbes.com/sites/ Hyperledger fabric: A distributed operating patricklin/2017/05/08/blockchain-the-missing- system for permissioned blockchains. In: link-between-genomics-and-privacy/ Proceedings of the Thirteenth EuroSys 40. Brown KV. Share your DNA, get shares: Conference. New York: ACM; 2018. pp. Startup files an unusual offering. Bloomberg 30:1–30:15. (EuroSys ’18). News [Internet]. 2018 Oct 5 [cited 2018 Oct 6]; 52. Wood G. Ethereum: A secure decentralised Available from: https://www.bloomberg.com/ generalised transaction ledger. 2014. news/articles/2018-10-05/illumina-backed- [Internet]. [cited 2018 Oct 10]. Available startup-asks-sec-to-let-it-pay-people-for-dna from: https://ethereum.github.io/ 41. Leipzig J. A review of bioinformatic yellowpaper/paper.pdf pipeline frameworks. Brief Bioinform. 2017 53. Aziz MMA, Sadat MN, Alhadidi D, et al. May 1;18(3):530–6. Privacy-preserving techniques of genomic 42. Arvados Documentation [Internet]. [cited data-a survey. Brief Bioinform [Internet]. 2018 Oct 10]. Available from: doc.arvados.org 2017 Nov 7; [cited 2018 Oct 10]. 43. Zaranek AW, Clegg T, Vandewege W, Available from: http://dx.doi.org/10.1093/ Church GM. Free factories: Unified bib/bbx139 infrastructure for data intensive web 54. Çetin GS, Chen H, Laine K, et al. Private services. Proc USENIX Annu Tech Conf. queries on encrypted genomic data. BMC 2008 May 1;2008:391–404. Med Genomics. 2017 Jul 26;10(Suppl 2):45. 44. DNAstack Documentation [Internet]. [cited 55. Sousa JS, Lefebvre C, Huang Z, et al. 2018 Oct 9]. Available from: https://docs. Efficient and secure outsourcing of genomic dnastack.com/java-sdk/ data storage. BMC Med Genomics. 2017 Jul 45. Seven Bridges Documentation [Internet]. 26;10(Suppl 2):46. [cited 2018 Oct 10]. Available from: docs. 56. Cho H, Wu DJ, Berger B. Secure genome- sevenbridges.com/docs wide association analysis using multiparty Blockchain in Healthcare TodayTM ISSN 2573-8240 online https://doi.org/10.30953/bhty.v1.34
You can also read