Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...

Page created by Juanita King
 
CONTINUE READING
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 1 of 23

Accelerating Genomic Data Generation and Facilitating
Genomic Data Access Using Decentralization, Privacy-Preserving
Technologies and Equitable Compensation
Dennis Grishin,1,2,3 Kamal Obbad,1 Preston Estep,1,3 Kevin Quinn,1
Sarah Wait Zaranek,3 Alexander Wait Zaranek,1,3 Ward Vandewege,1,3
Tom Clegg,3 Nico César,3 Mirza Cifric,1,3 George Church1,2,3

Authors
1
  Nebula Genomics, Inc., San Francisco, USA; 2Department of Genetics, Harvard Medical School, Boston, USA;
3
  Veritas Genetics, Inc, Danvers, USA.

Corresponding Author
Dennis Grishin, Nebula Genomics Inc., 73 Sumner Street, #401, San Francisco, CA 94103, USA; dgrishin@g.
harvard.edu

Keywords: Arvados, Blockchain, Data Privacy, Data Sharing, DNA Sequencing, Genomics, Homomorphic
Encryption, Nebula

Category: Use Cases/Pilots/Methodologies

In the years since the first human genome                  and technical challenges impede the growth of
was sequenced at a cost of over $3 billion,                genomic data and hinder data sharing.
technological advancements have driven
the price below $1,000, making personal                    In this article, we propose that these challenges
genome sequencing affordable to many                       can be addressed by combining decentralized
people. Personal genome sequencing has the                 system design, privacy-preserving technologies,
potential to enable better disease prevention,             and an equitable compensation model in a
more accurate diagnoses, and personalized                  platform that vests control over data with
therapies. Furthermore, sharing genomic data               individual owners; ensures transparency and
with researchers promises identification of the            privacy; facilitates regulatory compliance;
causes of many diseases and the development                minimizes expensive data transfers; and shifts
of new therapies. However, sequencing costs,               the sequencing costs from consumers, patients,
data privacy concerns, regulatory restrictions,            and biobanks to researchers in industry and

Blockchain in Healthcare TodayTM           ISSN 2573-8240 online        https://doi.org/10.30953/bhty.v1.34
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 2 of 23

Figure 1—Human genome sequencing cost, 2001–2017.

academia. We exemplify this by describing the         using DNA hybridization microarrays. These
implementation of Nebula, a distributed genomic       tests are referred to as genotyping and they
data generation, sharing, and analysis platform.      assess the presence or absence of genetic
                                                      variants associated with certain traits. For
The Human Genome Project has sequenced                a cost less than $100, genotyping typically
and assembled the first human reference               reads out only ~0.02% of the human genome,
genome at a cost of over $3 billion.1 Since then,     at predefined positions, often missing
development of next-generation sequencing             health-relevant genetic variants that must be
technology has resulted in exponentially              reported. In addition, variant identification
decreasing sequencing cost (Figure 1).2 Today,        at a small number of positions does not
the sequencing of a whole human genome                allow discovery of novel variants, including
costs less than $1,000. This price is projected       those that cause disease; the majority of
to drop to $100 in the next few years.3 The           these variants are distributed throughout
exponentially decreasing DNA sequencing               the genome and remain undiscovered.4 This
costs have made personal genome sequencing            limits the usefulness of genotyping data to
affordable to patients as well as healthy             researchers.
individuals.
                                                      OPPORTUNITIES
Personal genome sequencing is becoming                As genomic sequencing becomes more affordable,
more common as prices decline, but most               it opens up opportunities for individuals as well as
genetic tests to date have been performed             researchers in academia and industry.

Blockchain in Healthcare TodayTM        ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 3 of 23

Personal genome sequencing can support                   personal genomics company 23andMe received
data-driven decision-making for health-related           $60 million from Genentech16 and $300 million
issues. Studies estimated that ~2% of people             from GlaxoSmithKline17 for access to genotyping
carry genetic variants that cause or predispose          data collected from its customers. Other
them to a wide variety of diseases at various            biopharma companies have launched their own
levels of severity, the majority of which can be         sequencing projects. AstraZeneca announced it
preventable or treatable.5 In addition, every parent     would sequence 2 million human genomes,18 and
carries, on average, approximately five genetic          Regeneron is leading a $100 million consortium
variants that might cause diseases in offspring          to sequence approximately 500,000 samples
if the other parent carries the same variant.6 The       collected by the UK Biobank.19
presence of certain genetic variants also has
been associated with adverse effects for ~7% of          CHALLENGES
Food and Drug Administration-approved drugs.5            Multiple obstacles hinder the realization of
Personal genome sequencing can also help                 opportunities offered by personal genomics.
healthy individuals make better lifestyle choices.       Many people are deterred by the costs of
For example, genetic variants have been shown            personal genome sequencing, as well as concerns
to cause sensitivities to certain nutrients7–9 and       over genomic data privacy. Research is hampered
to increase risks of sports-related injuries.10–12 In    by the resultant scarcity of genomic data and is
the future, advancement in understanding human           further compounded by difficulties with respect
genetics will make personal genome sequencing            to data access.
more insightful, while correcting pathological
genetic variants will become possible as more            In 2018, the number of genotyped people
and more gene therapies enter clinical trials.13         surpassed 10 million and is expected to grow to
                                                         more than 100 million by 2021.20 This growth
Researchers study genomic data sets to                   is driven by a combination of factors, notably
identify genetic variants that cause diseases.           consumer interest in ancestry analysis coupled
This enables the research and development                with a decrease in genotyping costs below
of therapies targeting disease-associated                $100.21 In contrast, consumer interest in whole
genes with increasing specificity. Genomics-             genome sequencing has grown slowly due to
guided therapeutic discovery has been applied            a significantly higher cost. A recent survey
successfully to many types of cancers, rare              revealed that only ~3% of people are willing to
genetic diseases, and, increasingly, common              pay >$1,000 for whole genome sequencing.22
complex diseases.14 Furthermore, genomics-               For the majority of consumers, whose primary
guided patient cohort recruiting can reduce              interest in the area can best be described as
the failure rate of clinical trials by enriching         nonmedical “infotainment,” the benefits of
for likely responders and reducing reducing              sequencing over genotyping do not justify the
adverse reactions. This approach to clinical trials      significantly higher cost.
promises to reduce surging drug development
costs and lead to more drugs reaching the market         At the same time, the surge in popularity of
and benefiting patients.15                               genetic testing, forensic utilization of genetic
                                                         databases,23 and the purchase of genetic data
These opportunities are recognized by the                by biopharma companies24 have increased
biopharma industry. For example, the leading             consumer and media attention to genetic

Blockchain in Healthcare TodayTM           ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 4 of 23

data privacy. Studies show that privacy concerns      genomic data sharing, privacy-preserving
are legitimate, as data sharing policies of           computing has been utilized to protect genomic
many personal genomics companies do not               data privacy, and different compensation models
fulfill transparency guidelines with regard           have been explored to incentivize genomic data
to the confidentiality or sharing of customer         sharing.
genetic data.25 These developments are likely
to exacerbate reported privacy concerns over          Genomic Data Sharing
genetic data26,27 and deter personal genomic          The GA4GH Beacon Project33 and i2b2
sequencing.                                           SHRINE34 are two of the most advanced
                                                      systems for biomedical data sharing. Both are
For researchers, low adoption of personal             networks that enable participating institutions
genome sequencing has resulted in low                 to connect their genomic (and clinical) databases
availability of genomic data. According to            and process queries about the presence of
estimates, only ~500 thousand human genomes           genetic variants and traits, including medical
had been sequenced by 2017.3 This is detrimental      conditions. This federated model minimizes
for research because very large genomic               expensive data transfers and enables institutions
data sets are necessary to find links between         to retain control of their data. This addresses
genetic variants and traits, such as disease          privacy, regulatory, and technical challenges
predispositions. Finding such links is difficult      that are associated with centralized storage and
because most traits are the product of complex        transfers of genomic data.
interactions of many genetic variants, while the
effects of individual genetic variants are, on        However, there are limitations. First,
average, very small.28 Low diversity of genomic       functionality is currently limited to simple
data sets further compounds the search for links      queries. Orchestrated, distributed computations
between genetics and disease.29                       required for data processing and analysis are
                                                      currently not supported. Second, participation
The scarcity of genomic data is exacerbated by        is limited to academic research institutions and
difficulty in data access due to fragmentation of     hospitals. There are no patient- or consumer-
genomic data across proprietary data silos.30 Data    focused portals that would enable individuals
sharing is further hindered by the large size of      to easily contribute their personal genomic
genomic data, which impedes data transfer over        data. Third, decentralized governance and
networks.31 In addition to logistic and technical     compensation mechanisms have not been
challenges, data access is often complicated by       implemented.
restrictive government regulations that hinder
data sharing.32 Low availability of genomic           Genomic Data Protection
data combined with data silos also results in         Distributed genomic data storage and computing
high prices, making it unaffordable to many           can help protect genomic data privacy. However,
researchers.                                          data owners cannot always maintain in-house
                                                      servers and therefore they often must outsource
PREVIOUS WORK                                         data storage and computing to third parties,
Solutions to the challenges outlined above have       such as cloud service providers. To protect
been proposed previously. Federated data storage      the privacy of genomic data that are shared
systems have been implemented to facilitate           with untrusted third parties, encryption-based

Blockchain in Healthcare TodayTM        ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 5 of 23

privacy-preserving techniques have been adopted       to the challenges described in the previous
for genomics. These techniques enable third           sections. This model requires consumers to pay
parties to execute computations and return results    for genetic testing and result interpretation,
without having access to plaintext genomic data.      while personal genomics companies often take
                                                      ownership of the generated genomic data and
Privacy-preserving techniques have been applied       sell it to biopharma companies (Figure 2). This
previously to distributed medical and genomic         model requires consumers to carry the costs
databases. For example, MedCo integrates              and relinquish ownership and control of their
with the i2b2 SHRINE framework and uses a             genomic data, which discourages genetic testing.
homomorphic data encryption scheme to enable          In addition, this model promotes genomic data
outsourcing of genomic data storage and query         fragmentation across private data silos, which
execution to untrusted third parties.35 Another       hampers data access and increases data prices.
example is the Secure Multi Party Query
Language framework that implements similar            We propose to combine and extend previous
functionality and privacy guarantees using secure     work on genomic data sharing networks, privacy-
multiparty computations.36 Data can also be           preserving technologies, and compensation
protected using trusted hardware. An example          models to create a new model for personal
is the PRINCESS framework that executes               genomics that may overcome these challenges
computations on genomic data inside protected         (Figure 3).
memory regions of Intel microprocessors.37
                                                      First, the functionality of genomic data sharing
Compensation Models                                   networks must be extended beyond simple
Over the past few years, personal genomics            queries. This requires a network that can be
companies have explored different models to           integrated with a full-fledged bioinformatics
compensate individuals for contributing their         platform that supports genomic data processing
personal genomic data to research studies. In         and analysis. Implementing this functionality
2016, Genos offered to help its customers sell        would bundle fragmented genomic data and
their genomic data to researchers.38 A similar        make it available for analysis on a single
model that uses a cryptocurrency instead of fiat      network, thereby facilitating data access for
money was adapted by EncrypGen in 2017.39             researchers.
Most recently, LunaDNA announced that it
would compensate genomic data contributors            Second, the data sharing network must expand
with company stock.40 These models are similar        beyond research institutions and must be
in that individuals who want to participate must      accessible to individuals who want to share their
already own their personal genomic data, or           personal genomic data. However, the resulting
choose to purchase genetic testing because of the     network decentralization will necessitate a more
prospect they will be rewarded later for sharing      democratic governance model. This potentially
the data.                                             can be achieved by integrating blockchain
                                                      technology, which holds the promise of enabling
PERSONAL GENOMICS 2.0                                 decentralized, self-governing networks.
The traditional model for genomic data
generation and sharing that has been adopted by       Third, the privacy of genomic data must be
most personal genomics companies contributes          protected. Data access control on the blockchain

Blockchain in Healthcare TodayTM        ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 6 of 23

Figure 2. The traditional model for genomic data generation and sharing.

can ensure transparent consent management,              also may result in a reduction in genomic data
while privacy-preserving technologies can help          prices and thus empower more researchers to
protect shared genomic data. Together with              access large genomic data sets.
the distributed computing model that “brings
algorithms to the data,” these technologies can         DESIGN CONSIDERATIONS
enable network participants to retain ownership         To implement a system as outlined in
and control of their genomic data, thereby              the previous section, one must integrate
reducing privacy concerns and incentivizing data        a bioinformatics platform that supports
sharing.                                                distributed data storage and computing with a
                                                        suitable blockchain framework, as well as with
Fourth, genome sequencing and data sharing              techniques for privacy-preserving computing.
also must be incentivized by implementing               Here, we review and evaluate existing options.
subsidy and compensation mechanisms. The
decentralized data sharing model can facilitate         Bioinformatics Platforms
this, as it enables researchers to connect directly     Bioinformatics platforms have been developed to
with individuals with traits of interest, subsidize     facilitate organization of genomic data; to enable
their genome sequencing costs, and compensate           parallelized, high-performance computing with
them for data sharing. Elimination of middlemen         support for complex dependencies; and to allow

Blockchain in Healthcare TodayTM          ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 7 of 23

Figure 3—Alternative model for personal genomics that may overcome challenges.

a modular pipeline design that is flexible and       containerization of computing environments,
ensures reproducible results.41 Table 1 shows a      and utilization of standardized application
comparison of popular bioinformatics platforms.      programming interfaces (APIs).

The development of bioinformatics platforms          Based on these considerations, Arvados and
has been driven by exponentially growing             DNAstack appear as suitable choices for the
genomic data and marked by adaption of               proposed genomic data sharing platform. Both
multiple computing trends. Storage and               platforms have an API-focused architecture and
processing of genomic data has moved from            data sharing functionality. DNAstack integrates
local servers to remote clouds. This has enabled     with the GA4GH Beacon Network, while
scalable data storage and computing and              Arvados supports platform-agnostic, federated
facilitated access sharing to genomic data sets.     cloud environments and has an open-source
To scale beyond single clouds, efforts are being     codebase.
made to create federated cloud environments
that could enable distributed data storage           Blockchain Frameworks
and computing.48,49 Furthermore, the growth          Blockchain technology has three use cases
of genomic data and development of new               in the proposed system. First, the need to
bioinformatics tools that must be integrated         provide transparent consent management can
into workflows are driving the development of        be addressed by the ability of blockchains
standardized workflow description languages,         to store data access permissions on an

Blockchain in Healthcare TodayTM       ISSN 2573-8240 online     https://doi.org/10.30953/bhty.v1.34
Accelerating Genomic Data Generation and Facilitating Genomic Data Access Using Decentralization, Privacy-Preserving Technologies and Equitable ...
Page 8 of 23

Table 1. Comparison of bioinformatics platforms
 Criteria           Arvados42,43 DNAstack44                       Seven Bridges45 DNAnexus46 Galaxy47
 Hardware           Federated      Google Cloud with              Clouds          Clouds     Local servers
                    clouds and     Beacon Network
                    servers        integration
 Pipeline design    API-based;     API-based; web                 Web GUI               Web GUI           Web GUI
                    web GUI        GUI
 Containers         Yes            Yes                            Yes                   Yes               Yes
 Workflow           CWL            WDL                            CWL                   Custom            Custom
 language
 Open source        Yes            No                             No                    No                Yes
 Platform launch    2013           2014                           2012                  2010              2005
 year
API: application programming interface; CWL: Common Workflow Language; GUI: graphical (rather than textual) user interface;
WDL: Workflow Description Language.

immutable public ledger. Second, blockchains                      Based on these requirements, permissioned
can enable implementation of decentralized                        blockchains frameworks such as Exonum and
systems governed by network participants.                         Hyperledger Fabric appear most suitable (Table 2).
Third, an immutable ledger can facilitate                         Hyperledger Fabric has been more widely adopted,
verification of the integrity of decentrally                      but Exonum offers transparency and security that
stored data.                                                      is comparable to public blockchains.

Based on these use cases, one can create a                        First, Exonum-based blockchains offer public
set of requirements that a suitable blockchain                    read access but restrict write access to selected
framework must fulfill. First, consent                            validator nodes. By making read access to the
management requires that the identity of                          blockchain public, transaction audit does not
researchers who request to access data are                        rely on trusted parties. Exonum transactions
known to data owners. To this end, network                        are verified in real time by all nodes. Thus,
access must be limited to data buyers whose                       all network participants are able to audit the
identity has been verified. Therefore, consent                    blockchain state collectively.
management requires a blockchain that supports
permissioned access.                                              Second, Exonum supports anchoring of
                                                                  transaction logs in the Bitcoin blockchain.
Second, a large, decentralized data marketplace                   Hashes of the Exonum blockchain state are
requires smart contract functionality and high                    periodically written to the Bitcoin blockchain, so
transaction throughput. Private blockchains can                   even if all permissioned Exonum nodes collude,
achieve higher transaction throughputs than                       the transaction history cannot be falsified unless
public blockchains because the ability to write                   the attacker succeeds in compromising the
transactions to the blockchain is limited to a                    Bitcoin blockchain as well.
group of permissioned validator nodes. However,
this makes private blockchains more centralized                   Third, Exonum uses a byzantine fault-tolerant
and less dependable.                                              (BFT) consensus algorithm that protects against

Blockchain in Healthcare TodayTM              ISSN 2573-8240 online              https://doi.org/10.30953/bhty.v1.34
Page 9 of 23

Table 2. Comparison of blockchain frameworks
 Criteria               Exonum50                       Hyperledger Fabric51     Ethereum52
 Read access            Public                         Private                  Public
 Write access           Private                        Private                  Public
 Consensus              Byzantine fault-tolerant       Fault-tolerant (FT)      Proof of work (PoW)
                        (BFT)
 Transactions per       ~3,000                         ~3,000                   ~15
 second (TPS)
 Smart contracts        Yes (Rust, Java)               Yes (Go, Java)           Yes (Solidity)
 Light clients          Yes                            No                       Yes
 Public blockchain      Yes                            No                       NA
 anchoring
 Open source            Yes                            Yes                      Yes
NA: not applicable.

malicious behavior of permissioned nodes.              amounts during the computation. It is possible,
In contrast, Hyperledger and other private             however, to improve the performance of fully
blockchains rely on less computationally               homomorphic encryption and secure multiparty
intensive fault-tolerant (FT) consensus                computations significantly if they are optimized
algorithms that protect against node breakdown         for specific use cases. Practical performance
but not malicious behavior. Exonum offers both         levels have been demonstrated for queries on
BFT consensus and high transaction throughput          genomic data54,55 and genome-wide association
because it is written in Rust, one of the fastest      studies (GWAS).56
programming languages. Furthermore, Rust
offers memory safety which eliminates many             Alternative technologies have drawbacks of
vulnerabilities that are commonly exploited by         their own. Intel Software Guard Extensions
hackers.                                               technology is a hardware-assisted approach
                                                       that protects data privacy by executing
Privacy-Preserving Technologies                        computations inside private memory regions. It
Table 3 shows a comparison of privacy-                 offers good performance but has been affected
preserving technologies that all have been             by vulnerabilities that can compromise data
applied to secure genomic data.53 Fully                privacy.57 Differential privacy methods protect
homomorphic encryption and secure multiparty           data privacy by introducing randomness.
computations enable computations on encrypted          However, obfuscation of computation results can
data that generate encrypted results. These            complicate interpretation of studies.53
encrypted results, when decrypted, correspond to
the results of the same computation on plaintext       NEBULA
data. However, fully homomorphic encryption            In this section, concepts and design
is very slow and typically suffers from very           considerations outlined in the previous sections
large ciphertext expansion. The limitation of          are illustrated by describing the technical
secure multiparty computation protocols is             implementation of Nebula—a decentralized
that they require transfers of very large data         genomic data generation, sharing, and analysis

Blockchain in Healthcare TodayTM         ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 10 of 23

Table 3. Comparison of privacy-preserving technologies
                    Fully             Secure                  Intel                    Differential
                    Homomorphic Multiparty                    Software Guard           Privacy
 Criteria           Encryption        Computations            Extensions
 Principle          Computations      Distributed             Computations inside      Introduction of
                    (additions AND computations on            private memory           randomness to
                    multiplications) ciphertexts              regions                  data/results of
                    on ciphertexts                                                     computations
 Computation        Very slow         Slow                    Fast                     Fast
 time
 Memory usage       Very high         High                    Low                      Very low
 Communication      High              Very high               Low                      Low
 cost
 Specific           None              None                    Vulnerabilities have     Noise makes
 limitations                                                  been discovered;         interpretation
                                                              requires Intel CPUs      of results more
                                                                                       difficult
CPU: central processing unit.

platform. Nebula integrates the Arvados42,43             intention is to preserve data privacy by enabling
bioinformatics platform (github.com/curoverse/           investigators to query the whole database
arvados) with the Exonum50 blockchain                    and discover their data of interest, without
framework (github.com/exonum) and a fully                compromising the privacy of the queried data.
homomorphic data encryption scheme (Figure 4).           In the future, it should be possible to extend the
                                                         application of privacy-preserving technologies to
Arvados has two core services: Keep and                  GWAS and other computations.
Crunch. Keep is a distributed content-
addressable storage system that enables scalable         The Nebula blockchain is an Exonum-based
storage of genomic big data, high throughput             blockchain through which the Nebula network
data access, and efficient data management.              will be governed, consent will be documented,
Crunch is a workflow management engine                   and the data will be secured. Exonum-based
that enables flexible creation and parallelized          blockchains have three types of nodes: auditors,
execution of data analysis pipelines and                 light clients, and validators. Auditors are
generation of reproducible results. Arvados              full nodes that maintain a copy of the entire
implements a distributed data storage and                blockchain content and can generate transactions.
computing model that minimizes required data             Light clients also can generate transactions, but
transfers. This helps address big data challenges,       they replicate only information that is relevant
regulatory restrictions, and data privacy risks.         to them instead of the whole blockchain content.
                                                         Validators are permissioned nodes that verify
Utilization of a homomorphic data encryption             transactions received from auditors and light
scheme enables implementation of privacy-                clients and write new blocks to the blockchain.
preserving queries on genomic data. The                  While the current implementation of Nebula

Blockchain in Healthcare TodayTM         ISSN 2573-8240 online        https://doi.org/10.30953/bhty.v1.34
Page 11 of 23

Figure 4—Overview of the Nebula platform.

uses the Exonum framework, other permissioned            payments to their wallets by operating light
blockchains, in particular Hyperledger Fabric,           clients on the Nebula blockchain.
can be used as well.
                                                       • Network maintainers are organizations
                                                          that operate validator nodes on the Nebula
The Nebula network has four types of
                                                          blockchain. Validator nodes will collectively
participants: data owners, network maintainers,
                                                          control data access by managing encrypted
data buyers, and storage and compute providers.
                                                          key shares, verifying transactions, and keeping
                                                          track of data stored in Keep and computations
• Data owners can be private individuals or
                                                          executed by Crunch.
   institutions. They will store encrypted genomic
   data in public or private clouds that are part      • Data buyers are researchers who wish to
   of the Keep storage system. They will be able          obtain access to genomic data. They will be
   to control access to their data and receive            operating auditor nodes to keep a local copy of

Blockchain in Healthcare TodayTM         ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 12 of 23

  the metadata, which they will use to locate data     To this end, the Nebula platform enables a
  stored in Keep, verify data integrity, and keep      data buyer to create a smart contract that
  track of access permissions. Data buyers will        specifies the blockchain addresses of data
  be able to query homomorphically encrypted           owners previously identified in a query and
  data, utilize smart contracts to acquire data        send cryptocurrency tokens to that smart
  access permissions from data owners, and use         contract. The data owners are notified that
  Crunch to run analysis pipelines.                    a buyer has offered to pay their sequencing
                                                       costs. If a data owner accepts the offer by
• Storage and compute providers are data              executing the smart contract, the deposited
   owners that operate private clouds, or third        tokens are sent to a sequencing provider.
   parties that offer storage and computing            Next, the data owner receives a saliva
   services (e.g., Google, Amazon, and                 collection kit and submits a saliva sample
   Microsoft). They will form a federated cloud        to the sequencing facility. The sample
   environment that hosts the Keep storage             is sequenced, and the genomic data are
   system and Crunch-managed containers                deposited on a Keep server specified by
   within which computations are executed.             the data owner. Data hashes, along with
The development of Nebula is ongoing. Some             blockchain addresses of all data owners and
parts of the platform, in particular, Arvados,         buyers, are written to the blockchain. The
have been fully implemented over the past              data buyer who paid the sequencing costs is
few years and are already being deployed by            permitted to access and analyze the data. The
various organizations. Other parts of Nebula, in       data owner receives interpretations of his
particular, the homomorphic encryption schemes,        genomic data and is able to share data access
are a relatively recent addition and are not yet       with additional data buyers.
fully integrated. A report on the progress of our
work was published in a white paper.58 Here we
describe the implementation of Nebula in greater       Phenotypic data
detail but also revise some previously made            Information about medical conditions and
design choices.                                        other traits is referred to as phenotypic data.
                                                       These data are generated primarily through
Data Generation                                        survey questions. The platform utilizes a
                                                       phenotyping toolkit that maps plain-language
Genomic data                                           survey responses to clinical descriptions
Personal genome sequencing cost is a significant       in Human Phenotype Ontology (HPO)59
factor in preventing more widespread consumer          format. Survey data can be verified using two
adoption. Therefore, a key consideration in the        approaches. First, comparing the incidence of
design of the Nebula platform was to provide a         medical conditions in the general population to
mechanism to shift sequencing costs from data          the incidence observed in the platform’s data
owners (e.g., consumers and biobanks) to data          set will enable identification of survey results
buyers (e.g., pharma and biotech companies).           that deviate from the expected results. Second,
This is being implemented by enabling data             survey data can be verified by referencing
buyers to query the Nebula database, identify          Electronic Health Records (EHRs) imported
data sets of interest, and pay the sequencing costs    through the Fast Healthcare Interoperability
to generate and access genomic data (Figure 5).        Resources (FHIR) API.

Blockchain in Healthcare TodayTM         ISSN 2573-8240 online     https://doi.org/10.30953/bhty.v1.34
Page 13 of 23

Figure 5—Genome sequencing subsidy payment on the blockchain.

Data Encryption                                        Encryption Standard (AES) encrypted. The AES
Privacy of genomic and phenotypic data are             keys are encrypted with validator public keys and
protected through client-side encryption by            bundled with the encrypted data.
data owners and encryption key management
by blockchain validator nodes. To enable data          Data Storage
buyers to discover data prior to purchasing data
access, the platform implements a lattice-based        Data
fully homomorphic encryption scheme. To this           Genomic data are stored in Keep, a distributed
end, blockchain validator nodes generate public–       content-addressable storage system that retrieves
private key pairs and construct a single collective    files based on their content. Addresses of files
public key (Figure 4). Data owners encrypt their       are generated through cryptographic digest of
survey responses and genetic variant lists with        their content. Keep combines content-addressing
the collective public key and upload them to           with the distributed storage architecture of the
a Keep server. The homomorphic encryption              Google File System.60 Keep splits encrypted files
scheme protects data privacy by enabling data          into 64-megabyte blocks and stores them in an
buyers to execute Structured Query Language            underlying object store or file system (Figure 6).
(SQL)-like queries on the homomorphically              The content addresses of the blocks are stored on
encrypted data. Files that contain raw sequencing      the blockchain and are used to find data locations
data and are not used for queries are Advanced         and check data integrity.

Blockchain in Healthcare TodayTM         ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 14 of 23

Figure 6—Data blocks are stored in Keep. Block hashes are stored on the blockchain.

Keep is designed for storing genomic and other         Storj can potentially be supported if computing
types of biomedical big data. First, its content-      on stored data becomes possible. Data owners
addressing offers high-speed storage and retrieval     can register new, personal cloud instances or
by eliminating an indexing service, a potential        store their encrypted data in shared clouds.
bottleneck and point of failure, and enabling          Based on phenotypic information, data sets that
direct connections between the storage and             are likely to be analyzed together are stored in
compute subsystems. Second, content-addressing         physical proximity, which minimizes slow and
works well for data written to disk once and read      expensive data transfers.
many times, a characteristic of genomic data,
as it does not change over time but is accessed        As sequencing data are processed, different
frequently. Third, fixed-size data blocks allow        file formats are generated and stored in Keep.
scalable distributed storage of big data, and          Typically, Keep stores FASTQ files that contain
content-addressing enables easy file verification,     raw sequencing data (~200 gigabytes/genome),
which is particularly important for distributed        Binary Alignment Map files that store aligned
databases.                                             sequencing reads (~100 gigabytes/genome),
                                                       and Variant Call Format files that store genetic
Keep is designed to be a distributed, hybrid           variants (~200 megabytes/genome). Additionally,
storage system. Data owners can choose to              Nebula uses the Compact Genome Format (CGF)
store their data in clouds such as Amazon Web          to generate compact genomic data summaries.
Services (AWS), Google Cloud Platform (GCP),           Genomes in the CGF format are represented by
and Microsoft Azure, or on private bare-metal          pointers referencing sequences in a tile library
servers. Decentralized file storage solutions such     (Figure 7). CGF offers a consistent, standardized
as InterPlanetary File System (IPFS), Sia, and         representation of genomic data that makes

Blockchain in Healthcare TodayTM         ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 15 of 23

Figure 7—Simplified representation of a tile library and a Compact Genome Format (CGF) file. The
rectangles represent tile variants at different positions and the dotted line illustrates the tile composition of
specific genome.

different types of sequencing and genotyping                transactions, add new blocks to the blockchain,
data interoperable. The CGF representation                  and update the key-value store. Storage of
is also very space efficient (~30 megabytes/                metadata on an immutable ledger helps secure
genome), which facilitates file transfers, and              the integrity of the decentralized Nebula
enables fast queries and efficient analysis.60              database. To this end, multiple column families
                                                            are implemented:
Tabular phenotypic data generated through
surveys and imports of EHRs are stored in                   • Data ownership is registered by assigning each
physical proximity with associated genomic data.              block content address the blockchain address of
In contrast to static genomic data, phenotypic                the data owner who added the block to Keep.
information is much more dynamic and smaller                • Data locations are described by assigning each
than genomic data. This makes utilization of the              block content address the Uniform Resource
Google File System and content addressability                 Locator (URL) of a Keep server.
unsuitable. Therefore, phenotypic data files are            • Data integrity is verified by re-hashing data
stored as Not only SQL documents.                             blocks and comparing their hashes with content
                                                              addresses that are stored on the blockchain.
                                                            • Data buyer identities, including names and
Metadata                                                      institutional affiliations, are verified, linked
To organize data stored in Keep, Nebula stores                to blockchain addresses, and stored on the
metadata on the blockchain in a key-value store.              blockchain.
When new data are added to Keep or existing                 • Access permissions to the Nebula platform and
data are modified, blockchain transactions                    data stored in Keep also are managed on the
are generated. Validator nodes verify these                   blockchain.

Blockchain in Healthcare TodayTM          ISSN 2573-8240 online           https://doi.org/10.30953/bhty.v1.34
Page 16 of 23

Data Discovery                                        criteria, as well as their blockchain addresses.
Utilization of fully homomorphic encryption is        This enables data buyers to connect with data
intended to address the privacy barrier to data       owners to pay sequencing costs or to purchase
sharing. It enables data owners to make their data    access to existing genomic data (Figure 8).
available for discovery without privacy risks,
while at the same time allowing data buyers           Data Analysis
to explore the database before purchasing data        The Arvados container and pipeline management
access to perform analyses.                           engine, Crunch, executes computations on data
                                                      stored in Keep. Crunch implements a distributed
To this end, data buyers will construct a SQL-like    computing model whereby workflows, and not
query and encrypt it with the collective public       the genomic data, are moved between cloud
key that has been constructed by validator nodes      instances whenever possible. Highly distributed
and used to encrypt phenotypic information            genomic data processing is possible because
and genetic variant lists. The encrypted query is     many intensive bioinformatics computations,
executed on homomorphically encrypted data            such as alignment and variant calling, are
and an encrypted result is generated. The query       performed on single genomes and are easily
result is re-encrypted by the validator nodes         parallelizable. To this end, Crunch executes
under a public key provided by the data buyer         tasks inside Docker containers that are created
and shared with data buyer who can now decrypt        physically close to the data locations in Keep
it with its private key. A query can return the       and distributes computations between many
number of data owners that matched the specified      processing units.

Figure 8—Secure data discovery through queries on homomorphically encrypted data.

Blockchain in Healthcare TodayTM        ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 17 of 23

Crunch ensures result reproducibility through           Security
standardization of computing workflows                  Homomorphic encryption can enable privacy-
using Common Workflow Language (CWL),61                 preserving queries for data discovery. However,
which enables connection of open-source                 most computations that are necessary for
and proprietary bioinformatics software into            typical genomic data analysis workflows do not
workflow pipelines that are flexible, portable,         achieve practical runtimes when executed on
and scalable. Crunch can access CWL pipelines           homomorphically encrypted data. Therefore,
stored in public or private Git repositories such       other security mechanisms must be utilized.
as GitHub.

CWL can be used to implement end-to-                    Platform access control
end bioinformatics analysis pipelines.                  To protect data owners and their data, data
Typically, CWL pipelines include common,                buyers are required to go through a partially
computationally intensive “secondary analysis”          decentralized, three-step permission process.
tasks, such as alignment of sequencing reads to a       The first step is data buyer authentication.
reference genome and variant calling. However,          Here, a blockchain validator node will verify a
“tertiary analysis” tasks, which often involve          data buyer’s personal and institutional identity.
computing on genomic data sets rather than              Blockchain addresses of verified data buyers will
single genomes and are less standardized, also          be added to the blockchain metadata store. Data
can be incorporated into CWL pipelines. Typical         buyers will then be able to connect to Nebula
examples are statistical tests that are used in         REST API servers and use Crunch to execute
GWAS to identify correlations between genetic           pipelines on data stored in Keep. Data buyer
variants and phenotypes. For such tertiary              authentication will enable data owners to verify
analysis tasks, Nebula uses Lightning,62 a system       data buyer identity before agreeing to share
for high-performance, in-memory computations            data access. Furthermore, immutable storage of
on genomic data in the CGF. Lightning integrates        data buyer identities on the blockchain enables
into CWL pipelines and enables fast queries and         identification of data buyers who have violated
execution of complex machine learning tasks on          consent agreements or have bypassed pipeline
large genomic data sets.                                execution control.

CWL pipelines can also be used to analyze and
interpret personal genomic (and phenotypic)             Pipeline execution control
data. First, users can build their own custom           To protect data privacy, the platform design
pipelines to analyze their personal data and            incorporates the ability to define approved
also share pipelines among each other using             bioinformatics tools and CWL workflows.
public Git repositories. Second, developers             The intent is to prevent data buyers from
can build and monetize genomic apps. To this            downloading genomic data or executing any
end, CWL pipelines can be stored in private             computations that attempt to extract a large
repositories, and access by Crunch may require          amount of information about individual data
a smart contract-mediated token payment to the          owners. This approach was chosen because it
pipeline developer. The approach of bringing            has the ability to provide a sufficient level of
apps to the data facilitates protection of personal     data privacy protection without significantly
information.                                            restricting data buyers.

Blockchain in Healthcare TodayTM          ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 18 of 23

Data access control                                   re-encrypt the data under the data buyer’s public
The first task in every CWL pipeline is to get        key. Finally, the data buyer’s access permission
access to the input data (Figure 9). Here, a          is registered on the blockchain and tokens are
data buyer executes a smart contract on the           sent from the smart contract to the data owner’s
blockchain. The inputs are the data buyer’s           wallet. Crunch can now load decrypted data into
blockchain address and the content addresses of       a Docker container and begin pipeline execution.
all data blocks of the input files. The data buyer
also deposits tokens inside the smart contract        Governance
and defines a token payout for data access. When      The Nebula blockchain can be used to enable
a data owner’s light client synchronizes with         Nebula network participants to collectively
the blockchain, the data owner is notified of the     govern the network, in particular, to help
data access request. The data owner can decide        maintain data protection. To this end, for
about data sharing based on offered payment and       example, Token-Curated Registries (TCRs)63
identity of the data buyer. The data owner grants     can be used to conduct elections that determine
data access by executing the smart contract.          validator nodes or whitelist data analysis
The blockchain validator nodes then verify the        pipelines. TCRs are lists that are curated
integrity of the requested data stored in Keep by     decentrally by token holders. Importantly,
comparing data hashes with the content addresses      economic incentives drive the token holders to
stored on the blockchain and collectively             curate the list’s contents judiciously. In brief,

Figure 9—Data access control and data purchases.

Blockchain in Healthcare TodayTM        ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 19 of 23

network participants can cast votes whereby the        produced the data. If the source of the genomic
weights of votes scale with their token holdings.      data is unknown, or the sequencing facility
Since token holders are invested in the network,       does not cooperate, genomic data cannot be
they are incentivized to maintain its proper           validated. A possible solution to this problem
operation that ensures data protection.                can be a model that compensates personal
                                                       genomics companies and other genomic data
DISCUSSION                                             producers for validating data authenticity. Data
The obstacles that hinder personal genome              collected from different sources also must be
sequencing and genomic data sharing have a             made interoperable. It is particularly challenging
significant impact on the progress of research         to curate health records and other types of
into disease prevention, drug development,             phenotypic data. However, standards such as
and other crucial aspects of human health. We          FHIR are being developed very actively and
described one approach to overcoming these             have already enabled applications that can collect
obstacles, using a combination of multiple             EHRs across different health systems.65
technologies. A number of challenges remain
to be addressed regarding data privacy, data           The idea of a personal data marketplace is very
validation, data curation, and data economics.         new and has not yet been implemented at scale.
                                                       A personal data marketplace is likely to differ
Data privacy protection requires decentralization      from traditional marketplaces in important ways.
of data generation and further development of          For example, data supply can be regarded as
privacy-preserving technologies. Today, genomic        being unlimited because an individual can share
data generation is limited to laboratories that        data access with an unlimited number of data
own expensive sequencing machines operated             buyers. Personal data marketplaces also would
by experienced technicians. Centralized genomic        be asymmetric, since individuals are likely to be
data generation leads to data privacy risks that       unaware of the value of their personal data and
may be averted if sequencing is decentralized.         are thus at risk of not being compensated fairly.
We anticipate that this will become possible soon      The novelty of personal genomic data further
as new technologies are being developed that           compounds these challenges and makes market
would enable compact, affordable, and easy-to-         dynamics difficult to predict. We anticipate
operate sequencing machines.64 Data privacy            that future research into economics of data
protection is also impaired by current limitations     marketplaces will help answer these and other
of privacy-preserving technologies that do not         open questions.
allow complex computations on large data sets
and require extensive optimization for every           Contributions: Dennis Grishin, Kamal Obbad,
use case, which hinders effective data analysis.       and Kevin Quinn wrote the article. Dennis
However, practicality of privacy-preserving            Grishin, Kamal Obbad, and George Church
technologies has been steadily increased over          developed the ideas described in the article.
the past few years, and we anticipate continuing       Dennis Grishin, Kamal Obbad, and Kevin
progress in the future.                                Quinn are leading the development of the
                                                       Nebula platform. Alexander Wait Zaranek, Ward
Data validation and curation are another area of       Vandewege, Tom Clegg, Nico César, Preston
challenge. Validation of genomic data requires         Estep, and Mirza Cifric contributed to the
assistance of the sequencing facilities that have      development of Arvados. Alexander Wait Zaranek

Blockchain in Healthcare TodayTM         ISSN 2573-8240 online     https://doi.org/10.30953/bhty.v1.34
Page 20 of 23

and Sarah Wait Zaranek developed the Compact            7.    Yang A, Palmer AA, de Wit H. Genetics
Genome Format and Lightning. All authors edited               of caffeine consumption and responses
and/or reviewed the final article.                            to caffeine. Psychopharmacology. 2010
                                                              Aug;211(3):245–57.
Acknowledgments                                         8.    Mattar R, de Campos Mazo DF, Carrilho
The authors would like to thank Armon Rahim                   FJ. Lactose intolerance: Diagnosis, genetic,
                                                              and clinical factors. Clin Exp Gastroenterol.
and Nathaniel Tucker for their help in reviewing
                                                              2012 Jul 5;5:113–21.
this article.                                           9.    Freeman HJ. Risk factors in familial forms
                                                              of celiac disease. World J Gastroenterol.
Conflict of Interest                                          2010 Apr 21;16(15):1828–31.
All authors are employees, advisors, or                 10.   O’Connell K, Knight H, Ficek K, et al.
collaborators of Nebula Genomics.                             Interactions between collagen gene variants
                                                              and risk of anterior cruciate ligament
Funding Statement                                             rupture. EJSS. 2015;15(4):341–50.
Development of the Nebula platform is funded            11.   Tiziano FD, Palmieri V, Genuardi M,
by Nebula Foundation.                                         Zeppilli P. The role of genetic testing in
                                                              the identification of young athletes with
                                                              inherited primitive cardiac disorders at risk
REFERENCES                                                    of exercise sudden death. Front Cardiovasc
1.   International Human Genome Sequencing                    Med. 2016 Aug 26;3:28.
     Consortium. Finishing the euchromatic              12.   Bennett ER, Reuter-Rice K, Laskowitz
     sequence of the human genome. Nature.                    DT. Genetic Influences in Traumatic Brain
     2004 Oct 21;431(7011):931–45.                            Injury: Chapter XI. In: Laskowitz D, Grant
2.   Wetterstrand KA. DNA Sequencing Costs:                   G, editors. Translational Research in
     Data from the NHGRI Genome Sequencing                    Traumatic Brain Injury. Boca Raton, FL:
     Program (GSP) [Internet]. [cited 2018 Jan                CRC Press/Taylor and Francis Group; 2015.
     11]. Available from: https://www.genome.           13.   Ginn SL, Amaya AK, Alexander
     gov/sequencingcostsdata/                                 IE, Edelstein M, Abedi MR. Gene
3.   Illumina Promises to Sequence Human                      therapy clinical trials worldwide to
     Genome For $100—But Not Quite Yet.                       2017: An update. J Gene Med. 2018
     2017. [cited 2018 Oct 10]. Available                     May;20(5):e3015.
     from: https://www.forbes.com/sites/                14.   Cardon LR, Harris T. Precision medicine,
     matthewherper/2017/01/09/illumina-                       genomics and drug discovery. Hum Mol
     promises-to-sequence-human-genome-for-                   Genet. 2016 Oct 1;25(R2):R166–72.
     100-but-not-quite-yet/#672a5d72386d                15.   Rojahn SY. Genomics could blow up the
4.   Maurano MT, Humbert R, Rynes E, et al.                   clinical trial. MIT Technology Review
     Systematic localization of common disease-               [Internet]. 2013 Nov 12 [cited 2018
     associated variation in regulatory DNA.                  Aug 25]; Available from: https://www.
     Science. 2012 Sep 7;337(6099):1190–5.                    technologyreview.com/s/521496/genomics-
5.   Lindor NM, Thibodeau SN, Burke                           could-blow-up-the-clinical-trial/
     W. Whole-genome sequencing in                      16.   Herper M. Surprise! With $60 Million
     healthy people. Mayo Clin Proc. 2017                     Genentech Deal, 23andMe Has A Business
     Jan;92(1):159–72.                                        Plan [Internet]. Forbes. 2015 [cited 2017
6.   Berg JS, Adams M, Nassar N, et al.                       Oct 1]. Available from: https://www.forbes.
     An informatics approach to analyzing                     com/sites/matthewherper/2015/01/06/
     the incidentalome. Genet Med. 2013                       surprise-with-60-million-genentech-deal-
     Jan;15(1):36–44.                                         23andme-has-a-business-plan/

Blockchain in Healthcare TodayTM          ISSN 2573-8240 online      https://doi.org/10.30953/bhty.v1.34
Page 21 of 23

17. Bloomberg. GlaxoSmithKline Is Acquiring         26. Bloss CS, Ornowski L, Silver E, et al.
    a $300 Million Stake in 23andMe [Internet].         Consumer perceptions of direct-
    Fortune. [cited 2018 Aug 25]. Available             to-consumer personalized genomic
    from: http://fortune.com/2018/07/25/                risk assessments. Genet Med. 2010
    glaxosmithkline-23andme-gsk/                        Sep;12(9):556–66.
18. Ledford H. AstraZeneca launches project to      27. Sanderson SC, Brothers KB, Mercaldo ND,
    sequence 2 million genomes. Nature. 2016            Clayton EW, Antommaria AHM, Aufox SA,
    Apr 28;532(7600):427.                               et al. Public attitudes toward consent and
19. Herper M. Drug company consortium to                data sharing in Biobank Research: A large
    sequence the genes of 500,000 Britons over          multi-site experimental survey in the US.
    next two years. Forbes Magazine [Internet].         Am J Hum Genet. 2017 Mar 2;100(3):
    2018 Jan 8 [cited 2018 May 27]; Available           414–27.
    from: https://www.forbes.com/sites/             28. Visscher PM, Wray NR, Zhang Q, et al.
    matthewherper/2018/01/08/drug-company-              10 years of GWAS discovery: Biology,
    consortium-to-sequence-the-genes-of-                function, and translation. Am J Hum Genet.
    500000-britons-over-next-two-years/                 2017 Jul 6;101(1):5–22.
20. Khan R, Mittelman D. Consumer genomics          29. Popejoy AB, Fullerton SM. Genomics
    will change your life, whether you get              is failing on diversity. Nature. 2016 Oct
    tested or not. Genome Biol. 2018 Aug                13;538(7624):161–4.
    20;19(1):120.                                   30. Lawler M, Maughan T. From Rosalind
21. Allyse MA, Robinson DH, Ferber MJ,                  Franklin to Barack Obama: Data sharing
    Sharp RR. Direct-to-Consumer Testing 2.0:           challenges and solutions in genomics and
    Emerging models of direct-to-consumer               personalised medicine. New Bioeth. 2017
    genetic testing. Mayo Clin Proc. 2018               Apr;23(1):64–73.
    Jan;93(1):113–20.                               31. Feltus FA, Breen JR 3rd, Deng J, et al.
22. Marshall DA, Gonzalez JM, Johnson FR,               The widening gulf between genomics data
    et al. What are people willing to pay for           generation and consumption: A practical
    whole-genome sequencing information, and            guide to big data transfer technology.
    who decides what they receive? Genet Med.           Bioinform Biol Insights. 2015 Sep
    2016 Dec;18(12):1295–302.                           23;9(Suppl 1):9–19.
23. Kolata G, Murphy H. The golden state            32. Majumder MA, Cook-Deegan R, McGuire
    killer is tracked through a thicket of DNA,         AL. Beyond our borders? Public resistance
    and experts shudder. The New York Times             to global genomic data sharing. PLoS Biol.
    [Internet]. 2018 Apr 27 [cited 2018 Aug             2016 Nov;14(11):e2000206.
    21]; Available from: https://www.nytimes.       33. Global Alliance for Genomics and Health.
    com/2018/04/27/health/dna-privacy-golden-           GENOMICS. A federated ecosystem for
    state-killer-genealogy.html                         sharing genomic, clinical data. Science.
24. Ducharme J. A major drug company now                2016 Jun 10;352(6291):1278–80.
    has access to 23andMe’s genetic data.           34. Weber GM, Murphy SN, McMurry AJ, et
    Should you be concerned? Time [Internet].           al. The Shared Health Research Information
    2018 Jul 26 [cited 2018 Aug 21]; Available          Network (SHRINE): A prototype
    from: http://time.com/5349896/23andme-              federated query tool for clinical data
    glaxo-smith-kline/                                  repositories. J Am Med Inform Assoc. 2009
25. Laestadius LI, Rich JR, Auer PL. All                Sep;16(5):624–30.
    your data (effectively) belong to us: Data      35. Raisaro JL, Troncoso-Pastoriza J,
    practices among direct-to-consumer                  Misbach M, et al. MedCo: Enabling
    genetic testing firms. Genet Med. 2017              secure and privacy-preserving exploration
    May;19(5):513–20.                                   of distributed clinical and genomic

Blockchain in Healthcare TodayTM      ISSN 2573-8240 online    https://doi.org/10.30953/bhty.v1.34
Page 22 of 23

      data. IEEE/ACM Trans Comput Biol                   46. DNAnexus Documentation [Internet].
      Bioinform [Internet]. 2018 Jul 13;                     [cited 2018 Oct 10]. Available from: wiki.
      Available from: http://dx.doi.org/10.1109/             dnanexus.com
      TCBB.2018.2854776                                  47. Goecks J, Nekrutenko A, Taylor J, Galaxy
36.   Bater J, Elliott G, Eggen C, Goel S, Kho               Team. Galaxy: A comprehensive approach
      A, Rogers J. SMCQL: Secure querying                    for supporting accessible, reproducible,
      for federated databases. Proceed VLDB                  and transparent computational research in
      Endowment. 2017 Feb;10(6):673–84.                      the life sciences. Genome Biol. 2010 Aug
37.   Chen F, Wang S, Jiang X, et al. PRINCESS:              25;11(8):R86.
      Privacy-protecting rare disease International      48. Chaterji S, Koo J, Li N, Meyer F, Grama
      Network Collaboration via encryption                   A, Bagchi S. Federation in genomics
      through software guard extensions.                     pipelines: Techniques and challenges. Brief
      Bioinformatics. 2017 Mar 15;33(6):871–8.               Bioinform [Internet]. 2017 Aug 29; [cited
38.   Molteni M, Allain R, Chen S, Thompson                  2018 Oct 10]. Available from: http://dx.doi.
      A, Simon M, Gonzalez R. Genos will                     org/10.1093/bib/bbx102
      sequence your genes—And help you sell              49. Workflow Execution Service (WES) API
      them to science. Wired [Internet]. 2016                [Internet]. Github; [cited 2018 Oct 11].
      Dec 15 [cited 2018 Oct 6]; Available from:             Available from: https://github.com/ga4gh/
      https://www.wired.com/2016/12/genos-will-              workflow-execution-service-schemas
      sequence-genes-help-sell-science/                  50. Exonum Documentation [Internet]. [cited
39.   Lin P. Blockchain: The missing link between            2018 Oct 10]. Available from: exonum.
      genomics and privacy? Forbes [Internet].               com/doc
      2017 May 8 [cited 2018 Oct 6]; Available           51. Androulaki E, Barger A, Bortnikov V, et al.
      from: https://www.forbes.com/sites/                    Hyperledger fabric: A distributed operating
      patricklin/2017/05/08/blockchain-the-missing-          system for permissioned blockchains. In:
      link-between-genomics-and-privacy/                     Proceedings of the Thirteenth EuroSys
40.   Brown KV. Share your DNA, get shares:                  Conference. New York: ACM; 2018. pp.
      Startup files an unusual offering. Bloomberg           30:1–30:15. (EuroSys ’18).
      News [Internet]. 2018 Oct 5 [cited 2018 Oct 6];    52. Wood G. Ethereum: A secure decentralised
      Available from: https://www.bloomberg.com/             generalised transaction ledger. 2014.
      news/articles/2018-10-05/illumina-backed-              [Internet]. [cited 2018 Oct 10]. Available
      startup-asks-sec-to-let-it-pay-people-for-dna          from: https://ethereum.github.io/
41.   Leipzig J. A review of bioinformatic                   yellowpaper/paper.pdf
      pipeline frameworks. Brief Bioinform. 2017         53. Aziz MMA, Sadat MN, Alhadidi D, et al.
      May 1;18(3):530–6.                                     Privacy-preserving techniques of genomic
42.   Arvados Documentation [Internet]. [cited               data-a survey. Brief Bioinform [Internet].
      2018 Oct 10]. Available from: doc.arvados.org          2017 Nov 7; [cited 2018 Oct 10].
43.   Zaranek AW, Clegg T, Vandewege W,                      Available from: http://dx.doi.org/10.1093/
      Church GM. Free factories: Unified                     bib/bbx139
      infrastructure for data intensive web              54. Çetin GS, Chen H, Laine K, et al. Private
      services. Proc USENIX Annu Tech Conf.                  queries on encrypted genomic data. BMC
      2008 May 1;2008:391–404.                               Med Genomics. 2017 Jul 26;10(Suppl 2):45.
44.   DNAstack Documentation [Internet]. [cited          55. Sousa JS, Lefebvre C, Huang Z, et al.
      2018 Oct 9]. Available from: https://docs.             Efficient and secure outsourcing of genomic
      dnastack.com/java-sdk/                                 data storage. BMC Med Genomics. 2017 Jul
45.   Seven Bridges Documentation [Internet].                26;10(Suppl 2):46.
      [cited 2018 Oct 10]. Available from: docs.         56. Cho H, Wu DJ, Berger B. Secure genome-
      sevenbridges.com/docs                                  wide association analysis using multiparty

Blockchain in Healthcare TodayTM           ISSN 2573-8240 online     https://doi.org/10.30953/bhty.v1.34
You can also read