Examining Data Sharing and Data Reuse in the DataONE

Page created by Ray Guzman
 
CONTINUE READING
Examining Data Sharing and Data Reuse in the DataONE
                      Environment
                                                           Angela P. Murillo
                                              University of North Carolina at Chapel Hill
                                              School of Information and Library Sciences
                                                    216 Lenoir Drive • CB #3360
                                                           100 Manning Hall
                                                     Chapel Hill, NC 27599-3360
                                                       amurillo@email.unc.edu

ABSTRACT
                                                                             INTRODUCTION
The Data Observation Network for Earth (DataONE), a
                                                                             This poster-paper presents preliminary analysis of a mixed-
U.S. NSF DataNet Partner, seeks to provide
                                                                             method study to gain an understanding of data sharing and
cyberinfrastructure for “open, persistent, robust, and secure
                                                                             reuse within a new cyberinfrastructure program, the
access to…earth science observational data”. Scientists
                                                                             DataONE.
participating in DataONE are able to deposit, search, and
reuse data available through various DataONE tools. The                      Data sharing and reuse in the sciences has been a topic of
research presented in this poster-paper reports on two                       growing attention in recent years. This attention stems from
studies examining data sharing and reuse in the DataONE                      changes occurring in scientific practices driven by the data
environment. The two studies include 1) a profiling data                     deluge (Bell, Hey, & Szalay, 2009; Hey & Trefethen, 2003)
assessment that examines the data and metadata being                         and fourth paradigm data-intensive science (Hey, Tansley,
deposited into the DataONE system for data sharing, and 2)                   & Tolle, 2009); and changes in journal and grant agency
a pilot think-aloud study that examines what factors                         policies (National Institutes of Health, 2007; National
influence decisions regarding data reuse. From the profiling                 Science Foundation, 2010). The sharing of data provides
data assessment, preliminary results indicate that data being                the ability to extract additional value from existing data,
deposited into the DataONE for sharing have three specific                   avoid reproducing research, ask new questions of existing
types of metadata available including a) dataset, b) access,                 data, and advance the state of science in general (Borgman,
and c) additional metadata. Results also indicated that there                2012; Lord & Macdonald, 2003). These potential
is variation regarding the robustness and completeness of                    opportunities of data sharing and have placed pressure on
information. Additionally, through the think-aloud study                     the scientific community and funding agencies to provide
results indicated that particular aspects the metadata                       infrastructure solutions for the changes occurring in
information was useful for decision-making regarding reuse                   scientific practice.
of data for scientists, while other metadata aspects were
                                                                             To address the above, in 2007 the U.S. National Science
described as not useful. The results section provide specific
                                                                             Foundation announced a request for proposals for
details of these findings and demonstrate how these two
                                                                             Sustainable Digital Data Preservation and Access Network
studies examine both data sharing and reuse within the
                                                                             Partners (DataNet). The DataNet Partners were created to
DataONE environment.
                                                                             develop long-term sustainable data infrastructures,
Keywords
                                                                             interoperable data preservation and access, and
Data sharing and reuse, scientific data, infrastructure,                     cyberinfrastructure     capabilities  (National    Science
                                                                             Foundation, 2006). The Data Observation Network for
                                                                             Earth (DataONE), one of the initial DataNet Partners,
{This is the space reserved for copyright notices.]
                                                                             provides cyberinfrastructure for “open, persistent, robust,
                                                                             and secure access to well-described and easily discovered
ASIST 2014,November 1-4, 2014, Seattle, WA, USA.                             earth science observational data” (DataONE, 2013).
                                                                             Scientists participating in DataONE are able to deposit,
[Author Retains Copyright.      Insert personal or institutional copyright
notice here.]
                                                                             search, and reuse data available through the various
                                                                             DataONE tools.
                                                                             The majority of studies specific to DataONE have
                                                                             addressed: the organization and the infrastructure, specific
DataONE, NFS – DataNets
                                                                             tools that the DataONE has created, and the DataONE
                                                                             community. Additionally, the majority of data sharing
literature have addressed: general reasons why scientists        Sproull, 1994; Tenopir et al., 2011). Reasons for not
should share data, journal and grant policies that influence     sharing data include financial concerns, lack of time, lack
data sharing, behavioral aspects that influence data sharing,    of organizational support, lack of documentation and
and have been conducted in the biological and biomedical         complexity of metadata standards, as well as the difficulty
sciences.                                                        to anticipate intended users (Birnholtz & Bietz, 2003;
                                                                 Tenopir et al., 2011; Zimmerman, 2003).
As the majority of the DataONE activities have focused on
development, it is timely to evaluate the cyberinfrastructure    Researchers have also examined journal policies and data
progress. Furthermore, as the DataONE focuses on earth           deposition. Since the early 1980s many journals have added
science data, this provides an environment for studying data     policy statements to motivate scientists to share data,
sharing and reuse within the earth sciences. This research       recently these policies have become stricter. Some
recognizes specific gaps in the literature: (1) the need to      publishers have indicated they will refuse to publish
evaluate the DataONE cyberinfrastructure progress, (2) the       without evidence of data deposition (Brown, 2003; McCain,
need for studying how this infrastructure impacts data           1995, 2000). Additionally, studies have indicated that while
sharing, and (3) the need for studying sharing in the earth      no journal has complete compliance, much research data is
sciences. Additionally, considering the amount of time,          deposited along with the article (Noor, Zimmerman, &
energy, effort, and financial support the community has          Teeter, 2006; Ochsner, Steffen, Stoeckert, & McKenna,
invested in DataONE, it is incredibly important to evaluate      2008). Lastly, several studies have investigated specific
its usefulness in regards to data sharing and reuse.             factors associated with data deposition. These studies have
                                                                 shown that author experience and publications associated
This research will investigate data sharing and reuse within
                                                                 with high-impact factor journals were more likely to have
the DataONE cyberinfrastructure.
                                                                 associated data deposited alongside the authors’ journal
                                                                 articles (Piwowar & Chapman, 2010; Piwowar, 2011).
BACKGROUND LITERATURE
Data Sharing and Reuse
                                                                 The studies above demonstrate that scientists do believe
The below addresses provides a summary of the extensive          that data sharing and reuse is important to drive science
research that has been conducted regarding data sharing and      forward. As described, the motivations for data sharing
reuse.                                                           include value to the scientific community, pressures from
                                                                 granting agencies and journal policies, and scientific
Multiple studies have indicated that data sharing and reuse      reputation. Inhibitors include financial concerns, lack of
is will advance scientific development through factors such      time, lack of organizational support, lack of reward and
as avoiding duplication of work, allowing new questions to       inability to anticipate the intended user. However, the
be asked to existing data, and encouraging diversity in          current research does not describe the motivations and
analysis (Borgman, 2012; Lord & Macdonald, 2003).                inhibitors beyond general terms and therefore needs to be
Along with data sharing for the greater good of science,         further examined to understand the intricate details of data
funding agencies and journals have put pressure on               sharing and reuse. Furthermore, the current research does
scientists to make their data available. Funding agencies are    not address cyberinfrastructure, interoperability, technical
now requiring data sharing plans and publishers have             or system aspects but focuses mainly on human behavior.
threatened to not publish articles if scientists do not make     Moreover, much of the current research has been conducted
their data publically available (Blumenthal et al., 2006;        in the biological sciences; therefore studies within the earth
McCain, 1995; Sayogo & Pardo, 2013). Scientific                  sciences would provide a different perspective, as scientific
reputation is another factor that influences scientists          practices differ within these areas of study (Murillo, 2014).
willingness to share data. Scientists unwilling to share their
                                                                 DataONE Literature
data could seem possibly fraudulent; therefore data sharing
                                                                 DataONE is a four-year initiative that began in 2009, and
is considered a part of seeming reputable by fellow
                                                                 has been extended to 2014, and has been further extended
scientists (Birnholtz & Bietz, 2003; Ceci, 1988).
                                                                 until Phase Two of the initiative. With the emphasis on
Many researchers have analyzed data sharing practices            infrastructure design and the new status of the DataONE,
through multiple methods including direct report from            there has been limited time for assessment or investigation
scientists, journal policy studies, and bibliometric studies.    specific to the project itself, and few studies conducted have
Several direct report studies have investigated scientists’      investigated data sharing and reuse within the DataONE
attitude toward data sharing to gain an understanding of         community. The literature reviewed below provides an
motivations. These studies suggest that scientists want to       overview of the research that has been conducted in relation
have data sharing as a norm in science (Borgman, 2012;           to DataONE.
Ceci, 1988; Lord & Macdonald, 2003). Data ownership,
                                                                 Studies have been conducted to explore the collaborative
previous assistance from coworkers, journal policies, and
                                                                 relationship between information professionals and
grant agency requirements were some of the motivations for
                                                                 scientists and how this relates to the DataONE as a
data sharing (Blumenthal et al., 2006; Constant, Kiesler, &
                                                                 transdisciplinary organization (Allard & Allard, 2009).
Reichman, Jones, & Schildhauer (2011) demonstrated that            within the DataONE by analyzing the metadata that is being
DataONE provides access to well-curated, federated data            made available.
repositories that can lead to improvements in sharing and
                                                                   Secondly, a pilot think-aloud study explores: a) what types
reuse of data; however this study also indicated that the
                                                                   of data scientists need, b) what types of data scientists deem
most effective means for data sharing would be to alter the
                                                                   reusable, and c) what information do scientists need about
reward system.. Research has also investigated specific
                                                                   that data to determine if it is reusable. This provides an
technical aspects of the DataONE, for example integration
                                                                   understanding of data reuse within the DataONE by
of large-scale computational runs with DataONE data,
                                                                   analyzing what is needed for scientists to deem data
metadata, and workflow tools (Dexter, Cobb, Vieglais,
                                                                   reusable.
Jones, & Lowe, 2011). Lastly, and perhaps most closely
related to this research; there has been research conducted        Preliminary data gathering and analysis has been conducted
to address critical challenges facing researchers involved in      and these results are described below. This data analysis
data sharing and how these challenges influence researchers        will continue throughout the fall. It is expected to have
to share their data openly (Sayogo & Pardo, 2011). The data        more substantial results prior to the annual conference to
collected in this study was from 2009-2010, therefore a new        present at the poster session.
study of the current DataONE users is particularly
important to inform the DataONE. Furthermore, this study           PRELIMINARY RESULTS
investigated scientists making their data available for
                                                                   Profiling Data Assessments
sharing through deposition, but did not address scientists         The profiling data assessment is being conducted through
reusing data (Murillo, 2014).                                      an analysis of metadata records extracted from DataONE. A
These studies have begun preliminary research related to           random sample of metadata records have been extracted
the DataONE. In order to ensure that the DataONE is being          from the DataONE. This corpus of metadata records
used to its fullest potential, additional studies regarding data   includes 650 XML records, which represents the complete
sharing and reuse within the DataONE needs to be                   corpus of 105,121 metadata records that were available
conducted (Murillo, 2014). This research provides the              from the DataONE at the time of extraction. To date, two
preliminary examination of data sharing and reuse within           representative samples of 650 records have been extracted;
the DataONE environment, and will serve as preliminary             one is serving as a training and teaching set for content
analysis for future study that will examine factors that           analysis and coding. This analysis provides an
facilitate or interfere with data sharing and reuse within the     understanding of the data that is being shared into the
DataONE user community.                                            DataONE environment, as well as the metadata that is being
                                                                   provided alongside the data. Preliminary analysis has begun
RESEARCH QUESTIONS AND RESEARCH METHODS                            using the training set on 20 of the XML records. This
The research questions for this study are:                         analysis will continue throughout the fall.
Research Questions 1:                                              Of the 20 XML records analyzed all 20 records contained
                                                                   metadata regarding the dataset deposited into the DataONE.
• What data is being deposited into the DataONE?
                                                                   Additionally, 8 records contained metadata regarding
• What information is being provided regarding that data?          access and 12 contained additional metadata. Only 5 of the
Research Questions 2:                                              XML records contained all three aspects (dataset, access,
                                                                   and additional metadata).
• What data do scientists need when they search the
  DataONE?                                                         The additional metadata field contained clarifications
                                                                   regarding access; units, unit lists and definitions of units;
• What data are deemed reusable and what information do
                                                                   and related datasets. The majority of this additional
  scientists need about this data to deem it reusable?
                                                                   metadata was specifically related to units and unit lists. The
The study uses a mixed-method approach: 1) a profiling             access metadata provided information regarding access
data assessment and 2) a pilot think-aloud. Each method            rights, although only 8 of the 20 records contained this
used in this research addresses specific aspects of the            information, of these 8 all data was public and several of
research questions, which are described below in more              these records contained guidance for citing the data.
detail.
                                                                   The metadata for the datasets varied in regards to
First, a profiling data assessment is being conducted              robustness and completeness. For example, while many
through a quantitative content analysis of data deposited          records included a pubDate and language, not all records
into the DataONE. This explores research question 1                included this information. Furthermore, there were multiple
including: a) what types of data are being deposited, b)           formats for these metadata fields (i.e. “en” versus “English”
what agencies are depositing data, 3) what disciplines are         for language). Additionally, 18 of the 20 records contained
depositing data, and 4) what types of metadata are being           full abstracts, which scientists described as important to
provided. This provides an understanding of data sharing           determining if a dataset was reusable during the pilot think-
aloud study discussed in the next section. Furthermore,           discussed above, not all of the records have the same
some entries included information such as methods                 information.
information, intellectual rights, and coverage while others
                                                                  Data analysis continues on this data and will be included in
did not.
                                                                  the poster to address the research questions described
Data analysis continues on these records and will be              above. This pre-pilot has served as a preliminary study for
included in the poster to address the research questions          the creation of a second study, which will include a quasi-
described above. Additionally, in the pilot think-aloud           experimental aspect to control for specific variables.
study described below participants were asked how this
information affected their ability to reuse the data available    SIGNIFICANCE OF WORK
to them through the DataONE system.                               This work has great significance for our field for many
                                                                  reasons. The need to create systems for scientific data
Pilot Think-Aloud Study                                           sharing and reuse is ongoing and will continue due to
A pilot think-aloud study was conducted during the                changes in science. Therefore we need to understand if
November 2014, DataONE All-Hands Meeting in                       these systems are enabling sharing and reuse or not, and
Albuquerque, NM and was approved by the UNC                       why. This work contributes a method and an approach for
Institutional Review Board IRB number - 13-3499.                  analyzing data sharing and reuse. It takes advantage of the
                                                                  opportunity to research on a rich environment supporting
During the All-Hands meeting, I asked scientists to search
                                                                  both the sharing side (making data available) and reuse side
for data in the DataONE system while I observed them. As
                                                                  (reusing available data). Furthermore, this research provides
they were searching the system, I asked them to think-aloud
                                                                  insight into data sharing and reuse in the earth science
regarding the decisions they were making on if the data was
                                                                  community; to date data sharing research has been
reusable or not reusable. Additionally, I conducted semi-
                                                                  primarily focused in the biomedical, biological, and health
structured interviews where participants were asked their
                                                                  sciences area.
subject expertise, if they had previously searched for data
within the DataONE, and how often they searched.
                                                                  ACKNOWLEDGMENTS
Participants were asked to describe their typical searches,       The author would like to thank Dr. Jane Greenberg for her
what kind of data they were looking for, and what                 guidance and support. The author would also like to thank
information they needed to determine if the data was              the DataONE for their support.
relevant and reusable.
Participants indicated that they used the DataONE system          REFERENCES
fairly often, perhaps one or twice a month. They would            Allard, S., & Allard, G. (2009). Transdisciplinarity and
search using general key terms, but then would narrow the           information science in earth and environmental science
results using the facets available through the system.              research. Proceedings of the American Society for
Additionally, some participants indicated that they were            Information Science and Technology, 46, 1–9.
also responsible for depositing data into the DataONE               doi:10.1002/meet.2009.1450460346
system, and therefore searched the system to make sure            Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data
their data was being represented correctly.                         deluge.      Science,      323(5919),     1297–1298.
Results indicated that scientists made decisions based on           doi:10.1126/science.1170411
the metadata snippet they received from the system. These         Birnholtz, J. P., & Bietz, M. J. (2003). Data at work:
decisions were made through information including: if they          Supporting and sharing in science and engineering. In
had previous knowledge of the PI or author of the data, the         Proceedings of the 2003 international ACM SIGGROUP
actual data type, and the additional information provided           conference on Supporting group work - GROUP ’03 (pp.
regarding the data. In some instances, the topic or subject         339–348). New York, New York, USA: ACM Press.
matter of the data was of interest to the participant, but they     doi:10.1145/958160.958215
were unfamiliar with the data type and therefore the data         Blumenthal, D., Campbell, E. G., Gokhale, M., Yucel, R.,
was not reusable for their purposes. More often, the                Clarridge, B., & Hilgartner, S. (2006). Data withholding
robustness of the metadata and the provenance information           in genetics and the other life sciences: Prevalences and
was key for the scientists choosing if this data was reusable.      predictors. Academic Medicine, 81(2), 137–145.
Scientists stated that information such as coverage and
research methods helped them determine if the data was            Borgman, C. L. (2012). The conundrum of sharing research
suitable for reuse.                                                data. Journal of the American Society for Information
                                                                   Science    and     Technology,    63(6),    1059–1078.
Additionally, participants indicated that the key terms were       doi:10.1002/asi.22634
really only helpful during the search and that they would
                                                                  Brown, C. (2003). The changing face of scientific
prefer to see all of the PIs and not just the first author.
                                                                    discourse: Analysis of genomic and proteomic database
Furthermore, they discussed how it would be helpful if all
                                                                    usage and acceptance. Journal of the American Society
of the metadata snippets had the same stylesheet, as
for Information Science and Technology, 54(10), 926–           National Science Foundation. (2006, November 7).
  938.                                                            Sustainable Digital Data Preservation and Access
Ceci, S. J. (1988). Scientists’ attitudes toward data sharing.    Network Partners (DataNet). Program Solicitation NSF
  Science, Technology, and Human Values, 13(No. 1 & 2),           07-601.    Retrieved   February    4,   2014,    from
  45–52.                                                          http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.htm
Constant, D., Kiesler, S., & Sproull, L. (1994). What’s mine     National Science Foundation. (2010, November 10).
 is ours, or is it? A study of attitudes about information        Dissemination and Sharing of Research Results.
 sharing. Information Systems Research, 5(4), 400–421.            Retrieved                                   from
                                                                  http://www.nsf.gov/bfa/dias/policy/dmp.jsp
DataONE. (2013). What is DataONE? | DataONE. What is
 DataONE? Retrieved January 2, 2014, from                        Noor, M. A. F., Zimmerman, K. J., & Teeter, K. C. (2006).
 http://www.dataone.org/what-dataone                              Data sharing: How much doesn’t get submitted to
                                                                  GenBank?         PLoS      Biology,     4(7),      e228.
Dexter, N. C., Cobb, J. W., Vieglais, D., Jones, M. B., &         doi:10.1371/journal.pbio.0040228
 Lowe, M. (2011). DataONE member node pilot
 integration with TeraGrid? In Proceedings of the                Ochsner, S. A., Steffen, D. L., Stoeckert, C. J., &
 TeraGrid 2011 Conference Extreme Digital Discovery               McKenna, N. J. (2008). Much room for improvement in
 TG11.                   Retrieved                   from         deposition rates of expression microarray datasets.
 http://www.scopus.com/inward/record.url?eid=2-s2.0-              Nature Methods, 5(12), 991. doi:10.1038/nmeth1208-991
 80052325721&partnerID=40&md5=c0434f78901fe27b0                  Piwowar, H. A. (2011). Who shares? Who doesn’t? Factors
 25a58855be2b862                                                   associated with openly archiving raw research data. PLoS
Hey, T., Tansley, S., & Tolle, K. M. (2009). The fourth            ONE,             6(7,          e18657),             1–13.
 paradigm: Data-intensive scientific discovery. Redmond,           doi:10.1371/journal.pone.0018657
 Washington: Microsoft Research.                                 Piwowar, H. A., & Chapman, W. W. (2010). Public sharing
Hey, T., & Trefethen, A. E. (2003). The data deluge: An e-         of research datasets: A pilot study of associations.
 science perspective. In F. Berman, A. J. G. Hey, & G. C.          Journal       of      Informetrics, 4,      148–156.
 Fox (Eds.), Grid Computing: Making the Global                     doi:10.1016/j.joi.2009.11.010
 Infrastructure a Reality (pp. 809–824). Wiley and Sons.         Reichman, O. J., Jones, M. B., & Schildhauer, M. P. (2011).
 Retrieved from http://en.scientificcommons.org/2325382            Challenges and opportunities of open data in ecology.
Lord, P., & Macdonald, A. (2003). e-Science Curation               Science, 331(6018), 703–705.
  Report Data curation for e-Science in the UK  : An audit       Sayogo, D. S., & Pardo, T. a. (2013). Exploring the
  to establish requirements for future curation and                determinants of scientific data sharing: Understanding the
  provision (pp. 1–84). The JISC Committee for the                 motivation to publish research data. Government
  Support      of       Research.     Retrieved      from          Information        Quarterly,          30,       S19–S31.
  http://www.jisc.ac.uk/uploaded_documents/e-                      doi:10.1016/j.giq.2012.06.011
  ScienceReportFinal.pdf                                         Sayogo, D. S., & Pardo, T. A. (2011). Understanding the
McCain, K. W. (1995). Mandating sharing: Journal policies          capabilities and critical success factors in collaborative
 in the natural sciences. Science Communication, 16, 403–          data sharing network: The case of DataONE. In
 431. doi:10.1177/1075547095016004003                              Proceedings of the 12th Annual International Digital
McCain, K. W. (2000). Sharing digitized research-related           Government Research Conference dgo 2011 (pp. 74–83).
 information on the World Wide Web. Journal of the                 ACM. doi:10.1145/2037556.2037568
 American Society for Information Science, 51(14), 1321–         Sonnenwald, D. H. (2007). Scientific collaboration. Annual
 1327.                                                             Review of Information Science and Technology, 41(1),
Murillo, A. P. (September 2014). An investigation of               643–681. doi:10.1002/aris.2007.1440410121
 Selected    cyberinfrastructure   and     interoperability      Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U.,
 elements: Data Sharing and reuse in the sciences. Bulletin        Wu, L., Read, E., … Frame, M. (2011). Data sharing by
 of IEEE Technical Committee on Digital Libraries,                 scientists: Practices and perceptions. PLoS ONE, 6(6,
 Special Issue.                                                    e21101), 1–21.
National Institutes of Health. (2007). NIH Data Sharing          Zimmerman, A. S. (2003). Data sharing and secondary use
 Policy. NIH Data Sharing Policy. Retrieved from                   of scientific data: Experiences of ecologists.
 http://grants.nih.gov/grants/policy/data_sharing/

            The columns on the last page should be of approximately equal length.
You can also read