Examining Data Sharing and Data Reuse in the DataONE
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Examining Data Sharing and Data Reuse in the DataONE Environment Angela P. Murillo University of North Carolina at Chapel Hill School of Information and Library Sciences 216 Lenoir Drive • CB #3360 100 Manning Hall Chapel Hill, NC 27599-3360 amurillo@email.unc.edu ABSTRACT INTRODUCTION The Data Observation Network for Earth (DataONE), a This poster-paper presents preliminary analysis of a mixed- U.S. NSF DataNet Partner, seeks to provide method study to gain an understanding of data sharing and cyberinfrastructure for “open, persistent, robust, and secure reuse within a new cyberinfrastructure program, the access to…earth science observational data”. Scientists DataONE. participating in DataONE are able to deposit, search, and reuse data available through various DataONE tools. The Data sharing and reuse in the sciences has been a topic of research presented in this poster-paper reports on two growing attention in recent years. This attention stems from studies examining data sharing and reuse in the DataONE changes occurring in scientific practices driven by the data environment. The two studies include 1) a profiling data deluge (Bell, Hey, & Szalay, 2009; Hey & Trefethen, 2003) assessment that examines the data and metadata being and fourth paradigm data-intensive science (Hey, Tansley, deposited into the DataONE system for data sharing, and 2) & Tolle, 2009); and changes in journal and grant agency a pilot think-aloud study that examines what factors policies (National Institutes of Health, 2007; National influence decisions regarding data reuse. From the profiling Science Foundation, 2010). The sharing of data provides data assessment, preliminary results indicate that data being the ability to extract additional value from existing data, deposited into the DataONE for sharing have three specific avoid reproducing research, ask new questions of existing types of metadata available including a) dataset, b) access, data, and advance the state of science in general (Borgman, and c) additional metadata. Results also indicated that there 2012; Lord & Macdonald, 2003). These potential is variation regarding the robustness and completeness of opportunities of data sharing and have placed pressure on information. Additionally, through the think-aloud study the scientific community and funding agencies to provide results indicated that particular aspects the metadata infrastructure solutions for the changes occurring in information was useful for decision-making regarding reuse scientific practice. of data for scientists, while other metadata aspects were To address the above, in 2007 the U.S. National Science described as not useful. The results section provide specific Foundation announced a request for proposals for details of these findings and demonstrate how these two Sustainable Digital Data Preservation and Access Network studies examine both data sharing and reuse within the Partners (DataNet). The DataNet Partners were created to DataONE environment. develop long-term sustainable data infrastructures, Keywords interoperable data preservation and access, and Data sharing and reuse, scientific data, infrastructure, cyberinfrastructure capabilities (National Science Foundation, 2006). The Data Observation Network for Earth (DataONE), one of the initial DataNet Partners, {This is the space reserved for copyright notices.] provides cyberinfrastructure for “open, persistent, robust, and secure access to well-described and easily discovered ASIST 2014,November 1-4, 2014, Seattle, WA, USA. earth science observational data” (DataONE, 2013). Scientists participating in DataONE are able to deposit, [Author Retains Copyright. Insert personal or institutional copyright notice here.] search, and reuse data available through the various DataONE tools. The majority of studies specific to DataONE have addressed: the organization and the infrastructure, specific DataONE, NFS – DataNets tools that the DataONE has created, and the DataONE community. Additionally, the majority of data sharing
literature have addressed: general reasons why scientists Sproull, 1994; Tenopir et al., 2011). Reasons for not should share data, journal and grant policies that influence sharing data include financial concerns, lack of time, lack data sharing, behavioral aspects that influence data sharing, of organizational support, lack of documentation and and have been conducted in the biological and biomedical complexity of metadata standards, as well as the difficulty sciences. to anticipate intended users (Birnholtz & Bietz, 2003; Tenopir et al., 2011; Zimmerman, 2003). As the majority of the DataONE activities have focused on development, it is timely to evaluate the cyberinfrastructure Researchers have also examined journal policies and data progress. Furthermore, as the DataONE focuses on earth deposition. Since the early 1980s many journals have added science data, this provides an environment for studying data policy statements to motivate scientists to share data, sharing and reuse within the earth sciences. This research recently these policies have become stricter. Some recognizes specific gaps in the literature: (1) the need to publishers have indicated they will refuse to publish evaluate the DataONE cyberinfrastructure progress, (2) the without evidence of data deposition (Brown, 2003; McCain, need for studying how this infrastructure impacts data 1995, 2000). Additionally, studies have indicated that while sharing, and (3) the need for studying sharing in the earth no journal has complete compliance, much research data is sciences. Additionally, considering the amount of time, deposited along with the article (Noor, Zimmerman, & energy, effort, and financial support the community has Teeter, 2006; Ochsner, Steffen, Stoeckert, & McKenna, invested in DataONE, it is incredibly important to evaluate 2008). Lastly, several studies have investigated specific its usefulness in regards to data sharing and reuse. factors associated with data deposition. These studies have shown that author experience and publications associated This research will investigate data sharing and reuse within with high-impact factor journals were more likely to have the DataONE cyberinfrastructure. associated data deposited alongside the authors’ journal articles (Piwowar & Chapman, 2010; Piwowar, 2011). BACKGROUND LITERATURE Data Sharing and Reuse The studies above demonstrate that scientists do believe The below addresses provides a summary of the extensive that data sharing and reuse is important to drive science research that has been conducted regarding data sharing and forward. As described, the motivations for data sharing reuse. include value to the scientific community, pressures from granting agencies and journal policies, and scientific Multiple studies have indicated that data sharing and reuse reputation. Inhibitors include financial concerns, lack of is will advance scientific development through factors such time, lack of organizational support, lack of reward and as avoiding duplication of work, allowing new questions to inability to anticipate the intended user. However, the be asked to existing data, and encouraging diversity in current research does not describe the motivations and analysis (Borgman, 2012; Lord & Macdonald, 2003). inhibitors beyond general terms and therefore needs to be Along with data sharing for the greater good of science, further examined to understand the intricate details of data funding agencies and journals have put pressure on sharing and reuse. Furthermore, the current research does scientists to make their data available. Funding agencies are not address cyberinfrastructure, interoperability, technical now requiring data sharing plans and publishers have or system aspects but focuses mainly on human behavior. threatened to not publish articles if scientists do not make Moreover, much of the current research has been conducted their data publically available (Blumenthal et al., 2006; in the biological sciences; therefore studies within the earth McCain, 1995; Sayogo & Pardo, 2013). Scientific sciences would provide a different perspective, as scientific reputation is another factor that influences scientists practices differ within these areas of study (Murillo, 2014). willingness to share data. Scientists unwilling to share their DataONE Literature data could seem possibly fraudulent; therefore data sharing DataONE is a four-year initiative that began in 2009, and is considered a part of seeming reputable by fellow has been extended to 2014, and has been further extended scientists (Birnholtz & Bietz, 2003; Ceci, 1988). until Phase Two of the initiative. With the emphasis on Many researchers have analyzed data sharing practices infrastructure design and the new status of the DataONE, through multiple methods including direct report from there has been limited time for assessment or investigation scientists, journal policy studies, and bibliometric studies. specific to the project itself, and few studies conducted have Several direct report studies have investigated scientists’ investigated data sharing and reuse within the DataONE attitude toward data sharing to gain an understanding of community. The literature reviewed below provides an motivations. These studies suggest that scientists want to overview of the research that has been conducted in relation have data sharing as a norm in science (Borgman, 2012; to DataONE. Ceci, 1988; Lord & Macdonald, 2003). Data ownership, Studies have been conducted to explore the collaborative previous assistance from coworkers, journal policies, and relationship between information professionals and grant agency requirements were some of the motivations for scientists and how this relates to the DataONE as a data sharing (Blumenthal et al., 2006; Constant, Kiesler, & transdisciplinary organization (Allard & Allard, 2009).
Reichman, Jones, & Schildhauer (2011) demonstrated that within the DataONE by analyzing the metadata that is being DataONE provides access to well-curated, federated data made available. repositories that can lead to improvements in sharing and Secondly, a pilot think-aloud study explores: a) what types reuse of data; however this study also indicated that the of data scientists need, b) what types of data scientists deem most effective means for data sharing would be to alter the reusable, and c) what information do scientists need about reward system.. Research has also investigated specific that data to determine if it is reusable. This provides an technical aspects of the DataONE, for example integration understanding of data reuse within the DataONE by of large-scale computational runs with DataONE data, analyzing what is needed for scientists to deem data metadata, and workflow tools (Dexter, Cobb, Vieglais, reusable. Jones, & Lowe, 2011). Lastly, and perhaps most closely related to this research; there has been research conducted Preliminary data gathering and analysis has been conducted to address critical challenges facing researchers involved in and these results are described below. This data analysis data sharing and how these challenges influence researchers will continue throughout the fall. It is expected to have to share their data openly (Sayogo & Pardo, 2011). The data more substantial results prior to the annual conference to collected in this study was from 2009-2010, therefore a new present at the poster session. study of the current DataONE users is particularly important to inform the DataONE. Furthermore, this study PRELIMINARY RESULTS investigated scientists making their data available for Profiling Data Assessments sharing through deposition, but did not address scientists The profiling data assessment is being conducted through reusing data (Murillo, 2014). an analysis of metadata records extracted from DataONE. A These studies have begun preliminary research related to random sample of metadata records have been extracted the DataONE. In order to ensure that the DataONE is being from the DataONE. This corpus of metadata records used to its fullest potential, additional studies regarding data includes 650 XML records, which represents the complete sharing and reuse within the DataONE needs to be corpus of 105,121 metadata records that were available conducted (Murillo, 2014). This research provides the from the DataONE at the time of extraction. To date, two preliminary examination of data sharing and reuse within representative samples of 650 records have been extracted; the DataONE environment, and will serve as preliminary one is serving as a training and teaching set for content analysis for future study that will examine factors that analysis and coding. This analysis provides an facilitate or interfere with data sharing and reuse within the understanding of the data that is being shared into the DataONE user community. DataONE environment, as well as the metadata that is being provided alongside the data. Preliminary analysis has begun RESEARCH QUESTIONS AND RESEARCH METHODS using the training set on 20 of the XML records. This The research questions for this study are: analysis will continue throughout the fall. Research Questions 1: Of the 20 XML records analyzed all 20 records contained metadata regarding the dataset deposited into the DataONE. • What data is being deposited into the DataONE? Additionally, 8 records contained metadata regarding • What information is being provided regarding that data? access and 12 contained additional metadata. Only 5 of the Research Questions 2: XML records contained all three aspects (dataset, access, and additional metadata). • What data do scientists need when they search the DataONE? The additional metadata field contained clarifications regarding access; units, unit lists and definitions of units; • What data are deemed reusable and what information do and related datasets. The majority of this additional scientists need about this data to deem it reusable? metadata was specifically related to units and unit lists. The The study uses a mixed-method approach: 1) a profiling access metadata provided information regarding access data assessment and 2) a pilot think-aloud. Each method rights, although only 8 of the 20 records contained this used in this research addresses specific aspects of the information, of these 8 all data was public and several of research questions, which are described below in more these records contained guidance for citing the data. detail. The metadata for the datasets varied in regards to First, a profiling data assessment is being conducted robustness and completeness. For example, while many through a quantitative content analysis of data deposited records included a pubDate and language, not all records into the DataONE. This explores research question 1 included this information. Furthermore, there were multiple including: a) what types of data are being deposited, b) formats for these metadata fields (i.e. “en” versus “English” what agencies are depositing data, 3) what disciplines are for language). Additionally, 18 of the 20 records contained depositing data, and 4) what types of metadata are being full abstracts, which scientists described as important to provided. This provides an understanding of data sharing determining if a dataset was reusable during the pilot think-
aloud study discussed in the next section. Furthermore, discussed above, not all of the records have the same some entries included information such as methods information. information, intellectual rights, and coverage while others Data analysis continues on this data and will be included in did not. the poster to address the research questions described Data analysis continues on these records and will be above. This pre-pilot has served as a preliminary study for included in the poster to address the research questions the creation of a second study, which will include a quasi- described above. Additionally, in the pilot think-aloud experimental aspect to control for specific variables. study described below participants were asked how this information affected their ability to reuse the data available SIGNIFICANCE OF WORK to them through the DataONE system. This work has great significance for our field for many reasons. The need to create systems for scientific data Pilot Think-Aloud Study sharing and reuse is ongoing and will continue due to A pilot think-aloud study was conducted during the changes in science. Therefore we need to understand if November 2014, DataONE All-Hands Meeting in these systems are enabling sharing and reuse or not, and Albuquerque, NM and was approved by the UNC why. This work contributes a method and an approach for Institutional Review Board IRB number - 13-3499. analyzing data sharing and reuse. It takes advantage of the opportunity to research on a rich environment supporting During the All-Hands meeting, I asked scientists to search both the sharing side (making data available) and reuse side for data in the DataONE system while I observed them. As (reusing available data). Furthermore, this research provides they were searching the system, I asked them to think-aloud insight into data sharing and reuse in the earth science regarding the decisions they were making on if the data was community; to date data sharing research has been reusable or not reusable. Additionally, I conducted semi- primarily focused in the biomedical, biological, and health structured interviews where participants were asked their sciences area. subject expertise, if they had previously searched for data within the DataONE, and how often they searched. ACKNOWLEDGMENTS Participants were asked to describe their typical searches, The author would like to thank Dr. Jane Greenberg for her what kind of data they were looking for, and what guidance and support. The author would also like to thank information they needed to determine if the data was the DataONE for their support. relevant and reusable. Participants indicated that they used the DataONE system REFERENCES fairly often, perhaps one or twice a month. They would Allard, S., & Allard, G. (2009). Transdisciplinarity and search using general key terms, but then would narrow the information science in earth and environmental science results using the facets available through the system. research. Proceedings of the American Society for Additionally, some participants indicated that they were Information Science and Technology, 46, 1–9. also responsible for depositing data into the DataONE doi:10.1002/meet.2009.1450460346 system, and therefore searched the system to make sure Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data their data was being represented correctly. deluge. Science, 323(5919), 1297–1298. Results indicated that scientists made decisions based on doi:10.1126/science.1170411 the metadata snippet they received from the system. These Birnholtz, J. P., & Bietz, M. J. (2003). Data at work: decisions were made through information including: if they Supporting and sharing in science and engineering. In had previous knowledge of the PI or author of the data, the Proceedings of the 2003 international ACM SIGGROUP actual data type, and the additional information provided conference on Supporting group work - GROUP ’03 (pp. regarding the data. In some instances, the topic or subject 339–348). New York, New York, USA: ACM Press. matter of the data was of interest to the participant, but they doi:10.1145/958160.958215 were unfamiliar with the data type and therefore the data Blumenthal, D., Campbell, E. G., Gokhale, M., Yucel, R., was not reusable for their purposes. More often, the Clarridge, B., & Hilgartner, S. (2006). Data withholding robustness of the metadata and the provenance information in genetics and the other life sciences: Prevalences and was key for the scientists choosing if this data was reusable. predictors. Academic Medicine, 81(2), 137–145. Scientists stated that information such as coverage and research methods helped them determine if the data was Borgman, C. L. (2012). The conundrum of sharing research suitable for reuse. data. Journal of the American Society for Information Science and Technology, 63(6), 1059–1078. Additionally, participants indicated that the key terms were doi:10.1002/asi.22634 really only helpful during the search and that they would Brown, C. (2003). The changing face of scientific prefer to see all of the PIs and not just the first author. discourse: Analysis of genomic and proteomic database Furthermore, they discussed how it would be helpful if all usage and acceptance. Journal of the American Society of the metadata snippets had the same stylesheet, as
for Information Science and Technology, 54(10), 926– National Science Foundation. (2006, November 7). 938. Sustainable Digital Data Preservation and Access Ceci, S. J. (1988). Scientists’ attitudes toward data sharing. Network Partners (DataNet). Program Solicitation NSF Science, Technology, and Human Values, 13(No. 1 & 2), 07-601. Retrieved February 4, 2014, from 45–52. http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.htm Constant, D., Kiesler, S., & Sproull, L. (1994). What’s mine National Science Foundation. (2010, November 10). is ours, or is it? A study of attitudes about information Dissemination and Sharing of Research Results. sharing. Information Systems Research, 5(4), 400–421. Retrieved from http://www.nsf.gov/bfa/dias/policy/dmp.jsp DataONE. (2013). What is DataONE? | DataONE. What is DataONE? Retrieved January 2, 2014, from Noor, M. A. F., Zimmerman, K. J., & Teeter, K. C. (2006). http://www.dataone.org/what-dataone Data sharing: How much doesn’t get submitted to GenBank? PLoS Biology, 4(7), e228. Dexter, N. C., Cobb, J. W., Vieglais, D., Jones, M. B., & doi:10.1371/journal.pbio.0040228 Lowe, M. (2011). DataONE member node pilot integration with TeraGrid? In Proceedings of the Ochsner, S. A., Steffen, D. L., Stoeckert, C. J., & TeraGrid 2011 Conference Extreme Digital Discovery McKenna, N. J. (2008). Much room for improvement in TG11. Retrieved from deposition rates of expression microarray datasets. http://www.scopus.com/inward/record.url?eid=2-s2.0- Nature Methods, 5(12), 991. doi:10.1038/nmeth1208-991 80052325721&partnerID=40&md5=c0434f78901fe27b0 Piwowar, H. A. (2011). Who shares? Who doesn’t? Factors 25a58855be2b862 associated with openly archiving raw research data. PLoS Hey, T., Tansley, S., & Tolle, K. M. (2009). The fourth ONE, 6(7, e18657), 1–13. paradigm: Data-intensive scientific discovery. Redmond, doi:10.1371/journal.pone.0018657 Washington: Microsoft Research. Piwowar, H. A., & Chapman, W. W. (2010). Public sharing Hey, T., & Trefethen, A. E. (2003). The data deluge: An e- of research datasets: A pilot study of associations. science perspective. In F. Berman, A. J. G. Hey, & G. C. Journal of Informetrics, 4, 148–156. Fox (Eds.), Grid Computing: Making the Global doi:10.1016/j.joi.2009.11.010 Infrastructure a Reality (pp. 809–824). Wiley and Sons. Reichman, O. J., Jones, M. B., & Schildhauer, M. P. (2011). Retrieved from http://en.scientificcommons.org/2325382 Challenges and opportunities of open data in ecology. Lord, P., & Macdonald, A. (2003). e-Science Curation Science, 331(6018), 703–705. Report Data curation for e-Science in the UK : An audit Sayogo, D. S., & Pardo, T. a. (2013). Exploring the to establish requirements for future curation and determinants of scientific data sharing: Understanding the provision (pp. 1–84). The JISC Committee for the motivation to publish research data. Government Support of Research. Retrieved from Information Quarterly, 30, S19–S31. http://www.jisc.ac.uk/uploaded_documents/e- doi:10.1016/j.giq.2012.06.011 ScienceReportFinal.pdf Sayogo, D. S., & Pardo, T. A. (2011). Understanding the McCain, K. W. (1995). Mandating sharing: Journal policies capabilities and critical success factors in collaborative in the natural sciences. Science Communication, 16, 403– data sharing network: The case of DataONE. In 431. doi:10.1177/1075547095016004003 Proceedings of the 12th Annual International Digital McCain, K. W. (2000). Sharing digitized research-related Government Research Conference dgo 2011 (pp. 74–83). information on the World Wide Web. Journal of the ACM. doi:10.1145/2037556.2037568 American Society for Information Science, 51(14), 1321– Sonnenwald, D. H. (2007). Scientific collaboration. Annual 1327. Review of Information Science and Technology, 41(1), Murillo, A. P. (September 2014). An investigation of 643–681. doi:10.1002/aris.2007.1440410121 Selected cyberinfrastructure and interoperability Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., elements: Data Sharing and reuse in the sciences. Bulletin Wu, L., Read, E., … Frame, M. (2011). Data sharing by of IEEE Technical Committee on Digital Libraries, scientists: Practices and perceptions. PLoS ONE, 6(6, Special Issue. e21101), 1–21. National Institutes of Health. (2007). NIH Data Sharing Zimmerman, A. S. (2003). Data sharing and secondary use Policy. NIH Data Sharing Policy. Retrieved from of scientific data: Experiences of ecologists. http://grants.nih.gov/grants/policy/data_sharing/ The columns on the last page should be of approximately equal length.
You can also read