Working together with Fedora Commons: sustainable digital repository solutions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Working together with Fedora Commons: sustainable digital repository solutions Chris Awre Head of Information Management, University of Hull, UK Co-chair of UK & Ireland Fedora User Group Fedora Leadership Group member Introduction Digital repositories have emerged as important technical components in managing digital content collections. Their development has been spurred on by the need to both curate content that frequently is only available digitally and provide access to that content, taking advantage of the network opportunities now available. But what are digital repositories? Many systems have been used to manage digital material over time, including various databases, archives, digital vaults, etc. to name a few. Many of these provide much of the same functionality that repositories do. So what makes them different? Key to the success of repositories has been their ability to combine roles that other systems focused on individually: storing digital content as well as providing easy, open access to it; managing collections whilst also enabling preservation actions against them. Access Preservation Digital repositories Management and maintenance It is also notable that digital repository systems, as used within academic institutions at least, have been predominantly open source software systems. They have come out of a need identified within academia that the commercial sector has either ignored (until recently) or developed for other markets (e.g., digital asset management systems). The open source nature of digital repository systems has been a real strength in encouraging wide participation in their ongoing development and evolution within academia. It has also brought a need to focus on the sustainability of solutions that the different open source digital repository software communities have created. Good seeds have been sown in digital repository development so far. The role and debate about what digital repositories are, or need to be, still, though, continues. Can a repository provide all preservation functions; can a repository
manage data as well as documents; can a digital repository provide access to multiple types of content? Such questions will continue to be asked, and the challenge to repository user communities is how they respond to them. Notwithstanding this, the value of digital repositories in supporting digital curation developments has undoubtedly been a success over the past two decades. This paper focuses in particular on how Fedora has developed over this period and made its own contribution to digital curation, now and for the future. Fedora Fedora Commons1 (most often shortened to just ‘Fedora’) is open source digital repository software that is maintained and used by an active community of institutional contributors from around the world. It should be noted that there is no link to the Fedora Linux distribution, which is entirely separate! The software is used for a variety of different purposes, covering all types of digital content: one of Fedora’s main strengths is its flexibility, and it can thus be applied extensively across multiple use cases. Fedora 4.02 was released in December 2014, the result of 2.5 years effort from across the community of users. This development is a major re-write of the code to take advantage of up-to-date technologies and knowledge about software development, but very much keeping to the principles and designs that have proved valuable so far. The ongoing development of Fedora is coordinated through the non-profit DuraSpace Foundation3, a body set up in 2009 to oversee the sustainable ongoing community-based development of two open source digital repository systems, DSpace4 as well as Fedora. DuraSpace seeks sustainability through a combined model of operation: institutions join DuraSpace as members, contributing a fee that provides core staffing and strategic planning for Fedora and DSpace; specific, repository-related services are provided for a fee, e.g., DuraCloud5 providing cloud based storage for Fedora and DSpace; and project grants are used to kickstart new initiatives for each of the software systems. DuraSpace is establishing itself as an umbrella body for other related open source systems as well: the VIVO project6 recently joined and is working through an incubation process to place it on a firm footing for the future. Through using this combination of means to foster ongoing activity, DuraSpace provides a safe home for Fedora. In carrying out its role, DuraSpace works closely with the community-led Fedora Leadership Group7, which comprises a combination of members from senior Fedora users, by virtue of their ongoing financial support for Fedora and elected members from the wider body of users. DuraSpace works with the Fedora Leadership Group to set out future strategy, and also with the user community to 1 Fedora, http://fedoracommons.org/ 2 Fedora 4.0, https://wiki.duraspace.org/display/FF/Fedora+4.0.0+Release+Notes 3 DuraSpace Foundation, http://duraspace.org/ 4 DSpace, http://www.dspace.org/ 5 DuraCloud, http://www.duracloud.org/ 6 VIVO Project, http://www.vivoweb.org/ 7 Fedora Leadership Group, http://fedorarepository.org/leadership-group
promote Fedora through its website, the Fedora mailing lists8, and events such as the annual Open Repositories conference9 (which met, of course, in Helsinki in 201410). How did Fedora get to this point? It started as a computer science project at Cornell University in 199611, which sought to investigate what a system for managing any type of digital content would look like if you started from scratch. Fedora development continued on this project basis until 2003, when, together with the University of Virginia, the software was released as a stable production version for a digital repository platform12. Fedora has, since then, steadily attracted interest for a broad range of digital content management use cases around the world. Why is this? Each site will have had its own reasons, but I offer those used by the University of Hull in making its own selection in 2005: It was designed to scale up o The amount of digital content is only going to grow. Thus, we needed a system that could cope with increasing amounts of content without this being a concern (something that some database systems in the past struggled with). Fedora enables this by allowing the repository to link to content stored in multiple locations: the limits are thus the available storage behind Fedora, not Fedora itself. It was designed to be content agnostic o We don’t know what content types will need managing in the future. A key advantage often described by commercial digital asset management systems is the range of file formats that they can support. But do such lists actually describe limitations to the system? New file formats are being created regularly, and we needed a system that could cope with these. That is not to say that Fedora provides access to all such formats natively – specific software may be required to read the files – but we can safely curate the files regardless of the file format. Fedora enables this by abstracting the file itself from the way it is held and managed within the repository. It was designed to be based on open standards o Facilitating interoperability between systems. No software system should operate in isolation, especially not today. Use of open standards ensures that we can get content into Fedora, and also out again: we are not tied into the software. Standards also enable 8 Fedora mailing lists, http://fedorarepository.org/community/mailinglists 9 Open Repositories Conference, http://sites.tdl.org/openrepositories/ 10 OR2014, http://or2014.helsinki.fi/ 11 Payette, Sandra and Carl Lagoze, "Flexible and Extensible Digital Object and Repository Architecture," Second European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete, Greece, September 21-23, 1998, Springer, 1998, (Lecture notes in computer science; Vol. 1513). http://arxiv.org/abs/1312.1258 12 Staples, Thornton, Ross Wayland and Sandra Payette, "The Fedora Project: An Open-source Digital Object Repository System," D-Lib Magazine, April 2003. http://www.dlib.org/dlib/april03/staples/04staples.html
us to add functionality to the repository as we need to, for example the early addition of OAI-PMH functionality to facilitate harvesting. It was designed to support the management of related items and describe the connection between them o As well as the system itself, very little content lives in isolation. As we move more into the world of linked data this is evermore the case. Fedora has supported RDF13 ever since it was created, and Fedora 4 now uses RDF natively as the basis for holding and describing digital content, albeit that XML-based content can also be managed as well if preferred. It was designed to support the durability and preservation of digital content o To help digital content be usable into the future. Why do we keep this content? We wish to provide access to it, of course, but over what period of time? The longer we keep content, the more we need to be aware of what is needed to ensure it can continue to be accessed. Fedora has this durability at the centre of how it is designed. In essence, Fedora does much to remove the concerns and limits about how a digital content management system operates, allowing the focus to be on curation. The University of Hull’s vision for its digital repository is to provide a safe place to manage any digital content that the University needs managing over time, or needs to provide access to, as part of its research, teaching and administration. It aims to be the digital institutional memory of the University. Applying Fedora The advantage to the University of Hull was, as described in the Introduction, that whilst other systems might provide some of this capability, only Fedora provided it all. This was as much the case when comparing Fedora with other digital repository systems available at the time, primarily EPrints and DSpace. There have been a number of comparisons made between these systems over the years, and each has reached its own conclusion. Many of these have highlighted the dilemma of comparing Fedora with other repository systems: EPrints and DSpace come as packages that can be installed, by and large, off the shelf, and an institution can get a repository up and running reasonably quickly. This contrasts with Fedora, where getting a repository going benefits from planning to take advantage of Fedora’s flexibility: the system asks you what type of repository you would like to build rather than delivers a package based around a pre-defined functional set. The University of Hull took this on board and agreed that we wished to build a repository that suited our broad needs, and we didn’t want to be constrained by what other packages offered. Looking back, we are still benefitting from this decision. The principles we outlined that informed the decision have been maintained through the recent development of Fedora 4, ensuring we have a clear direction of travel to follow in further developing our repository. Alongside 13 RDF, http://www.w3.org/RDF/
the recent release of Fedora 4.0, which is aimed at new adopters, there is a roadmap to Fedora 4.1, aimed at those migrating from Fedora 3.x: this next release will be available during 2015. Fedora has been applied to a wide range of purposes: for collections of texts, collections of images, datasets, audio and video collections, as well collections made up of combinations of these. Key to making use of Fedora for these different purposes has been to model how to organize or model the content, which is itself informed by what you want to do with the content. One of two primary routes can be followed: A compound route – where files associated with each other in some way are grouped together as a single digital object so they can be easily referenced and delivered together. This route is more straightforward, but loses some of the flexibility in managing individual files. A complex route – where files are maintained as separate digital objects, and brought together through other means (e.g., through a search interface). This approach is the more complex of the two (hence the name), but provides the ability to reference and deliver individual files either in context or out of it depending on need. Fedora asks you to think about this, and forces serious consideration of how the content will be managed, both now and in the future. This can be hard, but the effort is worthwhile and increases the likelihood of sustainability for the collection. No one would build a physical library just to dump books within it, care would be taken over their organisation and presentation: why would we not do this for our digital libraries? In Fedora 4, content is held natively as RDF rather than the XML used as the basis of Fedora 3 and other previous versions. This shift reflects a broader adoption of linked data as a medium for storing and managing digital content. Use of RDF potentially adds to the complexity of how content files should be best managed, but it also provides a way of doing so simply initially and adding other options over time as additional links are added. In this way, RDF provides a degree of future-proofing in how content is managed: if an alternative use for the content is identified at some point, the way it is managed can be altered to meet this need through construction of additional links that meet the new use case, without wholesale change. RDF use can influence all aspects of Fedora use. This includes the following areas, which also form a checklist for attention when applying Fedora: Access/rights management – One of the great advantages of Fedora is that collections can be managed with variable access control. This allows those items that can be shared openly (e.g., open access research articles) to be fully accessible, whilst controlling access to those files that are aimed at specific audiences (e.g., at the University of Hull, past exam papers for students).
Content delivery – Fedora’s flexibility presents the dilemma that any default end-user interface could potentially limit what Fedora can offer, which is based on how the content is modelled. As such, Fedora implementations need to include the design of an end-user interface (an admin interface is provided). A number of generic solutions have been created over time, and two major initiatives (Hydra and Islandora) have sought to address this. These will be described more fully later. Storage – When Fedora was first developed the concept of cloud storage was almost non-existent. Now it is everywhere. DuraSpace themselves offer a cloud storage solution in DuraCloud, but this is one amongst many. The choice of whether to use local or cloud storage, or a combination of both for different purposes, extends beyond how Fedora manages digital content. Most important is to ensure that whatever storage is selected Fedora can link to it. Collection management – Within a Fedora repository there is the ability to group objects together within collections. As such, an important component in modelling content is deciding how to manage collections and sub-collections as part of that model. Fedora and preservation Fedora has often been described as a system that can support digital preservation – the durability that was described earlier. Fedora does indeed offer many preservation capabilities by default, for example, creating checksums for objects ingested. But it is the focus on durability that is key. Fedora is not a repository system that has preservation functionality, it is a repository system which has preservation built in to how it is designed and structured; every part of Fedora assumes that the content will need to be managed for a long time, and is designed accordingly. Fedora 4 has these embedded. Auditing and fixity services – to enable anything that happens to objects to be recorded and issues addressed. Advanced storage capabilities – the ability to plug in back-end stores to meet local storage and preservation requirements, whether locally or in the Cloud. The flexibility of being able to define policies as to where material gets stored, helping to address preservation policy. The reassurance of self-healing copies if content is corrupted. Projection – the ability to apply repository management across remote systems without specific deposit into the repository. RDF native – This has already been mentioned, but is particularly of relevance to preservation, as all data is stored in a way that enables its re- use in the future. Being RDF native, and standards compliant generally, provides a demonstration of one of the reasons why Hull originally saw Fedora as a valuable long term development platform for our repository. The other reasons also encapsulate this: the ability to scale up, to be content agnostic, and to understand the connection between related items to preserve meaning. Fedora can, of course, also make use of web-based preservation services either remotely (e.g., the
PRONOM format registry at The National Archives) or locally (e.g., a local installation of JHOVE or equivalent for format profiling) through its APIs. Recognising this, Fedora does not itself claim to have all the answers to providing preservation capability, but is designed so that a digital repository can be one component of a wider architecture, particularly for large bodies of data, and integrate with other processes and systems as required. It depends on local focus and requirements. One approach that has attracted interest is separating out access functionality from preservation within an overall system architecture, with the access repository saving a separate copy to a preservation repository that acts as a dark archive, albeit that both can apply the preservation capabilities listed above. Hydra and Islandora Fedora is a rich and flexible system, providing many options for the management of digital collections. This flexibility is empowering, as it allows individual sites to tailor a repository solution to meet local needs. It can, though, also lead to a lot of effort being required to implement that solution. In recent years two major open source initiatives have sought to address this by making adoption of Fedora more straightforward: Hydra14 and Islandora15. Both have created frameworks that make generating interfaces for creating, reading, updating and deleting content much easier, using tools based on Ruby on Rails16 and Drupal17, respectively. Both are seeing considerable interest and take-up globally, and both are seeking to build communities of their own to sustain the developments. The initiatives have sought to provide a way to take advantage of the richness of Fedora’s functionality through using more standard ways of implementing this. Those that have adopted on or other of these solutions have found that there are big advantages to working together on developments and this has been at the centre of sustainability plans. Functionality developed by community members can be more easily shared for local use; skillsets required are being better defined and developed to facilitate work with the frameworks; and the ability to expand a local solution to meet other needs is more straightforward. The development of these Fedora-related initiatives does pose an interesting challenge: one open source initiative is relying on the existence of another to deliver its capability. This is not a case of one initiative using another piece of open source software to provide a component part of its functionality: these are initiatives looking to provide equivalent functionality through different means. There is thus a need to maintain close links between the two, to maintain compatibility and align developments. Both initiatives have benefitted from having developers closely involved in the creation of Fedora 4, so have been able to ensure that all software works well together. This is also where DuraSpace plays a key role, coordinating community activities and acting as a common advocate for all three initiatives: Fedora, Hydra and Islandora. 14 Hydra, http://projecthydra.org/ 15 Islandora, http://islandora.ca/ 16 Ruby on Rails, http://rubyonrails.org/ 17 Drupal, https://www.drupal.org/
Developing the Hydra and Islandora frameworks has also generated useful ongoing debate about where specific functionality sites: should it be within Fedora, or should it be within the overlying framework? It has healthily contributed to the development of Fedora 4, helping the development team to properly define what is core to a digital repository and what is optional for implementation through some other means. Local factors may determine some of the answers, and all the initiatives have sought to enable such local decisions to be taken without excessive restraint. Practical considerations So where to start when looking to adopt Fedora? The Fedora website18 and associated wiki19 and github20 sites are clearly a good place to start. Starting with the equivalent resources for Hydra21 or Islandora22 would also prove useful if this is a preferred route. For all of these there is a requirement for technical skills and knowledge in order to make use of these various software options: this should not be underestimated, but also not considered too burdensome. The benefit of having software developed through community effort is that there is a lot of mutual interest in enabling you to work with that software. Putting aside work with the software, though, it is vital that you give serious consideration to what type of repository you are building and plan its design carefully. It is inevitable that an initial design may not encompass everything that needs attention: however, this initial planning will provide a valuable basis upon which the flexibility of the chosen software can be used to build and extend the repository. Commercial partners23 providing services based on the software can help support adoption. These exist for Fedora and the Hydra and Islandora frameworks, and can provide valuable knowledge towards creating your solution. When making use of a commercial partner it is important to bear in mind what service is required. Is it a defined repository solution, or development effort to create your own repository (even if based on an existing framework)? One of the dilemmas of delivering a defined repository solution is that a number of decisions will have had to be taken by the service provider on the functionality can be offered: this potentially delivers a clear solution to a need, or reduces the flexibility of what the repository can offer, depending on how you view the approach. A balance of need and flexibility is needed. Looking to the future What next for Fedora? The community development model that has served Fedora well has been refreshed and stimulated by the development of Fedora 4. The creation of the Fedora Leadership Group is now taking this to the next stage, 18 Fedora, http://fedoracommons.org/ 19 Fedora wiki, https://wiki.duraspace.org/display/FF/Fedora+Repository+Home 20 Fedora github, https://github.com/fcrepo4/fcrepo4 21 Hydra resources, https://wiki.duraspace.org/display/hydra/The+Hydra+Project 22 Islandora resources, http://islandora.ca/resources 23 Some commercial partner organisations, http://fedoracommons.org/service-providers
empowering the community to continue the effort through a more formal framework. There is a lot of investment in sustainability through collaboration, and a record of previous evidence to back this up. DuraSpace provides a stable home for the software, and dedicated support staff to facilitate ongoing community activity. Does this endeavor reach all parts of the world? There has, on occasion, been a perception that Fedora and DuraSpace are US-oriented: are they of relevance to European developments? Yes, there is a US orientation to the activities; this is inevitable given the origins of both Fedora and DuraSpace. However, Fedora has attracted interest from around the world ever since it first became available, and this is very likely to continue. A quick review of the Fedora User Registry24 demonstrates the international nature of the community. DuraSpace is also committed to expanding the international user base and increase the support for Fedora’s use in international contexts. Fedora adoption in Finland would be very welcome and well supported through the community, and would contribute to the range of existing European Fedora-based initiatives. Fedora 4 software development is also ongoing. It has reached a major milestone with the release of Fedora 4.0 and will reach further maturity during 2015 with the release of Fedora 4.1, which will provide specific support for existing Fedora users to migrate to the new system. There is much continuing interest and a great deal of emphasis on the use of RDF as the basis for storing digital content objects within the system. This adoption of linked data as the basis for storing digital collections promises to be valuable for the sustainability of the collections and allowing them to be used in new ways. As designed, then, Fedora will continue to be a valuable asset itself for working with digital content for some considerable time. January 2015 24 Fedora User Registry, http://registry.duraspace.org/registry/fedora
You can also read