A Proposal for Space Exploration at The National Archives

Page created by Fred Perkins
 
CONTINUE READING
A Proposal for Space Exploration at The National Archives

                     Amy Warner                                             Paul Clough
                 The National Archives                                 University of Sheffield
                 Kew Gardens, London                                   Sheffield (UK) S1 4DP
      Amy.Warner@nationalarchives.gsi.gov.uk                       p.d.clough@sheffield.ac.uk

                     Abstract
                                                           This information can be exploited and used to
    The National Archives (TNA) is the UK               provide spatial awareness to information systems
    government's official archive. They store           (see, e.g. (Buckland et al., 2007; Purves et al.,
    and maintain records spanning over a                2007)). Zong et al. (2005) discuss how the ability
    thousand years; more recently in digital            to perform query by location can be an important
    form. Much of the information held by               and useful addition to a digital library and Buck-
    TNA makes reference to place; and fre-              land et al. (2007:376) concur with this in saying
    quently user’s queries to TNA’s online              “libraries have a broad need to support geo-
    catalogue involve searches for location.            graphic search”
    To enhance access to this unique collec-               In this paper we consider how geo-referencing
    tion, we are planning to utilise existing           information from The National Archives1 (TNA),
    technologies from Geographical Informa-             the UK government's official archive, could pro-
    tion Retrieval (GIR). In this paper we de-          vide a means of synthesising access across dif-
    scribe TNA, initial analysis of the geo-            ferent archives and provide novel ways of
    graphical nature of data held by TNA,               searching and browsing historical and culturally
    geo-referencing as a starting point for GIR         significant material. Section 2 discusses past
    and our proposed plan of research.                  work in exploiting location for enhancing access
                                                        to libraries and archives; Section 3 introduces
1    Introduction                                       TNA, the data and motivations for the proposed
                                                        project; Section 4 describes areas involved in
Geo-referencing - relating information to geo-          geo-referencing information; Section 5 outlines
graphic location - is an important consideration        our planned project work and Section 6 con-
for information access systems (Hill, 2006;             cludes the paper.
Chapman & Wieczorek, 2006). This is due to the
ubiquity of location (particularly place names)         2        Related Work
within much of the information we encounter on
a regular basis. For example, many Web docu-            Geography provides an important facet for in-
ments contain geographical identifiers including        formation seeking in many contexts, including
place names (or toponyms), addresses (and ad-           historical, cultural and library archives. Several
dress fragments), postal codes, and hyperlinks          studies have examined the nature of search que-
(Ding et al., 2000; McCurley, 2001; Himmel-             ries in the cultural heritage field, both for general
stein, 2005). In addition, Petras (2004) showed         information (Cunningham et al, 2004) and for
that over half of a set of five million library cata-   images (Pask, 2005; Choi & Rasmussen, 2003;
logue records of the University of California           Collins, 1998; Chen, 2001). These all had similar
contained one or more place-related subject             findings: namely that people, places, time peri-
headings or codes. However, not only do docu-           ods, and subjects were popular topics of search
ments contain geo-references, but users of search       in this domain. This is also true of library users
systems also query based upon location. For ex-         (Buckland et al., 2007; Zong et al., 2005) and
ample, a study by Zhang et al. (2006) found that        utilizing place (and time) for information access
12.7% of queries submitted to a Web search en-          has been investigated in a range of past projects.
gine contained a place name; Sanderson & Han               Probably one of the most widely cited research
(2007) also showed that queries for city names          projects that made use of geo-referencing was
were repeated more frequently than for other
names (e.g. country and state).                         1
                                                            http://www.nationalarchives.gov.uk/
the Alexandria Digital Library (ADL) project           lection, is investigated. This includes the devel-
(Frew et al., 1998; Goodchild, 2004) aimed “to         opment of a prototype system to demonstrate
develop a user-friendly digital library system that    advanced search and browse functionalities, in-
provides a comprehensive range of services to          cluding faceted browsing, timelines and maps,
collections of maps, images and spatially-             based on semantic enrichment of data from Tate
referenced information.”                               Online.
   Moritz (1999) describes the geo-referencing of         Finally, the Vision of Britain5 website brings
narrative descriptions of specimens in natural         together census data, election results, historical
history museums written by collectors in the           maps, gazetteers and travel writing between 1901
field. Reid et al. (2004) discuss some of the pro-     and 2001 into a single point of access. This is
jects funned by EDINA, a National Data Centre          part of the Great Britain Historical Geographical
in Edinburgh, including Go-Geo! and Geo-X-             Information System (GBHGIS) project managed
Walk. These projects facilitate the discovery of       by the University of Portsmouth (Gregory &
geo-referenced information from distributed col-       Southall, 2003).
lections in the UK.
   Buckland & Lancaster (2004) describe com-           3     The National Archives (TNA)
bining place, time and topic in the Electronic
                                                       3.1     The Role of TNA
Cultural Atlas Initiative 2 , designed to foster a
research community interested in geo-temporally        The National Archives (TNA) is the UK gov-
encoded data and create a catalogue of geo-            ernment's official archive. The documents that it
referenced online resources. As a part of this ini-    holds reflects almost 1,000 years of history, with
tiative, Buckland et al. (2007) overview the Go-       records ranging from parchment and paper
ing Places in the Catalog project, an initiative to    scrolls through to digital files and archived web-
offer seamless searching of educational and            sites6. The National Archives also provides ad-
scholarly numeric and textual resources. This          vice and guidance to others throughout the public
includes extracting geo-spatial information from       and private sectors about caring for archives, and
library catalog records to enhance the user’s          records the location of archival collections
search experience.                                     throughout the UK. It publishes all UK legisla-
   Johnson (2004; 2005) describes the indexing         tion and advises upon and encourages the re-use
and delivery of historical maps online from the        of public sector information.
National Library of Australia using the Time-             The National Archives brings together the
Map3 tool (a mapping Java applet which gener-          Public Record Office (founded in 1838), Histori-
ates complete interactive maps with a few simple       cal Manuscripts Commission (founded 1869), the
lines of html code). With a similar goal, Martins      Office for Public Sector Information, and Her
et al. (2007) describe the EU co-funded DIG-           Majesty's Stationery Office (founded 1786).
MAP project aimed at making collections of
digitised historical maps spatially aware, through     3.2     Data held by TNA
the use of text mining techniques and Geo-             The National Archives’ history, and its central
graphic Information Retrieval (GIR). This pro-         role in information management, means that the
ject deals specifically with some of the issues        data that it holds is extensive, varied and rich.
involved in using digitised versions of historical     This information falls into 4 groups:
documents (e.g. old font styles, incorrectly
OCR’d words etc.).                                           1. Original documents, in both paper and
   Mostern & Johnson (2008) describe the con-                   electronic format (found in Electronic
struction of a historical event gazetteer (these are            Records Online – ERO) and digitised
records of historical events that describe the exis-            copies of original documents made
tence of named places) and the use of spatio-                   available online (through Documents
temporal visualisation to view events and rela-                 Online).
tionships from the Heurist 4 collaborative data-             2. Catalogues which describe the content
base. Clough et al. (2008) describe a study in                  of The National Archives collections and
which semantic access to cultural heritage infor-               record the location of archival records
mation from Tate Online, a large online art col-
2                                                      5
  http://www.ecai.org/                                  http://www.visionofbritain.org.uk
3                                                      6
  http://www.timemap.net                                http://www.nationalarchives.gov.uk/about/whowhathow.htm?
4
  http://heuristscholar.org/heurist/                   source= ddmenu_about1
held throughout the UK. These range                                 would enable the creation of a distinct set
             from the general, in particular TNA’s                               of place data.
             Catalogue of its collections and the Na-                    •       Disambiguation between different places
             tional Register of Archives, to the spe-                            with the same name.
             cific, with the E179 Taxation database.
          3. Published information, in particular the                     Exploiting place information in The National
             London Gazette, official newspaper of                     Archives databases could unlock their potential,
             record and the Statute Law Database.                      allowing the information to be interrogated in a
             This data is currently maintained and                     way which is not possible with 15 discreet data
                               7
             hosted separately .                                       sources, currently existing in isolation. Comple-
          4. A wide variety of other datasets, in-                     mentary information in different databases could
             cluding the ARCHON Directory of ar-                       be highlighted, enhancing the richness of the
             chive contact details, and Your Archives,                 data. For example, researchers interested in the
             a wiki which allows users to contribute                   history of Godalming parish could consult:
             content about archival records.
                                                                             •    Tithe records on The Catalogue:
     In addition to these groups, data can also be                                http://www.nationalarchives.gov.uk/ca
                                                                                  talogue/displaycataloguedetails.asp?
divided into structured and unstructured; the                                     CATID=4085656&CATLN=6&Highlight=
latter providing a greater challenge for computa-                                 ,GODALMING,PARISH&FullDetails=False
tional text analysis. Geographic information can                             •    Details on a seventeenth century Vicar of
be found throughout TNA’s databases, varying                                      Godalming, who failed to pay his tax,
from the postal addresses of every archive in the                                 found on the E179 database
UK (found in the ARCHON directory); a com-                                        http://www.nationalarchives.gov.uk/e1
                                                                                  79/details.asp?piece_id=1030&doc_id=3
prehensive directory of medieval places, with                                     8513&doc_ref=E179/14/216A
related OS grid references and alternative names                             •    The records of Godalming parish, held at
(drawn from the E179 tax database); and a list of                                 Surrey History Centre, recorded on the
parishes which relate to the National Farm Sur-                                   National Register of Archives:
vey carried out during World War Two (found in                                    http://www.nationalarchives.gov.uk/nr
                                                                                  a/searches/subjectView.asp?ID=O12730
the Catalogue, reference MAF 32).
                                                                             •    The location of records relating to the
3.3         Motivations for Space Exploration                                     manor of Godalming, found on the Ma-
                                                                                  norial Documents Register:
Allowing users to search TNA data by place is a                                   http://www.nationalarchives.gov.uk/md
high priority because of the popularity of place                                  r/searches/detail.asp?SubjectID=17071
names as search queries. A recent survey of a                                     6&CountyID=398&FirstDate=&LastDate=&P
                                                                                  arishName=&MDKeyword=
sample of search queries carried out in January
and July 2009 found that about 25% of search                             User’s resource discovery experience could
queries related to a place (substantially higher                       potentially be enriched further by links to related
than figures reported in previous studies for gen-                     external resources, for example:
eral-purpose Web search).
   Enabling TNA’s data to be navigated and ex-
                                                                             •    The page on Godalming on the Explor-
plored by place could provide an effective means
                                                                                  ing Surreys past website:
of helping users to find a way through the mass                                   http://www.exploringsurreyspast.org.
of information held by The National Archives –                                    uk/themes/places/surrey/waverley/goda
                                                                                  lming
currently about 32 million records in the online
databases and documents. In particular this                                  •    The entry for Godalming in the Victoria
would allow:                                                                      County History, found on the British
                                                                                  History Online website:
                                                                                  http://www.british-
      •     Visualisation of TNA data through geo-                                history.ac.uk/report.aspx?compid=
            graphical interfaces.                                                 42924&strquery=godalming
      •     Division between place names and per-                            •    The Wikipedia page on Godalming:
                                                                                  http://en.wikipedia.org/wiki/Godalmin
            sonal names (including personal titles,                               g
            such as the Duke of Richmond). This
                                                                         The fragmented nature of TNA’s databases
                                                                       can make resource discovery challenging for us-
7
    http://www.gazettes-online.co.uk/, http://www.statutelaw.gov.uk/
ers, and the vision of a single catalogue interface   4.1       Sources of Geographical Knowledge
is something which is being actively pursued and
                                                      To explicitly geo-reference information relies on
the identification and exploitation of place data
                                                      having access to existing geographical knowl-
in TNA’s resources could contribute to this.
                                                      edge. The most commonly used resource is the
   Because of the nature of data in TNA, a criti-
                                                      gazetteer: at minimum a list of place names, fea-
cal challenge that needs to be addressed is the
                                                      ture types and spatial positions (Hill, 2006;
issue of change of name over time and the use of
                                                      Mostern & Johnson, 2008). Sources of gazetteers
alternative names. The fact that TNA’s data in-
                                                      commonly include (Goodchild & Hill,
cludes, in digital form, information from the me-
                                                      2008:1039):
dieval period to the present day, means that it
provides a unique environment to test the type of
                                                            •    Gazetteers of ‘official’ toponyms (e.g.
issues which could arise. Examples of changes of
                                                                 the Ordnance Survey 1:50,000 Scale
name over time include:
                                                                 Gazetteer);
                                                            •    Indexes accompanying published atlases
    •   Hampshire, sometimes known by the ab-
                                                                 (e.g. Multimap.com);
        breviation Hants, also (formerly fre-
        quently) known as Southampton.                      •    Place identifier tables accompanying
                                                                 GIS datasets;
    •   Richmond, Surrey, formerly known as
        West Sheen.                                         •    Place authority files (e.g. Vision of Brit-
                                                                 ain) and rules (e.g. NCA Rules for the
    •   Stepney, formerly also known as Ste-
                                                                 Construction of Personal, Place and Cor-
        bunheath.
                                                                 porate Names8) used for cataloguing and
4    Geo-Referencing                                             indexing;
                                                            •    Historical printed gazetteers and ency-
Relating information to geographic location in-                  clopedias (e.g. Gazetteer of Great British
volves linking between informal means of refer-                  Place Names);
ring to locations using place names (or                     •    General online resources (e.g. Wikipe-
toponyms) and formal representations based on                    dia).
spatial referencing systems (e.g. longitude and
latitude or the British National Grid system).           Researchers have investigated how to combine
Much of the geographical information contained        multiple sources to create more comprehensive
in libraries and archives is based on place names,    resources (see, e.g. (Manguinhas et al., 2008).)
however these must be geospatially referenced         Example sources which could be used within the
(i.e. grounded) if they are to be used to their       project include the following 9 (these are pre-
greatest potential for information access and         dominantly UK-biased):
knowledge discovery. Locations are also identi-
fiable in other forms than place names, including           •    Getty Thesaurus of Geographic
addresses, postcodes, an address fragment, phone                 Names 10 (TGN) contains more than 1
numbers, URLs and IP addresses.                                  million names and other information
   Gazetteers are typically used to provide the                  about places across all continents (in-
link between informal (place names) and formal                   cluding political and historical places).
(corresponding spatial references) representa-                   The locations are organized in a hierar-
tions (section 4.1 presents some example gazet-                  chical structure (e.g. World>Europe>
teers). A key task for providing geospatial infor-               United Kingdom>Scotland).
mation access is the recognition (or extraction)            •    Ordnance Survey 1:50,000 Scale Gaz-
of potential geo-references and their geospatial                 etteer (OS50k) contains 260,000 names
referencing (discussed further in section 4.2).                  in Great Britain from the current and
Geographic Information Retrieval (GIR) deals                     previous years’ 1:50,000 Scale gazetteer.
with the indexing of information to enable                       Locations are represented by points (Na-
searching over it to find information that is po-                tional Grid squares) with additional fea-
tentially relevant to some information need (dis-                ture type.
cussed in section 4.3).
                                                      8
                                                        http://www.nca.org.uk/materials/namingrules.pdf
                                                      9
                                                        A comprehensive survey can be found from EDINA:
                                                      http://edina.ac.uk/projects/crosswalk/
                                                      10
                                                         http://www.getty.edu
•    Ordnance Survey Code-Point provides                  In addition to these resources geographic data
          a National Grid reference for each unit           can be mined from the following resources and
          postcode in Great Britain. The postcode           services:
          data is sourced, from among other re-
          sources, Address Point, which contains                 •   Wikipedia contains a wealth of place
          26 million addresses recorded in the                       names which can be automatically ex-
          Royal Mail Postcode Address List                           tracted and compiled into useful gazetteer
          (PAF).                                                     resources (see, e.g. (Overell & Rüger,
     •    GeoNames11 is a database of over 8 mil-                    2006)).
          lion place names (and postcodes) from                  •   Geo-X-Walk14is available for use through
          all countries. Locations are represented                   EDINA and provides APIs for UK name,
          by points and contain a range of feature                   postcode and identifier lookup.
          types (e.g. country, region, city, body of             •   Geograph 15 is a project which aims to
          water etc.). The database can be                           collect photographs and information for
          downloaded and used free of charge. The                    every square km of Great Britain. The pro-
          data is based on several sources includ-                   ject provides an Application Programming
          ing the GEOnet Names Server (GNS),                         Interface (API) which can be used to in-
          the US Geological Survey Geographic                        teract with the data. Currently there are
          Names Information System (GNIS) and                        1,359305 images for Great Britain which
          Wikipedia.                                                 include place name metadata and an OS
     •    Gazetteer of British Place Names 12                        grid reference.
          (GBPN) contains over 50,000 entries                    •   Yahoo Geocoding API 16 is a freely-
          (points - National Grid references) refer-                 available web service (limited to 5,000 re-
          ring to places in Great Britain. Each                      quests per day) that provides coordinates
          place name is related to its historic                      for a given address or place name. (The
          county (administrative areas) and vari-                    GeoNames service uses this as a backup
          ants of place names are also included.                     resource if the place name cannot be
     •    UK Placename Finder13 is a database of                     found in its own database.)
          more than 31,000 UK place names
          (available for purchase on CD-ROM).                  A more sophisticated form of geographic
          Places are referenced to the UK National          knowledge representation is an ontology which
          Grid.                                             typically organizes locations into hierarchical
     •    Seamless Administrative Boundaries                structures and provides topological relations be-
          of Europe (SABE) is a pan-European                tween locations (see, e.g. Fu et al., 2004). Recent
          dataset of administrative units compiled          work has focused on areas such as spatio-
          from national contributions (containing           temporal modeling (Mostern & Johnson, 2008),
          the geometry and semantics of the ad-             interoperability (Janowicz and Kessler, 2008)
          ministrative hierarchies of 30 European           and automatic gazetteer construction (Nadeau et
          countries). For the UK, the dataset con-          al., 2006; Popescu et al., 2008).
          sists of around 22,000 place names.
                                                            4.2      Identifying/Resolving Geo-references
     •    Alexandria Gazetteer contains around
          5.9 million place names, with feature             A basic task in geo-referencing information is to
          type information for all entries. The gaz-        identify candidate geo-references. For place
          etteer is mainly US-based (U.S. place             names (or toponyms), this is commonly referred
          names from the U.S. Geological Survey's           to as geo-parsing (Larson, 1996) or toponym
          GNIS database are used), but also con-            recognition (Leidner, 2008) and can be likened
          tains around 4 million non-US place               to the Information Extraction task of Named En-
          names from the GEOnet Names Server                tity Recognition (or NER): the process of assign-
          (GNS).                                            ing every word or group of words to a set of pre-
                                                            defined categories (including “not an entity”).

11                                                          14
   http://www.geonames.org/                                    http://www.geoxwalk.ac.uk
12                                                          15
   http://www.gazetteer.co.uk/                                 http://www.geograph.org.uk/
13                                                          16
   http://www.digital-documents.co.uk/archi/placename.htm      http://developer.yahoo.com/maps/rest/V1/geocode.html
These categories commonly include entities such       geographical ontologies, spatial indexing and
as location, person and organization.                 storage of documents, geographical relevance
   Most NER systems consist of at least three ba-     ranking, the extraction and resolution of geo-
sic components: (1) a tokeniser, (2) gazetteer        graphical references, determining the geographi-
lists, and (3) a Named Entity (NE) grammar            cal scope (or focus) of web documents, user in-
(syntactic rules that take into account features      terfaces and visualization, and methods for the
that describe the surrounding context). Rules for     formulation of spatial queries and interaction
NER can be generated entirely by hand (knowl-         with the results of geographic search (see, e.g.
edge-based) or automatically, using machine           (Hill, 2006:185-214; Purves et al., 2007)).
learning (or statistical) techniques on previously
classified texts (see, e.g. (Zhou & Su, 2002;         5     Discussion
Curran and Clark, 2003)). The most common
                                                      5.1    Example Applications
approach is to use gazetteers together with syn-
tactic rules. For example, Curran & Clark (2003)      By geo-referencing data from TNA, a number of
use a Maximum Entropy model to perform NER            applications could be developed to enhance ac-
using features such as Part-Of-Speech tags, a         cess and support knowledge discovery. These
history of preceding and following NEs, ortho-        include:
graphic information and gazetteers. Their ap-
proach was used on historical texts in the Geo-X-        Geographical search/browse: an obvious ap-
Walk project (Nissim et al., 2004).                   plication is providing users with functionalities
   Once recognised, a candidate place name must       to search and navigate the TNA archives through
be resolved, i.e. given a spatial reference (Larson   place, thereby making the current search spa-
(2006) calls this geo-coding; Leidner (2008) re-      tially-aware. Browsing across archives and
fers to this as toponym resolution). This may in-     search results would be enhanced through visual-
volve disambiguating the place. There are two         ising records on a map. Users could also initiate
main ambiguities in geo-references: (1) referent      a search by entering a place name, selecting
ambiguity and (2) reference ambiguity. The for-       names from a list (or ontology) or selecting
mer occurs when the same place name is used to        points (or a region) from a map. A gazetteer or
refer to more than one location (e.g. “Chapel-        geographical ontology could also provide search
town” refers to a location in South Yorkshire         assistance by suggesting locations spatially re-
(UK), Lancashire (UK), Kent County (USA) and          lated to the current search position (Purves et al.,
Panola County (USA)). This also includes a geo-       2007).
graphical reference being used for other entities
(e.g. the name of a person or company), which is         Collection overviews: to support more ex-
called referent class ambiguity. (Amitay et al.       ploratory search paradigms (i.e. users with no
(2004) separate ambiguity into geo/non-geo and        specific search goal in mind), a summary of loca-
geo/geo.)                                             tions represented in different archives could be
   Reference ambiguity occurs when the same           created. For example, through mapping all
location can have more than one name, e.g. due        documents to grid cells and shading these to in-
to the historical deviation of a location name        dicate document density. This would indicate the
over time (Smith & Mann, 2003), transliteration       coverage of the content from the archives (Hill,
(Kwok & Deng, 2003) and implicit formats and          2006:211).
character encoding (Axelrod, 2003). An exten-
sive summary can be found in (Leidner, 2008).            Co-occurring concepts: through statistical
                                                      analysis of term co-occurrences, terms could be
4.3   Geographical IR (GIR)                           identified that help to characterise a location (e.g.
Geographic Information Retrieval (GIR) is con-        it is common to find the terms “Sheffield” and
cerned with finding documents relevant to que-        “steel” co-occur). This could be analysed with
ries which include some geographical context          respect to time to see changes in concepts related
(Larson, 1996). Information is assigned one or        to specific places (or between archives). Map
more footprints and documents are retrieved and       visualizations overlaid with co-occurring con-
ranked according to thematic and spatial rele-        cepts (and timeline controls) could help to high-
vance (results can also be visualised using maps).    light concepts or themes which relate to specific
   Research in this field has addressed a wide        regions (see, e.g. (Purves and Edwardes, 2008)).
range of relevant areas such as the building of
5.2   Research Plan                                   to use tools such as GATE 17 (Cunningham,
                                                      2002) and OpenCalais18 from Reuters for NER,
The goal of our proposed work is to investigate
                                                      OpenSpace19 from Ordnance Survey and Google
geo-referencing within TNA (a set of distributed
                                                      Maps20 for mapping, Solr21 for interaction func-
archives) to enable geo-spatial information ac-
                                                      tionality including faceted browsing and Post-
cess to the archive collections. This will include
                                                      GIS 22 and LocalLucene 23 for spatial/textual in-
the following key stages:
                                                      dexing and retrieval.
   Stage 1: establishing user (and system) re-
                                                        Stage 5: evaluating prototype applications
quirements for geo-enabling TNA resources.
                                                      with users and TNA. This will involve previous
This will involve understanding the current use
                                                      work on relevance assessments in GIR (Purves &
of place in the search for archived material (e.g.
                                                      Clough, 2005).
log analysis and user interviews) and analysing
user’s information needs.                             5.3     Potential Challenges

   Stage 2: developing suitable geographical re-      By geo-referencing material from The National
sources that can be used for geo-referencing the      Archives, we will be able to provide additional
collection. Resources such as the OS50k gazet-        functionality to explore the content with respect
teer and GeoNames will be used as a starting          to place (as discussed in section 3.3). However,
point and initial analysis will be undertaken to      the particular context of TNA provides several
compare and understand these available re-            challenges, particularly with respect to applying
sources (and others). Techniques to combine and       automated text processing techniques. Chal-
extend existing geographical resources to deal        lenges include the following:
with historical variations and vernacular expres-
sions and add further information (e.g. feature          Multiple archives: each TNA database exhib-
types) will be explored based on mining existing      its different geographical characteristics (e.g.
data from the TNA (e.g. from the E179 database)       ARCHON includes address information; E179
and online sources (Uryupina, 2003; Jones et al.,     include historical variants; Documents Online
2006; Popescu et al., 2008; Manguinhas et al.,        contains unstructured text). Decisions regarding
2008). A key task in this activity will be deciding   how documents will be indexed in this distrib-
how to create spatial coordinates for places          uted setting must also be considered (e.g. central-
which do not appear in existing resources (e.g.       ized or distributed indexes; single search or me-
historical names, (Crane, 2004)). In addition, this   tasearch).
work will explore a suitable document represen-
tation and annotation scheme for the TNA data.           Understanding references to place: geo-
                                                      references are used in a number of different con-
   Stage 3: developing suitable methods for           texts (e.g. could refer to the location of the
toponym recognition and resolution (addresses         document’s author or publisher; the geographical
will also be considered for the ARCHON data-          scope of its content; or the location of a physical
base). Simple approaches for NER based on gaz-        entity, e.g. a library). Simply identifying geo-
etteer lookup and minimal syntactic rules             references and attaching these to documents or
(Clough, 2005) will be tested; together with more     records may not be sufficient.
sophisticated approaches based on machine
learning (Curran & Clark, 2003). An existing             Extracting and resolving toponyms: ambi-
method for resolving ambiguous toponyms using         guities in natural language always present chal-
default senses and gazetteer resources (Clough,       lenges to natural language processing. In particu-
2005) will be tested, together with approaches        lar, extracting and referencing toponyms which
which utilize spatial proximity (Pasley &             do not exist within the current resources (e.g.
Clough, 2007). We plan to investigate the suit-       historical variants) will require attention. Al-
ability of different approaches                       though approaches in NER can successfully
                                                      17
                                                         http://gate.ac.uk/
   Stage 4: developing “light-weight” prototypes      18
                                                         http://www.opencalais.com/
based on using publicly-accessible resources (i.e.    19
                                                         http://openspace.ordnancesurvey.co.uk/openspace/
                                                      20
mashups (Erle et al., 2005)) and a suitable archi-       http://maps.google.co.uk/
                                                      21
                                                         http://lucene.apache.org/solr/
tecture for providing GIR functionality. We plan      22
                                                         http://postgis.refractions.net/
                                                      23
                                                         http://locallucene.sourceforge.net/
identify place names, the main problem is how to     Amitay, E., Har'El, N., Sivan, R.. and Soffer, A. (
spatially ground the references (i.e. can we find      2004) Web-a-Where: Geotagging Web Content, In:
spatial information for known places which are         Proceedings of the 27th annual international con-
likely to co-occur with historical variants?)          ference on Research and development in informa-
                                                       tion retrieval (SIGIR04), Sheffield UK, 2004, pp.
                                                       273-280.
   Integrating space and time: in many contexts
space and time is linked and should therefore be     Axelrod, A. E. (2003) On building a high perform-
modeled together (e.g. Buckland & Lancaster,           ance gazetteer database. In: Kornai, A. and Sund-
2004; Mostern & Johnson, 2008)). Current spa-          heim, B. (eds.) Proceedings of the HLT-NAACL
                                                       2003 Workshop on Analysis of Geographic Refer-
tial indexing methods may not be suitable for
                                                       ences, Alberta, Canada: ACL, 63-68.
supporting representation of space and time
without major re-design.                             Buckland, M. and Lancaster, M. (2004) Combining
                                                       Place, Time and Topic: The Electronic Cultural At-
   Implementation: one goal of this work is to         las Initiative, D-Lib Magazine, Vol. 10(5).
inform future designs of the TNA’s search en-        Buckland, M., Chen, A., Gey, F.C., Larson, R.R.,
gine. Geo-enabling TNA’s content should be             Mostern, R. and Petras, V. (2007) Geographic
achieved in such a way that it offers minimal          Search: Catalogs, Gazetteers, and Maps. College &
interruption/change to the current system and          Research Libraries. Vol 68, no. 5 (Sept 2007):
cataloguing processes.                                 376-387.
                                                     Chapman, A.D. and J. Wieczorek (eds) (2006) Guide
   Scalability: the volume of data held in the         to Best Practices for Georeferencing. Copenhagen:
TNA databases also presents an implementation          Global Biodiversity Information Facility.
challenge, particularly for processes such as geo-   Chen, H. (2001) An analysis of image queries in the
parsing and geo-coding. This may affect the kind       field of art history. JASIST, 52(3), 260-273.
of solutions developed in the project (e.g. gazet-
                                                     Choi, Y. and Rasmussen, E.M. (2003) Searching for
teer lookup with minimal natural language proc-
                                                       Images: The Analysis of Users' Queries for Image
essing has shown to work efficiently for large         Retrieval in American History. JASIST, 54(6), 498-
collections in past work (McCurley, 2003;              511.
Clough, 2005)).
                                                     Clough, P. (2005), Extracting Metadata for Spatially-
6   Conclusions                                        Aware Information Retrieval on the Internet, In:
                                                       Proceedings of Workshop on Geographic Informa-
In this paper we have described a proposal for         tion Retrieval (GIR'05), held in conjunction with
space exploration at The National Archives, the        CIKM2005, Bremen, Germany, pp. 25-30.
UK government's official archive. The diversity      Clough, P., Marlow, J. and Ireson, N. (2008) Enabling
and vastness of content at TNA currently pre-          Semantic Access to Cultural Heritage: A Case
sents a real challenge for the archival staff and      Study of Tate Online, Larson, M., K. Fernie, J.
end users of the archive, such as scholars and the     Oomen and J. Cigarran (eds.) Proceedings of the
general public. By exploiting geo-references           ECDL 2008 Workshop on Information Access to
from the content, material can be linked and ag-       Cultural Heritage, Aarhus, Denmark, September
gregated to provide enhanced forms of informa-         18, 2008. ISBN 978-90-813489-1-1.
tion access and knowledge discovery across dis-      Collins, K. (1998) Providing subject access to images:
creet data sources. It would also highlight com-       A study of user queries. The American Archivist,
plementary information in TNA’s databases, and         61, 36-55.
in doing so enhance the richness of the data.        Crane, G. (2004) Georeferencing in Historical Collec-
Place data is also one of the most popular topics      tions, D-Lib Magazine, Vol. 10(5).
for searching the archives from TNA’s website
                                                     Cunningham, H. (2002) GATE, a General Architec-
and Geographical Information Retrieval (GIR)
                                                       ture for Text Engineering. Computers and the Hu-
technologies could help improve the current            manities 36, 223-254.
search paradigms. By applying (and exploring)
existing technologies from areas such as GIR and     Cunningham, S.J., Bainbridge, D. and Masoodian, M.
text mining, we can geo-enable the archive and         (2004) How people describe their image informa-
                                                       tion needs: A grounded theory analysis of visual
prepare it for future access.
                                                       arts queries, In: Proceedings of JCDL’04, 47-48.
References                                           Curran, J.R. and Clark, S. (2003) Language Inde-
                                                       pendent NER using a Maximum Entropy Tagger.
In Proceedings of CoNLL-2003, Edmonton, Can-                Geographic References, Alberta, Canada: ACL,
  ada, 2003, 164-167.                                         26-30.
Ding, J., Gravano, L. and Shivakumar, N. (2000)             Larson, R.R. (1996) Geographic Information Re-
  Computing geographical scopes of web resources.             trieval and Spatial Browsing. In: GIS and Librar-
  In: Proceedings of 26th International Conference            ies: Patrons, Maps and Spatial Information, Linda
  on Very Large Databases, VLDB 2000, Cairo,                  Smith and Myke Gluck, Eds., University of Illi-
  Egypt, September 2000, 10-14.                               nois.
Frew, J., Freeston, M., Freitas, N., Hill, L.L., Janee,     Leidner, J.L. (2008) Toponym Resolution in Text.
   G., Lovette, K., Nideffer, R., Smith, T.R. and             Boca Raton, FL, USA: Universal Publishers
   Zheng, Q. (1998) The Alexandria Digital Library            (ISBN-10 1581123841, ISBN-13: 978158112384
   Architecture. ECDL 1998, 61-73.                            5).
Fu, F., Jones, C.B. and Abdelmoty, A.I. (2005) Build-       Manguinhas, H., Martins, B. and Borbinha, J. (2008)
  ing a Geographical Ontology for Intelligent Spatial         A Geo-Temporal Web Gazetteer Integrating Data
  Search on the Web, In: Proceedings of IASTED In-            From Multiple Source. In Proceedings of the 2008
  ternational Conference on Databases and Applica-            IEEE international Conference on Digital Informa-
  tions (DBA2005).                                            tion Management (November 13 - 16, 2008).
Goodchild, M. (2004) The Alexandria Digital Library         Martins, B., Borbinha, J., Pedrosa, G., Gil, J., and
  Project: Review, Assessments and Prospects, D-              Freire, N. (2007) Geographically-aware informa-
  Lib Magazine, Vol. 10(5).                                   tion retrieval for collections of digitized historical
                                                              maps. In: Proceedings of the 4th ACM Workshop
Gregory, I. and Southall, H. (2003) The Great Britain
                                                              on Geographical information Retrieval (Lisbon,
  Historical GIS, University of Portsmouth, Avail-
                                                              Portugal, November 09 - 09, 2007), ACM, New
  able     online:    http://www.port.ac.uk/research/
                                                              York, NY, 39-42.
  gbhgis/aboutthegbhistoricalgis/documentation/
                                                            McCurley, S.K. (2001) Geospatial mapping and navi-
Goodchild, M. F. and Hill, L. L. (2008) Introduction
                                                              gation of the web. In: Proceedings of the Tenth In-
  to digital gazetteer research. Int. J. Geogr. Inf. Sci.
                                                              ternational WWW Conference, Hong Kong, 1-5
  22, 10 (Oct. 2008), 1039-1044.
                                                              May, 221-229.
Hill, L.L. (2006) Georeferencing: The Geographic
                                                            Mostern, R. and Johnson, I. (2008) From named place
  Associations of Information, The MIT Press, Cam-
                                                             to naming event: creating gazetteers for history.
  bridge, Massachusetts, September 2006, 0-262-
                                                             Int. J. Geogr. Inf. Sci. 22, 10 (Oct. 2008), 1091-
  08354-X.
                                                             1108.
Himmelstein, M. (2005) Local Search: The Internet Is
                                                            Moritz, T. (1999) Geo-referencing the Natural and
  the Yellow Pages, IEEE Computer Society Journal,
                                                             Cultural World, Past and Present: Towards Build-
  0018-9162/05, 26-34.
                                                             ing a Distributed, Peer-Reviewed Gazetteer Sys-
Janowicz, K. and Kessler, C. (2008) The role of on-          tem. Presented at the Digital Gazetteer Informa-
   tology in improving gazetteer interaction, Interna-       tion Exchange Workshop, Oct 13-14, 1999.
   tional Journal of Geographical Information Sci-
                                                            Nadeau, D., Turney, P.D. and Matwin, S. (2006) Un-
   ence, Vol. 22(10), 1129-1157.
                                                              supervised named-entity recognition: Generating
Johnson, I. (2004) Putting time on the map: Using             gazetteers and resolving ambiguity. In: Proceed-
  TimeMap for map animation and web delivery,                 ings of 19th Canadian Conference on Artificial In-
  GeoInformatics, 26-29.                                      telligence.
Johnson, I. (2005) Indexing and Delivery of Historical      Nissim, M., Matheson, C. and Reid, J. (2004) Recog-
  Maps Online Using TimeMapTM, National Library               nising Geographical Entities in Scottish Historical
  of Australia News, January 2005 Volume XV                   Documents, In: Proceedings of the Workshop on
  Number 4.                                                   Geographic Information Retrieval, SIGIR 2004.
Jones, C., Purves, R.S., Clough, P., and Joho, H.           Overell, S.E. and Rüger, S. (2006) Identifying and
  (2008) Modelling Vague Regions with Knowledge               Grounding Descriptions of Places, In: Proceedings
  from the Web, International Journal Geographic              of Workshop on Geographic Information Retrieval
  Information Systems (IJGIS)- Special Edition on             (GIR’06).
  Digital Gazetteer Research, Volume 22(10), 1045-
                                                            Pask, A. (2005) Art Historians' Use of Digital Images:
  1065.
                                                              A Usability test of ARTstor. Dissertation at Uni-
Kwok, K. L. and Deng, Q. (2003) GeoName: a sys-               versity of North Carolina, Chapel Hill.
  tem for back-transliterating pinyin place names. In:
                                                            Pasley, R., Clough, P. and Sanderson, M. (2007),
  Kornai, A. and Sundheim, B. (eds.) Proceedings of
                                                              Geo-Tagging for Imprecise Regions of Different
  the HLT-NAACL 2003 Workshop on Analysis of
Sizes, In: Proceedings of Workshop on Geographic       Zhou, G. and Su, J. (2002) Named Entity Recognition
  Information Retrieval GIR'07.                            Using a HMM-based Chunk Tagger. In: Proceed-
                                                           ings of the 40th Annual Meeting of the Association
Petras, V. (2004) Statistical Analysis of Geographic
                                                           for Computational Linguistics (ACL2002), Phila-
  and Language Clues in the MARC Record, Tech-
                                                           delphia, July 2002, 473-480.
  nical Report, University of California at Berkeley.
                                                         Zong, W., Wu, D., Sun, A., Lim, E., and Goh, D. H.
Popescu, A., Grefenstette, G. and Moëllic, P. A.
                                                           (2005) On assigning place names to geography re-
  (2008) Gazetiki: automatic creation of a geo-
                                                           lated web pages. In: Proceedings of the 5th
  graphical gazetteer. In: Proceedings of the 8th
                                                           ACM/IEEE-CS Joint Conference on Digital Li-
  ACM/IEEE-CS joint conference on Digital librar-
                                                           braries (Denver, CO, USA, June 07 - 11, 2005).
  ies table of contents, Pittsburgh PA, PA, USA.
                                                           JCDL '05, ACM, New York, NY, 354-362.
Purves, R.S., Clough, P., Jones, C.B., Arampatzis, A.,
  Bucher, B., Finch, D., Fu, G., Joho, H., Khirini,
  A.S., Vaid, S., and Yang, B. (2007), The Design
  and Implementation of SPIRIT: a Spatially-Aware
  Search Engine for Information Retrieval on the
  Internet, International Journal Geographic Infor-
  mation Systems (IJGIS), Volume 21(7), January
  2007, pp. 717-745.
Purves, R.S. and Clough, P. (2006) Judging spatial
  relevance and document location for Geographic
  Information Retrieval, extended abstract, In Pro-
  ceedings of 4th International Conference on Geo-
  graphic Information Science (GIScience 2006),
  Münster, Germany, September 2006, 159-164.
Purves, R.S. and Edwardes, A.J. (2008) Exploiting
  Volunteered Geographic Information to Describe
  Place. Proceedings of the GIS Research UK 16th
  Annual Conference,Lambrick, D.(eds), 252-255.
Reid, J.S., Higgins, C., Medyckyj-Scott, D. and
  Robson, A. (2004) Spatial Data Infrastructures and
  Digital Libraries: Paths to Convergence, D-Lib
  Magazine, Vol. 10(5).
Sanderson, M. and Han, Y. (2007) Search Words and
  Geography, In: Proceedings of the Geographic In-
  formation Retrieval workshop at the ACM CIKM
  conference.
Schuyler, E., Gibson, R. and Walsh, J. (2005) Map-
  ping Hacks: Tips and Tools for Electronic Cartog-
  raphy, O’Reilly.
Smith, D. A. and Mann, G. S. (2003) Bootstrapping
  toponym classifiers. In: Kornai, A. and Sundheim,
  B. (eds.) Proceedings of the HLT-NAACL 2003
  Workshop on Analysis of Geographic References,
  Alberta, Canada: ACL, 45-49.
Uryupina, O. (2003) Semi-supervised learning of
  geographical gazetteers from the internet. In: Kor-
  nai, A. and Sundheim, B. (eds.) Proceedings of the
  HLT-NAACL 2003 Workshop on Analysis of Geo-
  graphic References, Alberta, Canada: ACL, 18-25.
Zhang, V.W. Rey, B. Stipp. E. and Jones, R. (2006)
  Geomodification in Query Rewriting. In: Proceed-
  ings of the 2006 Workshop on Geographic Infor-
  mation Retrieval, Seattle, USA, 23-27.
You can also read