WP3 : entity-fishing service - Presented by Tanti Kristanti (INRIA - Paris) For the HIRMEOS Final Workshop 2 June 2019 Marseille, France
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
WP3 : entity-fishing service Presented by Tanti Kristanti (INRIA – Paris) For the HIRMEOS Final Workshop 2 June 2019 Marseille, France
entity-fishing (1) • An open source tool composed of services to automate the entity recognition and disambiguation against Wikidata 1 • It is not restricted and not limited for special domains, classes of entities or usages 2 • Initially developed within the FP9 CENDARI (Collaborative European Digital Archive Infrastructure) project 3 • Continued to be developed within the H2020 HIRMEOS (High Integration of Research Monographs in the European Open Science Infrastructure) project to enrich open access digital monographs published on five digital platforms 4 • Deployed as part of the national infrastructure Huma-Num in France • A stable online service within the DARIAH-EU infrastructure, the European digital research infrastructure for the arts and humanities • Distributed under Apache 2.0 license 1 Science-Miner, Entity disambiguation, http://science-miner.com/entity-disambiguation/, (accessed 6 May 2019) 2 Patrice Lopez, Overview: Motivation, 2019, https://nerd.readthedocs.io/en/latest/overview.html, (accessed 6 May 2019) 3 Patrice Lopez, Alexander Meyer, Laurent Romary. CENDARI Virtual Research Environment & Named Entity Recognition techniques. Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen, Feb 2014, Berlin, Germany, https://hal.inria.fr/hal-01577975, (accessed 6 May 2019) 4 OAPEN, End user services: Named Entity Recognition and Disambiguation, http://www.oapen.org/content/services-end-user-services, (accessed 6 May 2019)
entity-fishing (2) • Current version (0.0.3) supports English, French, German, Italian and Spanish • Based on machine-learning techniques (Gradient Tree Boosting, CRF, word and entity embeddings) • For English and French, a Name Entity Recognition based on CRF Grobid-NER in combination with the disambiguation • Library for machine learning uses SMILE ML • Knowledge base contains • 37 million entities 154M statements from Wikidata • 15 millions word and entity embeddings • Project repositories: https://github.com/kermitt2/entity-fishing • Demo: http://nerd.huma-num.fr/nerd/ • Documentation: https://nerd.readthedocs.io/en/latest/
How to use entity-fishing services ? Response of the service • Through REST API Query parameter to be sent to the • Service can be applied on 4 types of input 1: service • text • search query • weighted vector of terms • PDF document • REST query • POST /disambiguate • POST /language • POST /segmentation • POST /customisations • GET /kb/concept/{id} • GET /kb/term/{term} • GET /language?text={text} • GET /segmentation?text={text} • GET /customisations • GET /customisation/{name} • PUT /customisation/{profile} • DELETE /customisation/{profile} 1 Patrice Lopez, entity-fishing REST API, 2019, https://nerd.readthedocs.io/en/latest/restAPI.html, (accessed 13 May 2019)
WP3 Works • Deployment and integration of entity-fishing services in the partners’ open access platforms. • The approach : reusability and code sharing • Process the following data: • 4 000 books in English and French from Open Edition • 2000 titles in English and German from OAPEN • 162 books in English from Ubiquity Press • 765 books (606 in German, 159 in English) from UGOE • Result (entity-fishing clients in Java, Python, PHP) under licence APACHE 2.0 • entity-fishing-client-python: python client for entity-fishing service • entity-fishing client-php-oe: php client for entity-fishing service by OpenEdition • entity-fishing-client-php: php client for entity-fishing service by EKT • entity-fishing-client-oapen: integration scripts with the OAPEN infrastructure by OAPEN • For validation measures needs: • Use a CC-BY gold standard HIRMEOS corpus • Containing a set of thousands manually corrected Named Entity Recognition and Disambiguation entities with Wikidata identifier (not present in any of the corpuses already existing (e.g. IITB, AQUAINT) 1 High Integration of Research Monographs in the European Open Science Infrastructure (HIRMEOS), WP3 NERD Work Package Validation, (accessed 6 May 2019) 2 Hirmeos Github, https://github.com/Hirmeos
The OpenEdition Books publishing • entity-fishing PHP client is created and integrated into Core processes data for enrichments • Fetch entities as results of requesting the entity- fishing API services for chapters • Entities are classified as PERSON and LOCATION • Aggregate the entities results at books level • Location and Person entities at book and chapter level are stored in the SolR Index • Two facets for Persons and Location are added to the front-end interface
UGOE-SUB • entity-fishing is integrated into the publishing workflow of Göttingen University Press (GUP) to enable the semi-automatic indexing of its monographs • Titles, abstract and metadata of the monographs are processed by entity- fishing API to identify and categorize the named entities • Different named entities are classified into different classes : PERSON, LOCATION and ORGANIZATION • Show how often every singular entity occurs • The indexed data are displayed as facets which are made available to users as « Keywords »; It allows users to quickly find the monographs by the entities appeared
EKT / National Documentation Center • The current release of OMP does not support any annotation service and EKT has improved OMP with entity-fishing support • Integrating entity-fishing API service to the Open Monograph Press (OMP) monographs’ landing page to annotate the abstract • Two phases of implementation : • Create a PHP client that acts as a wrapper above entity-fishing service by hiding its complexity to the user; • Hiding the complexity of HTTP protocol • The JSON result of entity-fishing service is wrapped to high level class objects • Integrate the client into the OMP Software.
Ubiquity Press (UB) • Developed an internal service to receive notifications from the existing company platform when a new article has been published and POSTs its content to the entity- fishing API to retrieve all the entities and store them locally. • The entities are shown to the reader as clickable links referring to the Wikipedia entry.
OAPEN • Create some scripts to : • Call entity-fishing service with 1) path to PDF and 2) API URL as arguments • Storing the entity-fishing response locally • Combine the entity-fishing results with the unique identifier of the book or chapter in the OAPEN Library • Export of the database to CSV • OAPEN plans to make the data available as a CC0 licensed file, which will be published on the OAPEN Library metadata page
You can also read