Publication Disambiguation at the University of Florida: the institutional view
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Publication Disambiguation at the University of Florida: the institutional view Nicholas Rejack – nrejack@ufl.edu University of Florida Clinical and Translational Science Informatics and Technology
Problem: How do we find all the papers for our UF authors? • Publication identifers (Thomson-‐Reuters UT identifier, for example) aren’t widely used at UF. • No single university-‐wide source exists for tracking publications. • We can’t rely on researchers to enter all their publications – too time consuming.
Solution: Take the institutional view. • Limit the papers just to UF authors. • Then we can disambiguate across the university based on name. Much easier than trying to disambiguate the entire universe of publishing. • Take a two-‐pronged approach:Combine automatic matching on name parts with manual disambiguation.
Technical information • All UF papers published since 2008 have been harvested from Thomson-‐ Reuters. • We update this weekly as new papers are added. • Ingest takes place using custom Python software written by Mike Conlon. • We download a Bibtex file containing all the papers for the week using a tuned query.
Query: target the institutional affiliation • AD=(University Florida OR Univ Florida OR UFL OR UF) • (AD meaning address) • Run the query for the preceding week. Download a BibTeX file. • BibTeX is a standardized cross-‐platform format for lists of references. • Cons with this approach: • we’re relying on the accuracy of TR’s affliation data • Typos will affect matching • Historical data may be incomplete or out of date.
@article{ ISI:000319311500019, Author = {Schroeder, Ashley and Pennington-‐Gray, Lori and Kaplanidou, Kiki and Zhan, Fangzi}, Title = {Destination risk perceptions among U.S. residents for London as the host city of the 2012 Summer Olympic Games}, Journal = {TOURISM MANAGEMENT}, Year = {2013}, Volume = {38}, Pages = {107-‐119}, Month = {OCT}, Abstract = {Risks associated with the Olympic Games have been studied; however, there is lack of research that examines prospective tourists‘ perceptions…} Publisher = {ELSEVIER SCI LTD}, Address = {THE BOULEVARD, LANGFORD LANE, KIDLINGTON, OXFORD OX5 1GB, OXON, ENGLAND}, Type = {Article}, Language = {English}, Affiliation = {Schroeder, A (Reprint Author), Univ Florida, Tourism Crisis Management Inst, Dept Tourism Recreat and Sport Management, POB 118208, Gainesville, FL 32611 USA. Schroeder, Ashley; Pennington-‐Gray, Lori, Univ Florida, Tourism Crisis Management Inst, Dept Tourism Recreat and Sport Management, Gainesville, FL 32611 USA. Kaplanidou, Kiki; Zhan, Fangzi, Univ Florida, Dept Tourism Recreat and Sport Management, Gainesville, FL 32611 USA.}, DOI = {10.1016/j.tourman.2013.03.001}, ISSN = {0261-‐5177}, Keywords = {Olympic Games; Sports tourism; Mega-‐events; Destination risk perceptions}, Keywords-‐Plus = {POLITICAL INSTABILITY; TOURISM DECISIONS; CAPE-‐TOWN; TERRORISM; CHOICE; SAFETY; TRAVEL; ROLES; CRIME; FEAR}, Research-‐Areas = {Environmental Sciences and Ecology; Social Sciences -‐ Other Topics; Business and Economics}, Web-‐of-‐Science-‐Categories = {Environmental Studies; Hospitality, Leisure, Sport and Tourism; Management}, Author-‐Email = {alouise@hhp.ufl.edu penngray@hhp.ufl.edu kkaplanidou@hhp.ufl.edu zfz0123@ufl.edu}, Number-‐of-‐Cited-‐References = {69}, Times-‐Cited = {0}, Journal-‐ISO = {Tourism Manage.}, Doc-‐Delivery-‐Number = {149IF}, Unique-‐ID = {ISI:000319311500019},
Target the affiliation Affiliation = {{Dindar, S (Reprint Author), Univ Florida, Dept Comp \& Informat Sci \& Engn, CSE Bldg, Gainesville, FL 32611 USA. Dindar, Saleh; Yeo, Young In; Gao, Jianwei; Peters, Jorg, Univ Florida, Dept Comp \& Informat Sci \& Engn, Gainesville, FL 32611 USA. Ford, Eric B.; Boley, Aaron C.; Nelson, Benjamin, Univ Florida, Dept Astron, Gainesville, FL 32611 USA. Juric, Mario, LSST Corp, Tucson, AZ 85721 USA. Juric, Mario, Univ Arizona, Steward Observ, Tucson, AZ 85721 USA.}}
Next steps • Find the authors by name parts. Leverage the hard work done to create UF authors. • UF VIVO contains 157,901 Person objects. We need to limit our domain of discourse. So we seed our people dictionary with a SPARQL query that : • Limits to foaf:Person • Limits to people with foaf:lastName. We need at least a last name to draw a conclusion. • Limits to people defined as ufVivo:UFEntity (instituation internal class, updated by people information harvest) Result: 48,195 people. (Reduction of 70%). Many of these are stubs without complete name parts. Well-‐ curated VIVO profiles will have more complete name parts and as a result will attract articles.
Matching on names • Six cases: • Case 0: last name only • Case 1: last name, first initial • Case 2: last name, first name • Case 3: last name, first initial, middle initial • Case 4: last name, first initial, middle name • Case 5 :last name, first name, middle initial • Case 6: last name, first name, middle name • Make dictionaries for each case for people in VIVO. Given the appropriate input case (i.e., “J Johnson” is case 1), match the input from the BibTeX file.
Matching procedure • For each author with a UF affiliation: How many people at UF have this name? 0: Add the author, add to a notify list for the final report 1: Get the author’s URI and attach the paper 2: Move to manual disambiguation list. Other cases: • Corporate author: create new. • Non-‐UF author: • Create a stub (simple Person object with name parts and no affiliation). • We can’t conclude that we know more about them than this. • Don’t try to match to other stubs-‐ no reason to conclude that J Smith who wrote paper X is the same J Smith that wrote paper Y. (Open world assumption)
The numbers: for a recent average week of publications… 117 papers 3 were already in VIVO (2.6%) 2 had no UF authors (incorrectly identified as UF papers by the query) (1.7%) 112 were created as new papers in UF VIVO (95.7%) Conclusion: We can identify UF papers with a high degree of accuracy (> 98%). Unknown: How many papers are we missing? We’d need to know the total output for the week, not just what Thomson-‐ Reuters indexes.
The numbers: for a recent average week of publications… 189 were found in VIVO as UF authors (30.3%) 29 were made as new UF people (4.7%) 36 needed to be disambiguated (5.8%) 368 were non-‐UF authors created as stubs (59.2%)
Manual disambiguation process • When a paper matches more than one UF author based on name parts, randomly assign it to one of the authors. • Then, in VIVO, examine the paper and the authors and determine which it belongs to based on content. • The correction process takes place directly in VIVO-‐ no need to use external tools. • Sometimes, the random guess is right-‐ less work for us.
1 of 3 cases requiring disambiguation: Limited name information: author matches several UF people The publication at http://vivo.ufl.edu/individual/n4503510310 has one or more authors in question Tang Y : http://vivo.ufl.edu/individual/n11618 http://vivo.ufl.edu/individual/n26022 http://vivo.ufl.edu/individual/n497554421 http://vivo.ufl.edu/individual/n2060082549 http://vivo.ufl.edu/individual/n9627415712 http://vivo.ufl.edu/individual/n1780
1 of 3 cases requiring disambiguation: Limited name information: author matches several UF people • Fix: examine content of paper. • Determine which it belongs to based on paper subject matter (i.e., if the paper is about molecular genetics and one of the authors is in astronomy, it’s not their paper.) • Pitfalls: Cross-‐disciplinary papers can be tricky. Also, people in statistics are often cited on papers outside their field for contributing stats work.
2 of 3 cases requiring disambiguation: The same person, multiple profiles • Fix: Merge the profiles. Eventually we’ll clean up the duplicates and converge on a coherent set of UF authors. • Pitfalls: There are a lot of UF stubs-‐ this is going to take a while.
3 of 3 cases requiring disambiguation: Different person, same name. • Fix: Disambiguate based on content. Luckily not a lot of examples of this at UF (J Johnson). Again, it comes down to how complete the naming information we have is.
Work load • It takes one student assistant 2 hours a week to run the publication ingest software and update the manual disambiguations. This process keeps us up to date going forward. • Future work: go backwards in time. We expect messier and messier affiliation data the further back we go.
You can also read