Publication Disambiguation at the University of Florida: the institutional view

Page created by Randall Strickland
 
CONTINUE READING
Publication	
  Disambiguation	
  
at	
  the	
  University	
  of	
  Florida:	
  	
  
           the	
  institutional	
  view	
  
                                        	
  
                                      Nicholas	
  Rejack	
  –	
  nrejack@ufl.edu	
  
           University	
  of	
  Florida	
  Clinical	
  and	
  Translational	
  Science	
  
                                                Informatics	
  and	
  Technology	
  
                                                                                       	
  
Problem:	
  How	
  do	
  we	
  find	
  all	
  the	
  papers	
  for	
  our	
  
                     UF	
  authors?	
  	
  

• Publication	
  identifers	
  (Thomson-­‐Reuters	
  UT	
  identifier,	
  for	
  example)	
  aren’t	
  
   widely	
  used	
  at	
  UF.	
  	
  
• No	
  single	
  university-­‐wide	
  source	
  exists	
  for	
  tracking	
  publications.	
  
• We	
  can’t	
  rely	
  on	
  researchers	
  to	
  enter	
  all	
  their	
  publications	
  –	
  too	
  time	
  
   consuming.	
  
   	
  
Solution:	
  Take	
  the	
  institutional	
  view.	
  	
  

• Limit	
  the	
  papers	
  just	
  to	
  UF	
  authors.	
  	
  
• Then	
  we	
  can	
  disambiguate	
  across	
  the	
  university	
  based	
  on	
  name.	
  Much	
  
       easier	
  than	
  trying	
  to	
  disambiguate	
  the	
  entire	
  universe	
  of	
  publishing.	
  
• Take	
  a	
  two-­‐pronged	
  approach:Combine	
  automatic	
  matching	
  on	
  name	
  parts	
  
       with	
  manual	
  disambiguation.	
  
	
  
	
  
Technical	
  information	
  

• All	
  UF	
  papers	
  published	
  since	
  2008	
  have	
  been	
  harvested	
  from	
  Thomson-­‐
   Reuters.	
  
• We	
  update	
  this	
  weekly	
  as	
  new	
  papers	
  are	
  added.	
  
• Ingest	
  takes	
  place	
  using	
  custom	
  Python	
  software	
  written	
  by	
  Mike	
  Conlon.	
  
• We	
  download	
  a	
  Bibtex	
  file	
  containing	
  all	
  the	
  papers	
  for	
  the	
  week	
  using	
  a	
  
   tuned	
  query.	
  
Query:	
  target	
  the	
  institutional	
  affiliation	
  

•   AD=(University	
  Florida	
  OR	
  Univ	
  Florida	
  OR	
  UFL	
  OR	
  UF)	
  
•   (AD	
  meaning	
  address)	
  
•   Run	
  the	
  query	
  for	
  the	
  preceding	
  week.	
  Download	
  a	
  BibTeX	
  file.	
  
•   BibTeX	
  is	
  a	
  standardized	
  cross-­‐platform	
  format	
  for	
  lists	
  of	
  references.	
  
•   Cons	
  with	
  this	
  approach:	
  	
  
      •   we’re	
  relying	
  on	
  the	
  accuracy	
  of	
  TR’s	
  affliation	
  data	
  
      •   Typos	
  will	
  affect	
  matching	
  
      •   Historical	
  data	
  may	
  be	
  incomplete	
  or	
  out	
  of	
  date.	
  
@article{	
  ISI:000319311500019,	
  
Author	
  =	
  {Schroeder,	
  Ashley	
  and	
  Pennington-­‐Gray,	
  Lori	
  and	
  Kaplanidou,	
  Kiki	
  and	
  Zhan,	
  Fangzi},	
  
Title	
  =	
  {Destination	
  risk	
  perceptions	
  among	
  U.S.	
  residents	
  for	
  London	
  as	
  the	
  host	
  city	
  of	
  the	
  2012	
  Summer	
  Olympic	
  Games},	
  
Journal	
  =	
  {TOURISM	
  MANAGEMENT},	
  
Year	
  =	
  {2013},	
  
Volume	
  =	
  {38},	
  
Pages	
  =	
  {107-­‐119},	
  
Month	
  =	
  {OCT},	
  
Abstract	
  =	
  {Risks	
  associated	
  with	
  the	
  Olympic	
  Games	
  have	
  been	
  studied;	
  however,	
  there	
  is	
  lack	
  of	
  research	
  that	
  examines	
  prospective	
  tourists‘	
  perceptions…}	
  
Publisher	
  =	
  {ELSEVIER	
  SCI	
  LTD},	
  
Address	
  =	
  {THE	
  BOULEVARD,	
  LANGFORD	
  LANE,	
  KIDLINGTON,	
  OXFORD	
  OX5	
  1GB,	
  OXON,	
  ENGLAND},	
  
Type	
  =	
  {Article},	
  
Language	
  =	
  {English},	
  
Affiliation	
  =	
  {Schroeder,	
  A	
  (Reprint	
  Author),	
  Univ	
  Florida,	
  Tourism	
  Crisis	
  Management	
  Inst,	
  Dept	
  Tourism	
  Recreat	
  and	
  Sport	
  Management,	
  POB	
  118208,	
  Gainesville,	
  FL	
  
32611	
  USA.	
  
	
  	
  	
  Schroeder,	
  Ashley;	
  Pennington-­‐Gray,	
  Lori,	
  Univ	
  Florida,	
  Tourism	
  Crisis	
  Management	
  Inst,	
  Dept	
  Tourism	
  Recreat	
  and	
  Sport	
  Management,	
  Gainesville,	
  FL	
  32611	
  USA.	
  
	
  	
  	
  Kaplanidou,	
  Kiki;	
  Zhan,	
  Fangzi,	
  Univ	
  Florida,	
  Dept	
  Tourism	
  Recreat	
  and	
  Sport	
  Management,	
  Gainesville,	
  FL	
  32611	
  USA.},	
  
DOI	
  =	
  {10.1016/j.tourman.2013.03.001},	
  
ISSN	
  =	
  {0261-­‐5177},	
  
Keywords	
  =	
  {Olympic	
  Games;	
  Sports	
  tourism;	
  Mega-­‐events;	
  Destination	
  risk	
  perceptions},	
  
Keywords-­‐Plus	
  =	
  {POLITICAL	
  INSTABILITY;	
  TOURISM	
  DECISIONS;	
  CAPE-­‐TOWN;	
  TERRORISM;	
  CHOICE;	
  SAFETY;	
  TRAVEL;	
  ROLES;	
  CRIME;	
  FEAR},	
  
Research-­‐Areas	
  =	
  {Environmental	
  Sciences	
  and	
  Ecology;	
  Social	
  Sciences	
  -­‐	
  Other	
  Topics;	
  Business	
  and	
  Economics},	
  
Web-­‐of-­‐Science-­‐Categories	
  	
  =	
  {Environmental	
  Studies;	
  Hospitality,	
  Leisure,	
  Sport	
  and	
  Tourism;	
  Management},	
  
Author-­‐Email	
  =	
  {alouise@hhp.ufl.edu	
  penngray@hhp.ufl.edu	
  kkaplanidou@hhp.ufl.edu	
  zfz0123@ufl.edu},	
  
Number-­‐of-­‐Cited-­‐References	
  =	
  {69},	
  
Times-­‐Cited	
  =	
  {0},	
  
Journal-­‐ISO	
  =	
  {Tourism	
  Manage.},	
  
Doc-­‐Delivery-­‐Number	
  =	
  {149IF},	
  
Unique-­‐ID	
  =	
  {ISI:000319311500019},	
  
Target	
  the	
  affiliation	
  

Affiliation = {{Dindar, S (Reprint Author), Univ Florida, Dept Comp
\& Informat Sci \& Engn, CSE Bldg, Gainesville, FL 32611 USA.
   Dindar, Saleh; Yeo, Young In; Gao, Jianwei; Peters, Jorg, Univ
Florida, Dept Comp \& Informat Sci \& Engn, Gainesville, FL 32611
USA.
   Ford, Eric B.; Boley, Aaron C.; Nelson, Benjamin, Univ Florida,
Dept Astron, Gainesville, FL 32611 USA.
   Juric, Mario, LSST Corp, Tucson, AZ 85721 USA.
   Juric, Mario, Univ Arizona, Steward Observ, Tucson, AZ 85721
USA.}}
Next	
  steps	
  
•   Find	
  the	
  authors	
  by	
  name	
  parts.	
  Leverage	
  the	
  hard	
  work	
  done	
  to	
  create	
  UF	
  authors.	
  
•   UF	
  VIVO	
  contains	
  157,901	
  Person	
  objects.	
  We	
  need	
  to	
  limit	
  our	
  domain	
  of	
  discourse.	
  So	
  we	
  
    seed	
  our	
  people	
  dictionary	
  with	
  a	
  SPARQL	
  query	
  that	
  :	
  
      •      Limits	
  to	
  foaf:Person	
  
      •      Limits	
  to	
  people	
  with	
  foaf:lastName.	
  We	
  need	
  at	
  least	
  a	
  last	
  name	
  to	
  draw	
  a	
  conclusion.	
  
      •      Limits	
  to	
  people	
  defined	
  as	
  ufVivo:UFEntity	
  (instituation	
  internal	
  class,	
  updated	
  by	
  people	
  information	
  
             harvest)	
  
      Result:	
  48,195	
  people.	
  (Reduction	
  of	
  70%).	
  Many	
  of	
  these	
  are	
  stubs	
  without	
  complete	
  name	
  parts.	
  Well-­‐
      curated	
  VIVO	
  profiles	
  will	
  have	
  more	
  complete	
  name	
  parts	
  and	
  as	
  a	
  result	
  will	
  attract	
  articles.	
  
      	
  
Matching	
  on	
  names	
  

•   Six	
  cases:	
  
       •    Case	
  0:	
  last	
  name	
  only	
  
       •    Case	
  1:	
  last	
  name,	
  first	
  initial	
  
       •    Case	
  2:	
  last	
  name,	
  first	
  name	
  
       •    Case	
  3:	
  last	
  name,	
  first	
  initial,	
  middle	
  initial	
  
       •    Case	
  4:	
  last	
  name,	
  first	
  initial,	
  middle	
  name	
  
       •    Case	
  5	
  :last	
  name,	
  first	
  name,	
  middle	
  initial	
  
       •    Case	
  6:	
  last	
  name,	
  first	
  name,	
  middle	
  name	
  
•   Make	
  dictionaries	
  for	
  each	
  case	
  for	
  people	
  in	
  VIVO.	
  Given	
  the	
  appropriate	
  input	
  case	
  (i.e.,	
  “J	
  Johnson”	
  is	
  case	
  
    1),	
  match	
  the	
  input	
  from	
  the	
  BibTeX	
  file.	
  
Matching	
  procedure	
  
• For	
  each	
  author	
  with	
  a	
  UF	
  affiliation:	
  How	
  many	
  people	
  at	
  UF	
  have	
  this	
  name?	
  
0:	
  Add	
  the	
  author,	
  add	
  to	
  a	
  notify	
  list	
  for	
  the	
  final	
  report	
  
1:	
  Get	
  the	
  author’s	
  URI	
  and	
  attach	
  the	
  paper	
  
2:	
  Move	
  to	
  manual	
  disambiguation	
  list.	
  
Other	
  cases:	
  
• Corporate	
  author:	
  create	
  new.	
  
• Non-­‐UF	
  author:	
  	
  
      •   Create	
  a	
  stub	
  (simple	
  Person	
  object	
  with	
  name	
  parts	
  and	
  no	
  affiliation).	
  	
  
      •   We	
  can’t	
  conclude	
  that	
  we	
  know	
  more	
  about	
  them	
  than	
  this.	
  	
  
      •   Don’t	
  try	
  to	
  match	
  to	
  other	
  stubs-­‐	
  no	
  reason	
  to	
  conclude	
  that	
  J	
  Smith	
  who	
  wrote	
  paper	
  X	
  is	
  
          the	
  same	
  J	
  Smith	
  that	
  wrote	
  paper	
  Y.	
  (Open	
  world	
  assumption)	
  
The	
  numbers:	
  
          for	
  a	
  recent	
  average	
  week	
  of	
  publications…	
  
117	
  papers	
  
3	
  were	
  already	
  in	
  VIVO	
  (2.6%)	
  
2      	
  had	
  no	
  UF	
  authors	
  (incorrectly	
  identified	
  as	
  UF	
  papers	
  by	
  the	
  query)	
  (1.7%)	
  

112	
  were	
  created	
  as	
  new	
  papers	
  in	
  UF	
  VIVO	
  (95.7%)	
  
	
  
Conclusion:	
  We	
  can	
  identify	
  UF	
  papers	
  with	
  a	
  high	
  degree	
  of	
  accuracy	
  (>	
  98%).	
  	
  
Unknown:	
  How	
  many	
  papers	
  are	
  we	
  missing?	
  We’d	
  need	
  to	
  know	
  the	
  total	
  output	
  for	
  the	
  week,	
  not	
  just	
  what	
  Thomson-­‐
Reuters	
  indexes.	
  
	
  
	
  
The	
  numbers:	
  
    for	
  a	
  recent	
  average	
  week	
  of	
  publications…	
  

189	
  were	
  found	
  in	
  VIVO	
  as	
  UF	
  authors	
  (30.3%)	
  
29	
  were	
  made	
  as	
  new	
  UF	
  people	
  (4.7%)	
  
36	
  needed	
  to	
  be	
  disambiguated	
  (5.8%)	
  
368	
  were	
  non-­‐UF	
  authors	
  created	
  as	
  stubs	
  (59.2%)	
  
Manual	
  disambiguation	
  process	
  

• When	
  a	
  paper	
  matches	
  more	
  than	
  one	
  UF	
  author	
  based	
  on	
  name	
  parts,	
  
   randomly	
  assign	
  it	
  to	
  one	
  of	
  the	
  authors.	
  	
  
• Then,	
  in	
  VIVO,	
  examine	
  the	
  paper	
  and	
  the	
  authors	
  and	
  determine	
  which	
  it	
  
   belongs	
  to	
  based	
  on	
  content.	
  
• The	
  correction	
  process	
  takes	
  place	
  directly	
  in	
  VIVO-­‐	
  	
  no	
  need	
  to	
  use	
  external	
  
   tools.	
  	
  
• Sometimes,	
  the	
  random	
  guess	
  is	
  right-­‐	
  less	
  work	
  for	
  us.	
  
1	
  of	
  3	
  cases	
  requiring	
  disambiguation:	
  
 Limited	
  name	
  information:	
  author	
  matches	
  several	
  
                                  UF	
  people	
  
                                         	
  
The publication at http://vivo.ufl.edu/individual/n4503510310 has one or
more authors in question
     Tang Y   :
         http://vivo.ufl.edu/individual/n11618
         http://vivo.ufl.edu/individual/n26022
         http://vivo.ufl.edu/individual/n497554421
         http://vivo.ufl.edu/individual/n2060082549
         http://vivo.ufl.edu/individual/n9627415712
         http://vivo.ufl.edu/individual/n1780
1	
  of	
  3	
  cases	
  requiring	
  disambiguation:	
  
 Limited	
  name	
  information:	
  author	
  matches	
  several	
  
                                  UF	
  people	
  
                                         	
  
• Fix:	
  examine	
  content	
  of	
  paper.	
  	
  
     •   Determine	
  which	
  it	
  belongs	
  to	
  based	
  on	
  paper	
  subject	
  matter	
  (i.e.,	
  if	
  the	
  paper	
  is	
  about	
  
         molecular	
  genetics	
  and	
  one	
  of	
  the	
  authors	
  is	
  in	
  astronomy,	
  it’s	
  not	
  their	
  paper.)	
  
• Pitfalls:	
  Cross-­‐disciplinary	
  papers	
  can	
  be	
  tricky.	
  Also,	
  people	
  in	
  statistics	
  are	
  
   often	
  cited	
  on	
  papers	
  outside	
  their	
  field	
  for	
  contributing	
  stats	
  work.	
  
2	
  of	
  3	
  cases	
  requiring	
  disambiguation:	
  
                        The	
  same	
  person,	
  multiple	
  profiles	
  
                                                    	
  
• Fix:	
  Merge	
  the	
  profiles.	
  Eventually	
  we’ll	
  clean	
  up	
  the	
  duplicates	
  and	
  
    converge	
  on	
  a	
  coherent	
  set	
  of	
  UF	
  authors.	
  
• Pitfalls:	
  There	
  are	
  a	
  lot	
  of	
  UF	
  stubs-­‐	
  this	
  is	
  going	
  to	
  take	
  a	
  while.	
  	
  
3	
  of	
  3	
  cases	
  requiring	
  disambiguation:	
  
                          Different	
  person,	
  same	
  name.	
  
                                                 	
  
• Fix:	
  Disambiguate	
  based	
  on	
  content.	
  Luckily	
  not	
  a	
  lot	
  of	
  examples	
  of	
  this	
  at	
  
   UF	
  (J	
  Johnson).	
  Again,	
  it	
  comes	
  down	
  to	
  how	
  complete	
  the	
  naming	
  
   information	
  we	
  have	
  is.	
  
Work	
  load	
  

• It	
  takes	
  one	
  student	
  assistant	
  2	
  hours	
  a	
  week	
  to	
  run	
  the	
  publication	
  ingest	
  
   software	
  and	
  update	
  the	
  manual	
  disambiguations.	
  This	
  process	
  keeps	
  us	
  up	
  
   to	
  date	
  going	
  forward.	
  
• Future	
  work:	
  go	
  backwards	
  in	
  time.	
  We	
  expect	
  messier	
  and	
  messier	
  
   affiliation	
  data	
  the	
  further	
  back	
  we	
  go.	
  
You can also read