Tutorial: Experimenting IR/NLP with Terrier

Page created by John Gonzales

World Around

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Tutorial: Experimenting IR/NLP with Terrier

Tutorial: Experimenting IR/NLP with Terrier

                  Parth Gupta
              pgupta@dsic.upv.es

          Technical University of Valencia, Spain

Reference

• Some references, which are extensively used in the tutorial.
  ◦ “Tutorial: Large-scale Information Retrieval Experimentation with
    Terrier” at CIKM, 2011.
  ◦ Documentation for Terrier 3.5 http://terrier.org/docs/v3.5/

2 of 45

Terrier IR Platform

• Efficient - Has MapReduce support, Really fast indexing and
  retrieval, compressed data structures
• Effective - Has many IR models like TF-IDF, BM25, LM, DFR
  with many field based weighting scheme and proximilty options
• Flexible - Can be used cross platforms like Windows, Linux,
  MacOS
• Multilinguality - Supports many languages

3 of 45

Other Serch Engine Options

 • Non-Academic                          • Academic
   ◦ Lucene/Nutch/Solr (Apache)            ◦ Terrier (Glasgow)
          • Java                              • Java
          • Basic models                      • Advanced Models including
      ◦ Xapian (Cambridge)                      DFR, LM etc
          • C++ (Many bindings                • Advance Pseudo RF modules
              available)                   ◦ Lemur/Indri (CMU/UMass)
          • Very fast                         • C++
          • Basic models                      • Advanced models except DFR
      ◦ Sphinx(Sphinx Inc.)                     family
          •   C++
          •   Tightly coupled with DBs
          •   Very Basic Models
          •   No Relevance Feedback

4 of 45

Content of this Tutorial

• Covered
  ◦ Designing and Executing IR/NLP expriments with Terrier
  ◦ Using parts of Terrier in your Application like
          • Tokeniser, Stemmer
          • Similarity Scores
          • Relevance Feedback etc.
   ◦ Analysis with Terrier
• Not Covered
  ◦ MapReduce Support
  ◦ Web Support (JSP)

5 of 45

Installation

• Get Terrier
  ◦ Download Latest Version v3.5 freely from http://terrier.org/
• Requirement
  ◦ Java JDK 1.6 or greater
  ◦ Eclipse (just for this Tutorial!)
• Setup
  ◦ Extract it and its ready to use.

6 of 45

IR Basics

7 of 45

Basic IR Concepts

• Crawling
  ◦ Crawl the necessary part of the Web and prepare a static collection of
    documents

• Indexing
   ◦ Preprocess to convert it into raw text format (ASCII or UTF-8)
   ◦ Stop-word removal [Term Pipeline]
   ◦ Stemming [Term Pipeline]
   ◦ Store relevant information of terms and documents like term
     frequency (TF) [doc and collection] and document length in direct
     and inverted index.

8 of 45

• Query Normalisation
   ◦ Pass the query from the same pipeline

• Ranking
   ◦ The simplest yet powerfull model is TF-IDF
                                       n
                                       X
                       Score(Q, D) =          tf (qi , D) ∗ idf (qi )
                                        i=1

   ◦ tf (qi , D) = Frequency of Term qi in D.
   ◦ idf (qi ) = log( # of docs Ncontaining qi )

9 of 45

• TF − IDF Scoring Example
    ◦      Doc1 = I2R is in Singapore
    ◦      Doc2 = I2R is in SG
    ◦      Doc3 = UPV is in Valencia
    ◦      Q = i2r sg

• Ranking
  ◦ Score(Q, Doc1) = (1+0)*(0.64) = 0.64 Rank - 2
  ◦ Score(Q, Doc2) = (1+1)*(0.64) = 1.28 Rank - 1
  ◦ Score(Q, Doc3) = (0+0)*(0.64) = 0.0 Rank - 3

10 of 45

Other Unsupervised Ranking Models

• BM25 - Probabilistic Model
• Language Model for IR

11 of 45

Terrier: Indexing

12 of 45

Indexing

13 of 45

Indexing

14 of 45

Collection

15 of 45

Document

• UTFTokeniser

16 of 45

TermPipeline

               • Stopwords Removal
               • Stemmer
                 ◦ PorterStemmer,
                    WeakPorterStemmer
                 ◦ SnowballStemmera
                 a
                     http://snowball.tartarus.org/

17 of 45

Indexers

• Indexing
   ◦ Single-pass Indexing - Only Inverted Index
   ◦ Double-pass Indexing - Inverted Index + DirectIndex
• Indexing structures
   ◦ InvertedIndex
   ◦ DirectIndex
   ◦ Lexicon
   ◦ DocumentIndex

18 of 45

Single-pass and Two-pass Indexing

19 of 45

Field based Indexing

20 of 45

Indexing: Hands-on

21 of 45

Installation of Java and Eclipse

22 of 45

Set up

• Download [Java Platform (JDK) 7u17] http://www.oracle.com/
   technetwork/java/javase/downloads/index.html
    ◦ Linux - Select the distro
    ◦ Windows: .exe
    ◦ MacOS: .dmg
• Download Eclipse [Eclipse IDE for Java EE Developers] from
   http://www.eclipse.org/downloads/index-developer.php
• Installation of Eclipse: Just extract it and its ready to use.

23 of 45

Terrier Directory Structure

 bin       -   Scripts to run terrier
 doc       -   Documentation
 etc       -   Configuration files
 lib       -   Required Java libraries (.jar files)
 share     -   Utility files like stopword list
 src       -   Source code
 var       -   Index and results directory

24 of 45

TREC style experiments

• Usually the IR evaluation forums like TREC, CLEF, NTCIR, FIRE
   release the data, query list and their relevance judgments (qrels)
• The task is to submit runs, which they will evaluate.
• This is much more conventional experiments with IR which is
   usually called Adhoc track, which can be monolingual or
   cross-lingual.
• Terrier has implicit way to carry them painlessly.
• The advantage is, most of the weighting models are already
   implemented like TF-IDF, BM25, DFR, LM and they are ready to
   serve you as a baseline.
• You just need to implement your improvement and compare with
   these baselines.
25 of 45

Indexing with Terrier

 # This will create a list of files that is needed to be indexed..
 > ./bin/trec_setup.sh 

 # Modify the properties of indexing
 look at the next two slides to modify the properties

 # This will index the documents in the file collection.spec, index is at /var/index/
 > ./bin/trec_terrier.sh -i

26 of 45

Default terrier.properties file

#default controls for query expansion
querying.postprocesses.order=QueryExpansion
querying.postprocesses.controls=qe:QueryExpansion

#default and allowed controls
querying.default.controls=
querying.allowed.controls=qe,start,end,qemodel

#document tags specification
#for processing the contents of
#the documents, ignoring DOCHDR
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.skip=DOCHDR

#query tags specification
TrecQueryTags.doctag=TOP
TrecQueryTags.idtag=NUM
TrecQueryTags.process=TOP,NUM,TITLE
TrecQueryTags.skip=DESC,NARR

#stop-words file
stopwords.filename=stopword-list.txt

#the processing stages a term goes through
termpipelines=Stopwords,PorterStemmer

 27 of 45

Properties

• You have many possible options to configure the terrier without
   even looking at the Source code.
• Walk-through the terrier.properties.sample File located at
   $terrier_home/etc
• Walk-through the properties page
   http://terrier.org/docs/v3.5/properties.html

28 of 45

Printing Index

> ./bin/trec_terrier.sh --printstats

> ./bin/trec_terrier.sh --printlexicon

america,term631 Nt=2 TF=2 @0 55 5
terid,term DF TF @File_Number start_offset_in_inv_ndex start_bit_offset_in_inv_index

> ./bin/trec_terrier.sh --printinverted

901 (0,2) (3,2) (4,3) (6,5) (7,1) (8,3)
902 (4,1) (8,2)

> ./bin/trec_terrier.sh --printdirect

8 (1,3) (5,11) (13,1) (15,1) (20,1) (26,7) (28,1) (30,1) (33,1) (35,1) (38,1) (43,1)...

> ./bin/trec_terrier.sh --printdocid

1: 175 136@0,20,1
id: doc_length entries@pointer info

 29 of 45

Terrier API with Eclipse

30 of 45

Eclipse Welcome Screen

31 of 45

Eclipse Welcome Screen

32 of 45

Eclipse Home

33 of 45

Starting Point - Hello World!
• Extract terrier-tut-code.zip
• File → New → Project
• Select “Java Project from an Existing Ant Buildfile” → Next
• Select “Browse” → Select the “build.xml” file from the just
  extracted “terrier-tut-code” directory
• Finish

    package i2r.hlt;

    public class HelloWorld {
      public static void main(String[] args) {
         System.out.println("Hello World!");
      }
    }

34 of 45

Code walk-though

• HelloWorld.java and HelloWorldAdvanced.java
• Eclipse Error Suggestion System
• How to Run and Debug with eclipse
• Basic Java details
  ◦ Java Objects
  ◦ Javadoc

35 of 45

Indexing with Eclipse

• Indexing.java
• IndexAnalysis.java

36 of 45

Using the API

• Most of the time we are not doing Adhoc experiments but we need
  to use individual components of the search engine API.
• For example,
    ◦ I need the “term frequency ” of term X in Document Y.
    ◦ I need top 100 Documents similar to “my ” document using
      TF-IDF/BM25/LM.
    ◦ I need a tokenised, stopwords removed and stemmed version of “my ”
      text.
    ◦ I need top 10 words of document X based on TF / IDF / TF-IDF.
    ◦ I need a TF of a term X and IDF of term Y.
    ◦ I need to compute term-document matrix for this collection.
    ◦ .... and many more.

37 of 45

How to use terrier in “your” code?
• Its very easy and that will be the main goal of the tutorial.
• You need to use the
  $terrier_home/lib/terrier-3.5-core.jar in your java
  program and thats it.
• We will see how everything above can be done without hassle
• Outline
    ◦ Write a simple program to index our simple text files and customise
      indexing.
    ◦ How to retrieve documents from this index and customise retrieval.
    ◦ How to use terrier for cross-lingual or multilingual applications.
    ◦ How to extract term and document statistics from the index.
    ◦ How to create a term-document matrix of a collection.
    ◦ How to use query expansion modules in your applications like
      ROCCHIO
    ◦ A case study: A Chat System - IRIS.
38 of 45

Terrier: Retrieval

39 of 45

Retrieval

40 of 45

Retrieval with Terrier

• To retrieve documents from the index using relevance models like
   TF_IDF, BM25etc.
    ◦ Retrieval.java
• Create Term-Document Matrix for the collection using Terrier
   Index.
  ◦ TDMatrix.java
• Fetching Term and Document related Statistics of the indexed
   documents.
  ◦ IndexAnalysis.java
• Get the Expanded terms using Pseudo Relevance Feedback.
  ◦ PseudoRelevanceFeedback.java
• Multilingual IR
  ◦ ? :)

41 of 45

Case Study: You have a new weighting scheme like
TF-IDF

• You just create a Java class implementing your formula and put it
   in package org.terrier.matching.models
• Repeat the same procedure as above with your weighting scheme
   instead of PL2
• Submit the runs!

42 of 45

TREC style experiments with Terrier

 # This will create a list of files that is needed to be indexed..
 > ./bin/trec_setup.sh 

 look at the next two slides

 # This will index the documents in the file collection.spec, index is at /var/index/
 > ./bin/trec_terrier.sh -i

 # This will retrieve the indexed documents
   for the queries in the query-file and generates .res files in /var/results/
 > ./bin/trec_terrier.sh -r -Dtrec.model=PL2 -c 10.99 -Dtrec.topics=

 # This will evaluate the retrieval of .res files and put it in .eval files
 > ./bin/trec_terrier.sh -e -Dtrec.qrels=

43 of 45

Summary

• We have learnt how to use Terrier for “our needs” of IR and NLP.

44 of 45

Thank You! :)

45 of 45

You can also read

Configuring the LSU-E - OPERATIONALAL DIRECTIONS

Inviting Expression of Interest from interested parties to set up and operate Residential Schools of Excellence within the State of Gujarat on ...

COVID-19 Policy Information for Patients - Hamilton Medical Centre - Hamilton Medical ...

APPENDIX 1.2 LIST OF TECHNICAL DOCUMENTS CITED IN THE NETWORK STATEMENT - SNCF Réseau

Update on Residential Tenancy Issues during COVID - Danielle Sabelli and Holly Popenia Community Legal Assistance Society - Law ...

Thales Overview & Cybersecurity on Critical Infrastructure Protection - April 2018 A.Toaiari

HLTREF504B Monitor and evaluate reflexology treatments

Measuring KSA Broadband - Meqyas

All Weather Equity (AWE) - IIFL

Tata Nano Housing - Shubh Griha

Bintan Island Indonesia - Mango Vacations

INSPIRATION DOCUMENT TEMPTATION 2021

Geology of Michigan - Western ...

Virtual development and testing of autonomous vehicles - LCV 2018 Mike Dempsey Managing Director - Cenex-LCV

NEED BASED AID APPLICATION PACKAGE 2020-2021 - LCC ...

TECHNICAL SUPPORT DOCUMENT 2018-2019 CARBON MODELS 27.5" / 29" WARRANTY - Norco Bicycles

About Blackboard Ouriginal Integration API Building Block built by Blackboard

Climate Outlook Review - Northern Australia February 2019 Authors: Prof Roger C Stone & Dr Chelsea Jarvis - Centre for Applied Climate Sciences ...

Optimizing Powertrain Simulation Models - mdp

Where to find French books, periodicals, audiovisual materials and objects

Lights and Accessories Guide - Federal Signal

2021 BURSARY APPLICATION - Bhelesi Foundation

COVID-19's Impact on the H-1B Visa Process - May 27, 2020 Diane Butler, Davis Wright Tremaine LLP - Davis Wright ...

Practical Econometrics II / II - (with R and Python) Andrius Buteikis

Call for Entries - Center for Design and Material Culture

STEP-BY-STEP ORDERING PROCESS - CastleBranch Order Process & FAQs

SPWLA 2021 Sponsorship & Exhibition Opportunities - 62nd Annual Symposium, May 17-20, 2021 Boston ONLINE

MYOB Payroll 2021.1 Release Notes March 2021 - JK Business Systems Ltd

Release Notes OpenText Gupta SQLBase 12.2.1 - MD Consulting

Influenza Dr Bhakti Vasant Public Health Physician Metro South Public Health Unit - Brisbane South PHN

Dissecting and digging application source code for vulnerabilities

WHAT WILL 2021 BRING - Hinman ...

Spring 2021 - Scrub Squad

Make your MCO better with smart solutions - Space-saving smart storage for buying groceries in bulk Fast and efficient kitchen helpers for cooking ...

ERSA SUMMER LEAGUE 2021 - RACE 2 - Hosted by Suffolk Vikings Ski Race Club Sunday 27 June May 2021

The new uniform - myroyalmail

Industry Insights & Best Practices - January 2019 Edition 2017 Curiosity China

The most beautiful TV you've never seen - Samsung

Airport recycling rates soar with terminal food waste collection - Wrap

UTAS Electronic Publications Collection Instructions for Author Upload via WARP

The Journal of the Friends of Calligraphy - San Francisco Public Library

CranSEDS Sponsorship Prospectus 2020

C C Land Holdings Limited - Interim Results 2019 August 2019 - CC Land Holdings Limited

MALAGA WA INDUSTRIAL SUBURB GUIDE

Stainless chromium steel for razor blades

SOLUTIONS WIRE & CABLE - for 2021 - Graft Polymer's

Koala Kai Partners with Vintory to Amplify Owner Lead Generation Campaigns and Consistently Add New Properties to Their Program

Registration of Establishments under Orissa Shops & Establishments Act, 1956

Pentest-Report Surfshark VPN Extension 11.2018

Guidelines & regulations for Kerala Radiology Whatsapp groups 2021