Tutorial: Experimenting IR/NLP with Terrier
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Tutorial: Experimenting IR/NLP with Terrier Parth Gupta pgupta@dsic.upv.es Technical University of Valencia, Spain
Reference • Some references, which are extensively used in the tutorial. ◦ “Tutorial: Large-scale Information Retrieval Experimentation with Terrier” at CIKM, 2011. ◦ Documentation for Terrier 3.5 http://terrier.org/docs/v3.5/ 2 of 45
Terrier IR Platform • Efficient - Has MapReduce support, Really fast indexing and retrieval, compressed data structures • Effective - Has many IR models like TF-IDF, BM25, LM, DFR with many field based weighting scheme and proximilty options • Flexible - Can be used cross platforms like Windows, Linux, MacOS • Multilinguality - Supports many languages 3 of 45
Other Serch Engine Options • Non-Academic • Academic ◦ Lucene/Nutch/Solr (Apache) ◦ Terrier (Glasgow) • Java • Java • Basic models • Advanced Models including ◦ Xapian (Cambridge) DFR, LM etc • C++ (Many bindings • Advance Pseudo RF modules available) ◦ Lemur/Indri (CMU/UMass) • Very fast • C++ • Basic models • Advanced models except DFR ◦ Sphinx(Sphinx Inc.) family • C++ • Tightly coupled with DBs • Very Basic Models • No Relevance Feedback 4 of 45
Content of this Tutorial • Covered ◦ Designing and Executing IR/NLP expriments with Terrier ◦ Using parts of Terrier in your Application like • Tokeniser, Stemmer • Similarity Scores • Relevance Feedback etc. ◦ Analysis with Terrier • Not Covered ◦ MapReduce Support ◦ Web Support (JSP) 5 of 45
Installation • Get Terrier ◦ Download Latest Version v3.5 freely from http://terrier.org/ • Requirement ◦ Java JDK 1.6 or greater ◦ Eclipse (just for this Tutorial!) • Setup ◦ Extract it and its ready to use. 6 of 45
IR Basics 7 of 45
Basic IR Concepts • Crawling ◦ Crawl the necessary part of the Web and prepare a static collection of documents • Indexing ◦ Preprocess to convert it into raw text format (ASCII or UTF-8) ◦ Stop-word removal [Term Pipeline] ◦ Stemming [Term Pipeline] ◦ Store relevant information of terms and documents like term frequency (TF) [doc and collection] and document length in direct and inverted index. 8 of 45
• Query Normalisation ◦ Pass the query from the same pipeline • Ranking ◦ The simplest yet powerfull model is TF-IDF n X Score(Q, D) = tf (qi , D) ∗ idf (qi ) i=1 ◦ tf (qi , D) = Frequency of Term qi in D. ◦ idf (qi ) = log( # of docs Ncontaining qi ) 9 of 45
• TF − IDF Scoring Example ◦ Doc1 = I2R is in Singapore ◦ Doc2 = I2R is in SG ◦ Doc3 = UPV is in Valencia ◦ Q = i2r sg • Ranking ◦ Score(Q, Doc1) = (1+0)*(0.64) = 0.64 Rank - 2 ◦ Score(Q, Doc2) = (1+1)*(0.64) = 1.28 Rank - 1 ◦ Score(Q, Doc3) = (0+0)*(0.64) = 0.0 Rank - 3 10 of 45
Other Unsupervised Ranking Models • BM25 - Probabilistic Model • Language Model for IR 11 of 45
Terrier: Indexing 12 of 45
Indexing 13 of 45
Indexing 14 of 45
Collection 15 of 45
Document • UTFTokeniser 16 of 45
TermPipeline • Stopwords Removal • Stemmer ◦ PorterStemmer, WeakPorterStemmer ◦ SnowballStemmera a http://snowball.tartarus.org/ 17 of 45
Indexers • Indexing ◦ Single-pass Indexing - Only Inverted Index ◦ Double-pass Indexing - Inverted Index + DirectIndex • Indexing structures ◦ InvertedIndex ◦ DirectIndex ◦ Lexicon ◦ DocumentIndex 18 of 45
Single-pass and Two-pass Indexing 19 of 45
Field based Indexing 20 of 45
Indexing: Hands-on 21 of 45
Installation of Java and Eclipse 22 of 45
Set up • Download [Java Platform (JDK) 7u17] http://www.oracle.com/ technetwork/java/javase/downloads/index.html ◦ Linux - Select the distro ◦ Windows: .exe ◦ MacOS: .dmg • Download Eclipse [Eclipse IDE for Java EE Developers] from http://www.eclipse.org/downloads/index-developer.php • Installation of Eclipse: Just extract it and its ready to use. 23 of 45
Terrier Directory Structure bin - Scripts to run terrier doc - Documentation etc - Configuration files lib - Required Java libraries (.jar files) share - Utility files like stopword list src - Source code var - Index and results directory 24 of 45
TREC style experiments • Usually the IR evaluation forums like TREC, CLEF, NTCIR, FIRE release the data, query list and their relevance judgments (qrels) • The task is to submit runs, which they will evaluate. • This is much more conventional experiments with IR which is usually called Adhoc track, which can be monolingual or cross-lingual. • Terrier has implicit way to carry them painlessly. • The advantage is, most of the weighting models are already implemented like TF-IDF, BM25, DFR, LM and they are ready to serve you as a baseline. • You just need to implement your improvement and compare with these baselines. 25 of 45
Indexing with Terrier # This will create a list of files that is needed to be indexed.. > ./bin/trec_setup.sh # Modify the properties of indexing look at the next two slides to modify the properties # This will index the documents in the file collection.spec, index is at /var/index/ > ./bin/trec_terrier.sh -i 26 of 45
Default terrier.properties file #default controls for query expansion querying.postprocesses.order=QueryExpansion querying.postprocesses.controls=qe:QueryExpansion #default and allowed controls querying.default.controls= querying.allowed.controls=qe,start,end,qemodel #document tags specification #for processing the contents of #the documents, ignoring DOCHDR TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO TrecDocTags.skip=DOCHDR #query tags specification TrecQueryTags.doctag=TOP TrecQueryTags.idtag=NUM TrecQueryTags.process=TOP,NUM,TITLE TrecQueryTags.skip=DESC,NARR #stop-words file stopwords.filename=stopword-list.txt #the processing stages a term goes through termpipelines=Stopwords,PorterStemmer 27 of 45
Properties • You have many possible options to configure the terrier without even looking at the Source code. • Walk-through the terrier.properties.sample File located at $terrier_home/etc • Walk-through the properties page http://terrier.org/docs/v3.5/properties.html 28 of 45
Printing Index > ./bin/trec_terrier.sh --printstats > ./bin/trec_terrier.sh --printlexicon america,term631 Nt=2 TF=2 @0 55 5 terid,term DF TF @File_Number start_offset_in_inv_ndex start_bit_offset_in_inv_index > ./bin/trec_terrier.sh --printinverted 901 (0,2) (3,2) (4,3) (6,5) (7,1) (8,3) 902 (4,1) (8,2) > ./bin/trec_terrier.sh --printdirect 8 (1,3) (5,11) (13,1) (15,1) (20,1) (26,7) (28,1) (30,1) (33,1) (35,1) (38,1) (43,1)... > ./bin/trec_terrier.sh --printdocid 1: 175 136@0,20,1 id: doc_length entries@pointer info 29 of 45
Terrier API with Eclipse 30 of 45
Eclipse Welcome Screen 31 of 45
Eclipse Welcome Screen 32 of 45
Eclipse Home 33 of 45
Starting Point - Hello World! • Extract terrier-tut-code.zip • File → New → Project • Select “Java Project from an Existing Ant Buildfile” → Next • Select “Browse” → Select the “build.xml” file from the just extracted “terrier-tut-code” directory • Finish package i2r.hlt; public class HelloWorld { public static void main(String[] args) { System.out.println("Hello World!"); } } 34 of 45
Code walk-though • HelloWorld.java and HelloWorldAdvanced.java • Eclipse Error Suggestion System • How to Run and Debug with eclipse • Basic Java details ◦ Java Objects ◦ Javadoc 35 of 45
Indexing with Eclipse • Indexing.java • IndexAnalysis.java 36 of 45
Using the API • Most of the time we are not doing Adhoc experiments but we need to use individual components of the search engine API. • For example, ◦ I need the “term frequency ” of term X in Document Y. ◦ I need top 100 Documents similar to “my ” document using TF-IDF/BM25/LM. ◦ I need a tokenised, stopwords removed and stemmed version of “my ” text. ◦ I need top 10 words of document X based on TF / IDF / TF-IDF. ◦ I need a TF of a term X and IDF of term Y. ◦ I need to compute term-document matrix for this collection. ◦ .... and many more. 37 of 45
How to use terrier in “your” code? • Its very easy and that will be the main goal of the tutorial. • You need to use the $terrier_home/lib/terrier-3.5-core.jar in your java program and thats it. • We will see how everything above can be done without hassle • Outline ◦ Write a simple program to index our simple text files and customise indexing. ◦ How to retrieve documents from this index and customise retrieval. ◦ How to use terrier for cross-lingual or multilingual applications. ◦ How to extract term and document statistics from the index. ◦ How to create a term-document matrix of a collection. ◦ How to use query expansion modules in your applications like ROCCHIO ◦ A case study: A Chat System - IRIS. 38 of 45
Terrier: Retrieval 39 of 45
Retrieval 40 of 45
Retrieval with Terrier • To retrieve documents from the index using relevance models like TF_IDF, BM25etc. ◦ Retrieval.java • Create Term-Document Matrix for the collection using Terrier Index. ◦ TDMatrix.java • Fetching Term and Document related Statistics of the indexed documents. ◦ IndexAnalysis.java • Get the Expanded terms using Pseudo Relevance Feedback. ◦ PseudoRelevanceFeedback.java • Multilingual IR ◦ ? :) 41 of 45
Case Study: You have a new weighting scheme like TF-IDF • You just create a Java class implementing your formula and put it in package org.terrier.matching.models • Repeat the same procedure as above with your weighting scheme instead of PL2 • Submit the runs! 42 of 45
TREC style experiments with Terrier # This will create a list of files that is needed to be indexed.. > ./bin/trec_setup.sh look at the next two slides # This will index the documents in the file collection.spec, index is at /var/index/ > ./bin/trec_terrier.sh -i # This will retrieve the indexed documents for the queries in the query-file and generates .res files in /var/results/ > ./bin/trec_terrier.sh -r -Dtrec.model=PL2 -c 10.99 -Dtrec.topics= # This will evaluate the retrieval of .res files and put it in .eval files > ./bin/trec_terrier.sh -e -Dtrec.qrels= 43 of 45
Summary • We have learnt how to use Terrier for “our needs” of IR and NLP. 44 of 45
Thank You! :) 45 of 45
You can also read