Recuperaci n de Informaci n y Web Mining Modelo Booleano - JosØ M. Castaaeo
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Conceptos Básicos Documento Collección Information Need Query Document is described by a set of representative keywords (index terms) Terms: ( binary ) weights calculated from statistics of their frequency in text Terms vs Words/Tokens Retrieval: matching process between document terms and terms in queries 3
Modelos A model is an embodiment of the theory in which we define a set of objects about which assertions can be made and restrict the ways in which classes of objects can interact A retrieval model specifies the representations used for documents and information needs, and how they are compared. (Turtle y Croft, 1992)
Caracterización formal de IR Model An information retrieval model is a quadrupole where is a set of representations for the documents in the collection is a set of representations for the user information needs (queries) is a framework for modelling document representations, queries, and their relationships ) is a ranking function which associates a real number with a query and document representation (Baeza-Yates y Ribeiro-Neto, 1999) 5
Implementación vs. Modelo An IR model is a formalization of the way of thinking about information retrieval Compare to implementation—how to operationalize the model in a given environment (e.g. file structures) 7
IR using the Boolean model Queries are Boolean expressions, e.g., Caesar Brutus IR system returns all documents that satisfy the Boolean expression Modelo basado en Teoría de Conjuntos y algebra booleana IR systems comerciales (Dialog, Lexis/Nexis) Semántica precisa 9
Modelo Booleano first online systems in 60s and 70s most widely used in commercial IR Advanced feature in most other systems , , , (), precedencia + Terms usually supplemented with proximity operators requires an exact match based on inverted file 10
Drawbacks del sistema booleano Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few (conjuntive) or too many (disjunctive) documents in response to a user query
Extensiones del sistema booleano How to extend the Boolean model (past focus) partial matching ranking Two extensions of boolean model: Fuzzy Set Model Extended Boolean Model
Matriz de incidencia 13
Vectores de términos So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise . 110100 110111 101111 = 100100.
Corpora más grande? Consider N = 1M documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are m = 500K distinct terms among these. 15
Matriz de incidencia 500K x 1M matrix has half-a-trillion 0s and 1s. But it has no more than one billion 1s. matrix is extremely sparse Por qué What’s a better representation? We only record the 1 positions. 16
Inverted Index 17
Inverted Index 18
Construcción (Inverted Index) 19
20
23
25
Procesamiento de la Query Consider processing the query: Brutus Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. Merge the two postings: 26
Merge Walk through the two postings simultaneously, in time linear in the total number of postings entries If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID. 27
Intersecting merging two postings lists M ERGE 1 2 while NIL and NIL ! " ! " 3 do if ! %$ %$ # # & ' & ' 4 then A DD %$ # & 5 else if ' %$ %$ # # ()' & & ' 6 then *+ & ' 7 else *+ & ' 8 return 28
Queries más generales Cómo se adapta el algoritmo en este tipo de Queries? Brutus Caesar Brutus Caesar Es todavía o cuál es la complejidad? * , - Qué pasa con una fórmula Booleana como: (Brutus Caesar) (Antony Cleopatra) Es siempre lineal? 29
Exact Match The Boolean Retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean queries: You know exactly what you’re getting. 30
Ejemplo: Westlaw Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /3 = within 3 words, /S = in same sentence 31
Another example query: Requirements for disabled people to be able to access a workplace disabl! /p access! /s work-site work-place (employment /3 place) Note that SPACE is disjunction, not conjunction! Long, precise queries; proximity operators; incrementally developed; not like web search Preference for Boolean search: Precision, transparency and control But that doesn’t mean they actually work better 32
Optimización What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND them together. 33
Optimización Process in order of increasing freq: start with smallest set, then keep cutting further. Ejecutar como (Caesar AND Brutus AND Calpurnia 34
Optimización, intersección de un conjunto NIL ' ' ' ' ! " 35 . + . /0 & & ' + + + +1 S ORT B Y F REQ + NIL and + ' 1 1 0 3& 3& . + & /0 2 2 . + 1 & 1 + ! " M ERGE (1) + + + + + # # /0 . 8 return 4 while . + + + +1 /0 M ERGE 5 do . . 0 + + + 1 2 3 6 7
Generalizar la Optimización (Caesar Brutus) (Hamlet Cordellia) Get freq’s for all terms. Estimate the size of each by the sum of its freq’s (conservative). Process in increasing order of sizes 36
Ejemplo Cuál es el mejor orden para procesar: ( tangerine trees ) ( marmalade skies ) (kaleidoscope eyes ) 37
Más allá de los términos What about phrases? Stanford University Proximity: Find Gates NEAR Microsoft. Constraint on AND Need index to capture position information in docs Zones in documents: Find documents with (author = Ullman) AND (text contains automata). 38
Frecuencia intra Documento 1 vs. 0 occurrence of a search term 2 vs. 1 occurrence 3 vs. 2 occurrences, etc. Usually more seems better Need term frequency information in docs 39
Ranking Boolean queries give inclusion or exclusion of docs. Often we want to rank/group results In practice: order chronologically Need to measure proximity from query to each doc. Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query. 40
Extended Boolean Boolean model is simple and elegant. But, no provision for a ranking Fuzzy model, ranking by relaxing the condition on set membership. (No evaluationon standard test sets) Extend the Boolean model with the notions of partial matching and term weighting Combine characteristics of the Vector model with properties of Boolean algebra p-norm is most famous usually impractical to implement usually hard for user to understand
Pseudo-Boolean Queries A new notation, from web search +cat dog +collar leash These are prefix operators Does not mean the same thing as AND/OR! + means mandatory, must be in document - means cannot be in the document Phrases: “stray cat” AND “frayed collar” is equivalent to “+stray cat +frayed collar”
Result Sets Run a query, get a result set Two choices Reformulate query, run on entire collection Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documents 43
Faceted Queries Strategy: break query into facets conjunction of disjunctions each facet expresses a topic
You can also read