Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago 1100 E. 58th St., Chicago, IL 60637, USA levow@cs.uchicago.edu Abstract with concepts expressed in documents. This match- ing process is complicated by the variety of dif- Query expansion by pseudo-relevance ferent ways - different terms - available to express feedback is a well-established technique these concepts and information needs. In addition, in both mono- and cross- lingual informa- this matching process is dramatically complicated tion retrieval, enriching and disambiguat- in cross-language and spoken document retrieval ing the typically terse queries provided by the need to match expressions across languages by searchers. Comparable document-side and typically using error-prone processes such as expansion is a relatively more recent de- translation and automatic speech recognition tran- velopment motivated by error-prone tran- scription. To compensate for this variation in ex- scription and translation processes in spo- pression of underlying concepts, researchers have ken document and cross-language re- developed the technique of pseudo-relevance feed- trieval. In the cross-language case, one back whereby the information representation - query can perform expansion before translation, or document - is enriched with highly selective, after translation, and at both points. We topically related terms from a large collection of investigate the relative impact of pre- and comparable documents. Such expansion techniques post- translation document expansion for have proved useful across the range of information cross-language spoken document retrieval retrieval applications from mono-lingual to multi- in Mandarin Chinese. We find that post- lingual, from text to speech, and from queries to doc- translation expansion yields a highly sig- uments. nificant improvement in retrieval effec- tiveness, while improvements due to pre- Expansion in the context of cross-language in- translation expansion alone or in combina- formation retrieval (CLIR) is particularly interesting tion do not reach significance. We identify as it presents multiple opportunities for improving two key factors of segmentation and trans- retrieval effectiveness. The pseudo-relevance feed- lation in Chinese orthography that limit back process can be applied, depending on the re- the effectiveness of pre-translation expan- trieval architecture, before translating the query, af- sion in the Chinese-English case, while ter translating the query, before translating the doc- post-translation expansion yields its full ument, after translating the document, or at some benefit. subset of these points, though not all combinations are reasonable. While pre- and post-translation ex- pansion have been well-studied for a query transla- 1 Introduction tion architecture in European languages, as we de- Information retrieval aims to match the informa- scribe in more detail below, these effects are less tion need expressed by the searcher in the query well-understood on the document side, especially
for Asian languages. speech recognition transcriptions, Singhal et al in- In this paper, we compare the effects of pre- troduced document expansion as a way of recover- translation, post-translation, and combined pre- ing those words that might have been in the original and post-translation document expansion for cross- broadcast but that had been misrecognized. They language retrieval using English queries to retrieve speculated that correctly recognized terms would spoken documents in Mandarin Chinese. We iden- yield a topically coherent transcript, while the spo- tify not only significant enhancements to retrieval radic errors would be from a random distribution. effectiveness for post-translation document expan- Enriching the documents with highly selective terms sion, but also key contrasts with prior work on query drawn from highly ranked documents retrieved by translation and expansion, caused by certain char- using the document itself as a query yielded re- acteristics of Mandarin Chinese, shared by many trieval effectiveness that improved not only over the Asian languages, including issues of segmentation original errorful transcription but also over a perfect and orthography. manual transcription. (Levow and Oard, 2000) ap- plied post-translation document expansion to both 2 Related Work spoken documents and newswire text in Mandarin- This work draws on prior research in pseudo- English multi-lingual retrieval and found some im- relevance feedback for both queries and documents. provements in retrieval effectiveness. (Levow, 2003) evaluated multi-scale units (words and bi- 2.1 Pre- and Post-translation Query Expansion grams) for post-transcription expansion of Mandarin In pre-translation query expansion, the goal is both spoken documents, finding the significant improve- that of monolingual query expansion - providing ad- ments for expansion with word units using bigram ditional terms to refine the query and to enhance based indexing. the probability of matching the terminology cho- sen by the authors of the document - and to pro- 3 Experimental Configuration vide additional terms to limit the possibility of fail- Here we describe the basic experimental configu- ing to translate a concept in the query simply be- ration under which contrastive document expansion cause the particular term is not present in the trans- experiments were carried out. lation lexicon. (Ballesteros and Croft, 1997) eval- uated pre- and post-translation query expansion in 3.1 Experimental Collection a Spanish-English cross-language information re- We used the Topic Detection and Tracking (TDT) trieval task and found that combining pre- and post- Collection for this work. TDT is an evaluation pro- translation query expansion improved both precision gram where participating sites tackle tasks as such and recall with pre-translation expansion improving identifying the first time a story is reported on a both precision and recall, and post-translation ex- given topic or grouping similar topics from audio pansion enhancing precision. (McNamee and May- and textual streams of newswire date. In recent field, 2002)’s dictionary ablation experiments on the years, TDT has focused on performing such tasks effect of translation resource size and pre- and post- in both English and Mandarin Chinese. 1 The task translation query expansion effectiveness demon- that we have performed is not a strict part of TDT strated the key and dominant role of pre-translation because we are performing retrospective retrieval expansion in providing translatable terms. If too few which permits knowledge of the statistics for the terms are translated, post-translation expansion can entire collection. Nevertheless, the TDT collection provide little improvement. serves as a valuable resource for our work. The 2.2 Document Expansion TDT multilingual collection includes English and Mandarin newswire text as well as (audio) broad- The document expansion approach was first pro- cast news. For most of the Mandarin audio data, posed by (Singhal et al., 1999) in the context of word-level transcriptions produced by the Dragon spoken document retrieval. Since spoken document 1 retrieval involves search of error-prone automatic This year Arabic was added to the languages of interest.
automatic speech recognition system are provided. 3.4 Document Expansion All news stories are exhaustively tagged with event- based topic labels, which serve as the relevance We implemented document expansion for the VOA judgments for performance evaluation of our cross- Mandarin broadcast news stories in an effort to par- language spoken document retrieval work. We used tially recover terms that may have been mistran- a subset of the TDT-2 corpus for the experiments re- scribed. Singhal et al. used document expansion for ported here. monolingual speech retrieval (Singhal and Pereira, 1999). 3.2 Query Formulation The automatic transcriptions of the VOA Man- darin broadcast news stories and their word-for- TDT frames the retrieval task as query-by-example, word translations are an often noisy representation designating 4 exemplar documents to specify the in- of the underlying stories. For expansion, the text formation need. For query formulation, we con- of these documents was treated as a query to a structed a vector of the 180 terms that best distin- comparable collection (in Mandarin before transla- guish the query exemplars from other contempora- tion and English after translation), by simply com- neous (and hopefully not relevant) stories. We used a test in a manner similar to that used by Schütze bining all the terms with uniform weighting. This query was presented to the InQuery retrieval system et al (Schütze et al., 1995) to select these terms. The pure statistic is symmetric, assigning equal version 3.1pl developed at the University of Mas- sachusetts (Callan et al., 1992). value to terms that help to recognize known rele- Figure 1 depicts the document expansion process. vant stories and those that help to reject the other The use of pre- and post-translation document ex- contemporaneous stories. We limited our choice to pansion components was varied as part of the ex- terms that were positively associated with the known relevant training stories. For the computation, perimental suite described below. We selected the five highest ranked documents from the ranked re- we constructed a set of 996 contemporaneous doc- trieval list. From those five documents, we extracted uments for each topic by removing the four query the most selective terms and used them to enrich the examplars from a topic-dependent set of up to 1000 original translations of the stories. For this expan- stories working backwards chronologically from the sion process we first created a list of terms from the last English query example. Additional details may documents where each document contributed one in- be found in (Levow and Oard, 2000). stance of a term to the list. We then sorted the terms by inverse document frequency (IDF). We next aug- 3.3 Document Translation mented the original documents with these terms Our translation strategy implemented a word-for- until the document had approximately doubled in word translation approach. For our original length. Doubling was computed in terms of number spoken documents, we used the word bound- of whitespace delimited units. For Chinese audio aries provided in the baseline recognizer tran- documents, words were identified by the Dragon au- scripts. We next perform dictionary-based word- tomatic speech recognizer as part of the transcription for-word translation, using a bilingual term list process. For the Chinese newswire text, segmenta- produced by merging the entries from the sec- tion was performed by the NMSU segmenter ( (Jin, ond release of the LDC Chinese-English term list 1998)). The expansion factor chosen here followed (http://www.ldc.upenn.edu, (Huang, 1999)) and en- Singhal et al’s original proposal. A proportional tries from the CETA file, a large human-readable expansion factor is more desirable than some con- Chinese-English dictionary. The resulting term list stant additive number of words or some selectivity contains 195,078 unique Mandarin terms, with an threshold, as it provides a more consistent effect on average of 1.9 known English translations per Man- documents of varying lengths; an IDF-based thresh- darin term. We select the translation with the highest old, for example, adds disproportionately more new target language unigram frequency, based on a side terms to short original documents than long ones, collection in the target language. outweighing the original content. Prior experiments
indicate little sensitivity to the exact expansion fac- tor chosen, as long as it is proportional. This process thus relatively increased the weight of terms that occurred rarely in the document collec- tion as a whole but frequently in related documents. The resulting augmented documents were then in- Query Vector Results dexed by InQuery in the usual way.This expanded Post-translation document collection formed the basis for retrieval using the translated exemplar queries. Expansion The intuition behind document expansion is that InQuery Term Selection terms that are correctly transcribed will tend to be topically coherent, while mistranscription will intro- duce spurious terms that lack topical coherence. In Top 5 other words, although some “noise” terms are ran- domly introduced, some “signal” terms will survive. The introduction of spurious terms degrades ranked Translated retrieval somewhat, but the adverse effect is limited Documents by the design of ranking algorithms that give high InQuery scores to documents that contain many query terms. Because topically related terms are far more likely to appear together in documents than are spurious Translation terms, the correctly transcribed terms will have a Comp English disproportionately large impact on the ranking pro- Newswire Corpus cess. The highest ranked documents are thus likely to be related to the correctly transcribed terms, and Term Selection Transcribed to contain additional related terms. For example, a Documents system might fail to accurately transcribe the name “Yeltsin” in the context of the (former) “Russian Top 5 Prime Minister”. However, in a large contemporane- ous text corpus, the correct form of the name will ap- ASR InQuery pear in such document contexts, and relatively rarely outside of such contexts. Thus, it will be a highly Transcription correlated and highly selective term to be added in the course of document expansion. Mandarin Comp Chinese Broadcast Newswire Corpus 4 Document Expansion Experiments News Our goal is to evaluate the effectiveness of pseudo- relevance feedback expansion applied at different Pre-translation stages of document processing and determine what Expansion factors contribute to the any differences in final re- trieval effectiveness. We consider expansion before Figure 1: Document Expansion Process translation, after translation, and at both points. The expansion process aims to (re)introduce terminology that could have been used by the author to express the concepts in the documents. Expansion at differ- ent stages of processing addresses different causes of loss or absence of terms. At all points, it can ad-
dress terminological choice by the author. segmented into words using the NMSU seg- Since we are working with automatic transcrip- menter. The resulting documents are translated tions of spoken documents, pre-translation (post- as usual. Note that translation requires that the transcription) expansion directly addresses term loss expansion units be words. due to substitution or deletion errors in automatic recognition. In addition, as emphasized by (Mc- 3. Post-translation Expansion Namee and Mayfield, 2002), pre-translation expan- The English document forms produced by sion can be crucial to providing translatable terms so item 1 are expanded using a contemporaneous that there is some material for post-translation index- collection of English newswire text from the ing and matching to operate on. In other words, by New York Times and Associated Press (also including a wider range of expressions of the docu- part of the TDT-2 corpus). ment concepts, pre-translation expansion can avoid translation gaps by enhancing the possibility that 4. Pre- and Post-translation Expansion some term representing a concept that appears in The document forms produced by item 2 the original document will have a translation in the are translated in the the usual word-for-word bilingual term list. Addition of terms can also serve process. The resulting English text is expanded a disambiguating effect as identified by (Ballesteros as in item 3. and Croft, 1997). Post-translation expansion provides an opportu- After the above processing, the resulting English nity to address translation gaps even more strongly. documents are indexed. Pre-translation expansion requires that there be 4.1 Results some representation of the document language con- cept in the term list, whereas post-translation expan- The results of these different expansion configura- sion can acquire related terms with no representation tions appear in Figure 2. We observe that both post- in the translation resources from the query language translation expansion and combined pre- and post- side collection. This capability is particularly desir- translation document expansion yield highly sig- able given both the important role of named entities nificant improvements (Wilcoxon signed rank test, (e.g. person and organization names) in many re- two-tailed, ) in retrieval effectiveness trieval activities, in conjunction with their poor cov- over the unexpanded case. In contrast, although erage in most translation resources. Finally, it pro- pre-translation expansion yields an 18% relative in- vides the opportunity to introduce additional con- crease in mean average precision, this improvement ceptually related terminology in the query language, does not reach significance. The combination of pre- even if the document language form of the term was and post-translation expansion increases effective- not introduced by the original author to enhance the ness by only 3% relative over post-translation ex- representation. pansion, but 33% relative over pre-translation ex- We evaluate four document processing configura- pansion alone. This combination of pre- and post- tions: translation expansion significantly improves over pre-translation document expansion alone ( 1. No Expansion ). Documents are translated directly as de- 5 Discussion scribed above, based on the provided automatic speech recognition transcriptions. These results clearly demonstrate the significant utility of post-translation document expansion for 2. Pre-translation Expansion English-Mandarin CLIR with Mandarin spoken doc- Documents are expanded as described uments, in contrast to pre-translation expansion. Not above, using a contemporaneous Mandarin only do these results extend our understanding of the newswire text collection from Xinhua and Za- interactions of translation and expansion, but they obao news agencies. These collections are contrast dramatically with prior work on translation
ity of the query translation experiments that demon- strate the utility of pre-translation expansion have been performed on European language pairs that share a common alphabet, making names found at any stage of expansion available for matching as cognates in retrieval even when no explicit transla- tion is available. Recent side experiments on pre- and post-translation query expansion on the English- Document Expansion Chinese pair show a similar pattern of effectiveness None Pre Post Pre+Post for post-translation expansion over pre-translation 0.39 0.46 0.59 0.61 expansion (Levow et al., Under Review). A further complication is caused by the fact that Figure 2: Retrieval effectiveness of document ex- Mandarin Chinese is written without white space pansion separating words. As a result, some segmentation process must be performed to identify words for translation, even though indexing and retrieval can and query expansion - in particular, with the (Mc- be performed effectively on -gram units (Meng et Namee and Mayfield, 2002) work emphasizing the al., 2001). This segmentation process typically re- primary importance of pre-translation expansion. lies on a list of terms that may appear in legal seg- Two main factors contribute to this contrast: first, mentations. Just as in the case of translation, these differences between languages, and second, differ- term lists often lack good coverage of proper names. ences between documents and queries. The charac- Thus, these terms may not be identified for trans- teristics of the document and query languages play a lation, expansion, or even transcription by an auto- crucial role in determining the effectiveness of pre- matic speech recognition system that also depends and post-translation document expansion. In partic- on word lists as models. These constraints limit ular, the orthography of Mandarin Chinese and the the effectiveness of pre-translation expansion. In difference in writing systems between the English post-translation expansion, however, these problems queries and Mandarin documents affect the expan- are much less significant. In English, white-space sion process. If one examines the terms contributed delimited terms are available and largely sufficient by post-translation expansion, one can quickly ob- for retrieval (especially after stemming). Even with serve the utility of the enriching terms. For in- multi-word concepts as in the name examples above, stance in a document about the Iraqi oil embargo, the cooccurrence of these terms in expansion docu- one finds the names of Tariq Aziz and Saddam; in an ments makes it likely that they will cooccur in the article about the former Soviet republic of Georgia, list of enriching terms as well, though perhaps not in one finds the name of former president Zviad Gam- the same order. In Chinese or other typically unseg- sakhurdia. These and many of the other useful ex- mented languages, overlapping -grams can be used pansion terms do not appear anywhere in the transla- as indexing or expansion units, to bypass segmenta- tion resource. Even if these terms were proposed by tion issues, once translation has been completed. pre-translation expansion or existed in the original Finally, (McNamee and Mayfield, 2002) observe document, they would not be available in the trans- that pre-translation query expansion plays a crucial lated result. These named entities are highly useful role in ensuring that some terms are translatable, and in many information retrieval activities but are no- post-translation expansion would having nothing to toriously absent from translation resources. For lan- operate on if no query terms translated. This is cer- guages with different orthographies, these terms can tainly true, but this problem is much more likely to not match as cognates but must be explicitly trans- arise in the case of short queries, where only a single lated or transliterated. Thus, these terms are only term may represent a topic and there are few terms in useful for enrichment when the translation barrier the query. As documents are typically much longer, has already been passed. In contrast, the major- there is often more redundancy of representation.
This is analogous to the observation (Krovetz, 1993) Robert Krovetz. 1993. Viewing morphology as an infer- that stemming has less of an impact as documents ence process. In SIGIR-93, pages 191–202. become longer because a wider variety of surface Gina-Anne Levow and Douglas W. Oard. 2000. forms are likely to appear. Thus it is more likely Translingual topic tracking with PRISE. In Working that some translatable form of a concept is likely to Notes of the Third Topic Detection and Tracking Work- appear in a long document, even without expansion shop, February. and even with a poor translation resource. As a re- Gina-Anne Levow, Douglas W. Oard, and Philip Resnik. sult, pre-translation expansion may be less crucial Under Review. Dictionary-based techniques for cross- for long documents. language information retrieval. Gina-Anne Levow. 2003. Multi-scale document ex- 6 Conclusion pansion for mandarin chinese. In Proceedings of the ISCA Workshop on Multi-lingual Spoken Document These factors together explain both the significant Retrieval. improvement for post-translation document expan- sion that our experiments illustrate in contrast to the Paul McNamee and James Mayfield. 2002. Comparing cross-language query expansion techniques by degrad- much weaker effects of pre-translation expansion, ing translation resources. In Proceedings of the 25th and also the difference observed between the exper- Annual International Conference on Research and De- imental results reported here and prior work on pre- velopment in Information Retrieval (SIGIR-2002). and post-translation query expansion that has em- Helen Meng, Berlin Chen, Erika Grams, Wai-Kit Lo, phasized European language pairs. We have iden- Gina-Anne Levow, Douglas Oard, Patrick Schone, tified a key role for post-translation expansion in Karen Tang, and Jian Qiang Wang. 2001. Mandarin- CLIR language pairs where trivial cognate matching English Information (MEI): Investigating translingual is not possible, but explicit translation or translitera- speech retrieval. In Human Language Technology Conference. tion is required. We have also identified limitations on pre-translation expansion due to corresponding Hinrich Schütze, David A. Hull, and Jan O. Peder- gaps in segmentation, translation, and transcription sen. 1995. A comparison of classifiers and docu- ment representations for the routing problem. In Ed- resources. We believe that these findings will extend ward A. Fox, Peter Ingwersen, and Raya Fidel, ed- to other CLIR language combinations with com- itors, Proceedings of the 18th Annual International parable characteristics, including many other Asian ACM SIGIR Conference on Research and Develop- languages. ment in Information Retrieval, pages 229–237, July. ftp://parcftp.xerox.com/pub/qca/schuetze.html. Amit Singhal and Fernando Pereira. 1999. Document References expansion for speech retrieval. In Proceedings of the 22nd International Conference on Research and De- Lisa Ballesteros and W. Bruce Croft. 1997. Phrasal velopment in Information Retrieval, pages 34–41, Au- translation and query expansion techniques for cross- gust. language information retrieval. In Proceedings of the 20th International ACM SIGIR Conference on Amit Singhal, John Choi, Donald Hindle, Julia Research and Development in Information Retrieval, Hirschberg, Fernando Pereira, and Steve Whittaker. July. 1999. AT&T at TREC-7 SDR Track. In Proceedings of the DARPA Broadcast News Workshop. James P. Callan, W. Bruce Croft, and Stephen M. Hard- ing. 1992. The INQUERY retrieval system. In Proceedings of the Third International Conference on Database and Expert Systems Applications, pages 78– 83. Springer-Verlag. Shudong Huang. 1999. Evaluation of LDC’s bilingual dictionaries. Unpublished manuscript. Wanying Jin. 1998. NMSU Chinese segmenter. In First Chinese Language Processing Workshop, Philadel- phia.
You can also read