A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum and Kedar Bellare Fernando Pereira Department of Computer Science Department of Computer and Information Science University of Massachusetts Amherst University of Pennsylvania Amherst, MA 01003, USA Philadelphia, PA 10104, USA {mccallum,kedarb}@cs.umass.edu pereira@cis.upenn.edu Abstract re-estimated using the expectations determined in the E-step so as to reduce the cost of the edit sequences ex- The need to measure sequence similarity pected to have caused the match. A useful attribute of arises in information extraction, object iden- this method is that the edit operations and parameters tity, data mining, biological sequence analy- can be associated with states of a finite state machine sis, and other domains. This paper presents (with probabilities of edit operations depending on discriminative string-edit CRFs, a finite- previous edit operations, as determined by the finite- state conditional random field model for edit state structure.) However, as a generative model, this sequences between strings. Conditional ran- model cannot tractably incorporate arbitrary features dom fields have advantages over generative of the input strings, and it cannot benefit from nega- approaches to this problem, such as pair tive evidence from pairs of strings that (while partially HMMs or the work of Ristad and Yiani- overlapping) should be considered dissimilar. los, because as conditionally-trained meth- Bilenko and Mooney (2003) extend Ristad’s model to ods, they enable the use of complex, arbitrary include affine gaps, and also present a learned string actions and features of the input strings. As similarity measure based on unordered bags of words, in generative models, the training data does with training performed by an SVM. Cohen and Rich- not have to specify the edit sequences be- man (2002) use a conditional maximum entropy clas- tween the given string pairs. Unlike genera- sifier to learn weights on several sequence distance fea- tive models, however, our model is trained on tures. A survey of string edit distance measures is pro- both positive and negative instances of string vided by Cohen et al. (2003). However, none of these pairs. We present positive experimental re- methods combine the expressive power of a Markov sults on several data sets. model of edit operations with discriminative training. This paper presents an undirected graphical model for 1 Introduction string edit distance, and a conditional-probability pa- rameter estimation method that exploits both match- Parameterized string similarity models based on string ing and non-matching sequence pairs. Based on con- edits have a long history (Levenshtein, 1966; Needle- ditional random fields (CRFs), the approach not only man & Wunsch, 1970; Sankoff & Kruskal, 1999). How- provides powerful capabilities long sought in many ap- ever, there are few methods for learning model pa- plication domains, but also demonstrates an interest- rameters from training data, even though, as in other ing example of discriminative learning of a probabilis- tasks, learning may lead to greater accuracy on real- tic model involving structured latent variables. world problems. The training data consists of input string pairs, each Ristad and Yianilos (1998) proposed an expectation- associated with a binary label indicating whether the maximization-based method for learning string edit pair should be considered a “match” or a “mismatch.” distance with a generative finite-state model. In their Model parameters are estimated from both positive approach, training data consists of pairs of strings that and negative examples, unlike in previous generative should be considered similar, and the parameters are models (Ristad & Yianilos, 1998; Bilenko & Mooney, probabilities of certain edit operations. In the E-step, 2003). As in those models, however, it is not necessary the highest probability edit sequence is found using the to provide the desired edit-operations or alignments— current parameters; in the M-step the probabilities are the alignments that enable the most accurate discrimi-
nation will be discovered automatically through an EM 2 Discriminatively Trained String procedure. Thus this model is an example of an inter- Edit Distance esting class of graphical models that are trained condi- tionally, but have latent variables, and find the latent Let x = x1 · · · xm and y = y1 · · · yn be two strings or variable parameters that maximize discriminative per- symbol sequences. This pair of input strings is associ- formance. Another recent example includes work on ated with an output label z ∈ {0, 1} indicating whether CRFs for object recognition from images (Quattoni or not the strings should be considered a match (1) or et al., 2005). a mismatch (0).1 As we now explain, our model scores The model is structured as a finite-state machine alignments between x and y as to whether they are (FSM) with a single initial state and two disjoint sets a match or a mismatch. An alignment a is a four- of non-initial states with no transitions between them. tuple consisting of a sequence of edit operations, two State transitions are labeled by edit operations. One sequences of string positions, and a sequence of FSM of the disjoint sets represents the match condition, the states. other the mismatch condition. Any non-empty tran- Let a.e = e1 · · · ek indicate the sequence edit op- sition path starting at the initial state defines an edit erations, such as delete-one-character-in-x, substitute- sequence that is wholly contained in either the match one-character-in-x-for-one-character-in-y, or delete-all- or mismatch subsets of the machine. By marginalizing characters-in-x-up-to-its-next-nonalphabetic. Each edit out all the edit sequences in a subset, we obtain the operation ep in the sequence consumes either some of probability of match or mismatch. x (deletion), some of y (insertion), or some of both The cost of a transition is a function of its edit opera- (substitution), up to positions ip in x and jp in y. We tion, the previous state, the new state, the two input have therefore corresponding non-decreasing sequences strings, and the starting and ending position (the po- a.ix = i1 , . . . , ik and a.iy = j1 , . . . , jk of edit-operation sition of the match-so-far before and after performing positions for x and y. this edit operation) for each of the two input strings. To classify alignments into matches or mismatches, we In applications, we take full advantage of this flexi- take edits as transition labels for a non-deterministic bility. For example, the cost function can examine FSM with state set S = {q0 } ∪ S0 ∪ S1 . There are portions of the input strings both before and after the transitions from the initial state q0 to states in the current match position, it can examine domain knowl- disjoint sets S0 and S1 , but no transitions between edge, such as lexicons, or it can depend on rich con- those two sets. In addition to the edit sequence and junctions of more primitive features. string position sequences, we associate the alignment The flexibility of edit operations is possibly even more a with a sequence of consecutive destinations states valuable. Edits can make arbitrarily-sized forward a.q = q1 · · · qk , where ep labels an allowed transition jumps in both input strings, and the size of the jumps from qp−1 to qp . By construction, either a.q ⊆ S0 or can be conditioned on the input strings, the current a.q ⊆ S1 . Alignments with states in S1 are supposed match points in each, and the previous state of the to represent matches, while alignments with states in finite state process. For example, a single edit oper- S0 are supposed to represent mismatches. ation could match a three-letter acronym against its In summary, an alignment is specified by the four- expansion in the other string by consuming three cap- tuple a = ha.e = e1 · · · ek , a.ix = i1 · · · ik , a.iy = italized characters in the first string, and consuming j1 · · · jk , a.q = q1 · · · qk i. For convenience, we also three matching words in the second string. The cost of write a = a0 , a1 · · · ak with ap = hep , ip , jp , qp i, 1 ≤ such an operation could be conditioned on the previous p ≤ k and a0 = h−, 0, 0, q0 i where − is a dummy ini- state of the finite state process, as well as the appear- tial edit. ance of the consumed strings in various lexicons, and the words following the acronym. Given two strings x and y, our discriminative string edit CRF defines the probability of an alignment be- Inference and training in the model depends on a com- tween x and y as plex dynamic program in three dimensions. We em- ploy various optimizations to speed learning. |a| 1 Y p(a|x, y) = Φ(ai−1 , ai , x, y), We present experimental results on five standard text Zx,y i=1 data sets, including short strings such as names and 1 addresses, as well as longer more complex strings, such One could also straightforwardly imagine a different as bibliographic citations. We show significant error regression-based scenario in which z is real-valued, or also reductions in all but one of the data sets. a ranking-based criteria, in which two pairs are provided and z indicates which pair of strings should be considered closer.
where the potential function Φ(·) is a non-negative Dynamic programming for this model fills a three- function of its arguments, and Zx,y is the normalizer dimensional table (two for the two input strings, and (partition function). In our experiments we parame- one for the states in S). The table can be moderately terize these potential functions as an exponential of a large in practice (n = m = 100 and |S| = 12, resulting linear scoring function in 120,000 entries), and beam search may effectively be used to increase speed, just as in speech recognition, Φ(ai−1 , ai , x, y) = exp Λ · f (ai−1 , ai , x, y), where even larger tables are common. where f is a vector of feature functions, each taking It is interesting to examine what alignments will be as arguments two consecutive states in the alignment learned in S0 , the non-match portion of the model. To sequence, the corresponding edits, and their string po- attain high accuracy, these states should attract string sitions, which allow the feature functions to depend on pairs that are dissimilar. But even similar strings have the context of ai in x and y. A typical feature function bad alignments, for example the alignment that first combines some predicate on the input, orinput feature, deletes all of x, and then inserts all of y. Fortunately, with a predicate over the alignment itself (edit opera- finding how dissimilar two strings are requires finding tion, states, positions). as good an alignment as is possible, and then deciding To obtain the probability of match given simply the that this alignment is not very good. These as-good- input strings, we marginalize over all alignments in as-possible alignments are exactly what our learning the corresponding state set: procedure discovers: driven by an objective function that aims to maximize the likelihood of the correct |a| X 1 Y binary match/non-match labels, the model finds the p(z|x, y) = Φ(ai−1 , ai , x, y), latent alignment paths that enable it to maximize this Zx,y i=1 a.q⊆Sz likelihood. Fortunately, this sum can be calculated efficiently by This model thus falls in a family of interesting tech- dynamic programming. Typically, for any given edit niques involving discrimination among complex struc- operation, starting positions and input strings, there tured objects, in which the structure or relationship are a small number of possible resulting ending posi- among the parts is unknown (latent), and the latent tions. Max-product (Viterbi-like) inference can also choice has high impact on the discrimination task. be performed efficiently. Similar considerations are at the core of discrimina- tive non-probabilistic methods for structured problems 3 Parameter Estimation such as handwriting recognition (LeCun et al., 1998) and speech recognition (Woodland & Povey, 2002), Parameters are estimated by penalized maximum like- and, more recently, computer vision object recogni- lihood on a set of training data. Training data consists tion (Quattoni et al., 2005). We discuss related work of a set of N string pairs hx(j) , y(j) i with correspond- further in Section 6. ing labels z (j) ∈ {0, 1}, indicating whether or not the pair is a match. We use a zero-mean spherical Gaus- sian prior k λ2k /σ 2 for penalization. P 4 Implementation The incomplete (non-penalized) log-likelihood is then The model has been implemented as part of the X finite-state transducer classes in Mallet (McCallum, log p(z (j) |x(j) , y(j) ) LI = 2002). We map three-dimensional dynamic program- j ming problems over positions in x and y and states S to Mallet’s existing finite-state forward-backward and the complete log-likelihood is XX and Viterbi implementations by encoding the two po- log(p(z (j) |a, x(j) , y(j) )p(a|x(j) , y(j) )) LC = sition indices into a single index in a diagonal crossing j a pattern that starts at (0, 0). For example, a single- character delete operation, which would be a hop to an We maximize this likelihood with EM, estimating a adjacent vertical or horizontal in the original table, p(a|x(j) , y(j) ) given current parameters Λ in the E- is a longer, one-dimensional (but deterministically- step, and maximizing the complete penalized log- calculated) jump in the encoding. likelihood in the M-step. For optimization in the M- step we use BFGS. Unlike CRFs without latent vari- In addition to the standard edit operations (inser- ables, the objective function has local maxima. To tion, deletion, substitution), we have also more pow- avoid getting stuck in poor local maxima, the param- erful edits that fit naturally into this model, such eters are initialized to yield a reasonable default edit as delete-until-end-of-word, delete-word-in-lexicon, and distance. delete-word-appearing-in-other-string.
5 Experimental Results • Operations for handling acronyms and abbrevia- tions by inserting, deleting, or substituting spe- We show experimental results on one synthetic and six cific types of substrings. real-world data sets, all of which have been used in pre- vious work evaluating string edit measures. The first Learned parameters are associated with the input fea- two data sets are the name and address fields of the tures as well as with state transitions in the FSM. All Restaurant database. Among its 864 records, 112 are transitions entering a state may share tied parameters matches. The last four data sets are citation strings (first order), or have different parameters (second or- from the standard Reasoning, Constraint, Reinforce- der). Since the FSM can have more states than edit ment and Face sections of the CiteSeer data. The ra- operations, it can remember the context of previous tios of citations to unique papers for these are 514/196, edit actions. 349/242, 406/148 and 295/199 respectively. Making the problem more challenging than certain other evalu- ations on these data sets, our strings are not segmented 5.2 Experimental Methodology into fields such as title or author, but are each treated Our model exploits both positive and negative exam- as a single unsegmented character sequence. We also ples during training. Positive training examples in- present results on synthetic noise on person names, clude all pairs of strings referring to the same object generated by the UIS Database generator. This pro- (the matching strings). However, the total number gram produces perturbed names according to modifi- of negative examples is quadratic in the number of able noise parameters, including the probability of an objects. Due to both time and memory constraints, error anywhere in a record, the probability of single as well as a desire to avoid overwhelming the positive character insertion, deletion or swap, and the proba- training examples, we sample the negative (mismatch) bility of a word swap. string pairs so as to attain a 1:10 ratio of match to mis- match pairs. In order to preferentially sample “near 5.1 Edit Operations and Features misses” we filter negative examples in one of two ways: One of the main advantages of our model is the abil- ity to include non-independent input features and ex- • Remove negative examples that are too dissimilar tremely flexible edit operations. The input features according to a suitable metric. For the Citeseer used in our experiments include subsets of the follow- datasets we use the cosine metric to measure sim- ing, described as acting on cell i, j in the dynamic pro- ilarity of two citations; for other datasets we use gramming table and the two input strings x and y. the metric of Jaro (1989). • same, different : xi and yj match (do not match); • Select the best matching negative pairs according • same-alphabetic, different-alphabetic : xi and yj to a CRF with parameters set by hand to reason- are alphabetic and they match (do not match); able values. • same-numeric, different-numeric : xi and yj are nu- meric and they match (do not match); As in Bilenko and Mooney (2003), we use a 50/50 • punctuation-x, punctuation-y : xi and yj are punc- train/test split of the data, and repeat the process tuation, respectively; with the folds interchanged. With the restaurant name and restaurant address dataset, we run our algorithm • alphabet-mismatch, number-mismatch : One of xi with different choices of features and states, and 4 ran- and yj is alphabetic (numeric), the other is not; dom splits of the data. With the Citeseer datasets, we • end-of-x, end-of-y : i = |x| (j = |y|); have results for two random splits of the data. • same-next-character, different-next-character: xi+1 and yi+1 match (do not match). To give EM training a reasonable starting point, we hand-set the initial parameters to somewhat arbitrary, yet reasonable parameters. (Of course, hand-setting of Edit operations on FSM transitions include: string edit parameters is the standard for all the non- learning approaches.) We examined a small held-out • Standard string edit operations: insert, delete and set of data to verify that these initial parameters were substitute. reasonable. We set the parameters on the match por- • Two character operations: swap-two-characters. tion of the FSM to provide good alignments; then we • Word skip operations: skip-if-word-in-lexicon, skip- then copy these parameters to the mismatch portion of word-if-present-in-other-string, skip-parenthesized- the model, offseting them by bringing all values closer words and skip-any-word . to zero by a small constant.
Distance Metric Restaurant name Restaurant address Reasoning Face Reinforcement Constraint Edit Distance 0.290 0.686 0.927 0.952 0.893 0.924 Learned Edit Distance 0.354 0.712 0.938 0.966 0.907 0.941 Vector-space 0.365 0.380 0.897 0.922 0.903 0.923 Learned Vector-space 0.433 0.532 0.924 0.875 0.808 0.913 CRF Edit Distance 0.448 0.783 0.964 0.918 0.917 0.976 Table 1: Averaged F-measure for detecting matching field values on several standard data sets (bold indicates highest F1). The top four rows are results duplicated from Bilenko and Mooney (2003); the bottom row is the performance of the CRF method introduced in this paper. Lexicons were populated automatically by gathering Dataset Viterbi Forward-Backward the most frequent words in the training set. (Alter- Restaurant name 0.689 0.720 natively one could imagine lexicon feature values set Restaurant address 0.708 0.651 to inverse-document-frequency values, or similar infor- mation retrieval metrics.) In some cases, before train- Table 2: Averaged F-measures for Viterbi vs. forward- ing, lexicons were edited to remove author surnames. backward on (trained and evaluated on a subset of the The equations in section 3 are used to calculate data; smaller test set yields higher accuracy). p(z|x, y), with a first-order model. A threshold of 0.5 predicts whether the string pair is a match or a mis- match. (Note that alternative thresholds could easily stead of forward-backward (sum-product) inference. be used to trade of precision and recall, and that CRFs Except for the restaurant address dataset, forward- are typically good at predicting calibrated posterior backward performs significantly better than Viterbi on probabilities needed for such tuning as well as accu- all datasets. The restaurant address data set contains racy/coverage curves.) Bilenko and Mooney (2003) positive examples with a large unmatched suffix in one found transitive closure to improve F1, and use it for of the strings, which may lead to an inappropriate dilu- their results; we did not find it to help, and do not. tion of probability amongst many alignments. Average Precision is calculated to be the ratio of the number F1 measures for the restaurant datasets using Viterbi of correctly classified duplicates to the total number and forward-backward are shown in Table 2. All re- of duplicates identified. Recall is the ratio of correctly sults shown in Table 1 use forward-backward proba- classified duplicates to the total number of duplicates bilities. in the dataset. We report the mean performance across In the other tables we present results showing the im- multiple random splits. pact of various edit operations and features. Table 3 shows F1 on the restaurant data set as vari- 5.3 Results ous edit operations are added to the model: i denotes In experiments on the six real-world data sets we com- insert, d denotes delete, s denotes substitute, paren de- pare our performance against results in a recent bench- notes skip-parenthesized-word, lex denotes skip-if-word- mark paper by Bilenko and Mooney (2003); Bilenko in-lexicon, and pres denotes skip-word-if-present-in- recently completely thesis work in this area. These re- other-string. All use the same-alphabets and different- sults are summarized in Table 1, where the top four alphabets input features. As can be seen from the re- rows are duplicated from Bilenko and Mooney (2003), sults, adding “skip” edits improves performance. Al- and the bottom row shows the results of our method. though skip-parenthesized-words gives better results on The entries are the average F1 measure across the the smaller data set used for the experiments in the folds. We observe large performance improvements on table, skip-if-word-in-lexicon produces a higher accu- most datasets. The fact that the difference in perfor- racy on larger data sets, because of peculiarities in mance across our trials is typically around 0.01 sug- how restaurants with the same name and different lo- gests strong statistical significance. Our average F1 cations are named in the data set. We also see that on the Face dataset was 0.04 less than the previous a second-order model performs less well, presumably best. The examples on which we made errors gener- because of data sparseness. ally had a large venue, authors, or URL field in one Table 4 shows the benefits of including various features string but not in the other. for the restaurant address data set, while fixing the edit We also evaluate the effect on performance of us- operations (insert, delete and substitute). In the table, ing Viterbi (max-product) inference in training in- s and d denote the same and different features, salp
Run F1 Run F1 i, d, s 0.701 Without skip 0.856 i, d, s, paren 0.835 With skip 0.981 i, d, s, lex 0.769 i, d, s, lex 2nd order 0.742 Table 5: Average maximum F-measure for synthetic i, d, s, paren,lex,pres 0.746 name dataset with and without skip-if-present-in-other- i, d, s, paren,lex,pres, 2nd order 0.699 string state. Table 3: Averaged maximum F-measure for differ- restaurant : katzu ent state combinations on a subset of restaurant name l s (trained and evaluated on the same train/test split). k s a s Run F1 t s s s s, d 0.944 u s salp, dalp, snum, dnum 0.973 - Table 4: Averaged maximum F1-measure for differ- Table 6: Alignment in both the match and mismatch ent feature combinations on a subset of the restaurant subsets of the model, with correct prediction. Opera- address data set. tions causing edits are in bold. Table 7: Alignment in both the match and mismatch and dalp stand for the same-alphabets and different- subsets of the model, with correct prediction. Opera- alphabets features, and snum and dnum stand for the tions causing edits in bold. same-numbers and different-numbers features. The s and d features are different from the salp,dalp,snum, and dnum features in that the weights learned for the tion 3, the mismatch portion of the model indeed former depend only on whether the two characters are learns the best possible latent alignments in order to equal or not, and no separate weights are learned for measure distance with the most salient features. This a number match or an letter match. We conjecture example’s alignment score from the match portion is that a number mismatch in the address data needs to higher. The entries in the dynamic programming ta- be penalized more than a letter mismatch. Separating ble i, d, s, l, and p correspond to states reached by the the same and different features into features for letters operations insert, delete,substitute, skip-word-in-lexicon, and numbers reduces the error from about 6% to 3%. and skip-parenthesized-word respectively. The symbol - denotes a null transition. Finally, Table 5 demonstrates the power of CRFs to in- clude extremely flexible edit operations that examine 6 Related Work arbitrary pieces of the two input strings. In particu- lar we measure the impact of including the skip-word- String (dis)similarity metrics based on edit distance if-present-in-other-string operation, (“skip” for short). are widely used in applications ranging from approx- Here we train and test on the UIS synthetic name imate matching and duplicate removal in database data, in which the error probability is 40%, the typo records to identifying conserved regions in compara- error probability is 40% and the swap first and last tive genomics. Levenshtein (1966) introduced least- name probability is 50%; (the rest of the parameters cost editing based on independent symbol insertion, were unchanged from the default values). The differ- deletion, and substitution costs, and Needleman and ence in performance is dramatic, bringing error down Wunsch (1970) extended the method to allow gaps. from about 14% to less than 2%. Of course, arbi- Editing between strings over the same alphabet can trary substring swaps are not expressible in standard be generalized to transduction between strings in dif- dynamic programs, but the skip operation gives an ex- ferent alphabets, for instance in letter-to-sound map- cellent approximation while preserving efficient finite- pings (Riley & Ljolje, 1996) and in speech recognition state inference. Typical improved alignments with the (Jelinek et al., 1975). new operation may skip over a matching swapped first In most applications, the edit distance model is de- name, and then proceed to correct individual typo- rived by heuristic means, possibly including some graphic errors in the last name. data-dependent tuning of parameters. For exam- An example alignment found by our model on restau- ple, Monge and Elkan (1997) recognize duplicate cor- rant name is shown in Table 7. As discussed in Sec- rupted records using an edit distance with tunable
edit and gap costs. Hernandez and Stolfo (May 1995) do not need to be given explicit alignments because merge records in large databases using rules based on they do EM with alignment as a latent (structured) domain-specific edit distances for duplicate detection. variable. Joachims (2003) gives a generic maximum- Cohen (2000) use a token-based TF-IDF string simi- margin method for learning to score alignments from larity score to compute ranked approximate joins on positive and negative examples, but the training ex- tables derived from Web pages. Koh et al. (2004) use amples must include the actual alignments. In ad- association rule mining to check for duplicate records dition, he cannot solve the problem exactly because with per-field exact, Levenshtein or BLAST 2 gapped he does not exploit factorizations of the problem that alignment (Altschul et al., 1997) matching. Cohen yield a polynomial number of constraints and efficient et al. (2003) surveys edit and common substring simi- dynamic programming search over alignments. larity metrics for name and record matching, and their While the basic models and algorithms are expressed application in various duplicate detection tasks. in terms of single letter edits, in practice it is con- In bioinformatics, sequence alignment with edit costs venient to use a richer application-specific set of edit based on evolutionary or biochemical estimates are operations, for example name abbreviation. For ex- common (Durbin et al., 1998). Position-independent ample, Brill and Moore (2000) use edit operations de- costs are normally used for general sequence similar- signed for spelling correction in a spelling correction ity search, but position-dependent costs are often used model trained by EM. Tejada et al. (2001) has edit op- when searching for specific sequence motifs. erations such as abbreviation and acronym for record linkage. In basic edit distance, the cost of individual edit op- erations is independent of the string context. How- ever, applications often require edit costs to change 7 Conclusions depending on context. For instance, the characters in an author’s first name after the first character are more likely to be deleted than the first character. Instead We have presented a new discriminative model for of specialized representations and dynamic program- learning finite-state edit distance from postive and ming algorithms, we can instead represent context- negative examples consisting of matching and non- dependent editing with weighted finite-state transduc- matching strings. It is not necessary to provide se- ers (Eilenberg, 1974; Mohri et al., 2000) whose states quence alignments during training. Experimental re- represent different types of editing contexts. The sults show the method to outperform previous ap- same idea has also been expressed with pair hidden proaches. Markov models for pairwise biological sequence align- The model is an interesting member of a family of ment (Durbin et al., 1998). models that use a discriminative objective function If edit costs are identified with − log probabilities to discover latent structure. The latent edit opera- (up to normalization), edit distance models and cer- tion sequences that are learning by EM are indeed the tain weighted transducers can be interpreted as gen- alignments that help discriminate matching from non- erative models for pairs of sequences. Pair HMMs matching strings. are such generative models by definition. Therefore, We have described in some detail the finite-state ver- expectation-maximization using an appropriate ver- sion of this model. A context-free grammar version of sion of the forward-backward algorithm can be used the model could, through edit operations defined on to learn parameters that maximize the likelihood of trees, handle swaps of arbitrarily-sized substrings. a given training set of pairs of strings according to the generative model (Ristad & Yianilos, 1998; Ristad & Yianilos, 1996; Durbin et al., 1998). Bilenko and Acknowledgments Mooney (2003) use EM to train the probabilities in a simple edit transducer for one of the duplicate de- We thank Charles Sutton and Xuerui Wang for use- tection measures they evaluate. Eisner (2002) gives ful conversations, and Mikhail Bilenko for helpful a general algorithm for learning weights for transduc- comments on a previous draft. This work was sup- ers, and notes that the approach applies to transduc- ported in part by the Center for Intelligent Infor- ers with transition scores given by globally normalized mation Retrieval, the National Science Foundation log-linear models. These models are to CRFs as pair under NSF grants #IIS-0326249, #IIS-0427594, and HMMs are to HMMs. #IIS-0428193, and by the Defense Advanced Research The foregoing methods for training edit transduc- Projects Agency, through the Department of the Inte- ers or pair HMMs use positive examples alone, but rior, NBC, Acquisition Services Division, under con- tract #NBCHD030010.
References Koh, J. L. Y., Lee, M. L., Khan, A. M., Tan, P. T. J., & Brusic, V. (2004). Duplicate detection in biological Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, data using association rule mining. Proceedings of J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). the Second European Workshop on Data Mining and Gapped BLAST and PSI-BLAST: a new generation Text Mining in Bioinformatics. of protein database search programs. Nucleic Acids Research, 25. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to doc- Bilenko, M., & Mooney, R. J. (2003). Adaptive dupli- ument recognition. Proceedings of the IEEE, 86, cate detection using learnable string similarity mea- 2278–2324. sures. In Proceedings of the 9th ACM SIGKDD In- ternational Conference on Knowledge Discovery and Levenshtein, L. I. (1966). Binary codes capable of cor- Data Mining (KDD) (pp. 39–48). Washington, DC. recting deletions, insertions and reversals. Soviet Physics Doklady, 10, 707–710. Brill, E., & Moore, R. C. (2000). An improved error McCallum, A. K. (2002). Mallet: A machine learning model for noisy channel spelling correction. Proceed- for language toolkit. http://mallet.cs.umass.edu. ings of the 38th Annual Meeting of the ACL. Mohri, M., Pereira, F., & Riley, M. (2000). The Design Cohen, W. W. (2000). Data integration using similar- Principles of a Weighted Finite-State Transducer Li- ity joins and a word-based information representa- brary. Theoretical Computer Science, 231, 17–32. tion language. ACM Transactions on Information Systems, 18, 288–321. Monge, A. E., & Elkan, C. (1997). An efficient domain- independent algorithm for detecting approximately Cohen, W. W., Ravikumar, P., & Fienberg, S. (2003). duplicate database records. DMKD. Tuscan, Ari- A comparison of string metrics for matching names zona. and records. KDD Workshop on Data Cleaning and Needleman, S. B., & Wunsch, C. D. (1970). A general Object Consolidation. method applicable to the search for similarities in Cohen, W. W., & Richman, J. (2002). Learning to the amino acid sequence of two proteins. Journal of match and cluster large high-dimensional data sets Molecular Biology, 48, 443–453. for data integration. KDD (pp. 475–480). ACM. Quattoni, A., Collins, M., & Darrell, T. (2005). Con- Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. ditional random fields for object recognition. In (1998). Biological sequence analysis: Probabilistic L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances models of proteins and nucleic acids. Cambridge in neural information processing systems 17, 1097– University Press. 1104. Cambridge, MA: MIT Press. Eilenberg, S. (1974). Automata, languages and ma- Riley, M. D., & Ljolje, A. (1996). Automatic gener- chines, vol. A. Academic Press. ation of detailed pronunciation lexicons. In C. H. Lee, F. K. Soong and E. K. K. Paliwal (Eds.), Au- Eisner, J. (2002). Parameter estimation for proba- tomatic speech and speaker recognition: Advanced bilistic finite-state transducers. Proceedings of the topics, chapter 12. Boston: Kluwer Academic. 40th Annual Meeting of the Association for Compu- tational Linguistics. Ristad, E. S., & Yianilos, P. N. (1996). Finite growth models (Technical Report TR-533-96). Department Hernandez, M. A., & Stolfo, S. J. (May 1995). The of Computer Science, Princeton University. merge/purge problem for large databases. Proceed- ings of the 1995 ACM SIGMOD International Con- Ristad, E. S., & Yianilos, P. N. (1998). Learning string ference on Management of Data (SIGMOD-95) (pp. edit distance. IEEE Transactions on Pattern Anal- 127–138). San Jose, CA. ysis and Machine Intelligence, 20, 522–532. Sankoff, D., & Kruskal, J. (Eds.). (1999). Time warps, Jaro, M. A. (1989). Advances in record-linkage string edits, and macromolecules. Stanford, Cali- methodology as applied to matching the 1985 census fornia: CSLI Publications. Reissue edition edition, of tampa, florida. Journal of the American Statisti- Originally published by Addison-Wesley, 1983. cal Association, 84, 414–420. Tejada, S., Knoblock, C. A., & Minton, S. (2001). Jelinek, F., Bahl, L. R., & Mercer, R. L. (1975). Learning object identification rules for information The design of a linguistic statistical decoder for the integration. Information Systems, 26, 607–633. recognition of continuous speech. IEEE Transac- tions on Information Theory, 3, 250–256. Woodland, P. C., & Povey, D. (2002). Large scale discriminative training of hidden Markov models for Joachims, T. (2003). Learning to align sequences: A speech recognition. Computer Speech and Language, maximum-margin approach (Technical Report). De- 16, 25–47. partment of Computer Science, Cornell University.
You can also read