1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
|
1. fsts (modify target string produced) are quite feasible. augment fsa ff to not just emit features, but also new target words. composition or intersection is no problem (can always bunch into a single FSA/FST lazily by wrapping)
2. sparse fsas (many transitions have -inf score) aren't efficiently supported presently (they are in early_composer where the fsa is a phrase table); the fsa ff interface doesn't even provide a way to query the non-0 transitions (you have to emit a -inf feature). if sparse fsas were expected often and we wanted exact search, then a similar index of the tcfg as in earley_composer would make sense. however, with prob. beam search, we prune out bad-scoring stuff anyway
3. binarization of rhs isn't usually considered necessary in earley, but i liked the idea of optimal binarization making the most sharing possible. however, this means what would have just been a scan is now a scan+complete.
4. prefix (forward) + inside cost. this is phrased in a way so that prefix includes inside. but there's no reason not to think of it in exclusive terms (outside,inside) where prefix=outside*inside when using viterbi. on the other hand, we usually use the outside*inside as the beam score. and furthermore, it looks like when summing over all derivations, there would be some difficulty calculating, as the total inside wouldn't be known at first.
(a,i) r => (+=a*r,r) would become (o,i) r => (+=[(o*i*r)/r],r) = (+=o*i,r)
(_,b'') (a,b) => (+=a*b'',+=b*b'') would become (_,b'') (o,b) => (?????)
====
the target CFG (tcfg) is finite - absolutely no cycles. conceptually we're intersecting it with a wfsa (weights are feature vectors), which is a lot like parsing a lattice, in that states are (source state, dest state) pairs and you're covering some string(s) that go from source->dest in the wfsa.
Chris' paper http://www.ling.umd.edu/~redpony/forest-reordering.pdf - apparently (figure 5) already contains the exact concept we're going for, albeit with only inside scores. http://www.speech.sri.com/cgi-bin/run-distill?ftp:papers/stolcke-cl95.ps.gz describes a nice way of computing sums over derivations given a string by keeping a tuple of ("forward","inner") scores while Earley parsing. I'm not sure yet if this is applicable (because we'll already have the exact outside scores from the -LM forest already, and plan on using cost pushing toward the top so we don't have to explicitly consider them).
normally in earley, one word is consumed at a time, left to right. completions happen from shortest to longest, then (repeated) predictions, and finally scans. i'm sure this has the usual obvious extension to parsing lattices (proceed in some topological order).
but because the wfsa (ngram lm) has cycles and forgets the length of the string (at some point), it's slightly more complicated than lattice parsing the tcfg - there's no topological order over the wfsa states and so you can't finish all the items [x,j] for j from left->right. best first with monotonic total scores (admissable heuristics) is an easy way to avoid generating the whole space; otherwise I can't think of a fixed order that would allow for alternative beaming. as we discussed, arbitrary predicates filtering candidate items can be added if exact best-first is too slow
if the wfsa were just a single string t[0...n-1], then any time you have an item [i,j]X->a.b that means that there's some derivation in the tCFG of S =>* t[0...i-1]Xc => t[0....i-1]abc =>* t[0...j-1]bc , for a FSA, the analog is S =>* W(0,i)Xc => W(0,i)abc =>* W(0,i)W(i,j)bc where W(a,b) means any string in the wfsa language with a as the start state and b as the final state.
http://www.isi.edu/natural-language/teaching/cs544/cs544-huang-3-Earley.pdf
http://www.isi.edu/~lhuang/dp-final.pdf (variation on stolcke 1995 prefix cost)
http://acl.ldc.upenn.edu/P/P07/P07-1019.pdf - phrase based lazy priority queue "cube growing" descendants (p149)
http://www.speech.sri.com/cgi-bin/run-distill?ftp:papers/stolcke-cl95.ps.gz
http://www.icsi.berkeley.edu/~stolcke/papers/cl95/node10.html#SECTION00042000000000000000
a) An (unconstrained) Earley path, or simply path, is a sequence of Earley
states linked by prediction, scanning, or completion. For the purpose of
this definition, we allow scanning to operate in “generation mode,” i.e., all
states with terminals to the right of the dot can be scanned, not just those
matching the input. (For completed states, the predecessor state is defined
to be the complete state from the same state set contributing to the
completion.)
b) A path is said to be constrained by, or generate a string x if the terminals
immediately to the left of the dot in all scanned states, in sequence, form
the string x.
c) A path is complete if the last state on it matches the first, except that the
dot has moved to the end of the RHS.
d) We say that a path starts with nonterminal X if the first state on it is a
predicted statewith X on the LHS.
e) The length of a path is defined as the number of scanned states on it.
Note that the definition of path length is somewhat counter-intuitive, but is motivated
by the fact that only scanned states correspond directly to input symbols. Thus,
the length of a path is always the same as the length of the input string it generates.
A constrained path starting with the initial state contains a sequence of states from
state set 0 derived by repeated prediction, followed by a single state from set 1 produced
by scanning the first symbol, followed by a sequence of states produced by completion,
followed by a sequence of predicted states, followed by a state scanning the second
symbol, and so on. The significance of Earley paths is that they are in a one-to-one
correspondence with left-most derivations.
=============
The forward probability alpha_i(X[k]->x.y) is the sum of the probabilities of all
constrained paths of length that end in state X[k]->x.y
b) The inner probability beta_i(X[k]->x.y) is the sum of the probabilities of all
paths of length i-k that start in state X[k,k]->.xy and end in X[k,i]->x.y, and generate the input symbols x[k,...,i-1]
(forward,inner) [i.e. (outside,inside)?]
unchanged by scan (rule cost is paid up front when predicting)
if X[k,i] -> x.Yz (a,b) and rule Y -> r (p)
then Y[i,i] -> .r (a',p) with a' += a*p
if Y[j,i]->y. (a'',b'') and X[k,j]->r.Ys (a,b)
then X[k,i]->rY.s (a',b') with a' += a*b'', b' += b*b''
(this is summing over all derivations)
==========
is forward cost viterbi fine? i.e. can i have items whose names ignore the lhs NT (look up predictions that i finish lazily / graph structured?)
======
1) A -> x . * (trie)
this is somewhat nice. cost pushed for best first, of course. similar benefit as left-branching binarization without the explicit predict/complete steps?
vs. just
2) * -> x . y
here you have to potentially list out all A -> . x y as items * -> . x y immediately, and shared rhs seqs won't be shared except at the usual single-NT predict/complete. of course, the prediction of items -> . x y can occur lazy best-first.
vs.
3) * -> x . *
with 3, we predict all sorts of useless items - that won't give us our goal A and may not partcipate in any parse. this is not a good option at all.
======
-LM forest may have many in-edges per V. (many rules per NT lhs). so instead of generating all successors for scan/predict, i wanted to have them in sorted (admissable) -LM cost order and postpone once the prefix+rule part is more expensive than something else in the agenda. question: how many such postponed successor things
|