summaryrefslogtreecommitdiff
path: root/graehl
diff options
context:
space:
mode:
authorgraehl <graehl@ec762483-ff6d-05da-a07a-a48fb63a330f>2010-08-07 02:24:51 +0000
committergraehl <graehl@ec762483-ff6d-05da-a07a-a48fb63a330f>2010-08-07 02:24:51 +0000
commitaaac9f8ee73ba59b72609af9a78b167312a6dac7 (patch)
tree1273c65f43c124eac220a298acbe7af951b0a83e /graehl
parente2ef62dc50b9795fc801f97aba7bac8c80bb7c54 (diff)
propagation of feature name+debug from factory, return correct features array for fsa ffs
git-svn-id: https://ws10smt.googlecode.com/svn/trunk@486 ec762483-ff6d-05da-a07a-a48fb63a330f
Diffstat (limited to 'graehl')
-rwxr-xr-xgraehl/NOTES.beam9
-rwxr-xr-xgraehl/NOTES.earley73
2 files changed, 82 insertions, 0 deletions
diff --git a/graehl/NOTES.beam b/graehl/NOTES.beam
index 59314439..a48d1ab7 100755
--- a/graehl/NOTES.beam
+++ b/graehl/NOTES.beam
@@ -18,3 +18,12 @@ randlm worth using? guess not.
actually get all 0-state models in 1st pass parse and prune passive edges per span.
allocate cube pruning budget per prev pass
+
+(has been tried in ISI decoder) models with nonstandard state comparison,
+typically (partially) greedy forest scoring, some part of the state is excluded
+from equality/hashing. Within virtual ff interface, would just add equals, hash
+to vtable (rather than the faster raw state compare). If this is done often,
+then add a nonvirtual flag to interface instead, saying whether to use the
+virt. methods or not. or: simple flag by user of ApplyModels (per feature?)
+saying whether to be 100% greedy or 0% - no halfway e.g. item name uses bigram
+context, but score using 5gram state.
diff --git a/graehl/NOTES.earley b/graehl/NOTES.earley
new file mode 100755
index 00000000..73727a54
--- /dev/null
+++ b/graehl/NOTES.earley
@@ -0,0 +1,73 @@
+
+the target CFG (tcfg) is finite - absolutely no cycles. conceptually we're intersecting it with a wfsa (weights are feature vectors), which is a lot like parsing a lattice, in that states are (source state, dest state) pairs and you're covering some string(s) that go from source->dest in the wfsa.
+
+Chris' paper http://www.ling.umd.edu/~redpony/forest-reordering.pdf - apparently (figure 5) already contains the exact concept we're going for, albeit with only inside scores. http://www.speech.sri.com/cgi-bin/run-distill?ftp:papers/stolcke-cl95.ps.gz describes a nice way of computing sums over derivations given a string by keeping a tuple of ("forward","inner") scores while Earley parsing. I'm not sure yet if this is applicable (because we'll already have the exact outside scores from the -LM forest already, and plan on using cost pushing toward the top so we don't have to explicitly consider them).
+
+normally in earley, one word is consumed at a time, left to right. completions happen from shortest to longest, then (repeated) predictions, and finally scans. i'm sure this has the usual obvious extension to parsing lattices (proceed in some topological order).
+
+but because the wfsa (ngram lm) has cycles and forgets the length of the string (at some point), it's slightly more complicated than lattice parsing the tcfg - there's no topological order over the wfsa states and so you can't finish all the items [x,j] for j from left->right. best first with monotonic total scores (admissable heuristics) is an easy way to avoid generating the whole space; otherwise I can't think of a fixed order that would allow for alternative beaming. as we discussed, arbitrary predicates filtering candidate items can be added if exact best-first is too slow
+
+if the wfsa were just a single string t[0...n-1], then any time you have an item [i,j]X->a.b that means that there's some derivation in the tCFG of S =>* t[0...i-1]Xc => t[0....i-1]abc =>* t[0...j-1]bc , for a FSA, the analog is S =>* W(0,i)Xc => W(0,i)abc =>* W(0,i)W(i,j)bc where W(a,b) means any string in the wfsa language with a as the start state and b as the final state.
+
+
+http://www.isi.edu/natural-language/teaching/cs544/cs544-huang-3-Earley.pdf
+
+http://www.isi.edu/~lhuang/dp-final.pdf (variation on stolcke 1995 prefix cost)
+
+http://acl.ldc.upenn.edu/P/P07/P07-1019.pdf - phrase based lazy priority queue "cube growing" descendants (p149)
+
+
+
+
+
+http://www.speech.sri.com/cgi-bin/run-distill?ftp:papers/stolcke-cl95.ps.gz
+
+http://www.icsi.berkeley.edu/~stolcke/papers/cl95/node10.html#SECTION00042000000000000000
+
+a) An (unconstrained) Earley path, or simply path, is a sequence of Earley
+states linked by prediction, scanning, or completion. For the purpose of
+this definition, we allow scanning to operate in “generation mode,” i.e., all
+states with terminals to the right of the dot can be scanned, not just those
+matching the input. (For completed states, the predecessor state is defined
+to be the complete state from the same state set contributing to the
+completion.)
+b) A path is said to be constrained by, or generate a string x if the terminals
+immediately to the left of the dot in all scanned states, in sequence, form
+the string x.
+c) A path is complete if the last state on it matches the first, except that the
+dot has moved to the end of the RHS.
+d) We say that a path starts with nonterminal X if the first state on it is a
+predicted statewith X on the LHS.
+e) The length of a path is defined as the number of scanned states on it.
+
+Note that the definition of path length is somewhat counter-intuitive, but is motivated
+by the fact that only scanned states correspond directly to input symbols. Thus,
+the length of a path is always the same as the length of the input string it generates.
+
+A constrained path starting with the initial state contains a sequence of states from
+state set 0 derived by repeated prediction, followed by a single state from set 1 produced
+by scanning the first symbol, followed by a sequence of states produced by completion,
+followed by a sequence of predicted states, followed by a state scanning the second
+symbol, and so on. The significance of Earley paths is that they are in a one-to-one
+correspondence with left-most derivations.
+
+
+=============
+
+The forward probability alpha_i(X[k]->x.y) is the sum of the probabilities of all
+constrained paths of length that end in state X[k]->x.y
+
+b) The inner probability beta_i(X[k]->x.y) is the sum of the probabilities of all
+paths of length i-k that start in state X[k,k]->.xy and end in X[k,i]->x.y, and generate the input symbols x[k,...,i-1]
+
+(forward,inner) [i.e. (outside,inside)?]
+ unchanged by scan (rule cost is paid up front when predicting)
+
+if X[k,i] -> x.Yz (a,b) and rule Y -> r (p)
+then Y[i,i] -> .r (a',p) with a' += a*p
+
+if Y[j,i]->y. (a'',b'') and X[k,j]->r.Ys (a,b)
+then X[k,i]->rY.s (a',b') with a' += a*b'', b' += b*b''
+
+(this is summing over all derivations)
+