greedy binarization - needs testing, may have broke l2r

git-svn-id: https://ws10smt.googlecode.com/svn/trunk@560 ec762483-ff6d-05da-a07a-a48fb63a330f
author: graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f> 2010-08-16 09:11:03 +0000
committer: graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f> 2010-08-16 09:11:03 +0000
commit: 6cc769d102bfcf87822ceeb499cf45ff1e79e5f6 (patch)
tree: 5830cab16e60895832100206f962bfbc533fd860 /graehl/NOTES.partial.binarize
parent: 63531c9c25d1995e483c0a037518b5caa58fbb2c (diff)
1 files changed, 21 insertions, 0 deletions
diff --git a/graehl/NOTES.partial.binarize b/graehl/NOTES.partial.binarize
new file mode 100755
index 00000000..a9985891
--- /dev/null
+++ b/graehl/NOTES.partial.binarize
@@ -0,0 +1,21 @@
+Earley doesn't require binarized rules.
+
+But a (partially) binarized grammar may lead to smaller (exhaustive or heuristic) charts.  The tradeoff is mostly more reduce steps (the # of NTs should be similar or less than the usual dotted-item binarization0.
+
+Optionally collapse a rule rhs to unary as well (normal binarization would stop when an rhs is binary), if the rule to collapse it exists or is frequent enough.
+
+Greedy binarization schemes:
+
+1) (repeatedly) for the most frequent rhs bigram "X a" create a binary rule "V -> X a" and replace "X a" in all rules' rhs with V.  stop if the most frequent bigram has count lower than some threshold (e.g. 3), because each instance of it saves one symbol, but the new rule has 3 symbols.
+
+2) (repeatedly) for each rule, pick the most frequent bigram in its rhs and binarize it (2a for that rule only, 2b everywhere that bigram occurs).  again, some frequency threshold.  optionally allow collapsing an rhs to unary.  this fails to use some substitutions that are available "for free" based on actions taken at earlier rules w/ no frequent bigrams in common with this one.
+
+3) (DeNero) (for complete binarization only?) for each rule until binarized, pick a split point k of L->r[0..n) to make rules L->V1 V2, V1->r[0..k) V2->r[k..n), to minimize the number of new rules created.  If no prefix or suffix of r already exists as a virtual rule, then choose k=floor(n/2).  To amend this to consider frequency of rhs, use the frequency of rhs-prefix/suffixes to decide where to split?
+
+4?) Song, Chin-Yew Lin - seems to require collecting stats from a larged parsed corpus - interesting idea: make rules that don't match fail early (that's 1 way you get a speedup), and pick V1 -> ... based on some kind of expected utility.
+
+5) l2r, r2l.  yawn.
+
+1) seems the most sensible.  don't just keep a count for each bigram, keep a set of left and right adjacent partially overlapping bigrams (i.e. the words left and right). for "a b" if "c" and "d" occur to the right, then "b c" and "b d" would be the right adjacent bigrams.  when replacing a bigram, follow the left and right adjacencies to decrement the count of those bigrams, and add a (bidirectional) link to the new bigram.
+
+Further, partial-1) can be followed by complete-3) or 5) - although i see no reason not to just continue 1) until the grammar is binary if you want a full binarization.
author	graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f>	2010-08-16 09:11:03 +0000
committer	graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f>	2010-08-16 09:11:03 +0000
commit	6cc769d102bfcf87822ceeb499cf45ff1e79e5f6 (patch)
tree	5830cab16e60895832100206f962bfbc533fd860 /graehl/NOTES.partial.binarize
parent	63531c9c25d1995e483c0a037518b5caa58fbb2c (diff)