From 708e59d0e908b24bac36ec36956ad013268253b0 Mon Sep 17 00:00:00 2001
From: "graehl@gmail.com"
 <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f>
Date: Mon, 16 Aug 2010 09:11:03 +0000
Subject: greedy binarization - needs testing, may have broke l2r

git-svn-id: https://ws10smt.googlecode.com/svn/trunk@560 ec762483-ff6d-05da-a07a-a48fb63a330f
---
 graehl/NOTES.partial.binarize | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100755 graehl/NOTES.partial.binarize

(limited to 'graehl/NOTES.partial.binarize')

diff --git a/graehl/NOTES.partial.binarize b/graehl/NOTES.partial.binarize
new file mode 100755
index 00000000..a9985891
--- /dev/null
+++ b/graehl/NOTES.partial.binarize
@@ -0,0 +1,21 @@
+Earley doesn't require binarized rules.
+
+But a (partially) binarized grammar may lead to smaller (exhaustive or heuristic) charts.  The tradeoff is mostly more reduce steps (the # of NTs should be similar or less than the usual dotted-item binarization0.
+
+Optionally collapse a rule rhs to unary as well (normal binarization would stop when an rhs is binary), if the rule to collapse it exists or is frequent enough.
+
+Greedy binarization schemes:
+
+1) (repeatedly) for the most frequent rhs bigram "X a" create a binary rule "V -> X a" and replace "X a" in all rules' rhs with V.  stop if the most frequent bigram has count lower than some threshold (e.g. 3), because each instance of it saves one symbol, but the new rule has 3 symbols.
+
+2) (repeatedly) for each rule, pick the most frequent bigram in its rhs and binarize it (2a for that rule only, 2b everywhere that bigram occurs).  again, some frequency threshold.  optionally allow collapsing an rhs to unary.  this fails to use some substitutions that are available "for free" based on actions taken at earlier rules w/ no frequent bigrams in common with this one.
+
+3) (DeNero) (for complete binarization only?) for each rule until binarized, pick a split point k of L->r[0..n) to make rules L->V1 V2, V1->r[0..k) V2->r[k..n), to minimize the number of new rules created.  If no prefix or suffix of r already exists as a virtual rule, then choose k=floor(n/2).  To amend this to consider frequency of rhs, use the frequency of rhs-prefix/suffixes to decide where to split?
+
+4?) Song, Chin-Yew Lin - seems to require collecting stats from a larged parsed corpus - interesting idea: make rules that don't match fail early (that's 1 way you get a speedup), and pick V1 -> ... based on some kind of expected utility.
+
+5) l2r, r2l.  yawn.
+
+1) seems the most sensible.  don't just keep a count for each bigram, keep a set of left and right adjacent partially overlapping bigrams (i.e. the words left and right). for "a b" if "c" and "d" occur to the right, then "b c" and "b d" would be the right adjacent bigrams.  when replacing a bigram, follow the left and right adjacencies to decrement the count of those bigrams, and add a (bidirectional) link to the new bigram.
+
+Further, partial-1) can be followed by complete-3) or 5) - although i see no reason not to just continue 1) until the grammar is binary if you want a full binarization.
-- 
cgit v1.2.3