cfg

git-svn-id: https://ws10smt.googlecode.com/svn/trunk@493 ec762483-ff6d-05da-a07a-a48fb63a330f
author: graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f> 2010-08-08 08:49:13 +0000
committer: graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f> 2010-08-08 08:49:13 +0000
commit: ffd4acfe5b84f413c66af6ec5a76bdbc0d3aa9e8 (patch)
tree: e5e6f9063e200a7bee863fa34557007d5271a22f /graehl
parent: b588990fe985be78c05a9497d7f95b440d454cd3 (diff)
1 files changed, 13 insertions, 0 deletions
diff --git a/graehl/NOTES.earley b/graehl/NOTES.earley
index 73727a54..9b8bf1fc 100755
--- a/graehl/NOTES.earley
+++ b/graehl/NOTES.earley
@@ -1,3 +1,16 @@
+1. fsts (modify target string produced) are quite feasible.  augment fsa ff to not just emit features, but also new target words.  composition or intersection is no problem (can always bunch into a single FSA/FST lazily by wrapping)
+
+2. sparse fsas (many transitions have -inf score) aren't efficiently supported presently (they are in early_composer where the fsa is a phrase table); the fsa ff interface doesn't even provide a way to query the non-0 transitions (you have to emit a -inf feature).  if sparse fsas were expected often and we wanted exact search, then a similar index of the tcfg as in earley_composer would make sense.  however, with prob. beam search, we prune out bad-scoring stuff anyway
+
+3. binarization of rhs isn't usually considered necessary in earley, but i liked the idea of optimal binarization making the most sharing possible.  however, this means what would have just been a scan is now a scan+complete.
+
+4. prefix (forward) + inside cost.  this is phrased in a way so that prefix includes inside.  but there's no reason not to think of it in exclusive terms (outside,inside) where prefix=outside*inside when using viterbi.  on the other hand, we usually use the outside*inside as the beam score.  and furthermore, it looks like when summing over all derivations, there would be some difficulty calculating, as the total inside wouldn't be known at first.
+
+(a,i) r => (+=a*r,r) would become (o,i) r => (+=[(o*i*r)/r],r) = (+=o*i,r)
+(_,b'') (a,b)  => (+=a*b'',+=b*b'') would become (_,b'') (o,b) => (?????)
+
+====
+
 
 the target CFG (tcfg) is finite - absolutely no cycles.  conceptually we're intersecting it with a wfsa (weights are feature vectors), which is a lot like parsing a lattice, in that states are (source state, dest state) pairs and you're covering some string(s) that go from source->dest in the wfsa.
author	graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f>	2010-08-08 08:49:13 +0000
committer	graehl@gmail.com <graehl@gmail.com@ec762483-ff6d-05da-a07a-a48fb63a330f>	2010-08-08 08:49:13 +0000
commit	ffd4acfe5b84f413c66af6ec5a76bdbc0d3aa9e8 (patch)
tree	e5e6f9063e200a7bee863fa34557007d5271a22f /graehl
parent	b588990fe985be78c05a9497d7f95b440d454cd3 (diff)