summaryrefslogtreecommitdiff
path: root/dtrain/README.md
blob: f4e1abedebb01583b85631eb77fefdcabab78d12 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
This is a simple (but parallelizable) tuning method for cdec, as used here:
  "Joint Feature Selection in Distributed Stochastic
   Learning for Large-Scale Discriminative Training in
   SMT" Simianer, Riezler, Dyer
   ACL 2012


Building
--------
builds when building cdec, see ../BUILDING

Running
-------
To run this on a dev set locally:
```
    #define DTRAIN_LOCAL
```
otherwise remove that line or undef. You need a single grammar file
or per-sentence-grammars (psg) as you would use with cdec.
Additionally you need to give dtrain a file with
references (--refs).

The input for use with hadoop streaming looks like this:
```
    <sid>\t<source>\t<ref>\t<grammar rules separated by \t>
```
To convert a psg to this format you need to replace all "\n"
by "\t". Make sure there are no tabs in your data.

For an example of local usage (with 'distributed' format)
the see test/example/ . This expects dtrain to be built without
DTRAIN_LOCAL.

Legal stuff
-----------
Copyright (c) 2012 by Patrick Simianer <p@simianer.de>

See the file ../LICENSE.txt for the licensing terms that this software is
released under.