report/report.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

\documentclass[11pt]{report}
\usepackage{graphicx}
\usepackage{index}
\usepackage{varioref}
\usepackage{amsmath}
\usepackage{multirow}
\usepackage{theorem} % for examples
\usepackage{alltt}
\usepackage{ulem}
\usepackage{epic,eepic}
\usepackage{boxedminipage}
\usepackage{fancybox}
\usepackage{colortbl}
\usepackage[square]{natbib}
\usepackage{epsfig}
%\usepackage{subfig}
\usepackage{subfigure}
\usepackage{booktabs}

\usepackage[encapsulated]{CJK}
\usepackage{ucs}
\usepackage[utf8x]{inputenc}
% use one of bsmi(trad Chinese), gbsn(simp Chinese), min(Japanese), mj(Korean); see:
% /usr/share/texmf-dist/tex/latex/cjk/texinput/UTF8/*.fd
\newcommand{\cntext}[1]{\begin{CJK}{UTF8}{gbsn}#1\end{CJK}}


\oddsidemargin 0mm
\evensidemargin 5mm
\topmargin -20mm
\textheight 240mm
\textwidth 160mm


\newcommand{\bold}{\it}
\renewcommand{\emph}{\it}

\makeindex
\theoremstyle{plain}

\newcommand{\nt}[2]{\textrm{#1}_{\framebox[5pt]{\scriptsize #2}}}
\newcommand{\ind}[1]{{\fboxsep1pt\raisebox{-.5ex}{\fbox{{\tiny #1}}}}}

\begin{document}
\title{\vspace{-15mm}\LARGE {\bf Final Report}\\[2mm]
of the\\[2mm]
2010 Language Engineering Workshop\\[15mm]
{\huge \bf Models for\\
Synchronous Grammar Induction\\[2mm]
{\tt \Large http://www.clsp.jhu.edu/workshops/ws10/groups/msgismt/}\\[15mm]
Johns Hopkins University\\[2mm]
Center for Speech and Language Processing}}
\author{\large Phil Blunsom,
Chris Callison-Burch,
Trevor Cohn,
Chris Dyer,
Jonathan Graehl,
Adam Lopez,\\
\large
Jan Botha,
Vladimir Eidelman,
ThuyLinh Nguyen,
Ziyuan Wang, 
Jonathan Weese,
\\
\large Olivia Buzek, Desai Chen}
\normalsize

\maketitle

\section*{Abstract}
The last decade of research in Statistical Machine Translation (SMT) has seen rapid progress. The most successful methods have been based on synchronous context free grammars (SCFGs), which encode translational equivalences and license reordering between tokens in the source and target languages. Yet, while closely related language pairs can be translated with a high degree of precision now, the result for distant pairs is far from acceptable. In theory, however, the ``right'' SCFG is capable of handling most, if not all, structurally divergent language pairs. The 2010 Language Engineering Workshop {\emph Models of Synchronous Grammar Induction for SMT} had the goal to focus on the crucial practical aspects of acquiring such SCFGs from bilingual text. We started with existing algorithms for inducing unlabeled SCFGs (e.g. the popular Hiero model) and then used state-of-the-art unsupervised learning methods to refine the syntactic constituents used in the translation rules of the grammar.

\phantom{.}


\newpage
\section*{Acknowledgments}
The participants at the workshop would like to thank everybody at Johns Hopkins University who made the summer workshop such a memorable --- and in our view very successful --- event. The JHU Summer Workshop is a great venue to bring together researchers from various backgrounds and focus their minds on a problem, leading to intense collaboration that would not have been possible otherwise.

We especially would like to thank Fred Jelinek for heading the Summer School effort and Desir\'ee Cleves for her superhuman ability to keep things running smoothly.

\phantom{.}

\newpage
\section*{Team Members}

\begin{itemize}
\item Phil Blunsom, Team Leader, University of Oxford
\item Chris Callison-Burch, Senior Researcher, Johns Hopkins University
\item Trevor Cohn, Senior Researcher, University of Sheffield
\item Chris Dyer, Senior Researcher, Carnegie Mellon University
\item Adam Lopez, Senior Researcher, University of Edinburgh
\item Jonathan Graehl, Senior Researcher, Information Sciences Institute, USC
\item Jan Botha, Graduate Student, University of Oxford
\item Vladimir Eidelman, Graduate Student, University of Maryland
\item Thuylinh Nguyen, Graduate Student, Carnegie Mellon University
\item Jonathan Weese, Graduate Student, Johns Hopkins University
\item Ziyuan Wang, Graduate Student, Johns Hopkins University
\item Olivia Buzek, Undergraduate Student, University of Maryland
\item Desai Chen, Undergraduate Student, Carnegie Mellon University
\end{itemize}
\tableofcontents


\include{introduction}
\include{SCFGs}
\include{setup}
\include{np_clustering}
\include{morphology/morphology}
\include{pr-clustering/posterior}
\include{training}

\bibliographystyle{apalike}
\bibliography{biblio}

\printindex

\end{document}