summaryrefslogtreecommitdiff
path: root/report
diff options
context:
space:
mode:
authorredpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f>2010-08-18 22:20:32 +0000
committerredpony <redpony@ec762483-ff6d-05da-a07a-a48fb63a330f>2010-08-18 22:20:32 +0000
commit5d1ce5c1a4cc1c9cc4091847c19552394155a92a (patch)
tree92ea73a54ad5f128b0d37178321f93caecbcf0f5 /report
parent8ec14f00d1078f0fa7ab3ba2a01954b1f6ca5260 (diff)
add example grammar
git-svn-id: https://ws10smt.googlecode.com/svn/trunk@594 ec762483-ff6d-05da-a07a-a48fb63a330f
Diffstat (limited to 'report')
-rw-r--r--report/np_clustering.tex136
-rwxr-xr-xreport/pyp_clustering/format-grammar-latex.pl21
2 files changed, 149 insertions, 8 deletions
diff --git a/report/np_clustering.tex b/report/np_clustering.tex
index 17ff31a4..0d6eb854 100644
--- a/report/np_clustering.tex
+++ b/report/np_clustering.tex
@@ -3,14 +3,14 @@
\chapter{Nonparametric Models}
-In this chapter we describe a Bayesian nonparametric model for inducing categories in a synchronous context-free grammar. As discussed in Chapter~\ref{chapter:setup}, we hypothesize that each phrase pair, $\p$, can be clustered on the basis of the contexts it occurs in. Using this as our starting point, we define a generative model where contexts are generated by the (latent) category type of the phrases they occur in. In contrast to most prior work using Bayesian models for synchronous grammar induction \citep{blunsom:nips2008,blunsom:acl2009,zhang:2008}, we do not model parallel sentence pairs directly. Rather, we assume that our corpus is a {\emph collection of contexts} (grouped according to the phrases they occur in), where each context is conditionally independent of the others, given the type of the category it surrounds. The models used here are thus variations on the Latent Dirichlet Allocation (LDA) model of \cite{blei:2003}.
+In this chapter we describe a Bayesian nonparametric approach to inducing the categories in a synchronous context-free grammar. As discussed in Chapter~\ref{chapter:setup}, we hypothesize that each phrase pair, $\p$, can be clustered on the basis of the contexts it occurs in. Using this as our starting point, we define a generative model where contexts are generated by the (latent) category type of the phrases they occur in. In contrast to most prior work using Bayesian models for synchronous grammar induction \citep{blunsom:nips2008,blunsom:acl2009,zhang:2008}, we do not model parallel sentence pairs directly. Rather, we assume that our corpus is a {\emph collection of contexts}, grouped according to the phrases they occur in, and where each context is conditionally independent of the others, given the type of the category it surrounds. The models used here are thus variations on the Latent Dirichlet Allocation (LDA) model of \cite{blei:2003}.
In Section~\ref{sec:npmodel} we describe the basic structure of our nonparametric models as well as how inference was carried out.
\section{Model}
\label{sec:npmodel}
-This section describes the details of the phrase clustering model model. Each observed phrase (pair), $\p$, is characterized by a finite mixture of categories, $\theta_{\p}$. The collection of contexts for each phrase, $C_{\p}$, is generated as follows. A category type $z_i$ is drawn from $\theta_{\p}$, and this generates the observed context, $\textbf{c}_i$, according to a category-specific distribution over contexts types, $\phi_{z_i}$. Since we do not know the values of $\theta_{\p}$ and $\phi_z$, we place priors on the distributions, to reflect our prior beliefs about the shape these distributions should have and infer their values from the data we can observe. Specifically, our {\emph a priori} expectation is that both parameters will be relatively peaked, since each phrase, $\p$, should relatively unambiguous belong to particular category, and each category to generate a relatively small number of context strings, $\textbf{c}$. To encode these intuitions, we make use of Pitman-Yor processes \citep{pitman:1997}, which have already been demonstrated to be particularly effective models for language \citep{teh:2006,goldwater:2006}.
+This section describes the details of the phrase clustering model. Each observed phrase (pair), $\p$, is characterized by a finite mixture of categories, $\theta_{\p}$. The collection of contexts for each phrase, $C_{\p}$, is generated as follows. A category type $z_i$ is drawn from $\theta_{\p}$, and this generates the observed context, $\textbf{c}_i$, according to a category-specific distribution over contexts types, $\phi_{z_i}$. Since we do not know the values of $\theta_{\p}$ and $\phi_z$, we place priors on the distributions, to reflect our prior beliefs about the shape these distributions should have and infer their values from the data we can observe. Specifically, our {\emph a priori} expectation is that both parameters will be relatively peaked, since each phrase, $\p$, should relatively unambiguous belong to particular category, and each category to generate a relatively small number of context strings, $\textbf{c}$. To encode these intuitions, we make use of Pitman-Yor processes \citep{pitman:1997}, which have already been demonstrated to be particularly effective models for language \citep{teh:2006,goldwater:2006}.
Our model assumes a fixed number of categories, $K$. The category type, $z \in \{ 1 , 2 , \ldots , K \}$, is generated from a PYP with a uniform base distribution:
\begin{align*}
@@ -66,11 +66,11 @@ The final sample drawn from the model was used to estimate $p(z|\textbf{c},\p)$,
\section{Experiments}
-This section reports a number of experiments carried out to test the quality of the grammars learned using our nonparametric cluster models.
+This section reports a number of experiments carried out to test the quality of the grammars learned using our nonparametric cluster models. We evaluate them primarily in terms of their performance on translation tasks. Translation quality evaluation is reported using case-insensitive \textsc{bleu} \citep{bleu} with the number of references used depending on the experimental condition (refer to details in the discussion of the corpora used below).
\subsection{Corpora}
-The experiments reported in this section were carried out primarily on a small Chinese-English corpus from the travel and tourism domain \citep{btec} and a more general-domain Urdu-English corpus, made available by the US National Institute of Standards and Technology (NIST) for the Open MT Evaluation.\footnote{http://www.itl.nist.gov/iad/mig/tests/mt/} Table~\ref{tab:corpbtecur} provides statistics about the training and test data used in the experiments reported in this section. Translation quality evaluation is reported using case-insensitive \textsc{bleu} \citep{bleu} with the number of references given in Table~\ref{tab:corpbtecur}.
+The experiments reported in this section were carried out primarily on a small Chinese-English corpus from the travel and tourism domain \citep{btec} and a more general-domain Urdu-English corpus, made available by the US National Institute of Standards and Technology (NIST) for the Open MT Evaluation.\footnote{http://www.itl.nist.gov/iad/mig/tests/mt/} Table~\ref{tab:corpbtecur} provides statistics about the training and test data used in the experiments reported in this section. Additionally, this table gives the number of references used to compute the \textsc{bleu} score for a translated document.
\begin{table}[h]
\caption{Training corpus statistics for BTEC Chinese-English and the NIST Urdu-English data sets.}
@@ -84,7 +84,7 @@ English tokens & 364,297 & 968,013 \\
Foreign types & 13,664 & 33,757 \\
Foreign tokens & 333,438 & 1,052,260 \\
\hline
-Development sentences & 1,006 & 882 \\
+Dev. sentences & 1,006 & 882 \\
Test sentences & 506 & 883 \\
Num. references & 16 & 4
\end{tabular}
@@ -104,19 +104,139 @@ We provide two baseline systems: a single-category system constructed using the
\hline
Single category \citep{chiang:2007} & 57.0 & 21.1 \\
\hline
-Random ($K=10$) & 56.0 & \\
+Random ($K=10$) & 56.0 & 19.8 \\
Random ($K=25$) & 55.4 & 19.7 \\
-Random ($K=50$) & 55.3 & \\
+Random ($K=50$) & 55.3 & 19.6 \\
\hline
-Supervised \citep{samt} & 57.8 & 24.5
+Supervised \citep{samt} & 57.8 & 24.5 \\
+POS-only & TODO & 22.3 \\
\end{tabular}
\end{center}
\label{tab:npbaselines}
\end{table}%
+Because the margin of improvement from the 1-category baseline to the supervised condition is much more substantial in the Urdu-English condition than in the BTEC condition, some experiments were only carried out on Urdu.
\subsection{Number of categories}
+\begin{table}[h]
+\caption{Effect of varying $K$, single word left and right target language context, uniform $\phi_0$, hierarchical $\theta_0$.}
+\begin{center}
+\begin{tabular}{r|c|c}
+& BTEC & Urdu \\
+\hline
+Single category (baseline) & 57.0 & 21.1 \\
+\hline
+$K=10$ & 56.4 & \\
+$K=25$ & 57.5 & 22.0 \\
+$K=50$ & 56.2 & \\
+\end{tabular}
+\end{center}
+\label{tab:npbaselines}
+\end{table}%
+
+\subsection{Example grammar}
+
+\begin{table}[h]
+\caption{Fragment (part 1/2) of 25 category Urdu-English grammar, hierarchical $\theta_0$, uniform $\phi_0$, 1 word context on either side in the target language. Counts indicate the number of distinct rules that rewrite each category type.}
+\begin{center}
+\begin{tabular}{|c|l|c|l|}
+\hline
+22,386 & $ \textrm{X}^{0} \rightarrow \langle \textrm{EdAlt},\textrm{{\emph court}} \rangle $ &27,604 & $ \textrm{X}^{8} \rightarrow \langle \textrm{tHt},\textrm{{\emph under}} \rangle $ \\
+ & $ \textrm{X}^{0} \rightarrow \langle \textrm{bcwN},\textrm{{\emph children}} \rangle $ & & $ \textrm{X}^{8} \rightarrow \langle \textrm{myN},\textrm{{\emph into}} \rangle $ \\
+ & $ \textrm{X}^{0} \rightarrow \langle \textrm{lwg},\textrm{{\emph people}} \rangle $ & & $ \textrm{X}^{8} \rightarrow \langle \textrm{yA},\textrm{{\emph or}} \rangle $ \\
+ & $ \textrm{X}^{0} \rightarrow \langle \textrm{bcY},\textrm{{\emph children}} \rangle $ & & $ \textrm{X}^{8} \rightarrow \langle \textrm{ky},\textrm{{\emph for}} \rangle $ \\
+ & $ \textrm{X}^{0} \rightarrow \langle \textrm{ArkAn},\textrm{{\emph members}} \rangle $ & & $ \textrm{X}^{8} \rightarrow \langle \textrm{kY},\textrm{{\emph for}} \rangle $ \\
+\hline
+26,834 & $ \textrm{X}^{1} \rightarrow \langle \textrm{Drwrt},\textrm{{\emph need}} \rangle $ &40,283 & $ \textrm{X}^{9} \rightarrow \langle \textrm{3},\textrm{{\emph 3}} \rangle $ \\
+ & $ \textrm{X}^{1} \rightarrow \langle \textrm{SHAfywN},\textrm{{\emph journalists}} \rangle $ & & $ \textrm{X}^{9} \rightarrow \langle \textrm{Hsyn},\textrm{{\emph hussein}} \rangle $ \\
+ & $ \textrm{X}^{1} \rightarrow \langle \textrm{bAt},\textrm{{\emph speak}} \rangle $ & & $ \textrm{X}^{9} \rightarrow \langle \textrm{fArwq},\textrm{{\emph farooq}} \rangle $ \\
+ & $ \textrm{X}^{1} \rightarrow \langle \textrm{HSh},\textrm{{\emph participate}} \rangle $ & & $ \textrm{X}^{9} \rightarrow \langle \textrm{AqbAl},\textrm{{\emph iqbal}} \rangle $ \\
+ & $ \textrm{X}^{1} \rightarrow \langle \textrm{Apyl},\textrm{{\emph appeal}} \rangle $ & & $ \textrm{X}^{9} \rightarrow \langle \textrm{bn},\textrm{{\emph bin}} \rangle $ \\
+\hline
+61,592 & $ \textrm{X}^{2} \rightarrow \langle \textrm{kr},\textrm{{\emph the}} \rangle $ &182,196 & $ \textrm{X}^{10} \rightarrow \langle \textrm{yhAN},\textrm{{\emph here}} \rangle $ \\
+ & $ \textrm{X}^{2} \rightarrow \langle \textrm{hmArY},\textrm{{\emph our}} \rangle $ & & $ \textrm{X}^{10} \rightarrow \langle \textrm{AjlAs myN},\textrm{{\emph in the meeting}} \rangle $ \\
+ & $ \textrm{X}^{2} \rightarrow \langle \textrm{ErAqy},\textrm{{\emph iraqi}} \rangle $ & & $ \textrm{X}^{10} \rightarrow \langle \textrm{AysA},\textrm{{\emph so}} \rangle $ \\
+ & $ \textrm{X}^{2} \rightarrow \langle \textrm{dwsrY},\textrm{{\emph other}} \rangle $ & & $ \textrm{X}^{10} \rightarrow \langle \textrm{ElAj},\textrm{{\emph treatment}} \rangle $ \\
+ & $ \textrm{X}^{2} \rightarrow \langle \textrm{pr},\textrm{{\emph the}} \rangle $ & & $ \textrm{X}^{10} \rightarrow \langle \textrm{b@hrt},\textrm{{\emph india}} \rangle $ \\
+\hline
+98,970 & $ \textrm{X}^{3} \rightarrow \langle \textrm{zndgy},\textrm{{\emph life}} \rangle $ &7,648 & $ \textrm{X}^{11} \rightarrow \langle \textrm{sktA},\textrm{{\emph could}} \rangle $ \\
+ & $ \textrm{X}^{3} \rightarrow \langle \textrm{brTAnyh},\textrm{{\emph britain}} \rangle $ & & $ \textrm{X}^{11} \rightarrow \langle \textrm{skyN},\textrm{{\emph can}} \rangle $ \\
+ & $ \textrm{X}^{3} \rightarrow \langle \textrm{sEwdy Erb},\textrm{{\emph saudi arabia}} \rangle $ & & $ \textrm{X}^{11} \rightarrow \langle \textrm{tRym},\textrm{{\emph team}} \rangle $ \\
+ & $ \textrm{X}^{3} \rightarrow \langle \textrm{AslAm},\textrm{{\emph islam}} \rangle $ & & $ \textrm{X}^{11} \rightarrow \langle \textrm{kAm},\textrm{{\emph work}} \rangle $ \\
+ & $ \textrm{X}^{3} \rightarrow \langle \textrm{cyn},\textrm{{\emph china}} \rangle $ & & $ \textrm{X}^{11} \rightarrow \langle \textrm{\$mAly},\textrm{{\emph north}} \rangle $ \\
+\hline
+66,916 & $ \textrm{X}^{4} \rightarrow \langle \textrm{AnSAf},\textrm{{\emph justice}} \rangle $ &58,760 & $ \textrm{X}^{12} \rightarrow \langle \textrm{tryn},\textrm{{\emph most}} \rangle $ \\
+ & $ \textrm{X}^{4} \rightarrow \langle \textrm{bynk},\textrm{{\emph bank}} \rangle $ & & $ \textrm{X}^{12} \rightarrow \langle \textrm{wAlA},\textrm{{\emph man}} \rangle $ \\
+ & $ \textrm{X}^{4} \rightarrow \langle \textrm{nZAm},\textrm{{\emph system}} \rangle $ & & $ \textrm{X}^{12} \rightarrow \langle \textrm{ksy},\textrm{{\emph one}} \rangle $ \\
+ & $ \textrm{X}^{4} \rightarrow \langle \textrm{rws},\textrm{{\emph russia}} \rangle $ & & $ \textrm{X}^{12} \rightarrow \langle \textrm{sb},\textrm{{\emph most}} \rangle $ \\
+ & $ \textrm{X}^{4} \rightarrow \langle \textrm{srHd},\textrm{{\emph nwfp}} \rangle $ & & $ \textrm{X}^{12} \rightarrow \langle \textrm{bED},\textrm{{\emph some}} \rangle $ \\
+\hline
+29,526 & $ \textrm{X}^{5} \rightarrow \langle \textrm{qbwl},\textrm{{\emph accept}} \rangle $ &80,567 & $ \textrm{X}^{13} \rightarrow \langle \textrm{Amrykh kY},\textrm{{\emph the united}} \rangle $ \\
+ & $ \textrm{X}^{5} \rightarrow \langle \textrm{cAhtY},\textrm{{\emph want}} \rangle $ & & $ \textrm{X}^{13} \rightarrow \langle \textrm{"},\textrm{{\emph "}} \rangle $ \\
+ & $ \textrm{X}^{5} \rightarrow \langle \textrm{jAntY},\textrm{{\emph know}} \rangle $ & & $ \textrm{X}^{13} \rightarrow \langle \textrm{kl},\textrm{{\emph tomorrow}} \rangle $ \\
+ & $ \textrm{X}^{5} \rightarrow \langle \textrm{dY},\textrm{{\emph give}} \rangle $ & & $ \textrm{X}^{13} \rightarrow \langle \textrm{mjls},\textrm{{\emph majlis}} \rangle $ \\
+ & $ \textrm{X}^{5} \rightarrow \langle \textrm{bcAnY},\textrm{{\emph save}} \rangle $ & & $ \textrm{X}^{13} \rightarrow \langle \textrm{Awr pAkstAn},\textrm{{\emph and pakistan}} \rangle $ \\
+\hline
+12,625 & $ \textrm{X}^{6} \rightarrow \langle \textrm{dAxlh},\textrm{{\emph interior}} \rangle $ &111,291 & $ \textrm{X}^{14} \rightarrow \langle \textrm{pAkstAn myN},\textrm{{\emph in pakistan}} \rangle $ \\
+ & $ \textrm{X}^{6} \rightarrow \langle \textrm{myjr},\textrm{{\emph major}} \rangle $ & & $ \textrm{X}^{14} \rightarrow \langle \textrm{xw\$},\textrm{{\emph happy}} \rangle $ \\
+ & $ \textrm{X}^{6} \rightarrow \langle \textrm{Erb},\textrm{{\emph arab}} \rangle $ & & $ \textrm{X}^{14} \rightarrow \langle \textrm{\$rkt},\textrm{{\emph participated}} \rangle $ \\
+ & $ \textrm{X}^{6} \rightarrow \langle \textrm{nY btAyA},\textrm{{\emph told}} \rangle $ & & $ \textrm{X}^{14} \rightarrow \langle \textrm{kAmyAb},\textrm{{\emph succeeded}} \rangle $ \\
+ & $ \textrm{X}^{6} \rightarrow \langle \textrm{dAxlh},\textrm{{\emph home}} \rangle $ & & $ \textrm{X}^{14} \rightarrow \langle \textrm{nArAD},\textrm{{\emph angry}} \rangle $ \\
+\hline
+53,541 & $ \textrm{X}^{7} \rightarrow \langle \textrm{bcY},\textrm{{\emph child}} \rangle $ &8,547 & $ \textrm{X}^{15} \rightarrow \langle \textrm{hy},\textrm{{\emph is}} \rangle $ \\
+ & $ \textrm{X}^{7} \rightarrow \langle \textrm{Hq},\textrm{{\emph right}} \rangle $ & & $ \textrm{X}^{15} \rightarrow \langle \textrm{hw gy},\textrm{{\emph will be}} \rangle $ \\
+ & $ \textrm{X}^{7} \rightarrow \langle \textrm{\$hr},\textrm{{\emph city}} \rangle $ & & $ \textrm{X}^{15} \rightarrow \langle \textrm{hw},\textrm{{\emph have}} \rangle $ \\
+ & $ \textrm{X}^{7} \rightarrow \langle \textrm{SwrtHAl},\textrm{{\emph situation}} \rangle $ & & $ \textrm{X}^{15} \rightarrow \langle \textrm{rhy hY},\textrm{{\emph is}} \rangle $ \\
+ & $ \textrm{X}^{7} \rightarrow \langle \textrm{jng},\textrm{{\emph war}} \rangle $ & & $ \textrm{X}^{15} \rightarrow \langle \textrm{kA},\textrm{{\emph 's}} \rangle $ \\
+\hline
+\end{tabular}
+\end{center}
+\label{tab:npexample2}
+\end{table}%
+
+\begin{table}[h]
+\caption{Fragment (part 2/2) of 25 category Urdu-English grammar.}
+\begin{center}
+\begin{tabular}{|c|l|c|l|}
+\hline
+40,738 & $ \textrm{X}^{16} \rightarrow \langle \textrm{gyA},\textrm{{\emph .}} \rangle $ &68,633 & $ \textrm{X}^{20} \rightarrow \langle \textrm{m\$yn},\textrm{{\emph machine}} \rangle $ \\
+ & $ \textrm{X}^{16} \rightarrow \langle \textrm{lyA},\textrm{{\emph .}} \rangle $ & & $ \textrm{X}^{20} \rightarrow \langle \textrm{myN mwjwd},\textrm{{\emph present in}} \rangle $ \\
+ & $ \textrm{X}^{16} \rightarrow \langle \textrm{dy},\textrm{{\emph .}} \rangle $ & & $ \textrm{X}^{20} \rightarrow \langle \textrm{AZhAr},\textrm{{\emph expressing}} \rangle $ \\
+ & $ \textrm{X}^{16} \rightarrow \langle \textrm{hy},\textrm{{\emph .}} \rangle $ & & $ \textrm{X}^{20} \rightarrow \langle \textrm{jyt},\textrm{{\emph winning}} \rangle $ \\
+ & $ \textrm{X}^{16} \rightarrow \langle \textrm{, pAkstAn},\textrm{{\emph , pakistan}} \rangle $ & & $ \textrm{X}^{20} \rightarrow \langle \textrm{nhyN},\textrm{{\emph not}} \rangle $ \\
+\hline
+16,270 & $ \textrm{X}^{17} \rightarrow \langle \textrm{pr},\textrm{{\emph to}} \rangle $ &40,443 & $ \textrm{X}^{21} \rightarrow \langle \textrm{AnkAr},\textrm{{\emph refused}} \rangle $ \\
+ & $ \textrm{X}^{17} \rightarrow \langle \textrm{sy},\textrm{{\emph to}} \rangle $ & & $ \textrm{X}^{21} \rightarrow \langle \textrm{khnA},\textrm{{\emph according}} \rangle $ \\
+ & $ \textrm{X}^{17} \rightarrow \langle \textrm{AnhwN nY},\textrm{{\emph he further}} \rangle $ & & $ \textrm{X}^{21} \rightarrow \langle \textrm{mlAqAt},\textrm{{\emph met}} \rangle $ \\
+ & $ \textrm{X}^{17} \rightarrow \langle \textrm{mstRr jstRs},\textrm{{\emph mr. justice}} \rangle $ & & $ \textrm{X}^{21} \rightarrow \langle \textrm{nY},\textrm{{\emph gave}} \rangle $ \\
+ & $ \textrm{X}^{17} \rightarrow \langle \textrm{nY},\textrm{{\emph he}} \rangle $ & & $ \textrm{X}^{21} \rightarrow \langle \textrm{sykwrtRy},\textrm{{\emph security}} \rangle $ \\
+\hline
+90,448 & $ \textrm{X}^{18} \rightarrow \langle \textrm{jhAN},\textrm{{\emph where}} \rangle $ &573,610 & $ \textrm{X}^{22} \rightarrow \langle \textrm{w},\textrm{{\emph and}} \rangle $ \\
+ & $ \textrm{X}^{18} \rightarrow \langle \textrm{kh},\textrm{{\emph "}} \rangle $ & & $ \textrm{X}^{22} \rightarrow \langle \textrm{)},\textrm{{\emph )}} \rangle $ \\
+ & $ \textrm{X}^{18} \rightarrow \langle \textrm{tAkh},\textrm{{\emph so}} \rangle $ & & $ \textrm{X}^{22} \rightarrow \langle \textrm{nY},\textrm{{\emph ,}} \rangle $ \\
+ & $ \textrm{X}^{18} \rightarrow \langle \textrm{dryN AvnA'},\textrm{{\emph meanwhile}} \rangle $ & & $ \textrm{X}^{22} \rightarrow \langle \textrm{bEd},\textrm{{\emph after}} \rangle $ \\
+ & $ \textrm{X}^{18} \rightarrow \langle \textrm{smyt},\textrm{{\emph including}} \rangle $ & & $ \textrm{X}^{22} \rightarrow \langle \textrm{(},\textrm{{\emph (}} \rangle $ \\
+\hline
+64,006 & $ \textrm{X}^{19} \rightarrow \langle \textrm{pwlys},\textrm{{\emph police}} \rangle $ &80,463 & $ \textrm{X}^{23} \rightarrow \langle \textrm{ElAqY},\textrm{{\emph area}} \rangle $ \\
+ & $ \textrm{X}^{19} \rightarrow \langle \textrm{whAN},\textrm{{\emph there}} \rangle $ & & $ \textrm{X}^{23} \rightarrow \langle \textrm{bynk},\textrm{{\emph bank}} \rangle $ \\
+ & $ \textrm{X}^{19} \rightarrow \langle \textrm{lwg},\textrm{{\emph people}} \rangle $ & & $ \textrm{X}^{23} \rightarrow \langle \textrm{brAdry},\textrm{{\emph community}} \rangle $ \\
+ & $ \textrm{X}^{19} \rightarrow \langle \textrm{As},\textrm{{\emph there}} \rangle $ & & $ \textrm{X}^{23} \rightarrow \langle \textrm{Erb},\textrm{{\emph arabia}} \rangle $ \\
+ & $ \textrm{X}^{19} \rightarrow \langle \textrm{myrA},\textrm{{\emph i}} \rangle $ & & $ \textrm{X}^{23} \rightarrow \langle \textrm{mslm lyg},\textrm{{\emph muslim league}} \rangle $ \\
+\hline
+ 22,525 & $ \textrm{X}^{24} \rightarrow \langle \textrm{Drwry},\textrm{{\emph necessary}} \rangle $ &&\\
+ & $ \textrm{X}^{24} \rightarrow \langle \textrm{m\$kl},\textrm{{\emph difficult}} \rangle $ &&\\
+ & $ \textrm{X}^{24} \rightarrow \langle \textrm{mkml},\textrm{{\emph completed}} \rangle $ && \\
+ & $ \textrm{X}^{24} \rightarrow \langle \textrm{jA},\textrm{{\emph being}} \rangle $ &&\\
+ & $ \textrm{X}^{24} \rightarrow \langle \textrm{AjAzt},\textrm{{\emph allowed}} \rangle $&& \\
+\hline
+
+\end{tabular}
+\end{center}
+\label{tab:npexample2}
+\end{table}%
+
+
\subsection{Context types}
diff --git a/report/pyp_clustering/format-grammar-latex.pl b/report/pyp_clustering/format-grammar-latex.pl
new file mode 100755
index 00000000..e1fe3e45
--- /dev/null
+++ b/report/pyp_clustering/format-grammar-latex.pl
@@ -0,0 +1,21 @@
+#!/usr/bin/perl -w
+use strict;
+my $x = '';
+while(<>){
+ if (/^$/) { print "\\hline\n"; next; }
+ if (/^(\d+)$/) {
+ $x=$1;
+ $x=~s/^(\d\d\d)(\d\d\d)(\d\d\d)$/$1,$2,$3/;
+ $x=~s/^(\d\d)(\d\d\d)(\d\d\d)$/$1,$2,$3/;
+ $x=~s/^(\d)(\d\d\d)(\d\d\d)$/$1,$2,$3/;
+ $x=~s/^(\d\d\d)(\d\d\d)$/$1,$2/;
+ $x=~s/^(\d\d)(\d\d\d)$/$1,$2/;
+ $x=~s/^(\d)(\d\d\d)$/$1,$2/;
+ next;
+ }
+ s/ \|\|\| LHSProb.*$//; s/ \|\|\| / \\rightarrow \\langle \\textrm{/; s/\[X(\d+)\]/\\textrm{X}^{$1}/;
+ s/ \|\|\| /},\\textrm{{\\emph /;
+ chomp;
+ print "$x & \$ $_}} \\rangle \$ \\\\\n";
+ $x="";
+}