summaryrefslogtreecommitdiff
path: root/report/pr-clustering/posterior.tex
blob: 02eb0e9585c5964b5337512070e34c390f7f77ec (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
\chapter{Posterior Regularization}
Posterior regularization is an alternative way of clustering the phrases.
Unlike a Baysian approach where intuitions about the data are expressed through the priors, posterior regularization imposes constraints on posterior distributions of the data.

In this chapter , we will introduce a basic clustering model with EM 
and look at shortcomings of the basic model. This will motivate us for
more complicated posterior regularized models.
\section{Phrase Clustering Model} 
As a brief recap, the clustering problem we are working with
is to label phrases with $K$ induced categories, where
$K$ is picked manually.
Phrases are obtained from bi-text data.
We also look at context
 words before and after
phrases as cues for clustering.
The relationship between phrases, contexts and categories are
represented with a generative model shown in 
Figure \ref{fig:EM}: a phrase picks a 
category and then that category generates the contex for the phrase.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.0in]{pr-clustering/EMdigram}
  \caption{Basic Phrase Clustering Model}
  \label{fig:EM}
\end{figure}

The joint probability of a category $z$ and a context $\textbf{c}$ 
given a phrase $\textbf{p}$ is
\[
P(z,\textbf{c}|\textbf{p})=P(z|\textbf{p})P(\textbf{c}|z).
\]
$P(z|\textbf{p})$ is distribution of categories given a phrase.
This can be learned from data.
$P(\textbf{c}|z)$ is distribution of context given a category.
Since a context usually contains multiple slots for words, we further
decompose this distribution into independent distributions at
each slot. For example, suppose a context consists of two positions 
before and after the phrase. Denote these words as 
$c_{-2},c_{-1},c_1,c_2$.
Use $P_{-2},P_{-1},P_1,P_2$ to denote distributions of words at each 
position, $P(\textbf{c}|z)$ is decomposed as
\[
P(\textbf{c}|z)=P_{-2}(c_{-2}|z)P_{-1}
(c_{-1}|z)P_1(c_1|z)P_2(c_2|z).
\]
The posterior probability of a category given a phrase
and a context can be computed by normalizing the joint probability:
\[
P(z|\textbf{p},\textbf{c})=\frac{P(z,\textbf{c}|\textbf{p})}
{\sum_{i=1,K}P(i,\textbf{c}|\textbf{p})}.
\]
With the mechanisms to compute the posterior probabilities, we can 
apply EM to learn all the probabilities.
\section{Sparsity Constraints}\label{sec:pr-sparse}
A common linguistic intuition we have about the phrase 
clustering problem is that a phrase should be put into very
few categories, e.g. a verb phrase is unlikely to be used as 
a noun phrase. In other words, the categorization of
a phrase should be sparse.
The generative model we proposed above with EM
allows a phrase to be labelled with many tags. As we observed
from the output, EM is using more categories than we wanted for
each phrase.
Posterior regularization
provides a way to enforce sparsity \citep{ganchev:penn:2009}.
The general idea of posterior regularization is to modify
E-step of EM, so that instead of using posterior distribution
as the $q$ distribution directly, we want to find the nearest
$q$
in a constrained space  as shown in Figure \ref{fig:EMPR}.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.5in]{pr-clustering/EMPR}
  \caption{EM with posterior regularization}
  \label{fig:EMPR}
\end{figure}

The constraint we use here is called $l_1/ l_\infty$
regularization in Ganchev's technical report. The notations
we use here largely follows Ganchev's.
In a more mathematical formulation, for each phrase $\textbf{p}$,
we want the quantity 
\[\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i) \]
to be small, where $\textbf{c}_i$ is the context
appeared around the $i$th occurrence of phrase $\textbf{p}$
throughout the data. This quantity roughly equals 
the number of categories phrase $\textbf{p}$ uses.
It is minimized to $1$ if and only if
the posterior distributions $P(z|\textbf{p},\textbf{c}_i)$
are the same
for all
occurrences of $\textbf{p}$. That is ,
$\forall i,j$, $P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$. 

Define feature functions for $i$th occurrence of phrase $\textbf{p}$
with category $j$,
as a function of category $z$:
\[
\phi_{\textbf{p}ij}(z)=
\begin{cases}
1\text{ if z=j}\\
0\text{ otherwise}
\end{cases}.
\]
For notation simplicity, for
each phrase category pair, 
define variables $c_{\textbf{p}z}$ that will
eventually be $\max_i E_q[\phi_{\textbf{p}iz}]$.
The objective we want to optimize becomes:
\[
\arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) + 
\sigma \sum_{\textbf{p},z}c_{\textbf{p}z}
\]
\[
\text{ s.t. }\forall \textbf{p},z,
E_q[\phi_{\textbf{p}iz}]\leq c_{\textbf{p}z},
\]
where $\sigma$ is a constant to control
how strongly the soft constraint should
be enforced.
Using Lagrange Multipliers, this objective can
be optimized in its dual form,
for each phrase $\textbf{p}$:
\[
\arg\min_{\lambda\geq 0} \log 
(\sum_{z,i} P_\theta(z|\textbf{p},\textbf{c}_i)
\exp (-\lambda_{\textbf{p}iz}))
\]
\[
\text{ s.t. } \forall \textbf{p},z,
\sum_i \lambda_{\textbf{p}iz}\leq \sigma.
\]
This dual objective can be optimized with projected gradient
descent.
The $q$ distribution we are looking for is then
\[
q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i)
\exp(\lambda_{\textbf{p}iz}).
\]
M-step can be performed as usual.
\section{Agreement Models}\label{sec:pr-agree}
Another type of constraint we used is agreement between
different models. We came up with a similar generative
model in the reverse direction to agree with 
as shown in Figure \ref{fig:EMreverse}. We also
took advantage of bi-text data we have and make models
learning from different languages agree with each other.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.0in]{pr-clustering/EMreverse}
  \caption{EM with posterior regularization}
  \label{fig:EMreverse}
\end{figure}

In the reversed model,
the posterior probability of the labelling of
a context $\textbf{c}$ with
phrase $\textbf{p}$ is 
\[
P(z|\textbf{c},\textbf{p})\propto 
P(z|\textbf{c})P(\textbf{p}|z).
\]
Since a phrase contains a variable number of words,
we only look at the first and last word of
a phrase. That is $P(\textbf{p}|z)=P_1(p_1|z)P_n(p_n|z)$,
where $n$ is the length of $\textbf{p}$, $P_1$ and $P_n$
denotes distributions for words in the first and last position
of a phrase given a category.

The implementation of agreement models again ends up making
a small change to E-step. The $q$ distribution for
a phrase $\textbf{p}$ and a context $\textbf{c}$ 
is given by
\[
q(z)=\sqrt{P_{\theta 1}
(z|\textbf{p},\textbf{c})P_{\theta 2}(z|\textbf{p},\textbf{c})},
\]
where $P_{\theta 1}$ and $P_{\theta 2}$ are
posterior distributions for two models.
In M-step, both models should update their parameters with the
same $q$ distribution computed as above.
This modified EM maximizes the objective:
\[
\mathcal{L}_1+
\mathcal{L}_2+
\sum_{\textbf{p},\textbf{c}}
\log\sum_z\sqrt{P_{\theta 1}(z|\textbf{p},\textbf{c})
P_{\theta 2}(z|\textbf{p},\textbf{c})},
\]
where $\mathcal{L}_1$ and $\mathcal{L}_2$
are log-likelihood of
two models.
\section{Experiments}
As a sanity check, we looked at a few examples produced by
the basic model (EM) 
and the posterior regularization (PR) model
with sparsity constraints. Table \ref{tab:EMVSPR}
shows a few examples.

\begin{table}[h]
  \centering
  \includegraphics[width=3.5in]{pr-clustering/EMVSPR}
  \caption[A few examples comparing EM and PR]
  {A few examples comparing EM and PR. 
    Count of most frequent category shows how 
    many instances of a phrase are concetrated on 
    the single most frequent tag. 
    Number of categories shows how many categories
    a phrase is labelled with. By experience as mentioned before, 
    we want a phrase to use fewer categories. 
	These numbers are fair indicators of sparsity.
    }
  \label{tab:EMVSPR}
\end{table}

The models are formally evaluated with two kinds
of metrics. We feed the clustering output
through the whole translation pipeline 
to obtain a BLEU score. We also came up 
with an intrinsic evaluation of clustering quality
by comparing against a supervised CFG parser trained on the
tree bank.

We are mainly working on Urdu-English language pair. 
Urdu has very 
different word ordering from English. 
This leaves us room for improvement over
phrase-based systems.
Here in Table \ref{tab:results} 
we show BLEU scores as well as
conditional entropy for each of the models above
on Urdu data. Conditional entropy is computed
as the entropy of ``gold'' labelling given
the predicted clustering. ``Gold'' labelling
distribution
is obtained from Collins parser
trained on Penn Treebank. Since not
all phrases are constituents, we ignored
phrases that don't correspond any constituents.

\begin{table}[h]
  \centering
  \begin{tabular}{ |*{3}{c|} }
    \hline
    model & BLEU & H(Gold$|$Predicted)\\
    \hline
    hiero & 21.1 & 5.77\\
    hiero+POS & 22.3 & 1.00 \\
    SAMT & 24.5 & 0.00 \\
    \hline
    EM & 20.9 & 2.86 \\
    PR $\sigma=100$ & 21.7 & 2.36 \\
    agree language & 21.7 & 2.68 \\
    agree direction & 22.1 & 2.35\\
    non-parametric & 22.2 & ?\\
    \hline
  \end{tabular}
    \caption
  {Evaluation of PR models.
	Left column shows BLEU scores
	through the translation pipeline.
	Right columns shows conditional entropy
	of the 
    }
  \label{tab:results}
\end{table}

In Table \ref{tab:results}, the first three rows
are baseline system. The rest are developed in the workshop.
Hiero is hierachical phrase-based
model with 1 category in all of its SCFG rules. Hiero+POS
is hiero with all words labelled with their POS tags.
SAMT is a syntax based system with a supervised
parser trained on Treebank. EM is the first model mentioned
in the beginning of this chapter. PR $\sigma=100$ is 
posterior regularization model with sparsity constraint
explained in Section \ref{sec:pr-sparse}.
$\sigma$ is the constant controls strongness of the constraint.
Agree language and agree direction are models with agreement 
constraints mentioned in Section \ref{sec:pr-agree}. Non-parametric
is non-parametric model introduced in the previous chapter.
\section{Conclusion}