\chapter{Posterior Regularization}
Posterior regularization is an alternative way of clustering the phrases.
Unlike a Baysian approach where intuitions about the data are expressed through the priors, posterior regularization imposes constraints on posterior distributions of the data.

In this chapter , we will introduce a basic clustering model with EM 
and look at shortcomings of the basic model. This will motivate us for
more complicated posterior regularized models.
\section{Phrase Clustering Model} 
As a brief recap, the clustering problem we are working with
is to label phrases with $K$ induced categories, where
$K$ is picked manually.
Phrases are obtained from bi-text data.
We also look at context
 words before and after
phrases as cues for clustering.
The relationship between phrases, contexts and categories are
represented with a generative model shown in 
Figure \ref{fig:EM}: a phrase picks a 
category and then that category generates the contex for the phrase.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.0in]{pr-clustering/EMdigram}
  \caption{Basic Phrase Clustering Model}
  \label{fig:EM}
\end{figure}

The joint probability of a category $z$ and a context $\textbf{c}$ 
given a phrase $\textbf{p}$ is
\[
P(z,\textbf{c}|\textbf{p})=P(z|\textbf{p})P(\textbf{c}|z).
\]
$P(z|\textbf{p})$ is distribution of categories given a phrase.
This can be learned from data.
$P(\textbf{c}|z)$ is distribution of context given a category.
Since a context usually contains multiple slots for words, we further
decompose this distribution into independent distributions at
each slot. For example, suppose a context consists of two positions 
before and after the phrase. Denote these words as 
$c_{-2},c_{-1},c_1,c_2$.
Use $P_{-2},P_{-1},P_1,P_2$ to denote distributions of words at each 
position, $P(\textbf{c}|z)$ is decomposed as
\[
P(\textbf{c}|z)=P_{-2}(c_{-2}|z)P_{-1}
(c_{-1}|z)P_1(c_1|z)P_2(c_2|z).
\]
The posterior probability of a category given a phrase
and a context can be computed by normalizing the joint probability:
\[
P(z|\textbf{p},\textbf{c})=\frac{P(z,\textbf{c}|\textbf{p})}
{\sum_{i=1,K}P(i,\textbf{c}|\textbf{p})}.
\]
With the mechanisms to compute the posterior probabilities, we can 
apply EM to learn all the probabilities.
\section{Sparsity Constraints}\label{sec:pr-sparse}
A common linguistic intuition we have about the phrase 
clustering problem is that a phrase should be put into very
few categories, e.g. a verb phrase is unlikely to be used as 
a noun phrase. In other words, the categorization of
a phrase should be sparse.
The generative model we proposed above with EM
allows a phrase to be labelled with many tags. As we observed
from the output, EM is using more categories than we wanted for
each phrase.
Posterior regularization
provides a way to enforce sparsity \citep{ganchev:penn:2009}.
The general idea of posterior regularization is to modify
E-step of EM, so that instead of using posterior distribution
as the $q$ distribution directly, we want to find the nearest
$q$
in a constrained space  as shown in Figure \ref{fig:EMPR}.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.5in]{pr-clustering/EMPR}
  \caption{EM with posterior regularization}
  \label{fig:EMPR}
\end{figure}

The constraint we use here is called $l_1/ l_\infty$
regularization in Ganchev's technical report. The notations
we use here largely follows Ganchev's.
In a more mathematical formulation, for each phrase $\textbf{p}$,
we want the quantity 
\[\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i) \]
to be small, where $\textbf{c}_i$ is the context
appeared around the $i$th occurrence of phrase $\textbf{p}$
throughout the data. This quantity roughly equals 
the number of categories phrase $\textbf{p}$ uses.
It is minimized to $1$ if and only if
the posterior distributions $P(z|\textbf{p},\textbf{c}_i)$
are the same
for all
occurrences of $\textbf{p}$. That is ,
$\forall i,j$, $P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$. 

Define feature functions for $i$th occurrence of phrase $\textbf{p}$
with category $j$,
as a function of category $z$:
\[
\phi_{\textbf{p}ij}(z)=
\begin{cases}
1\text{ if z=j}\\
0\text{ otherwise}
\end{cases}.
\]
For notation simplicity, for
each phrase category pair, 
define variables $c_{\textbf{p}z}$ that will
eventually be $\max_i E_q[\phi_{\textbf{p}iz}]$.
The objective we want to optimize becomes:
\[
\arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) + 
\sigma \sum_{\textbf{p},z}c_{\textbf{p}z}
\]
\[
\text{ s.t. }\forall \textbf{p},z,
E_q[\phi_{\textbf{p}iz}]\leq c_{\textbf{p}z},
\]
where $\sigma$ is a constant to control
how strongly the soft constraint should
be enforced.
Using Lagrange Multipliers, this objective can
be optimized in its dual form,
for each phrase $\textbf{p}$:
\[
\arg\min_{\lambda\geq 0} \log 
(\sum_{z,i} P_\theta(z|\textbf{p},\textbf{c}_i)
\exp (-\lambda_{\textbf{p}iz}))
\]
\[
\text{ s.t. } \forall \textbf{p},z,
\sum_i \lambda_{\textbf{p}iz}\leq \sigma.
\]
This dual objective can be optimized with projected gradient
descent.
The $q$ distribution we are looking for is then
\[
q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i)
\exp(\lambda_{\textbf{p}iz}).
\]
M-step can be performed as usual.
\section{Agreement Models}\label{sec:pr-agree}
Another type of constraint we used is agreement between
different models. We came up with a similar generative
model in the reverse direction to agree with 
as shown in Figure \ref{fig:EMreverse}. We also
took advantage of bi-text data we have and make models
learning from different languages agree with each other.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.0in]{pr-clustering/EMreverse}
  \caption{EM with posterior regularization}
  \label{fig:EMreverse}
\end{figure}

In the reversed model,
the posterior probability of the labelling of
a context $\textbf{c}$ with
phrase $\textbf{p}$ is 
\[
P(z|\textbf{c},\textbf{p})\propto 
P(z|\textbf{c})P(\textbf{p}|z).
\]
Since a phrase contains a variable number of words,
we only look at the first and last word of
a phrase. That is $P(\textbf{p}|z)=P_1(p_1|z)P_n(p_n|z)$,
where $n$ is the length of $\textbf{p}$, $P_1$ and $P_n$
denotes distributions for words in the first and last position
of a phrase given a category.

The implementation of agreement models again ends up making
a small change to E-step. The $q$ distribution for
a phrase $\textbf{p}$ and a context $\textbf{c}$ 
is given by
\[
q(z)=\sqrt{P_{\theta 1}
(z|\textbf{p},\textbf{c})P_{\theta 2}(z|\textbf{p},\textbf{c})},
\]
where $P_{\theta 1}$ and $P_{\theta 2}$ are
posterior distributions for two models.
In M-step, both models should update their parameters with the
same $q$ distribution computed as above.
This modified EM maximizes the objective:
\[
\mathcal{L}_1+
\mathcal{L}_2+
\sum_{\textbf{p},\textbf{c}}
\log\sum_z\sqrt{P_{\theta 1}(z|\textbf{p},\textbf{c})
P_{\theta 2}(z|\textbf{p},\textbf{c})},
\]
where $\mathcal{L}_1$ and $\mathcal{L}_2$
are log-likelihood of
two models.
\section{Experiments}
As a sanity check, we looked at a few examples produced by
the basic model (EM) 
and the posterior regularization (PR) model
with sparsity constraints. Table \ref{tab:EMVSPR}
shows a few examples.

\begin{table}[h]
  \centering
  \includegraphics[width=3.5in]{pr-clustering/EMVSPR}
  \caption[A few examples comparing EM and PR]
  {A few examples comparing EM and PR. 
    Count of most frequent category shows how 
    many instances of a phrase are concetrated on 
    the single most frequent tag. 
    Number of categories shows how many categories
    a phrase is labelled with. By experience as mentioned before, 
    we want a phrase to use fewer categories. 
	These numbers are fair indicators of sparsity.
    }
  \label{tab:EMVSPR}
\end{table}

The models are formally evaluated with two kinds
of metrics. We feed the clustering output
through the whole translation pipeline 
to obtain a BLEU score. We also came up 
with an intrinsic evaluation of clustering quality
by comparing against a supervised CFG parser trained on the
tree bank.

We are mainly working on Urdu-English language pair. 
Urdu has very 
different word ordering from English. 
This leaves us room for improvement over
phrase-based systems.
Here in Table \ref{tab:results} 
we show BLEU scores as well as
conditional entropy for each of the models above
on Urdu data. Conditional entropy is computed
as the entropy of ``gold'' labelling given
the predicted clustering. ``Gold'' labelling
distribution
is obtained from Collins parser
trained on Penn Treebank. Since not
all phrases are constituents, we ignored
phrases that don't correspond any constituents.

\begin{table}[h]
  \centering
  \begin{tabular}{ |*{3}{c|} }
    \hline
    model & BLEU & H(Gold$|$Predicted)\\
    \hline
    hiero & 21.1 & 5.77\\
    hiero+POS & 22.3 & 1.00 \\
    SAMT & 24.5 & 0.00 \\
    \hline
    EM & 20.9 & 2.86 \\
    PR $\sigma=100$ & 21.7 & 2.36 \\
    agree language & 21.7 & 2.68 \\
    agree direction & 22.1 & 2.35\\
    non-parametric & 22.2 & ?\\
    \hline
  \end{tabular}
    \caption
  {Evaluation of PR models.
	Left column shows BLEU scores
	through the translation pipeline.
	Right columns shows conditional entropy
	of the 
    }
  \label{tab:results}
\end{table}

In Table \ref{tab:results}, the first three rows
are baseline system. The rest are developed in the workshop.
Hiero is hierachical phrase-based
model with 1 category in all of its SCFG rules. Hiero+POS
is hiero with all words labelled with their POS tags.
SAMT is a syntax based system with a supervised
parser trained on Treebank. EM is the first model mentioned
in the beginning of this chapter. PR $\sigma=100$ is 
posterior regularization model with sparsity constraint
explained in Section \ref{sec:pr-sparse}.
$\sigma$ is the constant controls strongness of the constraint.
Agree language and agree direction are models with agreement 
constraints mentioned in Section \ref{sec:pr-agree}. Non-parametric
is non-parametric model introduced in the previous chapter.
\section{Conclusion}