\chapter{Posterior Regularization} Posterior regularization is an alternative way of clustering the phrases. Unlike a Bayesian approach where intuitions about the data are expressed through the priors, posterior regularization imposes constraints on posterior distributions of the data. In this chapter , we will introduce a basic clustering model with EM and look at shortcomings of the basic model. This will motivate us for more complicated posterior regularized models. \section{Phrase Clustering Model} As a brief recap, the clustering problem we are working with is to label phrases with $K$ induced categories, where $K$ is picked manually. Phrases are obtained from bi-text data. We also look at context words before and after phrases as cues for clustering. The relationship between phrases, contexts and categories are represented with a generative model shown in Figure \ref{fig:EM}: a phrase picks a category and then that category generates the contex for the phrase. \begin{figure}[h] \centering \includegraphics[width=3.0in]{pr-clustering/EMdigram} \caption{Basic Phrase Clustering Model} \label{fig:EM} \end{figure} The joint probability of a category $z$ and a context $\textbf{c}$ given a phrase $\textbf{p}$ is \begin{equation}\label{eq:NBPost} P(z,\textbf{c}|\textbf{p})=P(z|\textbf{p})P(\textbf{c}|z). \end{equation} $P(z|\textbf{p})$ is distribution of categories given a phrase. This can be learned from data. $P(\textbf{c}|z)$ is distribution of context given a category. Since a context usually contains multiple slots for words, we further decompose this distribution into independent distributions at each slot. For example, suppose a context consists of two positions before and after the phrase. Denote these words as $c_{-2},c_{-1},c_1,c_2$. Use $P_{-2},P_{-1},P_1,P_2$ to denote distributions of words at each position, $P(\textbf{c}|z)$ is decomposed as \[ P(\textbf{c}|z)=P_{-2}(c_{-2}|z)P_{-1} (c_{-1}|z)P_1(c_1|z)P_2(c_2|z). \] The posterior probability of a category given a phrase and a context can be computed by normalizing the joint probability: \[ P(z|\textbf{p},\textbf{c})=\frac{P(z,\textbf{c}|\textbf{p})} {\sum_{i=1,K}P(i,\textbf{c}|\textbf{p})}. \] With the mechanisms to compute the posterior probabilities, we can apply expectation-maximization algorithm (EM) to learn all the probabilities. EM algorithm maximizes the data likelihood \[ \mathcal{L}= \sum_{\textbf{p},\textbf{c}} \log \sum_{z=1}^K P(z,\textbf{c}|\textbf{p}) \] EM works by iterating between two steps: E step computes posterior distributions according to Equation \ref{eq:NBPost}, M step updates model parameters with maximum likelihood estimation as in Equation \ref{eq:MStep}. \begin{equation}\label{eq:MStep} \boldsymbol{\theta}= \arg\max_{\boldsymbol{\theta}} \sum_{\textbf{c},\textbf{p}}\sum_z P(z|\textbf{p},\textbf{c}, \boldsymbol\theta^{old})\log P(\textbf{c},\textbf{p}|z,\boldsymbol{\theta}). \end{equation} \section{Sparsity Constraints}\label{sec:pr-sparse} A common linguistic intuition we have about the phrase clustering problem is that a phrase should be put into very few categories, e.g. a verb phrase is unlikely to be used as a noun phrase. In other words, the categorization of a phrase should be sparse. The generative model we proposed above with EM allows a phrase to be labelled with many tags. As we observed from the output, EM is using more categories than we wanted for each phrase. Posterior regularization provides a way to enforce sparsity \citep{ganchev:penn:2009}. The general idea of posterior regularization is to modify E-step of EM, so that instead of using posterior distribution as the $q$ distribution directly, the nearest $q$ in a constrained space is used as shown in Figure \ref{fig:EMPR}. \begin{figure}[h] \centering \includegraphics[width=3.5in]{pr-clustering/EMPR} \caption{EM with posterior regularization} \label{fig:EMPR} \end{figure} The constraint we use here is called $L_1/ L_\infty$ regularization in Ganchev's technical report. The notations we use here largely follows Ganchev's. In a more mathematical formulation, for each phrase $\textbf{p}$, we want the quantity \begin{equation}\label{eq:sparsity} \sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i) \end{equation} to be small, where $\textbf{c}_i$ is the context appeared around the $i$th occurrence of phrase $\textbf{p}$ throughout the data. This quantity roughly equals the number of categories phrase $\textbf{p}$ uses. It is minimized to $1$ if and only if the posterior distributions $P(z|\textbf{p},\textbf{c}_i)$ are the same for all occurrences of $\textbf{p}$. That is , $\forall i,j$, $P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$. For example, Table \ref{tab:sparse} compares two $L_1/L_{\infty}$ norms of phrases. The first phrase prefers multiple categories under different contexts, the second phrase prefers the same category under all contexts. The second phrase has a much smaller $L_1/L_{\infty}$ norm. \begin{table}[h] \label{tab:sparse} \centering \includegraphics[width=3.5in]{pr-clustering/sparse0} \caption{$L_1/L_{\infty}$ norm of a phrase that prefers multiple categories} \includegraphics[width=3.5in]{pr-clustering/sparse1} \caption{$L_1/L_{\infty}$ norm of a phrase that prefers a single phrase} \end{table} Define feature functions for $i$th occurrence of phrase $\textbf{p}$ with category $j$, as a function of category $z$: \[ \phi_{\textbf{p}ij}(z)= \begin{cases} 1\text{ if z=j}\\ 0\text{ otherwise} \end{cases}. \] For notation simplicity, for each phrase category pair, define variables $e_{\textbf{p}z}$ to be $\max_i E_q[\phi_{\textbf{p}iz}]$. Note this is just a different notation for the term in Expression \ref{eq:sparsity}. The objective we want to optimize becomes: \[ \arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) + \sigma \sum_{\textbf{p},z}e_{\textbf{p}z} \] \[ \text{ s.t. }\forall \textbf{p},z, E_q[\phi_{\textbf{p}iz}]\leq e_{\textbf{p}z}, \] where $\sigma$ is a constant to control how strongly the soft constraint should be enforced. Using Lagrange Multipliers, this objective can be optimized in its dual form, for each phrase $\textbf{p}$: \[ \arg\min_{\lambda\geq 0} \log (\sum_{z,i} P_\theta(z|\textbf{p},\textbf{c}_i) \exp (-\lambda_{\textbf{p}iz})) \] \[ \text{ s.t. } \forall \textbf{p},z, \sum_i \lambda_{\textbf{p}iz}\leq \sigma. \] This dual objective can be optimized with projected gradient descent. Notice that each phrase has its own objective and constraint. The $\lambda$s are not shared acrossed phrases. Therefore we can optimize the objective for each phrase separately. It is convenient for parallelizing the algorithm. It also makes the objective easier to optimize. The $q$ distribution we are looking for is then \[ q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i) \exp(\lambda_{\textbf{p}iz}). \] M-step can be performed as usual by replacing $P(z|\textbf{p},\textbf{c}, \boldsymbol\theta^{old})$ with $q(z)$ in Equation \ref{eq:MStep}. \section{Agreement Models}\label{sec:pr-agree} Another type of constraint we used is agreement between different models. The intuition is that if two models run on the same dataset, it is desirable for them to give the same output. Suppose two models gives two posterior distributions of categories $P_{\theta_1}(z|\textbf{p},\textbf{c})$, $P_{\theta_2}(z|\textbf{p},\textbf{c})$. The agreement constraint seeks posterior distributions $q_1(z),q_2(z)$ such that $q_1(z)=q_2(z)$ while minimizing the KL-divergence $KL(q(z_1)q(z_2)|| P_{\theta_1}(z|\textbf{p},\textbf{c}) P_{\theta_1}(z|\textbf{p},\textbf{c}))$. The solution to $q_1,q_2$ has a simple closed form derived from Lagrange Multipliers shown in Equation \ref{eq:AgreePost}. \begin{equation}\label{eq:AgreePost} q_1(z)=q_2(z)\propto \sqrt{P_{\theta_1}(z|\textbf{p},\textbf{c}) P_{\theta_1}(z|\textbf{p},\textbf{c})} \end{equation}. Since there is no reason to believe in that phrases generates a category and then generate a context, it is very natural to come up with a model that generates in the reverse direction as shown in Figure \ref{fig:EMreverse}. This model and the original model should then follow the agreement constraint. We also took advantage of bi-text data and make models learning from different languages to agree with each other. \begin{figure}[h] \centering \includegraphics[width=3.0in]{pr-clustering/EMreverse} \caption{Generative Model in the reverse Direction} \label{fig:EMreverse} \end{figure} In the reversed model, the posterior probability of the labelling of a context $\textbf{c}$ with phrase $\textbf{p}$ is \[ P(z|\textbf{c},\textbf{p})\propto P(z|\textbf{c})P(\textbf{p}|z). \] Since a phrase contains a variable number of words, we only look at the first and last word of a phrase. That is $P(\textbf{p}|z)=P_l(p_l|z)P_r(p_r|z)$, where $P_l$ and $P_r$ denotes distributions for words in the first and last position , $p_l$ and $p_r$ are words in the first and last position. The implementation of agreement models again ends up making a small change to E-step. The $q$ distribution for a phrase $\textbf{p}$ and a context $\textbf{c}$ is given by Equation \ref{eq:AgreePost}. In M-step, both models should update their parameters with $q$ distribution computed as above. This modified EM is proven to maximizes the objective: \[ \mathcal{L}_1+ \mathcal{L}_2+ \sum_{\textbf{p},\textbf{c}} \log\sum_z\sqrt{P_{\theta_1}(z|\textbf{p},\textbf{c}) P_{\theta_2}(z|\textbf{p},\textbf{c})}, \] where $\mathcal{L}_1$ and $\mathcal{L}_2$ are log-likelihoods of each individual model. \section{Experiments} As a sanity check, we looked at a few examples produced by the basic model (EM) and the posterior regularization (PR) model with sparsity constraints. Table \ref{tab:EMVSPR} shows a few examples. \begin{table}[h] \centering \includegraphics[width=3.5in]{pr-clustering/EMVSPR} \caption[A few examples comparing EM and PR] {A few examples comparing EM and PR. Count of most frequent category shows how many instances of a phrase are concentrated on the single most frequent tag. Number of categories shows how many categories a phrase is labelled with. By experience as mentioned before, we want a phrase to use fewer categories. These numbers are fair indicators of sparsity. } \label{tab:EMVSPR} \end{table} The models are formally evaluated with two kinds of metrics. We feed the clustering output through the whole translation pipeline to obtain a BLEU score. We also came up with an intrinsic evaluation of clustering quality by comparing against a supervised CFG parser trained on the tree bank. We are mainly working on Urdu-English language pair. Urdu has very different word ordering from English. This leaves us room for improvement over phrase-based systems. Here in Table \ref{tab:results} we show BLEU scores as well as conditional entropy for each of the models above on Urdu data. Conditional entropy is computed as the entropy of ``gold'' labelling given the predicted clustering. ``Gold'' labelling distribution is obtained from Collins parser trained on Penn Treebank. Since not all phrases are constituents, we ignored phrases that don't correspond to any constituents. We conducted experiments with various data pre-processing and tried different models. Number of phrase categories is fixed at $25$. We chose to only look at the target side language. The context is set to be $1$ word to the left and to the right of the phrase. We chose such a setting because it emperically works better in the pipeline than other variations. This is also the case for non-parametric methods. The models as we discussed in previous sections are EM, EM with sparsity constraint, agreement of two models in reverse directions and agreement of two models trained on two languages. We tried our models with word classes as well. In the context, each word is replaced with a word class unsupervisedly learned from the data. The results are shown in Table \ref{tab:results}. \begin{table}[h] \centering \begin{tabular}{ |*{4}{c|} } \hline \multicolumn{2}{|c|}{model} & BLEU & H(Gold$|$Predicted)\\ \hline \multicolumn{2}{|c|}{hiero} & 21.1 & 5.77\\ \multicolumn{2}{|c|}{hiero+POS} & 22.3 & 1.00 \\ \multicolumn{2}{|c|}{SAMT} & 24.5 & 0.00 \\ \hline \multirow{2}{*}{EM} & words & 20.9 & 2.85 \\ & word classes & 21.54 & 2.86 \\ \hline \multirow{2}{*}{PR $\sigma=100$}&words & 21.1 & 2.56 \\ &word classes & 21.7 & 2.36 \\ \hline \multirow{2}{*}{agree language}&word & 21.7 & 2.80 \\ &word classes & 21.4 & 2.69\\ \hline \multirow{2}{*}{agree direction}&word & 21.6 & 2.48\\ &word classes &22.1 &2.36 \\ \hline \multirow{2}{*}{non-parametric}&word & 22.0 & 2.86\\ & word classes&22.3&2.27\\ \hline \end{tabular} \caption {Evaluation of PR models. Left column shows BLEU scores through the translation pipeline. Right columns shows conditional entropy of the } \label{tab:results} \end{table} In Table \ref{tab:results}, the first three rows are baseline system. The rest are developed in the workshop. Hiero is hierachical phrase-based model with 1 category in all of its SCFG rules. Hiero+POS is hiero with all words labelled with their POS tags. SAMT is a syntax based system with a supervised parser trained on Treebank. EM is the first model mentioned in the beginning of this chapter. PR $\sigma=100$ is posterior regularization model with sparsity constraint explained in Section \ref{sec:pr-sparse}. $\sigma$ is the constant controls strongness of the constraint. We picked $\sigma$ by trying different values ranging from $1$ to $100$. Agree language and agree direction are models with agreement constraints mentioned in Section \ref{sec:pr-agree}. Non-parametric is non-parametric model introduced in the previous chapter. \section{Conclusion} The posterior regularization framework has a solid theoretical foundation. It is shown mathematically to balance between constraint and likelihood. In our experiments, we used it to enforce sparsity constraint and agreement constraint and achieved results comparable to non-parametric method that enforces sparcity through priors. The algorithm is fairly fast if the constraint can be decomposed into smaller pieces and compute separately. In our case, the sparsity constraint for phrases can be decomposed into one small optimization procedure for each phrase. In practice, our algorithm is much faster than non-parametric models with Gibbs sampling inference. The agreement models are even faster because they are performing almost the same amount of computation as the simple models trained with EM.