summaryrefslogtreecommitdiff
path: root/report/pr-clustering/posterior.tex
diff options
context:
space:
mode:
Diffstat (limited to 'report/pr-clustering/posterior.tex')
-rw-r--r--report/pr-clustering/posterior.tex153
1 files changed, 115 insertions, 38 deletions
diff --git a/report/pr-clustering/posterior.tex b/report/pr-clustering/posterior.tex
index 02eb0e95..eb53e915 100644
--- a/report/pr-clustering/posterior.tex
+++ b/report/pr-clustering/posterior.tex
@@ -1,6 +1,9 @@
\chapter{Posterior Regularization}
Posterior regularization is an alternative way of clustering the phrases.
-Unlike a Baysian approach where intuitions about the data are expressed through the priors, posterior regularization imposes constraints on posterior distributions of the data.
+Unlike a Bayesian approach where intuitions about the data are
+expressed through the priors,
+posterior regularization imposes constraints
+on posterior distributions of the data.
In this chapter , we will introduce a basic clustering model with EM
and look at shortcomings of the basic model. This will motivate us for
@@ -27,9 +30,9 @@ category and then that category generates the contex for the phrase.
The joint probability of a category $z$ and a context $\textbf{c}$
given a phrase $\textbf{p}$ is
-\[
+\begin{equation}\label{eq:NBPost}
P(z,\textbf{c}|\textbf{p})=P(z|\textbf{p})P(\textbf{c}|z).
-\]
+\end{equation}
$P(z|\textbf{p})$ is distribution of categories given a phrase.
This can be learned from data.
$P(\textbf{c}|z)$ is distribution of context given a category.
@@ -51,7 +54,29 @@ P(z|\textbf{p},\textbf{c})=\frac{P(z,\textbf{c}|\textbf{p})}
{\sum_{i=1,K}P(i,\textbf{c}|\textbf{p})}.
\]
With the mechanisms to compute the posterior probabilities, we can
-apply EM to learn all the probabilities.
+apply expectation-maximization algorithm (EM)
+ to learn all the probabilities.
+EM algorithm maximizes the data likelihood
+\[
+\mathcal{L}=
+\sum_{\textbf{p},\textbf{c}}
+\log \sum_{z=1}^K P(z,\textbf{c}|\textbf{p})
+\]
+EM works by iterating between two steps:
+E step computes posterior
+distributions according to Equation \ref{eq:NBPost},
+M step updates model parameters with maximum likelihood
+estimation as in Equation \ref{eq:MStep}.
+
+\begin{equation}\label{eq:MStep}
+\boldsymbol{\theta}=
+\arg\max_{\boldsymbol{\theta}}
+\sum_{\textbf{c},\textbf{p}}\sum_z
+P(z|\textbf{p},\textbf{c},
+\boldsymbol\theta^{old})\log
+P(\textbf{c},\textbf{p}|z,\boldsymbol{\theta}).
+\end{equation}
+
\section{Sparsity Constraints}\label{sec:pr-sparse}
A common linguistic intuition we have about the phrase
clustering problem is that a phrase should be put into very
@@ -66,9 +91,10 @@ Posterior regularization
provides a way to enforce sparsity \citep{ganchev:penn:2009}.
The general idea of posterior regularization is to modify
E-step of EM, so that instead of using posterior distribution
-as the $q$ distribution directly, we want to find the nearest
+as the $q$ distribution directly, the nearest
$q$
-in a constrained space as shown in Figure \ref{fig:EMPR}.
+in a constrained space is used
+ as shown in Figure \ref{fig:EMPR}.
\begin{figure}[h]
\centering
@@ -77,12 +103,14 @@ in a constrained space as shown in Figure \ref{fig:EMPR}.
\label{fig:EMPR}
\end{figure}
-The constraint we use here is called $l_1/ l_\infty$
+The constraint we use here is called $L_1/ L_\infty$
regularization in Ganchev's technical report. The notations
we use here largely follows Ganchev's.
In a more mathematical formulation, for each phrase $\textbf{p}$,
we want the quantity
-\[\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i) \]
+\begin{equation}\label{eq:sparsity}
+\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i)
+\end{equation}
to be small, where $\textbf{c}_i$ is the context
appeared around the $i$th occurrence of phrase $\textbf{p}$
throughout the data. This quantity roughly equals
@@ -92,7 +120,25 @@ the posterior distributions $P(z|\textbf{p},\textbf{c}_i)$
are the same
for all
occurrences of $\textbf{p}$. That is ,
-$\forall i,j$, $P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$.
+$\forall i,j$,
+$P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$.
+For example, Table \ref{tab:sparse} compares two $L_1/L_{\infty}$
+norms of phrases. The first phrase prefers multiple categories
+under different contexts, the second phrase prefers
+the same category under all contexts. The second phrase
+has a much smaller $L_1/L_{\infty}$ norm.
+
+\begin{table}[h]
+ \label{tab:sparse}
+
+ \centering
+ \includegraphics[width=3.5in]{pr-clustering/sparse0}
+ \caption{$L_1/L_{\infty}$ norm of a phrase
+ that prefers multiple categories}
+ \includegraphics[width=3.5in]{pr-clustering/sparse1}
+ \caption{$L_1/L_{\infty}$ norm of a phrase
+ that prefers a single phrase}
+\end{table}
Define feature functions for $i$th occurrence of phrase $\textbf{p}$
with category $j$,
@@ -106,16 +152,20 @@ as a function of category $z$:
\]
For notation simplicity, for
each phrase category pair,
-define variables $c_{\textbf{p}z}$ that will
-eventually be $\max_i E_q[\phi_{\textbf{p}iz}]$.
+define variables
+$e_{\textbf{p}z}$ to be
+$\max_i E_q[\phi_{\textbf{p}iz}]$. Note this is just
+a different notation for the term in
+Expression \ref{eq:sparsity}.
+
The objective we want to optimize becomes:
\[
\arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) +
-\sigma \sum_{\textbf{p},z}c_{\textbf{p}z}
+\sigma \sum_{\textbf{p},z}e_{\textbf{p}z}
\]
\[
\text{ s.t. }\forall \textbf{p},z,
-E_q[\phi_{\textbf{p}iz}]\leq c_{\textbf{p}z},
+E_q[\phi_{\textbf{p}iz}]\leq e_{\textbf{p}z},
\]
where $\sigma$ is a constant to control
how strongly the soft constraint should
@@ -139,19 +189,52 @@ The $q$ distribution we are looking for is then
q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i)
\exp(\lambda_{\textbf{p}iz}).
\]
-M-step can be performed as usual.
+M-step can be performed as usual by replacing
+$P(z|\textbf{p},\textbf{c},
+\boldsymbol\theta^{old})$ with
+$q(z)$ in Equation \ref{eq:MStep}.
+
\section{Agreement Models}\label{sec:pr-agree}
+
Another type of constraint we used is agreement between
-different models. We came up with a similar generative
-model in the reverse direction to agree with
-as shown in Figure \ref{fig:EMreverse}. We also
-took advantage of bi-text data we have and make models
-learning from different languages agree with each other.
+different models. The intuition is that if two models run
+on the same dataset, it is desirable for them to give the same
+output. Suppose two models gives two posterior
+distributions of categories
+$P_{\theta_1}(z|\textbf{p},\textbf{c})$,
+$P_{\theta_2}(z|\textbf{p},\textbf{c})$. The
+agreement constraint seeks posterior distributions
+$q_1(z),q_2(z)$
+such that $q_1(z)=q_2(z)$ while minimizing
+the KL-divergence
+$KL(q(z_1)q(z_2)||
+P_{\theta_1}(z|\textbf{p},\textbf{c})
+P_{\theta_1}(z|\textbf{p},\textbf{c}))$.
+The solution to $q_1,q_2$ has a simple closed
+form derived from Lagrange Multipliers shown in
+Equation \ref{eq:AgreePost}.
+\begin{equation}\label{eq:AgreePost}
+q_1(z)=q_2(z)\propto
+\sqrt{P_{\theta_1}(z|\textbf{p},\textbf{c})
+P_{\theta_1}(z|\textbf{p},\textbf{c})}
+\end{equation}.
+
+Since there is no reason to believe in that
+phrases generates a category and then generate
+a context, it is very natural to come up with a model
+that generates in the reverse direction
+as shown in Figure \ref{fig:EMreverse}. This model
+and the original model should then follow the agreement
+constraint.
+We also
+took advantage of bi-text data and make models
+learning from different languages to
+agree with each other.
\begin{figure}[h]
\centering
\includegraphics[width=3.0in]{pr-clustering/EMreverse}
- \caption{EM with posterior regularization}
+ \caption{Generative Model in the reverse Direction}
\label{fig:EMreverse}
\end{figure}
@@ -165,34 +248,28 @@ P(z|\textbf{c})P(\textbf{p}|z).
\]
Since a phrase contains a variable number of words,
we only look at the first and last word of
-a phrase. That is $P(\textbf{p}|z)=P_1(p_1|z)P_n(p_n|z)$,
-where $n$ is the length of $\textbf{p}$, $P_1$ and $P_n$
+a phrase. That is $P(\textbf{p}|z)=P_l(p_l|z)P_r(p_r|z)$,
+where $P_l$ and $P_r$
denotes distributions for words in the first and last position
-of a phrase given a category.
+, $p_l$ and $p_r$ are words in the first and last position.
The implementation of agreement models again ends up making
a small change to E-step. The $q$ distribution for
a phrase $\textbf{p}$ and a context $\textbf{c}$
-is given by
-\[
-q(z)=\sqrt{P_{\theta 1}
-(z|\textbf{p},\textbf{c})P_{\theta 2}(z|\textbf{p},\textbf{c})},
-\]
-where $P_{\theta 1}$ and $P_{\theta 2}$ are
-posterior distributions for two models.
-In M-step, both models should update their parameters with the
-same $q$ distribution computed as above.
-This modified EM maximizes the objective:
+is given by Equation \ref{eq:AgreePost}.
+In M-step, both models should update their parameters with $q$ distribution computed as above.
+This modified EM is proven to
+maximizes the objective:
\[
\mathcal{L}_1+
\mathcal{L}_2+
\sum_{\textbf{p},\textbf{c}}
-\log\sum_z\sqrt{P_{\theta 1}(z|\textbf{p},\textbf{c})
-P_{\theta 2}(z|\textbf{p},\textbf{c})},
+\log\sum_z\sqrt{P_{\theta_1}(z|\textbf{p},\textbf{c})
+P_{\theta_2}(z|\textbf{p},\textbf{c})},
\]
where $\mathcal{L}_1$ and $\mathcal{L}_2$
-are log-likelihood of
-two models.
+are log-likelihoods of
+each individual model.
\section{Experiments}
As a sanity check, we looked at a few examples produced by
the basic model (EM)
@@ -206,7 +283,7 @@ shows a few examples.
\caption[A few examples comparing EM and PR]
{A few examples comparing EM and PR.
Count of most frequent category shows how
- many instances of a phrase are concetrated on
+ many instances of a phrase are concentrated on
the single most frequent tag.
Number of categories shows how many categories
a phrase is labelled with. By experience as mentioned before,