summaryrefslogtreecommitdiff
path: root/report/pr-clustering/posterior.tex
blob: 734e53ed2f23e6c753ba27acda2361d523e90f8e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
\chapter{Posterior Regularization}
Posterior regularization is an alternative way of clustering the phrases.
Unlike a Baysian approach where intuitions about the data are expressed through the priors, posterior regularization imposes constraints on posterior distributions of the data.

In this chapter , we will introduce a basic clustering model with EM 
and look at shortcomings of the basic model. This will motivate us for
more complicated posterior regularized models.
\section{Phrase Clustering Model} 
As a brief recap, the clustering problem we are working with
is to label phrases with $K$ induced categories, where
$K$ is picked manually.
Phrases are obtained from bi-text data.
We also look at context
 words before and after
phrases as cues for clustering.
The relationship between phrases, contexts and categories are
represented with a generative model shown in 
Figure \ref{fig:EM}: a phrase picks a 
category and then that category generates the contex for the phrase.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.5in]{pr-clustering/EMdigram}
  \caption{Basic Phrase Clustering Model}
  \label{fig:EM}
\end{figure}

The joint probability of a category $z$ and a context $\textbf{c}$ 
given a phrase $\textbf{p}$ is
\[
P(z,\textbf{c}|\textbf{p})=P(z|\textbf{p})P(\textbf{c}|z).
\]
$P(z|\textbf{p})$ is distribution of categories given a phrase.
This can be learned from data.
$P(\textbf{c}|z)$ is distribution of context given a category.
Since a context usually contains multiple slots for words, we further
decompose this distribution into independent distributions at
each slot. For example, suppose a context consists of two positions 
before and after the phrase. Denote these words as 
$c_{-2},c_{-1},c_1,c_2$.
Use $P_{-2},P_{-1},P_1,P_2$ to denote distributions of words at each 
position, $P(\textbf{c}|z)$ is decomposed as
\[
P(\textbf{c}|z)=P_{-2}(c_{-2}|z)P_{-1}
(c_{-1}|z)P_1(c_1|z)P_2(c_2|z).
\]
The posterior probability of a category given a phrase
and a context can be computed by normalizing the joint probability:
\[
P(z|\textbf{p},\textbf{c})=\frac{P(z,\textbf{c}|\textbf{p})}
{\sum_{i=1,K}P(i,\textbf{c}|\textbf{p})}.
\]
With the mechanisms to compute the posterior probabilities, we can 
apply EM to learn all the probabilities.
\section{Sparsity Constraints}
A common linguistic intuition we have about the phrase 
clustering problem is that a phrase should be put into very
few categories, e.g. a verb phrase is unlikely to be used as 
a noun phrase. In other words, the categorization of
a phrase should be sparse.
The generative model we proposed above with EM
allows a phrase to be labelled with many tags. As we observed
from the output, EM is using more categories than we wanted for
each phrase.
Posterior regularization
provides a way to enforce sparsity \citep{ganchev:penn:2009}.
The general idea of posterior regularization is to modify
E-step of EM, so that instead of using posterior distribution
as the $q$ distribution directly, we want to find the nearest
$q$
in a constrained space  as shown in Figure \ref{fig:EMPR}.

\begin{figure}[h]
  \centering
  \includegraphics[width=3.5in]{pr-clustering/EMPR}
  \caption{EM with posterior regularization}
  \label{fig:EMPR}
\end{figure}

The constraint we use here is called $l_1/ l_\infty$
regularization in Ganchev's technical report. The notations
we use here largely follows Ganchev's.
In a more mathematical formulation, for each phrase $\textbf{p}$,
we want the quantity 
\[\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i) \]
to be small, where $\textbf{c}_i$ is the context
appeared around the $i$th occurrence of phrase $\textbf{p}$
throughout the data. This quantity roughly equals 
the number of categories phrase $\textbf{p}$ uses.
It is minimized to $1$ if and only if
the posterior distributions $P(z|\textbf{p},\textbf{c}_i)$
are the same
for all
occurrences of $\textbf{p}$. That is ,
$\forall i,j$, $P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$. 

Define feature functions for $i$th occurrence of phrase $\textbf{p}$
with category $j$,
as a function of category $z$:
\[
\phi_{\textbf{p}ij}(z)=
\begin{cases}
1\text{ if z=j}\\
0\text{ otherwise}
\end{cases}.
\]
For notation simplicity, for
each phrase category pair, 
define variables $c_{\textbf{p}z}$ that will
eventually be $\max_i E_q[\phi_{\textbf{p}iz}]$.
The objective we want to optimize becomes:
\[
\arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) + 
\sigma \sum_{\textbf{p},z}c_{\textbf{p}z}
\]
\[
\text{ s.t. }\forall \textbf{p},z,
E_q[\phi_{\textbf{p}iz}]\leq c_{\textbf{p}z}.
\]
Using Lagrange Multipliers, this objective can
be optimized in its dual form,
for each phrase $\textbf{p}$:
\[
\arg\min_{\lambda\geq 0} \log 
(\sum_{z,i} P_\theta(z|\textbf{p},\textbf{c}_i)
\exp (-\lambda_{\textbf{p}iz}))
\]
\[
\text{ s.t. } \forall \textbf{p},z,
\sum_i \lambda_{\textbf{p}iz}\leq \sigma.
\]
This dual objective can be optimized with projected gradient
descent.
The $q$ distribution we are looking for is then
\[
q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i)
\exp(\lambda_{\textbf{p}iz}).
\]