summaryrefslogtreecommitdiff
path: root/report
diff options
context:
space:
mode:
Diffstat (limited to 'report')
-rw-r--r--report/pr-clustering/posterior.tex61
1 files changed, 59 insertions, 2 deletions
diff --git a/report/pr-clustering/posterior.tex b/report/pr-clustering/posterior.tex
index c66eaa4c..734e53ed 100644
--- a/report/pr-clustering/posterior.tex
+++ b/report/pr-clustering/posterior.tex
@@ -20,7 +20,7 @@ category and then that category generates the contex for the phrase.
\begin{figure}[h]
\centering
- \includegraphics[width=3.5in]{EMdigram}
+ \includegraphics[width=3.5in]{pr-clustering/EMdigram}
\caption{Basic Phrase Clustering Model}
\label{fig:EM}
\end{figure}
@@ -64,8 +64,22 @@ from the output, EM is using more categories than we wanted for
each phrase.
Posterior regularization
provides a way to enforce sparsity \citep{ganchev:penn:2009}.
+The general idea of posterior regularization is to modify
+E-step of EM, so that instead of using posterior distribution
+as the $q$ distribution directly, we want to find the nearest
+$q$
+in a constrained space as shown in Figure \ref{fig:EMPR}.
+
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=3.5in]{pr-clustering/EMPR}
+ \caption{EM with posterior regularization}
+ \label{fig:EMPR}
+\end{figure}
+
The constraint we use here is called $l_1/ l_\infty$
-regularization.
+regularization in Ganchev's technical report. The notations
+we use here largely follows Ganchev's.
In a more mathematical formulation, for each phrase $\textbf{p}$,
we want the quantity
\[\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i) \]
@@ -79,3 +93,46 @@ are the same
for all
occurrences of $\textbf{p}$. That is ,
$\forall i,j$, $P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$.
+
+Define feature functions for $i$th occurrence of phrase $\textbf{p}$
+with category $j$,
+as a function of category $z$:
+\[
+\phi_{\textbf{p}ij}(z)=
+\begin{cases}
+1\text{ if z=j}\\
+0\text{ otherwise}
+\end{cases}.
+\]
+For notation simplicity, for
+each phrase category pair,
+define variables $c_{\textbf{p}z}$ that will
+eventually be $\max_i E_q[\phi_{\textbf{p}iz}]$.
+The objective we want to optimize becomes:
+\[
+\arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) +
+\sigma \sum_{\textbf{p},z}c_{\textbf{p}z}
+\]
+\[
+\text{ s.t. }\forall \textbf{p},z,
+E_q[\phi_{\textbf{p}iz}]\leq c_{\textbf{p}z}.
+\]
+Using Lagrange Multipliers, this objective can
+be optimized in its dual form,
+for each phrase $\textbf{p}$:
+\[
+\arg\min_{\lambda\geq 0} \log
+(\sum_{z,i} P_\theta(z|\textbf{p},\textbf{c}_i)
+\exp (-\lambda_{\textbf{p}iz}))
+\]
+\[
+\text{ s.t. } \forall \textbf{p},z,
+\sum_i \lambda_{\textbf{p}iz}\leq \sigma.
+\]
+This dual objective can be optimized with projected gradient
+descent.
+The $q$ distribution we are looking for is then
+\[
+q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i)
+\exp(\lambda_{\textbf{p}iz}).
+\]