diff options
Diffstat (limited to 'report')
-rw-r--r-- | report/pr-clustering/posterior.tex | 61 |
1 files changed, 59 insertions, 2 deletions
diff --git a/report/pr-clustering/posterior.tex b/report/pr-clustering/posterior.tex index c66eaa4c..734e53ed 100644 --- a/report/pr-clustering/posterior.tex +++ b/report/pr-clustering/posterior.tex @@ -20,7 +20,7 @@ category and then that category generates the contex for the phrase. \begin{figure}[h] \centering - \includegraphics[width=3.5in]{EMdigram} + \includegraphics[width=3.5in]{pr-clustering/EMdigram} \caption{Basic Phrase Clustering Model} \label{fig:EM} \end{figure} @@ -64,8 +64,22 @@ from the output, EM is using more categories than we wanted for each phrase. Posterior regularization provides a way to enforce sparsity \citep{ganchev:penn:2009}. +The general idea of posterior regularization is to modify +E-step of EM, so that instead of using posterior distribution +as the $q$ distribution directly, we want to find the nearest +$q$ +in a constrained space as shown in Figure \ref{fig:EMPR}. + +\begin{figure}[h] + \centering + \includegraphics[width=3.5in]{pr-clustering/EMPR} + \caption{EM with posterior regularization} + \label{fig:EMPR} +\end{figure} + The constraint we use here is called $l_1/ l_\infty$ -regularization. +regularization in Ganchev's technical report. The notations +we use here largely follows Ganchev's. In a more mathematical formulation, for each phrase $\textbf{p}$, we want the quantity \[\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i) \] @@ -79,3 +93,46 @@ are the same for all occurrences of $\textbf{p}$. That is , $\forall i,j$, $P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$. + +Define feature functions for $i$th occurrence of phrase $\textbf{p}$ +with category $j$, +as a function of category $z$: +\[ +\phi_{\textbf{p}ij}(z)= +\begin{cases} +1\text{ if z=j}\\ +0\text{ otherwise} +\end{cases}. +\] +For notation simplicity, for +each phrase category pair, +define variables $c_{\textbf{p}z}$ that will +eventually be $\max_i E_q[\phi_{\textbf{p}iz}]$. +The objective we want to optimize becomes: +\[ +\arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) + +\sigma \sum_{\textbf{p},z}c_{\textbf{p}z} +\] +\[ +\text{ s.t. }\forall \textbf{p},z, +E_q[\phi_{\textbf{p}iz}]\leq c_{\textbf{p}z}. +\] +Using Lagrange Multipliers, this objective can +be optimized in its dual form, +for each phrase $\textbf{p}$: +\[ +\arg\min_{\lambda\geq 0} \log +(\sum_{z,i} P_\theta(z|\textbf{p},\textbf{c}_i) +\exp (-\lambda_{\textbf{p}iz})) +\] +\[ +\text{ s.t. } \forall \textbf{p},z, +\sum_i \lambda_{\textbf{p}iz}\leq \sigma. +\] +This dual objective can be optimized with projected gradient +descent. +The $q$ distribution we are looking for is then +\[ +q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i) +\exp(\lambda_{\textbf{p}iz}). +\] |