summaryrefslogtreecommitdiff
path: root/report/pr-clustering
diff options
context:
space:
mode:
Diffstat (limited to 'report/pr-clustering')
-rw-r--r--report/pr-clustering/EMPR.pdfbin71179 -> 0 bytes
-rw-r--r--report/pr-clustering/EMVSPR.pdfbin58054 -> 0 bytes
-rw-r--r--report/pr-clustering/EMdigram.pdfbin10707 -> 0 bytes
-rw-r--r--report/pr-clustering/EMreverse.pdfbin11021 -> 0 bytes
-rw-r--r--report/pr-clustering/posterior.tex402
-rw-r--r--report/pr-clustering/pr5long.pdfbin645301 -> 0 bytes
-rwxr-xr-xreport/pr-clustering/sparse0.pdfbin33279 -> 0 bytes
-rwxr-xr-xreport/pr-clustering/sparse1.pdfbin33254 -> 0 bytes
8 files changed, 0 insertions, 402 deletions
diff --git a/report/pr-clustering/EMPR.pdf b/report/pr-clustering/EMPR.pdf
deleted file mode 100644
index fe7b2412..00000000
--- a/report/pr-clustering/EMPR.pdf
+++ /dev/null
Binary files differ
diff --git a/report/pr-clustering/EMVSPR.pdf b/report/pr-clustering/EMVSPR.pdf
deleted file mode 100644
index c03b41f2..00000000
--- a/report/pr-clustering/EMVSPR.pdf
+++ /dev/null
Binary files differ
diff --git a/report/pr-clustering/EMdigram.pdf b/report/pr-clustering/EMdigram.pdf
deleted file mode 100644
index aefd14bc..00000000
--- a/report/pr-clustering/EMdigram.pdf
+++ /dev/null
Binary files differ
diff --git a/report/pr-clustering/EMreverse.pdf b/report/pr-clustering/EMreverse.pdf
deleted file mode 100644
index 095592e7..00000000
--- a/report/pr-clustering/EMreverse.pdf
+++ /dev/null
Binary files differ
diff --git a/report/pr-clustering/posterior.tex b/report/pr-clustering/posterior.tex
deleted file mode 100644
index 8a199e63..00000000
--- a/report/pr-clustering/posterior.tex
+++ /dev/null
@@ -1,402 +0,0 @@
-\chapter{Posterior Regularization}
-Posterior regularization is an alternative way of clustering the phrases.
-Unlike a Bayesian approach where intuitions about the data are
-expressed through the priors,
-posterior regularization imposes constraints
-on posterior distributions of the data.
-
-In this chapter , we will introduce a basic clustering model with EM
-and look at shortcomings of the basic model. This will motivate us for
-more complicated posterior regularized models.
-\section{Phrase Clustering Model}
-As a brief recap, the clustering problem we are working with
-is to label phrases with $K$ induced categories, where
-$K$ is picked manually.
-Phrases are obtained from bi-text data.
-We also look at context
- words before and after
-phrases as cues for clustering.
-The relationship between phrases, contexts and categories are
-represented with a generative model shown in
-Figure \ref{fig:EM}: a phrase picks a
-category and then that category generates the contex for the phrase.
-
-\begin{figure}[h]
- \centering
- \includegraphics[width=3.0in]{pr-clustering/EMdigram}
- \caption{Basic Phrase Clustering Model}
- \label{fig:EM}
-\end{figure}
-
-The joint probability of a category $z$ and a context $\textbf{c}$
-given a phrase $\textbf{p}$ is
-\begin{equation}\label{eq:NBPost}
-P(z,\textbf{c}|\textbf{p})=P(z|\textbf{p})P(\textbf{c}|z).
-\end{equation}
-$P(z|\textbf{p})$ is distribution of categories given a phrase.
-This can be learned from data.
-$P(\textbf{c}|z)$ is distribution of context given a category.
-Since a context usually contains multiple slots for words, we further
-decompose this distribution into independent distributions at
-each slot. For example, suppose a context consists of two positions
-before and after the phrase. Denote these words as
-$c_{-2},c_{-1},c_1,c_2$.
-Use $P_{-2},P_{-1},P_1,P_2$ to denote distributions of words at each
-position, $P(\textbf{c}|z)$ is decomposed as
-\[
-P(\textbf{c}|z)=P_{-2}(c_{-2}|z)P_{-1}
-(c_{-1}|z)P_1(c_1|z)P_2(c_2|z).
-\]
-The posterior probability of a category given a phrase
-and a context can be computed by normalizing the joint probability:
-\[
-P(z|\textbf{p},\textbf{c})=\frac{P(z,\textbf{c}|\textbf{p})}
-{\sum_{i=1,K}P(i,\textbf{c}|\textbf{p})}.
-\]
-With the mechanisms to compute the posterior probabilities, we can
-apply expectation-maximization algorithm (EM)
- to learn all the probabilities.
-EM algorithm maximizes the data likelihood
-\[
-\mathcal{L}=
-\sum_{\textbf{p},\textbf{c}}
-\log \sum_{z=1}^K P(z,\textbf{c}|\textbf{p})
-\]
-EM works by iterating between two steps:
-E step computes posterior
-distributions according to Equation \ref{eq:NBPost},
-M step updates model parameters with maximum likelihood
-estimation as in Equation \ref{eq:MStep}.
-
-\begin{equation}\label{eq:MStep}
-\boldsymbol{\theta}=
-\arg\max_{\boldsymbol{\theta}}
-\sum_{\textbf{c},\textbf{p}}\sum_z
-P(z|\textbf{p},\textbf{c},
-\boldsymbol\theta^{old})\log
-P(\textbf{c},\textbf{p}|z,\boldsymbol{\theta}).
-\end{equation}
-
-\section{Sparsity Constraints}\label{sec:pr-sparse}
-A common linguistic intuition we have about the phrase
-clustering problem is that a phrase should be put into very
-few categories, e.g. a verb phrase is unlikely to be used as
-a noun phrase. In other words, the categorization of
-a phrase should be sparse.
-The generative model we proposed above with EM
-allows a phrase to be labelled with many tags. As we observed
-from the output, EM is using more categories than we wanted for
-each phrase.
-Posterior regularization
-provides a way to enforce sparsity \citep{ganchev:penn:2009}.
-The general idea of posterior regularization is to modify
-E-step of EM, so that instead of using posterior distribution
-as the $q$ distribution directly, the nearest
-$q$
-in a constrained space is used
- as shown in Figure \ref{fig:EMPR}.
-
-\begin{figure}[h]
- \centering
- \includegraphics[width=3.5in]{pr-clustering/EMPR}
- \caption{EM with posterior regularization}
- \label{fig:EMPR}
-\end{figure}
-
-The constraint we use here is called $L_1/ L_\infty$
-regularization in Ganchev's technical report. The notations
-we use here largely follows Ganchev's.
-In a more mathematical formulation, for each phrase $\textbf{p}$,
-we want the quantity
-\begin{equation}\label{eq:sparsity}
-\sum_{z=1}^K \max_i P(z|\textbf{p},\textbf{c}_i)
-\end{equation}
-to be small, where $\textbf{c}_i$ is the context
-appeared around the $i$th occurrence of phrase $\textbf{p}$
-throughout the data. This quantity roughly equals
-the number of categories phrase $\textbf{p}$ uses.
-It is minimized to $1$ if and only if
-the posterior distributions $P(z|\textbf{p},\textbf{c}_i)$
-are the same
-for all
-occurrences of $\textbf{p}$. That is ,
-$\forall i,j$,
-$P(z|\textbf{p},\textbf{c}_i)=P(z|\textbf{p},\textbf{c}_j)$.
-For example, Table \ref{tab:sparse} compares two $L_1/L_{\infty}$
-norms of phrases. The first phrase prefers multiple categories
-under different contexts, the second phrase prefers
-the same category under all contexts. The second phrase
-has a much smaller $L_1/L_{\infty}$ norm.
-
-\begin{table}[h]
- \label{tab:sparse}
-
- \centering
- \includegraphics[width=3.5in]{pr-clustering/sparse0}
- \caption{$L_1/L_{\infty}$ norm of a phrase
- that prefers multiple categories}
- \includegraphics[width=3.5in]{pr-clustering/sparse1}
- \caption{$L_1/L_{\infty}$ norm of a phrase
- that prefers a single phrase}
-\end{table}
-
-Define feature functions for $i$th occurrence of phrase $\textbf{p}$
-with category $j$,
-as a function of category $z$:
-\[
-\phi_{\textbf{p}ij}(z)=
-\begin{cases}
-1\text{ if z=j}\\
-0\text{ otherwise}
-\end{cases}.
-\]
-For notation simplicity, for
-each phrase category pair,
-define variables
-$e_{\textbf{p}z}$ to be
-$\max_i E_q[\phi_{\textbf{p}iz}]$. Note this is just
-a different notation for the term in
-Expression \ref{eq:sparsity}.
-
-The objective we want to optimize becomes:
-\[
-\arg\min_{q,c_{\textbf{p}z}} KL(q||P_{\theta}) +
-\sigma \sum_{\textbf{p},z}e_{\textbf{p}z}
-\]
-\[
-\text{ s.t. }\forall \textbf{p},z,
-E_q[\phi_{\textbf{p}iz}]\leq e_{\textbf{p}z},
-\]
-where $\sigma$ is a constant to control
-how strongly the soft constraint should
-be enforced.
-Using Lagrange Multipliers, this objective can
-be optimized in its dual form,
-for each phrase $\textbf{p}$:
-\[
-\arg\min_{\lambda\geq 0} \log
-(\sum_{z,i} P_\theta(z|\textbf{p},\textbf{c}_i)
-\exp (-\lambda_{\textbf{p}iz}))
-\]
-\[
-\text{ s.t. } \forall \textbf{p},z,
-\sum_i \lambda_{\textbf{p}iz}\leq \sigma.
-\]
-This dual objective can be optimized with projected gradient
-descent. Notice that each phrase has its own objective and
-constraint. The $\lambda$s are not shared acrossed
-phrases. Therefore we can optimize the objective
-for each phrase separately. It is convenient for parallelizing
-the algorithm. It also makes the objective easier to optimize.
-
-The $q$ distribution we are looking for is then
-\[
-q_i(z)\propto P_{\theta}(z|\textbf{p},\textbf{c}_i)
-\exp(\lambda_{\textbf{p}iz}).
-\]
-M-step can be performed as usual by replacing
-$P(z|\textbf{p},\textbf{c},
-\boldsymbol\theta^{old})$ with
-$q(z)$ in Equation \ref{eq:MStep}.
-
-\section{Agreement Models}\label{sec:pr-agree}
-
-Another type of constraint we used is agreement between
-different models. The intuition is that if two models run
-on the same dataset, it is desirable for them to give the same
-output. Suppose two models gives two posterior
-distributions of categories
-$P_{\theta_1}(z|\textbf{p},\textbf{c})$,
-$P_{\theta_2}(z|\textbf{p},\textbf{c})$. The
-agreement constraint seeks posterior distributions
-$q_1(z),q_2(z)$
-such that $q_1(z)=q_2(z)$ while minimizing
-the KL-divergence
-$KL(q(z_1)q(z_2)||
-P_{\theta_1}(z|\textbf{p},\textbf{c})
-P_{\theta_1}(z|\textbf{p},\textbf{c}))$.
-The solution to $q_1,q_2$ has a simple closed
-form derived from Lagrange Multipliers shown in
-Equation \ref{eq:AgreePost}.
-\begin{equation}\label{eq:AgreePost}
-q_1(z)=q_2(z)\propto
-\sqrt{P_{\theta_1}(z|\textbf{p},\textbf{c})
-P_{\theta_1}(z|\textbf{p},\textbf{c})}
-\end{equation}.
-
-Since there is no reason to believe in that
-phrases generates a category and then generate
-a context, it is very natural to come up with a model
-that generates in the reverse direction
-as shown in Figure \ref{fig:EMreverse}. This model
-and the original model should then follow the agreement
-constraint.
-We also
-took advantage of bi-text data and make models
-learning from different languages to
-agree with each other.
-
-\begin{figure}[h]
- \centering
- \includegraphics[width=3.0in]{pr-clustering/EMreverse}
- \caption{Generative Model in the reverse Direction}
- \label{fig:EMreverse}
-\end{figure}
-
-In the reversed model,
-the posterior probability of the labelling of
-a context $\textbf{c}$ with
-phrase $\textbf{p}$ is
-\[
-P(z|\textbf{c},\textbf{p})\propto
-P(z|\textbf{c})P(\textbf{p}|z).
-\]
-Since a phrase contains a variable number of words,
-we only look at the first and last word of
-a phrase. That is $P(\textbf{p}|z)=P_l(p_l|z)P_r(p_r|z)$,
-where $P_l$ and $P_r$
-denotes distributions for words in the first and last position
-, $p_l$ and $p_r$ are words in the first and last position.
-
-The implementation of agreement models again ends up making
-a small change to E-step. The $q$ distribution for
-a phrase $\textbf{p}$ and a context $\textbf{c}$
-is given by Equation \ref{eq:AgreePost}.
-In M-step, both models should update their parameters with $q$ distribution computed as above.
-This modified EM is proven to
-maximizes the objective:
-\[
-\mathcal{L}_1+
-\mathcal{L}_2+
-\sum_{\textbf{p},\textbf{c}}
-\log\sum_z\sqrt{P_{\theta_1}(z|\textbf{p},\textbf{c})
-P_{\theta_2}(z|\textbf{p},\textbf{c})},
-\]
-where $\mathcal{L}_1$ and $\mathcal{L}_2$
-are log-likelihoods of
-each individual model.
-\section{Experiments}
-As a sanity check, we looked at a few examples produced by
-the basic model (EM)
-and the posterior regularization (PR) model
-with sparsity constraints. Table \ref{tab:EMVSPR}
-shows a few examples.
-
-\begin{table}[h]
- \centering
- \includegraphics[width=3.5in]{pr-clustering/EMVSPR}
- \caption[A few examples comparing EM and PR]
- {A few examples comparing EM and PR.
- Count of most frequent category shows how
- many instances of a phrase are concentrated on
- the single most frequent tag.
- Number of categories shows how many categories
- a phrase is labelled with. By experience as mentioned before,
- we want a phrase to use fewer categories.
- These numbers are fair indicators of sparsity.
- }
- \label{tab:EMVSPR}
-\end{table}
-
-The models are formally evaluated with two kinds
-of metrics. We feed the clustering output
-through the whole translation pipeline
-to obtain a BLEU score. We also came up
-with an intrinsic evaluation of clustering quality
-by comparing against a supervised CFG parser trained on the
-tree bank.
-
-We are mainly working on Urdu-English language pair.
-Urdu has very
-different word ordering from English.
-This leaves us room for improvement over
-phrase-based systems.
-Here in Table \ref{tab:results}
-we show BLEU scores as well as
-conditional entropy for each of the models above
-on Urdu data. Conditional entropy is computed
-as the entropy of ``gold'' labelling given
-the predicted clustering. ``Gold'' labelling
-distribution
-is obtained from Collins parser
-trained on Penn Treebank. Since not
-all phrases are constituents, we ignored
-phrases that don't correspond to any constituents.
-
-We conducted experiments with various data pre-processing
-and tried different models. Number of phrase categories
-is fixed at $25$. We chose to only look at
-the target side language. The context is set
-to be $1$ word to the left and to the right of the phrase.
-We chose such a setting because it emperically works better
-in the pipeline than other variations. This is also
-the case for non-parametric methods. The models as we discussed
-in previous sections are EM, EM with sparsity constraint,
-agreement of two models in reverse directions and agreement
-of two models trained on two languages. We tried our models
-with word classes as well. In the context, each word is
-replaced with a word class unsupervisedly learned from the data.
-The results are shown in Table \ref{tab:results}.
-
-\begin{table}[h]
- \centering
- \begin{tabular}{ |*{4}{c|} }
- \hline
- \multicolumn{2}{|c|}{model} & BLEU & H(Gold$|$Predicted)\\
- \hline
- \multicolumn{2}{|c|}{hiero} & 21.1 & 5.77\\
- \multicolumn{2}{|c|}{hiero+POS} & 22.3 & 1.00 \\
- \multicolumn{2}{|c|}{SAMT} & 24.5 & 0.00 \\
- \hline
- \multirow{2}{*}{EM} & words & 20.9 & 2.85 \\
- & word classes & 21.54 & 2.86 \\ \hline
- \multirow{2}{*}{PR $\sigma=100$}&words & 21.1 & 2.56 \\
- &word classes & 21.7 & 2.36 \\ \hline
- \multirow{2}{*}{agree language}&word & 21.7 & 2.80 \\
- &word classes & 21.4 & 2.69\\ \hline
- \multirow{2}{*}{agree direction}&word & 21.6 & 2.48\\
- &word classes &22.1 &2.36 \\ \hline
- \multirow{2}{*}{non-parametric}&word & 22.0 & 2.86\\
- & word classes&22.3&2.27\\
- \hline
- \end{tabular}
- \caption
- {Evaluation of PR models.
- Left column shows BLEU scores
- through the translation pipeline.
- Right columns shows conditional entropy
- of the
- }
- \label{tab:results}
-\end{table}
-
-In Table \ref{tab:results}, the first three rows
-are baseline system. The rest are developed in the workshop.
-Hiero is hierachical phrase-based
-model with 1 category in all of its SCFG rules. Hiero+POS
-is hiero with all words labelled with their POS tags.
-SAMT is a syntax based system with a supervised
-parser trained on Treebank. EM is the first model mentioned
-in the beginning of this chapter. PR $\sigma=100$ is
-posterior regularization model with sparsity constraint
-explained in Section \ref{sec:pr-sparse}.
-$\sigma$ is the constant controls strongness of the constraint.
-We picked $\sigma$ by trying different values ranging from
-$1$ to $100$.
-Agree language and agree direction are models with agreement
-constraints mentioned in Section \ref{sec:pr-agree}. Non-parametric
-is non-parametric model introduced in the previous chapter.
-\section{Conclusion}
-The posterior regularization framework has a solid theoretical foundation.
-It is shown mathematically to balance between constraint and likelihood.
-In our experiments,
-we used it to enforce sparsity constraint and agreement constraint and
-achieved results comparable to non-parametric method that enforces
-sparcity through priors. The algorithm is fairly fast if the constraint
-can be decomposed into smaller pieces and compute separately. In our case,
-the sparsity constraint for phrases can be decomposed into one small optimization
-procedure for each phrase. In practice, our algorithm is much
-faster than non-parametric models with Gibbs sampling inference.
-The agreement
-models are even faster because they are performing almost the same amount
-of computation as the simple models trained with EM.
diff --git a/report/pr-clustering/pr5long.pdf b/report/pr-clustering/pr5long.pdf
deleted file mode 100644
index 7de9dae9..00000000
--- a/report/pr-clustering/pr5long.pdf
+++ /dev/null
Binary files differ
diff --git a/report/pr-clustering/sparse0.pdf b/report/pr-clustering/sparse0.pdf
deleted file mode 100755
index d103d4db..00000000
--- a/report/pr-clustering/sparse0.pdf
+++ /dev/null
Binary files differ
diff --git a/report/pr-clustering/sparse1.pdf b/report/pr-clustering/sparse1.pdf
deleted file mode 100755
index 0edad23a..00000000
--- a/report/pr-clustering/sparse1.pdf
+++ /dev/null
Binary files differ