CS228 Probabilistic Graphical Models Homework 5-Solved
1. ] (Bayesian inference) Let X ∈ {x1,...,xK} be a multinomial variable and let θ be a parametrization of the distribution of X, i.e. P(X = xk|θ) = θk. Let D = {x[1],...,x[M]} be a dataset consisting of M realizations of X. We would like to infer something interesting about θ based on D.
Our strategy so far has been to infer a true set of parameters θ∗ from which the data was generated; we found θ∗ using the principle of maximum likelihood; this was an example of the so-called frequentist approach to statistics. In Bayesian statistics, we instead construct a posterior distribution P(θ|D) that can be used to describe our uncertainty over the parameters given the evidence we observed. Recall that we will construct P(θ|D) using Bayes theorem:
.
This approach has the advantage of better modeling the full uncertainty over the parameters. However, the probabilities no longer correspond to limiting frequencies within a data-generating process described by P(D|θ). Instead, they can only be interpreted as “beliefs”. Most crucially, these probabilities can be arguably called “subjective” because they depend on a set of arbitrary initial beliefs specified by P(θ). This may raise objections, since we might want our inferences about the world to be independent of any subjective choices by the statistician. It is also not always clear how to specify P(θ) and how to translate our prior beliefs into probabilities. Arguments like these are part of the great frequentist vs. Bayesian debate in statistics. Here, we will see a concrete example of how the Bayesian approach can be useful.
(a)] Let’s use the posterior to make predictions on new samples. Suppose that the likelihood ) is a product of categorical distributions (i.e. we assume that the observations are independent of each other given θ) and let’s choose a Dirichlet prior over θ, i.e. θ ∼ Dirichlet(α1,...,αK). Recall that
.
Show that the Bayesian predictive probability using a Dirichlet prior is
,
where M[i] is the number of times x[m] = xi appears in the dataset and α = Pi αi. X[M + 1] is assumed conditionally independent of D given θ.
Note that this probability has a very neat interpretation: M[i]/M by itself is simply the frequency of class i in our dataset. By adding a Dirichlet prior, we effectively augment our dataset with α[i] “virtual” data points of class i: the predictive probability is the same we would’ve had it there were α[i] extra points of class i in the actual dataset!
Hint: Recall from Lecture 15 that the posterior P(θ | x[1],··· ,x[M]) is given by), where
M αk0 = αk + X1{x[j] = xk}.
j=1
(b) Now we want to compute the Bayesian predictive probability over two samples. Show how to compute
P(X[M + 1] = xi,X[M + 2] = xj|D)
(c) Suppose we decide to use the approximation
P(X[M + 1] = xi,X[M + 2] = xj|D) ≈ P(X[M + 1] = xi|D) · P(X[M + 2] = xj|D)
That is, we ignore the dependencies between X[M + 1] and X[M + 2]. Analyze the error in this approximation (the ratio between the approximation and the correct probability). What is the quality of this approximation for small M? What is the asymptotic behavior of the approximation when
M → ∞.
In general, Bayesian inference may not always be tractable, and often requires approximations such as the one in this question. Finding such approximations is the topic of a large subfield of machine learning which studies the problem of “approximate inference”.
2. Programming Assignment [1]
This homework explores parameter learning in latent variable graphical models in the context of a hypothetical problem involving voter registration.
Suppose you are working as a volunteer on behalf of one of the two major presidential candidates in an evenly divided city in a swing state – lets say, Cleveland, Ohio. Your goal is to register as many voters as possible who are likely to support your candidate. However, your party has a limited number of volunteers, and needs to be wise about how it spends scarce resources in canvassing for voters. Fortunately, a local university recently conducted an extensive survey in which the city was partitioned into fifty precincts, and twenty citizens per precinct were surveyed on their political views. A table has been made publicly available that lists for each respondent:
• The precinct i = 1,...,N. In our city, N = 50.
• The index j = 1,...,M of the respondent within its precinct; in our case, M = 20.
• Two summary statisticsindicating respectively the overall social conservatism/liberalism and economic conservatism/liberalism of the j-th respondent in district i.
• In five of the fifty precincts, we have explicit respondent party preferences Zij ∈ {0,1}.
Your goal will be to use the summary statistics to obtain probabilistic information about party preference for the respondents that have not been explicitly surveyed (missing data).
Figure 1: Graphical representation of the models used in this problem.
We will use two models shown in Figure 1 and described below.
(A) Gaussian mixture model
We model the Xij as a mixture of two Gaussians P(X) = (1 − π)N(X|µ0,Σ0) + πN(X|µ1,Σ1).
This model tries to capture the following generative story: first, each person independently samples a party preference Zij from a Bernoulli distribution with parameter π; then, their sample summary statistics are sampled as Xij | zij ∼ N(µzij,Σzij), where is the class-conditional mean for party l, and Σl is the class-conditional variance/covariance matrix for party l. Note that precinct membership does not factor into this model, despite its use in indexing the variables.
With regards to the above model, answer the following questions.
i Estimate the parameters π, µ0, µ1, Σ0, and Σ1 by maximum likelihood using just data from the five precincts in which respondents have reported their party preferences. The data is provided in 'survey-labeled.dat'. Provide an explicit estimator formula for each parameter.
ii In this part, we will use the unlabelled dataset 'survey-unlabeled.dat' to learn the model parameters. Implement a Gaussian-Mixtures EM algorithm for model A and use it to estimate θ = {π,µ0,µ1,σ0,σ1}.
• The E-step computes P(zij|xij,θ), which is the expected “counts” for every respondent given a fixed value of the model parameters.
• The M-step re-estimates the model parameters, θ based on the expected values calculated in the E-step.
The algorithm should compute and output the log likelihood on every iteration, and should terminate when this quantity increases by less than a value of 0.01 between iterations. Run the algorithm with three different initializations: one equal to your estimates from part 2(A)i and two other (poorer) initializations of your choice. Plot the log likelihood as a function of algorithm iteration for all three cases. Comment on any differences in the local maxima that are found. Report your parameter estimates.
(B) Geography-aware mixture model
The second model attempts to capture the fact that respondents in the city tend to be geographically separated by party preference, with some precincts showing strong preferences for one party and others showing strong preferences for the other party. In this model, an additional variable Yi ∈ {0,1} is introduced for each precinct i, representing that precinct’s preferred party. Our new model has the following generative story: first, the Yi variables are drawn i.i.d. from a Bernoulli distribution with parameter φ. Then, the party preferences Zij as sampled according to
λ if zij = yi
p(zij|yi) =
(1 − λ) otherwise
Here, λ is a new parameter that we introduce. Note also that the Zij variables are conditionally independent given the Yi variables. Finally, the Xij are sampled as in the previous model.
With regards to the above model, answer the following questions.
i. We will first estimate the parameters using data from the five precincts in which respondents have reported their party preferences provided in 'survey-labeled.dat'. Specifically, estimate the parameters φ and λ by the approximate method of setting each yi to the consensus (majority) of the corresponding zij values ( )), then acting as as if the yis were observed. Explicitly write out the log likelihood function in terms of φ, λ, µ0, µ1, Σ0, and Σ1. Derive maximum likelihood estimators for φ and λ in terms of completely observed x, y, and z variables. Note that the estimates for µ0, µ1, Σ0, and Σ1 will remain unchanged from the ones estimated for part 2(A)i.
ii.Having estimated your parameters for model B by “supervised” training, use them to analyze the unlabeled data set ('survey-unlabeled.dat') by identifying precincts to be targeted by party 1. Specifically, compute p(yi|xi,1:M) for each precinct i, where xi,1:M = {xij : 1 ≤ j ≤ M}, and identify those precincts for which this probability exceeds 0.5. Summarize your results by presenting a table with one row per precinct, in ascending order by index, a column with the quantity p(yi|xi,1:M), and a mark indicating the precincts that exceed a threshold of 0.5. In addition, compute p(zij|xi,1:M) for every respondent (i,j), and summarize the results by plotting the data points on a two-dimensional plane and coloring them blue if p(zij = 1|xi,1:M) > 0.5 or red otherwise. Also indicate the positions of the two means.
Hint: From the factorized joint distribution of the precinct,
.
iii. Now, we will estimate the parameters for model B using the unlabelled dataset 'survey-unlabeled.dat'. Derive E- and M-step updates for model B. Start by writing down the complete log likelihood from part 2(B)i. Let θ = {φ,λ,µ0,µ1,Σ0,Σ1} denote the model parameters.
• In the E-step, derive an expression for p(yi,zi,1:M|xi,1:M,θ), which is the expected value of the relevant “counts” for a fixed value of the parameters. Can this be represented compactly (in a factored form)?
• The M-step re-estimate the model parameters, θ based on the expected values computed in the E-step. As a hint, note that the parameter update equations can be easily computed once we have p(yi,zij|xi,1:M,θ),p(zij|xi,1:M,θ), and p(yij|xi,1:M,θ).
Hint: You calculations in the M-step should roughly mirror the ones in part 2(B)i. In addition, the E-step will resemble that for model A.
iv. Based on the above updates, implement an EM algorithm for model B. Run the algorithm with three different initializations, plot the log likelihood and report the parameter estimates.
v. Using your best set of parameter estimates (those yielding the highest log likelihood) again compute p(yi = 1|xi,1:M) for each precinct i, and identify those precincts for which this probability exceeds 0.5. Report a new version of the table from part 2(B)ii. In addition, again compute p(zij|xi,1:M) and prepare a new plot with red and blue data points.
Hints:
(a) It is important that you understand EM algorithm well in order to do this programming assignment.One good note on EM algorithm is : https://people.csail.mit.edu/rameshvs/content/gmm-em. pdf.
(b) One common source of mistake in part b is how you calculate the log of product-sum. Note that:
log(YXf(xij)) = Xlog(Xf(xij))
i j i j
This is not equivalent to:
log(YXf(xij)) = XXlog(f(xij))