CS228 Probabilistic Graphical Models Homework 4-Solved

On Sale

$15.00

Added to cart

1. We have a data association problem where there are K objects and we are given K observations. Each observation corresponds to a single object, and we are given one observation for each object. However, we don’t know which observation corresponds to which object, and we would like to infer that using a probabilistic model relating observations to objects. Specifically we have

• K objects u1,...,uK

• observations v1,...,vK, where Val(vi) = {a1,...,aL} (so vi is a discrete random variable), where each observation corresponds to the appearance of one object and there is exactly one observation of each object

• correspondence variables C1,...CK, where Val(Ci) = {1,...,K}; Ci = k denotes that measurement vi is derived from object uk

• a known appearance model for each object uk,Pk(vi = al|Ci = k).

Note that because of the mutex constraints, the correspondence variables C1,...,CK will be a permutation over 1,...,K. We also assume for simplicity that all permutations are equally likely a priori.

We wish to compute the marginals P(Ci|v1,...,vK), for i = 1,...,K, using Metropolis-Hastings (MH) to sample the correspondence variables. We will start with an arbitrary assignment to C1,...CK, and take MH-steps. The proposal distribution that we will use randomly picks two correspondence variables Ci,Cj from a uniform distribution over all pairs of correspondence variables, and swaps their assignments.

(a) [ Compute the acceptance probability for each MH step.

(b) Suppose we have run the MH sampler for a long time and collected M samples (C1[m],...,CK[m]) for m = 1,...,M after the chain has mixed. Give an explicit expression for estimating the marginal P(Ci|v1,...,vK).

(c) Your friend Geoff Gibbs hears about your MH algorithm and suggests that you can also consider using Gibbs sampling to compute your marginals. Briefly explain why this will or will not work.

2. Multi-conditional Parameter Learning, Markov Networks

In this problem, we will consider the problem of learning parameters for a Markov network using a specific objective function. In particular assume that we have two sets of variables Y and X, and a dataset

. We will estimate the model parameters θ = [θ1 ...θn] by maximizing the

following objective function:

g(θ;D) = (1 − α)`Y |X(θ;D) + α`X|Y (θ;D)

where `X|Y (θ;D) means the conditional log-likelihood of the dataset D using the distribution Pθ(X | Y ) defined by the Markov network with parameters θ (similarly for `Y |X). Thus, our objective is a mixture of two conditional log-likelihoods (0 < α < 1). As usual, we consider a log-linear parameterization of a Markov network, using a set of n features fi(Xi,Y i) where Xi and Y i are some (possibly empty) subsets of the variables X and Y , respectively.

(a) Write down the full objective function g(θ;D) in terms of the features fi and weights θi

(b) Derive ): the derivative of the objective with respect to a weight θi. Write your final answer in terms of feature expectations IEQ[fi], where Q is either: the empirical distribution of our dataset Pˆ; or a conditional distribution of the form Pθ(W | Z = z) (for some sets of variables W,Z, and assignment z.)

3. Expectation Maximization in a Naive Bayes Model

Consider the Naive Bayes model with class variable C and discrete evidence variables X1,...,Xn. The CPDs for the model are parameterized by P(C = c) = θc and P(Xi = x | C = c) = θxi|c for i = 1,...,n, and for all assignments xi ∈ Val(Xi) and classes c ∈ Val(C).

Now given a data set D = {x[1],...,x[M]}, where each x[m] is a complete assignment to the evidence variables, X1,...,Xn, we can use EM to learn the parameters of our model. Note that the class variable, C, is never observed.

Show that if we initialize the parameters uniformly,

and ,

for all xi,c, then the EM algorithm converges in one iteration, and give a closed form expression for the parameter values at this convergence point.

Programming Assignment [1]

In this homework, you will apply Gibbs sampling to a simple Markov random field model for image restoration. In an image restoration problem, you are given an image corrupted by noise X and you want to recover the original image Y (see Figure 1 and Lecture 4).

(a) Image denoising task.

(b) Graphical model structure, for N = M = 3. Figure 1: Image denoising with graphical models.

Let x = {xij} denote the observed image, with xij ∈ {−1,+1} representing the pixel at row i and column j. Assume a black-and-white image, with -1 corresponding to white and +1 to black. The image has dimensions N × M, so that 1 ≤ i ≤ N and 1 ≤ j ≤ M. Assume a set of (unobserved) variables y = {yij} representing the true (unknown) image, with yij ∈ {−1,+1} indicating the value of xij before noise was added. Each (internal) yij is linked with four immediate neighbors, yi−1,j, yi+1,j, yi,j−1, and yi,j+1, which together are denoted yN(i,j). Pixels at the borders of the image (with i ∈ {1,N} or j ∈ {1,M}) also have neighbors denoted yN(i,j), but these sets are reduced in the obvious way. We denote E the corresponding set of edges. For example, the pair ((1,1),(1,2)) ∈ E, but the pair ((1,1),(2,2)) ∈/ E. The joint probability of y and x can be written (with no prior preference for black or white):

(1)

(2)

where

Z = X (3)

(Notice in particular that each pair of neighbors, yij and yi0j0, factors into the formula only once, despite that each variable is a neighbor of the other. Failing to account for this will lead to double counting of β values.) This is equivalent to a Boltzmann (sometimes called Gibbs) distribution with “energy”:

E(y,x) = −η Xyijxij − β X yijyi0j0 (4)

i,j ((i,j),(i0,j0))∈E

The system will have lower energy, and hence higher probability, in states in which neighboring yij variables, and neighboring yij and xij variables, tend to have the same value (assuming η and β are positive). This captures the fact that each noisy pixel xij is likely to be similar to the corresponding “true” pixel yij, and that images tend to be “smooth”.

There are algorithms for deterministically estimating y given an image x but we will here use the alternative approach: we will devise a Markov Chain Monte Carlo (MCMC) algorithm to sample values of y conditional on x. Here are some advantages over the deterministic algorithms:

i. It is very general, and can easily be extended to more complex graphs.

ii. It provides great flexibility for quantifying the uncertainty of y (and, potentially, for the parameters η and β).

iii. It is relatively straightforward in this setting to derive the exact conditional distributions for nodesgiven the Markov blanket, so Gibbs sampling is possible, and one need not worry about the acceptance rate for proposed samples.

You will apply your methods to two small, black-and-white images that have been made available with the problem set. These two noisy images, and the original, undistorted image from which they derive, are available both in PNG format and in a simple text format that lists each coordinate pair (i,j) and the corresponding value of xij. You may find it useful to convert between this text representation and a viewable image format.

(a) Derive an expression for the conditional probability that pixel (i,j) is black given its

Markov blanket, i.e. p(yij = 1|yM(i,j)), where yM(i,j) denotes the variables in the Markov blanket of yij (but you should be explicit about which variables are included). Your expression should take the form of a logistic function and should depend only on η,β,and yM(i,j).

(b) Outline a Gibbs sampling algorithm (in pseudocode) that iterates over the pixels in the image and samples each yij given its Markov blanket. Use the simple approach of sweeping across the image in row-major fashion on every iteration of the algorithm. Thus, an “iteration” will generate a complete new sample of y. Allow for a burn-in of B iterations, followed by draws of S samples. You may assume η and β are fixed constants. How can we show in our case that the equilibrium distribution is in fact the posterior distribution p(y|x)?

(c) Implement your algorithm and apply it to the image with 20% noise (noisy 20.png,txt). Use values of η = 1, β = 1, B = 100, and S = 1000. On each iteration of your algorithm, compute the energy E(y,x) for the current sample of y and output it to a log file, keeping track of which values correspond to the burn-in. Run your algorithm with three different initializations - one in which each yij is initialized to xij , one in which each yij is initialized to −xij , and one in which the yij are set to −1 or +1 at random. Plot the energy of the model as a function of the iteration number for all three chains and visually inspect these traces for signs of convergence. Do all three seem to be converging to the same general region of the posterior, or are some obviously suboptimal? Does the burn-in seem to be adequate in length? Is there substantial fluctuation from iteration to iteration, indicating that the chain is mixing well, or does it become stuck at particular energies for several iterations at a time?

(d) Have your program output a restored image after completing its sampling iterations, by thresholding the estimated posterior probabilities for the yij variables at 0.5 - i.e., by estimating the “true” color of each pixel (i,j) as:

+1 if p(yij = 1|x) > 0.5

1 otherwise

To estimate the required posterior probabilities, store a running count cij of the number of (retained) samples for which each yij = 1, and then use the Monte Carlo estimate:

(5)

where represents the tth sample of yij. Restore both the 10% - and 20% -noise images in this way, using the same values of η, β, B, and S as above. Evaluate the quality of the restoration by computing the fraction of all pixels that differ between the restored images and the original image. Prepare a figure for each the the two images, showing the original, the noisy version, and the restoration side by side.

(e) If you have implemented your algorithm correctly, your restored images should be quite close to the original. But is this because you have a clever algorithm or just because the problem is easy? To examine this question, implement a trivial reconstruction algorithm that sets each yij equal to the consensus (majority) of its neighbors (including xij), and iterates a few times until convergence (use sequential rather than batch updates, as in Gibbs sampling. This algorithm need not converge in theory, but quickly do quite often in practice. To be safe, you can force it to terminate after, say, 30 iterations.) You should be able to get this program working quickly and easily by reusing code from your Gibbs sampler. However, note that in this case, you should not average over samples (there are no samples here) but instead should use the final value of the yij variables for your restored image. Run this program on both images and compute its restoration error. Include figures for the images restored in this way. Does the Gibbs sampler do better than the trivial algorithm? Why or why not?

(f) While the Gibbs sampler is useful for obtaining marginal posterior probabilities of interest, much of its appeal derives from its flexibility in estimating posterior distributions for more complex features of the model. To get a sense for its flexibility, use your Gibbs sampler to estimate the posterior distribution over the number of pixels in the “Z” in the image, which approximately falls in the rectangle from (i = 125,j = 143) to (i = 162,j = 174). Using the same parameters as above, simply count the number of cases of yij = +1 within this rectangle for each retained sample, output one count per iteration as your sampler runs, then use their relative frequencies as an estimate of the posterior distribution of interest. Plot a histogram showing these relative frequencies for both images and comment on any differences between the two estimated posterior distributions.

You will get a ZIP (2MB) file

CS228 Probabilistic Graphical Models Homework 4-Solved

Customer Reviews

You Might Also Like