CS228 Probabilistic Graphical Models Homework 1-Solved
Problem 1: Probability theory
The doctor has bad news and good news for X. The bad news is that X tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people. Why is it good news that the disease is rare? What are the chances that X actually has the disease?
Problem 2: Review of dynamic programming
Suppose you have a probability distribution P over random variables X1,X2,...,Xn which all take values in the set S = {v1,...,vm}, where the vj are some distinct values (e.g., integers or letters). Suppose that P satisfies the Markov assumption: for all i ≥ 2 we have
P(xi|xi−1,...,x1) = P(xi|xi−1).
In other words, P factorizes as
P(x1,x2,...,xn) = P(x1)P(x2|x1)···P(xn|xn−1).
For each factor P(xi|xi−1) for i ≥ 2 you are given the probability P(Xi = u|Xi−1 = v) for each u,v ∈ S in the form of a m × m table. You are also given P(X1 = v) for each v ∈ S. • (7 points) Give an O(m2n) algorithm for solving the problem
.
Hint: think dynamic programming!
Problem 3: Bayesian networks
Let us try to relax the definition of Bayesian networks by removing the assumption that the directed graph is acyclic. Suppose we have a directed graph G = (V,E) and discrete random variables X1,··· ,Xn, and define
f(x1,··· ,xn) = Y fv(xv|xpa(v))
v∈V
where pa(v) refers to the parents of variable Xv in G and fv(xv|xpa(v)) specifies a distribution over Xv for every assignment to the parents of Xv, i.e. 0 ≤ fv(xv|xpa(v)) ≤ 1 for all xv ∈ V al(Xv) and Pxv∈V al(Xv) fv(xv|xpa(v)) = 1. Recall that this is precisely the definition of the joint probability distribution associated with the Bayesian network G, where the fv are the conditional probability distributions. Show that if G has a directed cycle, f may no longer define a valid probability distribution. In particular, give an example of a cyclic graph G and distributions fv that lead to improper probability distributions. Remember, a valid probability distribution must be non-negative and sum to one. This is why Bayesian networks must be defined on acyclic graphs.
Problem 4: Conditional Independence
The question investigates the way in which conditional independence relationships affect the amount of information needed for probabilistic calculations. Let α, β, and γ be three random variables.
• (6 points) Suppose we wish to calculate Pr(α|β,γ) and we have no conditional independence information. Which of the following sets of numbers is sufficient for the calculation?
1. Pr(β,γ), Pr(α), Pr(β|α) and Pr(γ|α).
2. Pr(β,γ), Pr(α) and Pr(β,γ|α)
3. Pr(β|α), Pr(γ|α) and Pr(α).
For each case, justify your response either by showing how to calculate the desired answer or by explaining why this is not possible.
• (6 points) Suppose we know that β and γ are conditionally independent given α. Now which of the preceeding three sets is sufficient. Justify your response as before.
Problem 5: Bayesian networks (AD Exercise 4.1)
Consider the Bayesian network B given above.
1. (2 points) Compute Pr(A = 0,B = 0) and Pr(E = 1|A = 1). Justify your answers.
2. (3 points) True or false? Why?
(a) d-sepB(A;E|{B,H})
(b) d-sepB(G;E|D)
(c) d-sepB({A,B};{G,H}|F)
Problem 6: Bayesian Networks and explaining away
You want to model the admission process of Farm University. Students are admitted based on their Creativity (C) and Intelligence (I). You decide to model them as continuous random variables, and your data suggests that both are uniformly distributed in [0,1], and are independent of each other. Formally I ∼ Uniform[0,1], C ∼ Uniform[0,1], C ⊥ I. Being very prestigious, the school only admits students such that C + I ≥ 1.5.
1. (1 points) What’s the expected Creativity score of a student?
2. (2 points) What’s the expected Creativity score of an admitted student?
3. (2 points) What’s the expected Creativity score of a student with I = 0.95 (a highly intelligent student)?
4. (2 points) What’s the expected Creativity score of an admitted student with I = 0.95? How does it compare to the expected Creativity score of an admitted student (computed in 2)? Hint: it might be helpful to think about the correlation between Creativity and Intelligence in the general student population and among admitted students.
Problem 7: Bayesian networks (Exercise 3.11 from Koller-Friedman)
1. (8 points) Consider the Burglary Alarm network given above. Construct a Bayesian network over all the node except the Alarm that is a minimal I-map for the marginal distribution over the remaining variables (namely, over B,E,N,T,J,M). Be sure to get all the dependencies from the original network.
2. (8 points) Generalize the procedure you used above to an arbitrary network. More precisely, assume we are given a network BN, an ordering X1,··· ,Xn that is consistent with the ordering of the variables in BN, and a node Xi to be removed. Specify a network BN0 such that BN0 is consistent with this ordering, and such that BN0 is a minimal I-map of PBN(X1,··· ,Xi,Xi+1 ···Xn). Your answer must be an explicit specification of the set of parents for each variable in BN0.
Problem 8: Towards inference in Bayesian networks
1. (4 points) Suppose you have a Bayes’ net over variables (X1,··· ,Xn) and all variables except Xi are observed. Using the chain rule and Bayes’ rule, find an efficient algorithm to compute P(xi|x1,··· ,xi−1,xi+1,··· ,xn). In particular, your algorithm should not require evaluation of the full joint distribution.
2. (4 points) Find an efficient algorithm to generate random samples from the probability distribution defined by a Bayesian network. You can assume access to a routine that generates random samples from any given multinomial distribution. Hint: Show that for any joint distribution P(X,Y ) you can sample by first drawing a sample x ∼ P(X) and then drawing a sample y ∼ P(Y |X = x).
Problem 9: Programming assignment
In this programming assignment, we will investigate the structure of the binarized MNIST dataset of handwritten digits using Bayesian networks. The dataset contains images of handwritten digits with dimensions 28 × 28 (784) pixels. Consider the Bayesian network in Figure 1. The network contains two layers of variables. The variables in the bottom layer, X1:784 denote the pixel values of the flattened image and are referred to as manifest variables. The variables in the top layer, Z1 and Z2, are referred to as latent variables, because the value of these variables will not be explicitly provided by the data and will have to be inferred.
Figure 1: Bayesian network for the MNIST dataset. X1:784 variables correspond to pixels in an image. Z1 and Z2 variables are latent.
The Bayesian network specifies a joint probability distribution over binary images and latent variables p(Z1,Z2,X1:784). The model is trained so that the marginal probability of the manifest variables, p(x1:784) = Pz1,z2 p(z1,z2,x1:784) is high on images that look like digits, and low for other images. We consider a model parameterized using neural networks, trained using stochastic gradient descent. Bayesian networks specified as such are popularly referred to as variational autoencoders and represent one of the most powerful existing deep generative models in current use. We will return to the exact details of learning such models later in the course.
For this programming assignment, we provide a pretrained model trained mnist model. The starter code pa1.py loads this model and provides functions to directly access the conditional probability tables. Further, we simplify the problem by discretizing the latent and manifest variables such that V al(Z1) = V al(Z2) = {−3,−2.75,...,2.75,3} and V al(Xj) = {0,1}, i.e., the image is binary.
1. (2 points) How many values can the random vector X1:784 take, i.e., how many different 28×28 binary images are there?
2. (2 points) How many parameters would you need to specify an arbitrary probability distribution over all possible 28 × 28 binary images?
3. (4 points) How many parameters do you need to specify the Bayesian network in Figure 1?
For parts 4-7 below, refer to pa1.py. The starter code contains some helper functions for solving these questions. It is not compulsory to use them and you are allowed to use your own implementations, nor are the helper functions sufficient so feel free to introduce your own functions if required.
4. (5 points) Produce 5 samples from the joint probability distribution (z1,z2,x1:784) ∼ p(Z1,Z2,X1:784), and plot the corresponding images (values of the pixel variables).
Hint: they should look like (binarized) handwritten digits. Imagine we could build such a model not for handwritten digits, but for Renaissance paintings. Each sample from the model would produce a new piece of art!
5. (5 points) For each possible value of
(z1,z2) ∈ {−3,−2.75,...,2.75,3} × {−3,−2.75,...,2.75,3},
compute the conditional expectation E[X1:784|Z1,Z2 = (z1,z2)]. This is the expected image corresponding to each possible value of the latent variables Z1,Z2. Plot the images on on a 2D grid where the grid axes correspond to Z1 and Z2 respectively. What is the intuitive role of the Z1,Z2 variables in this model?
6. (10 points) In q6.mat, you are given a validation and a test dataset. In the test dataset, some images are “real” handwritten digits, and some are anomalous (corrupted images). We would like to use our Bayesian network to distinguish real images from the anomalous ones. Intuitively, our Bayesian network should assign low probability to corrupted images and high probability to the real ones, and we can use this for classification. To do this, we first compute the average marginal log-likelihood,
on the validation dataset, and the standard deviation (again, standard deviation over the validation set). Consider a simple prediction rule where images with marginal log-likelihood, logp(x1:784), outside three standard deviations of the average marginal log-likelihood are classified as corrupted. Classify images in the test set as corrupted or real using this rule. Then plot a histogram of the marginal log-likelihood for the images classified as “real”. Plot a separate histogram of the marginal log-likelihood for the images classified as “corrupted”.
Hint: If you run into any flow issues, search for the “log-sum-exp trick” online for help.
7.
(7 points) In q7.mat, you are given a labeled dataset of images of handwritten digits (the label corresponds to the digit identity). For each image Ik, compute the conditional probabilities p((Z1,Z2) = (z1,z2)|X1:784 = Ik). Use these probabilities to compute the conditional expectation
E[(Z1,Z2)|X1:784 = Ik)]
Plot all the conditional expectations in a single plot, color coding each point as per their label. What is the relationship with the figure you produced for part 5?