Your Cart
Loading

CS228 Probabilistic Graphical Models Homework 2-Solved

On Sale
$15.00
$15.00
Added to cart

17

1.    (I-Maps and P-Maps) Consider a probability distribution P over the variables A, B, C and D that satisfies only the following independencies:

1.  A C|B.

2.  A C|B,D.

3.  A D|B.

4.  A D|B,C.

5.  B D.

6.  B D|A.

7.  A D.

(a)    Draw a Bayesian network that is a perfect map for P.

(b) Does this perfect map have any I-equivalent graphs? If so, draw them. If not, explain why not.

(c)  ] Draw a Markov network that is a minimal I-map for P and explain why the Markov network is or is not also a perfect map.

2. ] (I-Maps and P-Maps) Consider a distribution P over four binary random variables (A,B,C,D), which gives probability 1/8 to assignments (0,0,0,0), and (1,1,0,0), and gives probability 1/4 to assignments (1,1,1,0), (0,1,0,1), and (1,0,1,1). A skeleton for the Bayesian network G is also provided; the skeleton contains all the nodes and edges, but the directions of the edges are missing.

(a) ] Decide whether the following two independencies are in I(P): (C D),(C B)

(b)   ] Give a direction to each individual edge in G, such that the resulting Bayesian network is a perfect map of P. Is your solution unique? Briefly state why.

(c)  Give the CPDs for each node in your Bayesian network specified in (b).


3. (Undirected Graphical Models, Inference) The Restricted Boltzmann Machine

Restricted Boltzmann machines (RBMs) have been widely used as a generative model in many fields of machine learning: image processing, speech recognition and collaborative filtering. Concretely, an RBM is an undirected graphical model defined on variables (v,h), v ∈ {0,1}m and h ∈ {0,1}n. The joint probability is given by

where

φ(v,h) = −αTv βTh vTWh

is a potential function. Here α ∈ Rm, β ∈ Rn, W ∈ Rm×n and Z is the normalizing constant. You can interpret it as a fully connected bipartite network with two layers: one for visible variables v and one for hidden variables h. In this problem, you will explore inference and learning for RBMs.

(a)   What is the marginal conditional distribution of P(hi | v) for a single hidden variable hi? Can it be computed tractably?

(b)   What is the conditional distribution P(h | v)? Can it be expressed in a compact (factored) form?

(c)   Suppose we are given samples for the visible units D = {v1,··· ,vK} (e.g., the MNSIT digits dataset), and we want to learn the parameters of the RBM to maximize the likelihood of these samples. Since we don’t know the values of hidden variables hk, a natural approach is to look for parameters that maximize the marginal likelihood of each sample vk, given by

                                                                                               h                                      h

For a fixed v, can

X

exp(φ(v,h)) h

be computed efficiently (in time not exponential in n)?

(d)   For a fixed h, can

X

exp(φ(v,h)) v

be computed efficiently (in time not exponential in m)?

(e)    Can the normalization constant

Z = XXexp(φ(v,h))

h v

be computed efficiently (in time not exponential in m or n)?


4. [Maximum Likelihood Learning of Bayes Nets] Let G be a valid Bayes Net structure, and D be some training data. It can be shown that the likelihood of D for a given choice of the parameters θ (specifying the conditional probability tables in G) is

We have seen in class that the maximum likelihood estimate θML = argmaxθ `G(θ;D) of the parameters of G can be computed as follows

                                                                                                                                                                        

where M[xi,ui] is the number of times the tuple (Xi = xi,Pa(Xi) = ui) appears in the data D. Let G0 be a valid Bayes net structure obtained by adding an edge to G. Show that

max`G0(θ0;D) ≥ max`G(θ;D) θ0                      θ

Note that G0 is stricly more expressive than G (G0 makes less conditional independence assumptions). This is confirmed by the fact that to specify a Bayes Net with structure G0 we need more parameters, i.e., the vector θ0 has more components than θ. The goal of this exercise is to show directly that the more “complex” a Bayesian Network is, the better we can fit it to data.

You may assume that G and G0 are identical, except for an extra x1 → xn edge in G0.

Hint: There are several ways to show this. Jensen’s inequality might come handy.

5. (Programming Assignment) Tree Augmented Naive Bayes

Naive Bayes (NB) classifiers are often competitive classifiers even though their strong independence assumptions may be unrealistic. We will explore an extension of this classification framework, Tree Augmented Naive Bayes (TANB), which will aim to improve the results of NB by modelling correlations between attributes while still being computationally tractable. If C denotes the class variable and (A1,...,AN) the attributes, then a NB model can be represented as a directed graph with these variables as nodes and edges {(C,Ai) : 1 ≤ i n}. A TANB extends this graph structure by allowing each child Ai to have one additional node Aj as a parent with the restriction that these additional edges (Ai,Aj) must form a tree. Figure 1 illustrates the difference in the graph structure.

Figure 1: A Naive Bayes model and Tree Augmented Naive Bayes model

In general, learning the structure of this graph can be hard, but we will cover this topic later in the course. For now we have supplied starter code necessary to learn the structure of the graph for the dataset.

In this problem, you will implement the parameter learning for both NB and TANB classifiers and explore the trade-offs of both. You will apply these classifiers to predict the party affiliation of either Democrat or Republican of US Congressmen (the class variable) based on their votes for 16 different measures (the attribute variables) shown in Table 1. Not all Congressmen voted on all 16 measures so sometimes entries in this dataset will have missing attributes, however, we will still be able to utilize our Bayes Network to accurately classify these examples. To keep things simple, the class and attribute variables are all binary with 0, 1 corresponding to a no, yes vote respectively.

When training the models, some of the parameters may not have enough examples for accurate estimation. To mitigate this, we will perform Laplace smoothing with a pseudocount of α = 0.1 for every value of an attribute given its parents when learning the parameters.

In order to evaluate the performance of our classifiers on the dataset, we will use 10-fold cross-validation. Under 10-fold cross-validation, the dataset is first partitioned into 10 equally sized partitions. Of these 10 partitions, one of them is used for the test set while the rest of the data are used as the training set to compute test error on this partition. This process is repeated for the other nine partitions, and we can take the average of the resulting test errors to obtain the 10-fold CV test error. We have implemented this procedure for you in the function evaluate.

(a) [

Implement a NB classifier that both learns the parameters from the training data and can use these parameters to score and classify examples in the training data. What is your test error rate using 10-fold cross-validation?

Note: you can use the evaluate function in the starter code, but leave the optional argument train subset to its default value until part d.

(b) [

We have provided starter code that implements the Chow-Liu algorithm (see Friedman et al. 1997) to learn the tree structure from the training data for a TANB. Using this tree, implement a TANB

Attribute

Name

Incomplete Entry 1

A1

handicapped infants

1

A2

water project cost sharing

1

A3

adoption of the budget resolution

0

A4

physician fee freeze

0

A5

El Salvador aid

0

A6

religious groups in schools

0

A7

anti satellite test ban

0

A8

aid to Nicaraguan Contras

?

A9

mx missile

?

A10

immigration

0

A11

synfuels corporation cutback

0

A12

education spending

?

A13

superfund right to sue

?

A14

crime

?

A15

duty free exports

0

A16

export administration act south Africa

0

Table 1: Attribute names for Congressional Voting Records together with an incomplete example that has some voting records missing for a particular Congressman.

classifier that learns the parameters from the training data and uses these to score examples. What is your test error rate using 10-fold cross-validation?


(c)

In general, working with data where the values of attributes and labels are missing is difficult when learning model parameters. However, we can still use our generative model from a fully trained Bayes Network to classify examples in which some of the attributes may be unobserved. Suppose Ai is unobserved. We can still compute P(C|A1,··· ,Ai−1,Ai+1,··· ,AN) by computing P(C,Ai|A1,··· ,Ai−1,Ai+1,··· ,AN) and marginalizing over Ai. Update your TANB implementation to handle the case where some attributes may have missing values and use this new implementation to classify Incomplete Entry 1 in Table 1[1]. Given the observed attributes, what is the marginal probability this Congressman is Democrat (C = 1) given the votes we did observe? Can you predict how this Congressman voted on education spending (A12)?

Note the power of a generative model: it can easily handle missing data and can be used to answer all sorts of probabilistic queries. In contrast, this would not be possible with a discriminative model, e.g., if you trained a logistic regression or a random forest classifier to directly predict the affiliation of a Congressman (C), because these models would only provide you with the conditional distribution P(C|A1,··· ,AN).

Note: You should train your classifier on the full dataset. You can use the function

evaluate incomplete entry, which both trains on the full dataset and loads Incomplete Entry 1 for classification. evaluate.


(d)

What is the test error when you train both classifiers on a smaller subset of the training data? Explain why the test error for the TANB may not strictly be better than NB.

Note: set the arguments train subset=True when calling evaluate so that the classifiers are trained on a much smaller subset of the data.


[1] Your implementation does not have to be fully general. It’s fine as long as it works on that particular example.



You will get a ZIP (250KB) file

Customer Reviews

There are no reviews yet.