Machine Learning Course lab 6-Solved
1 Support Vector Machines using SGD
Until now we have implemented linear and logistic regression to do classification. In this exercise we will use the Support Vector Machine (SVM) for classification. As we have seen in the lecture notes, the original optimization problem for the Support Vector Machine (SVM) is given by
N
λ 2
min
w (1) w
where ℓ : R → R, ℓ(z) := max{0,1 − z} is the hinge loss function. Here for any n, 1 ≤ n ≤ N, the vector xn ∈ RD is the nth data example, and yn ∈ {±1} is the corresponding label.
Problem 1 (SGD for SVM):
Implement stochastic gradient descent (SGD) for the original SVM formulation (1). That is in every iteration, pick one data example n ∈ [N] uniformly at random, and perform an update on w based on the (sub-)gradient of the nth summand of the objective (1). Then iterate by picking the next n.
1. Fill in the notebook functions calculate accuracy(y, X, w) which computes the accuracy on the training/test dataset for any w and calculate primal objective(y, X, w, lambda ) which computes the total primal objective (1).
2. Derive the SGD updates for the original SVM formulation and fill in the notebook function calculate stochastic gradient() which should return the stochastic gradient of the total cost function (loss plus regularizer) with respect to w. Finally, use sgd for svm
demo() provided in the template for training.
2 Support Vector Machines using Coordinate Descent
As seen in class, another approach to train SVMs is by considering the dual optimization problem given by
YXX⊤Yα such that 0 ≤ αn ≤ 1 ∀n ∈ [N] (2)
where 1 is the vector of size N filled with ones, Y := diag(y), and X ∈ RN×D again collects all N data examples as its rows, as usual. In this approach we optimize over the dual variables α and map the solutions back to the primal vector w via the mapping we have seen in class: w(α) = λN1 XTYα.
Problem 2 (Coordinate Descent for SVM):
In this part we will derive the coordinate descent (or rather ascent in our case) algorithm seen in class to solve the dual (2) of the SVM formulation. That is, at every iteration, we pick a coordinate n ∈ [N] uniformly at random, and fully optimize the objective (2) with respect to that coordinate alone. Hence one step of coordinate ascent corresponds to solving, for a given coordinate and our current vector α ∈ [0,1]N, the one dimensional problem:
such that 0 ≤ αn + γ ≤ 1 (3)
where f(α) = N1α⊤1 − 2λN1 2α⊤YXX⊤Yα and en = [0,··· ,1,··· ,0]⊤ (all zero vector except at the nth position). We denote by γ∗ the value of γ which maximises problem 3. The coordinate update is then αnew = α + γ∗en.
1. Solve problem 3 and give a closed form solution for the update αnew = α + γ∗en, this update should only involve α, λ, N, xn, yn and w(α) (Hint: Notice that γ 7→ f(α + γen) is polynomial and don’t forget the constraints !). Convince yourself that this update can be computed in O(D) time.
2. Find an efficient way of updating the corresponding w(αnew). This should be computable in O(D) time.
3. Fill in the notebook functions calculate coordinate update() which should compute the coordinate update for a single desired coordinate and calculate dual objective() which should return the objective (loss) for the dual problem (2) .
4. Finally train your model using coordinate descent (here ascent) using the given function sgd for svm demo() in the template. Compare to your SGD implementation. Which one is faster? (Compare the training objective values (1) for the w iterates you obtain from each method). Is the gap going to 0?
Additional Theory Exercises
Convexity
Recall that we say that a function f is convex if the domain of f is a convex set and
f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y), for all x,y in the domain of f, 0 ≤ θ ≤ 1.
And strictly convex if
f(θx + (1 − θ)y) < θf(x) + (1 − θ)f(y), for all x ≠
y in the domain of f, 0 < θ < 1.
Prove the following assertions.
1. The affine function f(x) = ax + b is convex, where a,b and x are scalars.
2. If multiple functions fn(x) are convex over a fixed domain, then their sum g(x) = Pn fn(x) is convex over the same domain.
3. Take f,g : R → R to be convex functions and g to be increasing. Then the function g ◦ f defined as (g ◦ f)(x) = g(f(x)) is also convex.
Note: A function g is increasing if a ≥ b ⇔ g(a) ≥ g(b). An example of a convex and increasing function is exp(x),x ∈ R.
4. If f : R → R is convex, then g : RD → R, where g(x) := f(w⊤x + b), is also convex. Here, w is a constant vector in RD, b is a constant in R and x ∈ RD.
5. Let f : RD → R be strictly convex. Let x⋆ ∈ RD be a global minimizer of f. Show that this global minimizer is unique. Hint: Do a proof by contradiction.
Extension of Logistic Regression to Multi-Class Classification
Suppose we have a classification dataset with N data example pairs {xn,yn}, n ∈ [1,N], and yn is a categorical variable over K categories, yn ∈ {1,2,...,K}. We wish to fit a linear model in a similar spirit to logistic regression, and we will use the softmax function to link the linear inputs to the categorical output, instead of the logistic function.
We will have K sets of parameters wk, and define and compute the probability distribution of the output as follows,
.
Note that we indeed have a probability distribution, as. To make the model identifiable, we will fix wK to 0, which means we have K−1 sets of parameters to learn. As in logistic regression, we will assume that each yn is i.i.d., i.e.,
N
P[y |X,w1,...,wK] = YP[yn |xn,w1,...,wK].
n=1
1. Derive the log-likelihood for this model.
Hint: It might be helpful to use the indicator function 1yn=k, that is equal to one if yn = k and 0 otherwise
2. Derive the gradient with respect to each wk.
3. Show that the negative of the log-likelihood is jointly convex in w1,...,wK. Hint: you can use H¨older’s inequality:
,
where.
Mixture of Linear Regression
If you have a regression dataset with two or more distinct clusters, a mixture of linear regression models is preferred over just one linear regression model.
Consider a regression dataset with N pairs {yn,xn}. Similar to Gaussian mixture model (GMM), let rn ∈ {1,2,...,K} index the mixture component. The distribution of the output yn under the kth linear model is defined as follows:
p(yn|xn,rn = k,w) := N(yn|w⊤k x˜n,σ2)
Here, wk is the regression parameter vector for the kth model with w being a vector containing all wk. Also, denote x˜n := [1,x⊤n]⊤.
1. Define rn to be a binary vector of length K such that all the entries are 0 except the kth entry, i.e., rnk = 1, implying that xn is assigned to the kth mixture. Rewrite the likelihood p(yn|xn,w,rn) in terms of rnk.
2. Write the expression for the joint distribution p(y|X,w,r) where r is the set of all r1,r2,...,rN.
3. Assume that rn follows a multinomial distribution p(rn = k|π) = πk, with π = [π1,π2,...,πK]. Derive the marginal distribution p(yn|xn,w,π) obtained after marginalizing rn out.
4. Write the expression for the maximum likelihood estimator L(w,π) := −logp(y|X,w,π) in terms of data y and X, and parameters w and π.
5. (a) Is L jointly convex with respect to w and π?
(b) Is the model identifiable? That is, can the MLE estimator return the true parameters w and π, assuming we have infinitely many samples. Prove your answers.