Q1

What quantity is minimised by “linear regression”? Mean squared error. Sum of squared error. L^2 distance between values of predicted and true response variable.

$$ \sum_{i = 1}^n (y_i - \hat{y}_i)^2 $$

How does the linear “support vector machine” make non-linear predictions? By making linear predictions in expanded feature space formed by a non-linear transform of the original feature vectors. By transforming the feature vector with a non-linear transform. As training and making prediction with SVM only depends on the inner product between the input vectors, the non-linear transformation need not be computed explicitly. Instead, the inner product between the transformed vectors can be done via a kernel function.

$$ \mathbf{x} = (x_1, \dots, x_d) \mapsto \phi(\mathbf{x}) \mapsto \text{prediction}(x) = \mathrm{SVM}(\phi(\mathbf{x})) $$

Example of a dataset that a perceptron cannot classify. Famously, perceptron cannot compute the XOR function. So, it cannot classify the following dataset with binary labels: $\{((0, 0), +), ((1, 1), +), ((0, 1), -), ((1, 0), -)\}$ A perceptron can only form linear decision boundary and hence cannot classify dataset that are not linearly separable. For example, a dataset with binary class label consisting of two “rings” arranged so that the positive data points forms an inner ring enclosed by a larger ring consisting of the negative datapoints.
Similarity of Gaussian mixture models and K-means clustering? Both are a method of unsupervised learning methods that focus on clustering data. Both can be “fitted” or trained using iterative methods, and indeed both are trained by an instance of Expectation-Maximisation method. Both requires a hyperparameter k specifying the number of cluster to output.
Why split on “information gain” in decision tree? Let’s focus on one node in the tree. Splitting the node in base on a splitting criteria — a choice of categorical feature or a choice numerical feature and a threshold — that has the highest information gain maximise the gain in purity of the children nodes. This maximises the expected improvement of classifying (classification rule being the most frequent label in the node) using the splitting criteria over the parent node.
Number of edges of a $k$-clique with $k = 4$. There are 4 nodes with every node connected to all nodes except itself. So, number of edges is equal to the number of distinct pairs which gives us 4 choose 2 $= 4 \times 3 / 2$ = 6 edges.

Q3

Linear regression vs Logistic regression in terms of how they make predictions. During training, both models find an optimal set of parameters $\hat{w} = (\hat{w_1}, \dots, \hat{w_d})$ according to some loss function. When making a prediction on a test instance $x = (x_1, \dots, x_d)$, linear regression computes $\hat{y} = w_1 x_1 + \dots + w_d x_d$ as its output, whereas the logistic regression computes a probability $1 / (1 + e^{-\hat{y}})$ — with the same expression for $\hat{y}$ — that the instance is a positive instance as the model output.
Why logistic regression requires gradient descent and linear regression not? The optimal parameter for linear regression can be expressed in closed form (can be solved analytically) in terms of the training samples. That is not the case for logistic regression, hence the need for iterative approximate optimisation method like gradient descent.
How do the objective function for training linear SVM (soft-margin) and ridge regression differ? How are they alike? Both objective functions consists of a sum of two terms:

a prediction loss term $\sum_{i = 1}^n L(y_i, \hat{y}_i)$ with $\hat{y} = w_0 + w_1 x_1 + \dots + w_d x_d$
a regularisation term proportional to the euclidean norm of the model parameters $\vert \vert w \vert \vert_2$, i.e. both models are $L^2$ -regularised. They differ in their choice of loss function: SVM uses hinge loss $\ell(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y})$, whereas ridge regression uses sum-of-squared-error $\ell(y, \hat{y}) = (y - \hat{y})^2$.

Q4

What do we mean by “cohesion” and “separation” in cluster evaluation?