Loss
Binary cross-entropy: -sum[y·log(p) + (1-y)·log(1-p)]. Convex → global optimum via gradient descent.
Advertisement
Gradient
∂L/∂w = X^T(σ(Xw) - y) / N. Simple form — enables large-scale training.
Advertisement
Multiclass — softmax
K classes: softmax(Wx). Cross-entropy generalizes. Same convex optimization.
Regularization
L2 typical. L1 for feature selection. Elastic net combines. Prevents overfitting on high-d.