Loss

Binary cross-entropy: -sum[y·log(p) + (1-y)·log(1-p)]. Convex → global optimum via gradient descent.

Advertisement

Gradient

∂L/∂w = X^T(σ(Xw) - y) / N. Simple form — enables large-scale training.

Advertisement

Multiclass — softmax

K classes: softmax(Wx). Cross-entropy generalizes. Same convex optimization.

Regularization

L2 typical. L1 for feature selection. Elastic net combines. Prevents overfitting on high-d.