Stage 1: candidate generation

Goal: from 1B+ videos, retrieve ~1000 candidates for THIS user, in <50ms. Two-tower model: one tower encodes user (history, demographics), another encodes video. Both produce 256-dim embeddings. Retrieval = nearest-neighbor search via ANN (FAISS, ScaNN).

Advertisement

Stage 2: ranking

Goal: score 1000 candidates, return top 20. Use a much richer model: ~100s of features (user-video interactions, watch context, device, etc.). Multi-task learning — predicts (click probability, watch time, satisfaction, share probability) simultaneously. Linear combination weighted by business goals.

Advertisement

Why two stages

Single-stage ranking on 1B videos is impossible (each query would be ~1B model evaluations). Two-stage: retrieval is fast and approximate, ranking is slow and precise. Same pattern used by every large recsys.

Cold start

New user: no history → recommend popular content in their language/country. New video: no engagement signals → recommend to users whose embedding aligns with the video's content embedding (e.g., NLP features from title + transcript).

Online vs offline training

Offline: nightly batch retraining on billion-row logs. Online: real-time feature serving (last-clicked video, current session). Hybrid: stable embeddings learned offline, fast-changing features (recency) injected at serving time.