Recommender Systems
What should YouTube show you next? What movie should Netflix put on the homepage? Recommendation is its own discipline of machine learning — with techniques that long predate (and still complement) modern LLMs.
The five-bullet version
- Recommendation is the problem of predicting what a user will want next, given partial information.
- Collaborative filtering: use other users’ ratings of overlapping items.
- Content-based filtering: use features of the items themselves.
- Matrix factorization: factor the sparse user×item matrix into two small “taste” matrices.
- Modern systems are hybrid two-tower neural nets: one tower per user, one per item, optimized for retrieval at scale.
§ 00 · WHY RECOMMENDATION IS ITS OWN PROBLEMSparse data, lots of choices
A recommendation system has a particular shape. You have users and items. Some users have rated some items (or watched, or purchased). Most have not. You want to predict, for the missing cells: if this user saw this item, how much would they like it?
The data is wildly sparse. Netflix has ~250 million users and ~17,000 titles. The average user has watched a few hundred. So 99.99% of the user×item matrix is empty. Standard ML doesn’t love this — you’re asked to predict in a regime where nearly all entries are missing.
§ 01 · COLLABORATIVE FILTERINGWisdom of similar tastes
The original recommendation idea is collaborative filteringcollaborative filtering. Predicting a user's rating for an item by looking at how similar users rated that item, or how the same user rated similar items. Doesn't require any features of users or items — only the rating matrix itself.: if Alice and Bob both loved 20 of the same movies, and Alice loves a 21st movie Bob hasn’t seen, Bob will probably love it too.
Two flavors:
- User-based. Find users similar to the target user. Recommend what they liked.
- Item-based. Find items similar to ones the user liked. Recommend those.
Both work surprisingly well with simple similarity metrics (cosine, Pearson). The downsides: cold start — new users or items have no overlap with anything — and scalability — comparing every user to every other user gets expensive past a few hundred thousand users.
§ 02 · CONTENT-BASED FILTERINGRecommending by features instead of overlap
The complementary idea: don’t look at other users at all. Just recommend things similar tothe user’s past favorites, using features of the items themselves.
For movies: genre, director, year, runtime, embedded plot summary. Compute a user profile as the average of features for their liked movies. Recommend movies with high similarity to that profile.
Content-based filtering handles cold start gracefully (a brand-new movie still has features), but it can’t suggest anything truly new — by construction it’s recommending things like what you’ve already seen. The famous Netflix filter-bubble critique is largely about content-based filtering’s tendency to narrow rather than broaden taste.
§ 03 · MATRIX FACTORIZATIONCompressing taste into a few hidden dimensions
The dominant technique from ~2009 (Netflix Prize era) onward: matrix factorizationmatrix factorization. Approximate the sparse user×item rating matrix R as a product of two low-rank matrices U (users × k) and I (items × k). Each user and each item gets a k-dimensional latent vector; their dot product predicts the rating.. Take the user×item rating matrix R (mostly empty). Find two small matrices U (users × k) and I (items × k) whose product U·Iᵀ approximates R on the observed entries.
Each user gets a k-dimensional latent vector — interpret it as their “taste profile” in some k-axis space the model discovers (one axis might roughly mean “likes action,” another “likes art-house”). Each item gets a matching latent vector. The dot product is the predicted rating.
Observed view: dashes are users who haven’t rated those items. Predicted view: the factorization fills in the gaps by finding two small “taste” matrices whose product approximates the observed cells.
Two beautiful things about this:
- The unobserved cells are filled inby the factorization. Trained on the cells you have, the model predicts the cells you don’t.
- The latent dimensions emerge from the data. Nobody told the model to think in terms of “action vs art-house”; that structure was implicit in the ratings.
§ 04 · MODERN HYBRIDS AND TWO-TOWER MODELSHow YouTube and Spotify do it now
Production recommendation systems combine multiple signals:
- Collaborative (who else liked this?).
- Content-based (what features does this share with what I liked?).
- Context (time of day, device, what came before).
- Real-time behavior (what did I click in the last 30 seconds?).
The dominant modern architecture is the two-tower modeltwo-tower model. A neural recommender with two parallel encoders: one tower that produces an embedding for a user (from history, demographics, context) and one for an item (from its features). Recommendations come from dot product / nearest neighbor in the shared embedding space. Scales to billions of items.: two neural networks running in parallel. One takes the user (history, demographics, context) and produces a vector. The other takes an item (features, metadata) and produces a matching-shape vector. The dot product of the two vectors predicts engagement.
Why two-tower wins at scale: you can precompute the embedding for every item once, store them in a vector index, and at query time just compute the user’s embedding and look up the nearest items. This is the same nearest-neighbor pattern used in modern RAG — and it’s not a coincidence. Recommendation and retrieval converge at scale.
§ 05 · TAKING THIS FORWARDWhere the field is moving
Three threads worth watching:
- Sequential / session-based — the next song depends on the last three. Transformers on user history work very well for this and are increasingly the production default.
- Multi-objective — engagement is one signal, but so are diversity, novelty, fairness, advertiser revenue. Modern recommenders optimize a multi-objective score, not pure click-through.
- LLM-augmented — using an LLM to summarize user intent, generate item descriptions, or directly recommend from natural- language queries. New surface, not a replacement for the underlying ranker.
§ · GOING DEEPERFrom matrix factorization to two-tower neural recommenders
The Netflix Prize (2006–2009) made matrix factorization (Koren et al. 2009) the dominant recommender architecture for a decade — factor the user-item interaction matrix into low-rank latent representations, recommend by dot product. Simple, fast, interpretable. The neural era replaced the dot product with a learned similarity function (Neural Collaborative Filtering, He et al. 2017) but kept the factorization structure.
Modern industrial systems (YouTube, TikTok, Spotify) almost all use a two-towerarchitecture (Covington et al. 2016, Yi et al. 2019): one neural network encodes the user, a separate one encodes the item, recommendation is dot product of their embeddings. Pre-compute item embeddings, do nearest-neighbor search at query time. Sequential variants (SASRec, Kang & McAuley 2018) model the user as a sequence of past interactions — directly inheriting transformer architecture for what was once a matrix-factorization problem.
§ · FURTHER READINGReferences & deeper sources
- (2009). Matrix Factorization Techniques for Recommender Systems · IEEE Computer
- (2017). Neural Collaborative Filtering · WWW
- (2016). Deep Neural Networks for YouTube Recommendations · RecSys
- (2018). Self-Attentive Sequential Recommendation (SASRec) · ICDM
- (2019). Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations · RecSys
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.