Core Concepts · Module 17·9 min read

Recommender Systems

What should YouTube show you next? What movie should Netflix put on the homepage? Recommendation is its own discipline of machine learning — with techniques that long predate (and still complement) modern LLMs.

The five-bullet version

  • Recommendation is the problem of predicting what a user will want next, given partial information.
  • Collaborative filtering: use other users’ ratings of overlapping items.
  • Content-based filtering: use features of the items themselves.
  • Matrix factorization: factor the sparse user×item matrix into two small “taste” matrices.
  • Modern systems are hybrid two-tower neural nets: one tower per user, one per item, optimized for retrieval at scale.

§ 00 · WHY RECOMMENDATION IS ITS OWN PROBLEMSparse data, lots of choices

A recommendation system has a particular shape. You have users and items. Some users have rated some items (or watched, or purchased). Most have not. You want to predict, for the missing cells: if this user saw this item, how much would they like it?

The data is wildly sparse. Netflix has ~250 million users and ~17,000 titles. The average user has watched a few hundred. So 99.99% of the user×item matrix is empty. Standard ML doesn’t love this — you’re asked to predict in a regime where nearly all entries are missing.

§ 01 · COLLABORATIVE FILTERINGWisdom of similar tastes

The original recommendation idea is collaborative filteringcollaborative filtering. Predicting a user's rating for an item by looking at how similar users rated that item, or how the same user rated similar items. Doesn't require any features of users or items — only the rating matrix itself.: if Alice and Bob both loved 20 of the same movies, and Alice loves a 21st movie Bob hasn’t seen, Bob will probably love it too.

Two flavors:

Both work surprisingly well with simple similarity metrics (cosine, Pearson). The downsides: cold start — new users or items have no overlap with anything — and scalability — comparing every user to every other user gets expensive past a few hundred thousand users.

§ 02 · CONTENT-BASED FILTERINGRecommending by features instead of overlap

The complementary idea: don’t look at other users at all. Just recommend things similar tothe user’s past favorites, using features of the items themselves.

For movies: genre, director, year, runtime, embedded plot summary. Compute a user profile as the average of features for their liked movies. Recommend movies with high similarity to that profile.

Content-based filtering handles cold start gracefully (a brand-new movie still has features), but it can’t suggest anything truly new — by construction it’s recommending things like what you’ve already seen. The famous Netflix filter-bubble critique is largely about content-based filtering’s tendency to narrow rather than broaden taste.

§ 03 · MATRIX FACTORIZATIONCompressing taste into a few hidden dimensions

The dominant technique from ~2009 (Netflix Prize era) onward: matrix factorizationmatrix factorization. Approximate the sparse user×item rating matrix R as a product of two low-rank matrices U (users × k) and I (items × k). Each user and each item gets a k-dimensional latent vector; their dot product predicts the rating.. Take the user×item rating matrix R (mostly empty). Find two small matrices U (users × k) and I (items × k) whose product U·Iᵀ approximates R on the observed entries.

Each user gets a k-dimensional latent vector — interpret it as their “taste profile” in some k-axis space the model discovers (one axis might roughly mean “likes action,” another “likes art-house”). Each item gets a matching latent vector. The dot product is the predicted rating.

Lab · user×item matrixObserved ratings · matrix factorization fills in the gaps
InceptionArrivalUpWALL·EJohn WickAmit
5
4
3
Beth
4
5
2
1
Carl
5
5
Dana
3
4
5
Eva
1
5

Observed view: dashes are users who haven’t rated those items. Predicted view: the factorization fills in the gaps by finding two small “taste” matrices whose product approximates the observed cells.

Two beautiful things about this:

§ 04 · MODERN HYBRIDS AND TWO-TOWER MODELSHow YouTube and Spotify do it now

Production recommendation systems combine multiple signals:

The dominant modern architecture is the two-tower modeltwo-tower model. A neural recommender with two parallel encoders: one tower that produces an embedding for a user (from history, demographics, context) and one for an item (from its features). Recommendations come from dot product / nearest neighbor in the shared embedding space. Scales to billions of items.: two neural networks running in parallel. One takes the user (history, demographics, context) and produces a vector. The other takes an item (features, metadata) and produces a matching-shape vector. The dot product of the two vectors predicts engagement.

Why two-tower wins at scale: you can precompute the embedding for every item once, store them in a vector index, and at query time just compute the user’s embedding and look up the nearest items. This is the same nearest-neighbor pattern used in modern RAG — and it’s not a coincidence. Recommendation and retrieval converge at scale.

User toweruser history + contextEncoderuser vecItem toweritem featuresEncoderitem vecu · i = scoreItems pre-encoded offline · users encoded at query · ANN search on item index.
Fig 1Two-tower model. The architecture you'll find inside YouTube, Spotify, TikTok, and most large-scale recommenders. Looks exactly like a retrieval system.
CHECKA music app wants to recommend new songs to a brand-new user who has played 5 songs. Which approach handles this 'cold start' best?

§ 05 · TAKING THIS FORWARDWhere the field is moving

Three threads worth watching:

§ · GOING DEEPERFrom matrix factorization to two-tower neural recommenders

The Netflix Prize (2006–2009) made matrix factorization (Koren et al. 2009) the dominant recommender architecture for a decade — factor the user-item interaction matrix into low-rank latent representations, recommend by dot product. Simple, fast, interpretable. The neural era replaced the dot product with a learned similarity function (Neural Collaborative Filtering, He et al. 2017) but kept the factorization structure.

Modern industrial systems (YouTube, TikTok, Spotify) almost all use a two-towerarchitecture (Covington et al. 2016, Yi et al. 2019): one neural network encodes the user, a separate one encodes the item, recommendation is dot product of their embeddings. Pre-compute item embeddings, do nearest-neighbor search at query time. Sequential variants (SASRec, Kang & McAuley 2018) model the user as a sequence of past interactions — directly inheriting transformer architecture for what was once a matrix-factorization problem.

§ · FURTHER READINGReferences & deeper sources

  1. Koren, Bell, Volinsky (2009). Matrix Factorization Techniques for Recommender Systems · IEEE Computer
  2. He, Liao, Zhang, Nie, Hu, Chua (2017). Neural Collaborative Filtering · WWW
  3. Covington, Adams, Sargin (2016). Deep Neural Networks for YouTube Recommendations · RecSys
  4. Kang, McAuley (2018). Self-Attentive Sequential Recommendation (SASRec) · ICDM
  5. Yi et al. (2019). Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations · RecSys

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.