Searching for better scoring knobs

The 💖For You feed has a few parameters that determine what content gets to the top of the feed.

Among them is the num_paths exponent factor which boosts posts that have more distinct paths leading to it. The current value is 0.3.

The other one is the popularity exponent factor which demotes generally popular posts. The current value is 0.2.

I manually picked those values because they looked good in the Playground and then they performed better in A/B tests.

Is there a better combination of these parameters? Most likely! How do we find it? I log the feed responses that were returned to the users. These logs include the list of posts (usually 30 posts per response) and the "features" of each post - the score, the popularity, the number of paths, etc. I also record when the user likes a post or presses "show less like this" or "show more like this". With that information we can try different values of parameters and see how the ranking of those 30 posts would change and see whether the liked posts end up ranking higher and "show less like this" - lower.

The simplest way to do the search is "grid search" where we run the evaluation for all pairs of values. Another one is to use a smarter algorithm that picks the next combination of parameters to try based on the previous evaluation results.

The latter method resulted in the parameters:

num_paths exponent 0.3 -> 0.5 - it would prioritize posts that were liked by more users that have liked the same posts as you

popularity exponent 0.2 -> 0.3 - it would more strongly demote posts with more likes in general.

Based on these new parameters the liked posts would move up +4.4% in the list and the "show less" posts would move down by 16%.

This retrospective analysis does not tell the whole story though, because it only applies to the results that were returned to the user by the old algorithm.

The real test is to run an A/B experiment where 50% of the users are getting recommendations with the old parameters and the other 50% - with the new ones.

I would typically run the A/B test for a week or two to get enough data to draw conclusions and even then the metrics would not reach the statistical significance.

I started this new experiment just yesterday but I'm already seeing statistically significant results, which is very surprising.

In the experiment +2.5% of users liked something in their For You feed and -14.3% of users pressed "show less":

Both changes are statistically significant with p=0.95. This is a huge change. It is also interesting that the magnitude of the change is comparable to the offline ranking evaluation (+4.4%, -16%).

Other metrics:

number of likes: +9.7% (86,293 -> 94,675)

number of "show less": -36% (2,092 -> 1,334)

I will keep the experiment running for longer to make sure the improvements remain significant.

made using Leaflet