Serving Findka's recommendations via libpython-clj

8 July 2020

I've written previously about the simple baseline algorithm that Findka's been using to generate content recommendations. My brother Marshall (who's into data science) has since then started helping out part-time with improving on it. As of last night, we have an off-the-shelf KNN algorithm (written in Python) running in an A/B test. This was my first time integrating Clojure with Python, so I thought I'd give an outline of what we did.

Background: dataset + algorithm

Findka's data set consists of about 75K ratings: tuples of user ID, item ID, rating. The ratings are -1 (for thumbs-down) or 1 (for thumbs-up or bookmark). We also keep track of when items are skipped, but currently that data isn't used for building the model (a lot of users mash the skip button). Marshall used Surprise (a Python recommendation lib) to run a bunch of experiments on our existing rating data and found that KNN Baseline performed the best. His words:

Establishing the best metric was crucial for proper testing. Surprise is designed to handle an arbitrary range of real-valued ratings while our data is binary (thumbs up/down), so the default RMSE used by surprise isn’t the most accurate for our purposes. Surprise also offers a fraction of concordant pairs metric that is more accurate for our data, but using any built in metric on the entire test set doesn’t give the best optimization for our purposes because what we really care about are the top N ratings for each individual user. This way a single user with 500 accurate predictions in the test set won’t offset the overall accuracy when there are lots of users with only a few ratings in the test set, and predictions will change after a few items from the top are rated anyway.

Using accuracy of the top 3 items for each user as a metric, I tested SVD, Co-Clustering, all the KNN algorithms, and a couple other algorithms from Surpise. KNN with baseline was slightly better than any others. One of the biggest advantages of KNN is that we can determine a confidence along with the prediction rating by looking at the number of actual neighbors that were used in each prediction. The algorithm uses 40 neighbors at most, but many items don’t have that many. If an item has a predicted rating of 1 with 40 neighbors, we can have a fair amount of confidence in that prediction. To determine the best predictions, we simply sort with estimated rating as the primary key and actual neighbors as a secondary key. Based on our test set, KNN with baseline will give good recommendations to users more than 75% of the time. Interestingly, the prediction accuracy is actually 83%, but the algorithm correctly predicts that 8% of the top recommendations will be bad because some users dislike almost everything.

Integration via libpython-clj

Since recommendations are made online (i.e. recommendations are computed while you use the app, rather than e.g. once a week like with Spotify's Discover Weekly), we used libpython-clj to integrate Surprise into the Clojure web app.

The first piece is a python file which exposes an informal interface of three functions:

  • train(ratings_csv_filename) -> surprise.KNNBaseline.
  • top(knn_baseline, user_id, item_ids) -> list of the top 20 recommended items, taken from item_ids.
  • set_rating(knn_baseline, user_id, item_id, rating) -> None.

(View source)

This file is kept at resources/python/, so we can require it from Clojure like so (and it gets deployed with the rest of the app):

(ns findka.algo
    [ :as io]
    [libpython-clj.python :as py]
    [libpython-clj.require :refer [require-python]]

(require-python 'sys)
(py/py. sys/path append (.getPath (io/resource "python")))
(require-python '[knn_baseline :reload :as knn])

Thanks to :reload, any changes made to the Python code will come into effect whenever we evaluate the findka.algo namespace.

Then we call the three functions from different parts of Biff. During startup, we pull all the ratings from Crux and store them in a CSV, then pass it to knn/train:

(ns findka.core
    [biff.system :refer [start-biff]]
    [findka.algo :as algo]

(defn start-findka [sys]
  (-> sys
      {:findka/model (atom {})})
    (start-biff 'findka)
(ns findka.algo
    [ :as io]
    [crux.api :as crux]
    [findka.trident.crux :as tcrux]
    [trident.util :as u]


(defn knn-model-inputs [user+item+ratings]
  (let [f (io/file (u/tmp-dir) "knn-ratings.csv")]
    (with-open [w (io/writer f)]
      (doseq [[user item rating] user+item+ratings]
        (.write w (str
                    (hash user) ","
                    (hash item) ","
                    (if (= rating :dislike) -1 1) "\n"))))
    #:knn{:model (knn/train (.getPath f))}))

(defn update-knn-model! [{:keys [findka.biff/node findka/model] :as sys}]
  (swap! model merge
    ; wraps crux.api/open-q
    (tcrux/lazy-q (crux/db node)
      {:find '[user item rating]
       :where '[[user-item :rating rating]
                [(!= rating :skip)]
                [user-item :user user]
                [user-item :item item]]}

I should also set up a cron job to call update-knn-model! periodically... though for now I'm just calling it manually once a day (via a tunneled nrepl connection) since I have other manual tasks I run daily anyway.

Whenever a client needs some more recommendations, it submits a Sente event. The handler calls findka.algo/recommend, which checks the user's A/B test assignment and then calls knn/top if the user is on the B arm (and returns a Biff transaction, along with some other data for logging purposes).

(ns findka.algo
    [crux.api :as crux]
    [trident.util :as u]


(def hash-str (comp str hash))

(defn recommend-knn [{:keys [user-ref model db n]
                      :or {n 5}}]
  (let [; A list of Crux document IDs for items to possibly recommend. Currently we
        ; query Crux for all the items and then exclude ones which the user has
        ; already rated.
        candidates ...
        hash-str->candidate (u/map-from hash-str candidates)
        ; user-ref is a Crux document ID for the user.
        user-id (hash-str user-ref)
        exploit (for [m (knn/top model (hash-str user-ref) (map hash-str candidates))]
                  (merge (u/map-keys keyword m)
                    {:item (hash-str->candidate (get m "item-id"))
                     :algo :knn/exploit}))
        recs (->> candidates
               (map #(hash-map :algo :knn/explore :item %))
               ; Like interleave, but probabalistically choose items from the first
               ; collection X% of the time.
               (interleave-random 0.666 exploit)
               (take n))]
    {:items (map :item recs)
     ; A Biff transaction
     :tx ...}))

(defn recommend [{:keys [user-ref db] :as env}]
  (let [algo-assignment (:ab/algo (crux/entity db {:ab/user user-ref}))
        unassigned (nil? algo-assignment)
        algo-assignment (or algo-assignment (first (shuffle [:cooc :knn])))
        f (case algo-assignment
            :cooc recommend-cooc
            :knn recommend-knn)]
    (cond-> (f env)
      true (assoc :algo algo-assignment)
      unassigned (update :tx conj
                   [[:ab {:ab/user user-ref}]
                    {:db/merge true :ab/algo algo-assignment}]))))

We're still using an epsilon-greedy strategy for exploration: 1/3 of the recommendations are purely random. That's probably a good area for future experiments.

Interlude: performance

On my first pass at using libpython-clj, recommend-knn was really slow. Even when I limited the candidates value to 500 random items (out of ~4.5K), the function was taking anywhere between 8 and 40 seconds to run on the server. (Curiously, it was never that slow on my laptop, even though most experiments I've ran go about 25% faster on the server... so I only found out after I deployed).

Long story short, the culprit was passing too many objects between Python and Clojure. libpython-clj wraps Python objects in Clojure interfaces and vice-versa (or something like that), but evidently that shouldn't be relied on willy-nilly. In particular:

  1. Originally I was passing the rating data directly from Crux to Surprise without writing to a CSV in between. I switched to the approach shown above, having Surprise read the data from disk instead.

  2. Instead of using the knn/top function, I was calling a Surprise method from Clojure to get the rating prediction for each item, after which I sorted the items and took the top 20. Now knn/top does that all in Python, so it only has to pass 20 objects back to Clojure instead of hundreds (or thousands).

I'm also still limiting candidates (to 1,000) because KNN is relatively slow even when I run the Python code directly (i.e. not via Clojure). That should be unnecessary once we switch to matrix factorization (when we tested SVD, predictions were about 30x faster and almost as accurate).

With those changes, recommend-knn for a typical user takes about 1,200 ms to run on the server—roughly 500 ms for knn/top and 700 ms for fetching candidates from Crux. The latter could be optimized further as well. I haven't investigated raw index access with Crux, but we could keep candidates in memory for active clients if nothing else.

(Also: before using knn/top, I tried returning tuples instead of dictionaries (maps), but it was still slow. Passing in a large list of strings (map hash-str candidates) from Clojure wasn't an issue though.)

Some simple tests indicated that computation via libpython-clj ran about 25% slower than plain Python, even when there was no object passing involved. I don't know if that's inherent or if I'm doing something wrong. Right now it doesn't matter, but if future us cares about it, I'm thinking we could run Python in a separate process and communicate over IPC (assuming IPC isn't slow, another thing which I have not investigated). Maybe by that time we'll have switched to rolling our own models in Clojure.

Python integration continued

Finally, on to knn/set_rating. Again, Findka does recommendations online. If someone visits the site for the first time, we want the algorithm to adapt to their preferences as soon as they rate an item. So we must update the model incrementally (at least the portion of the model representing the user's tastes). This was mildly inconvenient because Surprise (and I suspect most other recommendation libraries) aren't written with that use case in mind. After peering into the source, I came up with this:

@synchronized  # lock a mutex during execution
def set_rating(knn_baseline, user_id, item_id, rating):
    if not knn_baseline.trainset.knows_item(item_id):

    inner_item_id = knn_baseline.trainset.to_inner_iid(item_id)
        inner_user_id = knn_baseline.trainset.to_inner_uid(user_id)
        new_user = False
    except ValueError:
        inner_user_id = len(knn_baseline.trainset._raw2inner_id_users)
        new_user = True
        new_bu = np.append(knn_baseline.bu, 0)
        knn_baseline.bu = new_bu = new_bu
        knn_baseline.yr[inner_user_id] = []

    new_ratings = [(i, r) for i, r in knn_baseline.yr[inner_user_id] if i != inner_item_id]
    if rating is not None:
        new_ratings += [(inner_item_id, rating)]
    knn_baseline.yr[inner_user_id] = new_ratings

    if new_user:
        # Do this last to make sure top doesn't get messed up if it calls while
        # this function is executing (a likely occurrence).
        knn_baseline.trainset._raw2inner_id_users[user_id] = inner_user_id

This is called from a small wrapper function:

(ns findka.algo

(defn set-rating-knn! [{:keys [findka/model]
                        {:keys [user item rating]} :user-item}]
    (:knn/model @model)
    (hash-str user)
    (hash-str item)
    (normalize-rating rating)))

Which is called from a Biff trigger whenever someone rates an item:

(ns findka.triggers


(defn ratings-updated [{:keys [doc doc-before] :as env}]
  (when (some-changed? doc doc-before :rating)
    (let [{:keys [user item]} (merge doc-before doc)
          rating (:rating doc)]
      (algo/set-rating-knn! (assoc env :user-item
                              {:user user
                               :item item
                               :rating rating})))

(def triggers
  {:user-items {:write ratings-updated}

I'm quite happy with this now that it's all set up. I think it'll help us iterate quickly on the algorithm. I'm particularly looking forward to see the results for this A/B test. My next mini project is focused on analytics to that end, as we haven't been doing any A/B testing before now.