I first got into data science and machine learning because I wanted to help businesses make smarter decisions. My friend from high school, Chelsea, runs the e-commerce business Boxfox. Boxfox enables users to send gift boxes to their friends and family. Inside these gift boxes are products that are either chosen by Boxfox beforehand (pre-curated boxes), or chosen by the customer (custom boxes). Chelsea told me that she would like to know what features are import for explaining whether a first-time customer will come back to purchase from Boxfox again?
I saw this as a great opportunity to use binary classification for explanatory purposes, where our target is either a returning customer or a one-time customer. Boxfox is built on Shopify, and Shopify has an API with excellent documentation. I used this API to get data on Boxfox's customers, products, and orders. The orders here are the most granular level of data.
Defining the Target
Boxfox has completed over 10,000 orders, which means we have a little more than 10,000 rows of data for this project. First we need to define our target by labelling our orders as belonging to either a customer that had more than one order (a returning customer) or only one order (a one-time customers). Now we have labeled data to train on. But for our problem we want to look at only the first orders of the customers that do come back. So I filtered down my training set to only the first orders of each customer. Now I want to look at the distribution of my target to see how balanced it is.
This bar graph tells us that around 1 in 4 orders belong to a Customer that will return. So if my model at any point is better than the baseline of 0.25, then I will consider that an improvement.
There's only so much information that the Shopify API provides, so we really have to squeeze out as much signal as possible with some feature engineering. The Shopify API provides data on customers including the client's user agent. I used regex to extract features like what device the customer used to place their order. The barchart shows us the counts of each device for all the orders. This is a good way to visualize results of our feature engineering and give us an idea of what the distribution of those features are. When we plug these features into a model, then we'll get an idea of how predictive they are.
The Shopify API did not contain any relationship between what items were in a box, and I thought that this would be important information to include in my model. The only way I could get this information was by taking the product description, which listed which items were in each box. I vectorized these product descriptions with Term Frequency Inverse Document Frequency, and then used Dimensionality Reduction with Latent Semantic Analysis to reduce the vectorized product description down to 10 "topics". I now have 10 new features that contain the frequency of the topics in the product description for each order.
I modeled these features with a random forest model. I chose a random forest model because they handle categorical features well. They also provide feature importances based on information gained, which will help us to answer our question of which features are important when trying to predict whether a customer will return or not.
I used my model to predict the probability of each order in my test set. Then I created the histogram below to see the distribution of likelihoods. When evaluating mymodel, and set my threshold at about .48, I received a precision score of 0.4. This means that every time my model predicts that an order belongs to a returning customer, 40% of the time it is going to be correct. This is an improvement from the baseline of 0.25!
Now I want to look at two groups in this distribution. I want to look at the orders in the turquoise who's likelihood of belonging to a returning customer are above my threshold. I also want to look at the orders from the salmon-colored bars. These are orders the model thought were highly unlikely to belong to a returning customer. I want to look at both these groups, and compare them with my model's most important features.
The table above shows how the orders that were less likely to belong to returning customers were on average more average than the orders that were more likely to belong to returning customers. The two most popular months of purchase are February and December (popular times of the year to buy gifts). What's interesting is that orders that were more likely to belong to a returning customer had a higher share of February orders and a lower share of December orders than the less likely group.
The bottom two features in the table come from the results of the Latent Semantic Analysis, and ended up being two of my model's most important features. Orders that were less likely to belong to a returning customer were more frequently found in these two "topics". Let's take a look at which items in each of these topics:
Now I can go back to Chelsea and Boxfox and tell them that there seems to be some signal when people purchase items from related to candles and treats, or books and pens. The question we have to ask now is: is there something about these products that customers aren't as happy with compared to other products? Or do these products not leave customers feeling like they need to purchase another Boxfox. But that's for another blog post...
Thanks for reading! You can view my source code for this project here.