1 Nov 2013

Neo4j: A first attempt at retail product substitution

One of the interesting problems in the world of online shopping from the perspective of the retailer is working out whether there is a suitable substitute product if an ordered item isn’t currently in stock.

Since this problem brings together three types of data - order history, stock levels and products - it seems like it should be a nice fit for Neo4j so I 'graphed up' a quick example.

I wrote the following cypher to create some products, a person, a few orders and then the availability of those products in an imaginary store.

CREATE (bakedBeans :Category {name: "Baked Beans"} )
CREATE (fruit :Category {name: "Fruit"} )

CREATE (hbb :Product {name: "Heinz Baked Beans", type: "brand"} )
CREATE (bbb :Product {name: "Branstone Baked Beans", type: "brand"} )
CREATE (sbb :Product {name: "Sainsbury's Baked Beans", type: "own"} )
CREATE (apple :Product {name: "Bag of Apples"} )

CREATE UNIQUE (hbb)-[:HAS_CATEGORY]->(bakedBeans)
CREATE UNIQUE (bbb)-[:HAS_CATEGORY]->(bakedBeans)
CREATE UNIQUE (sbb)-[:HAS_CATEGORY]->(bakedBeans)

CREATE (southwark :Store {name: "Southwark"})

CREATE UNIQUE (southwark)-[:HAS_IN_STOCK {availability: 0}]->(hbb)
CREATE UNIQUE (southwark)-[:HAS_IN_STOCK {availability: 2}]->(bbb)
CREATE UNIQUE (southwark)-[:HAS_IN_STOCK {availability: 10}]->(sbb)
CREATE UNIQUE (southwark)-[:HAS_IN_STOCK {availability: 10}]->(apple)

CREATE (mark :Person {name: "Mark"})

CREATE (order1 :Order {id: 1, date: 1380884632})

CREATE UNIQUE (order1)-[:CONTAINS {count: 1}]->(hbb)
CREATE UNIQUE (order1)-[:CONTAINS {count: 5}]->(apple)
CREATE UNIQUE (mark)-[:PLACED_ORDER]->(order1)

CREATE (order2 :Order {id: 2, date: 1380885051})

CREATE UNIQUE (order2)-[:CONTAINS {count: 1}]->bbb
CREATE UNIQUE (mark)-[:PLACED_ORDER]->(order2)

We may then have a new order which we’re trying to fulfil and we want to check whether the products are in stock.

First we’ll create the order:

// Create the order
CREATE (order3:Order {id: 3, date: 1380895051})

WITH order3

// Assign the order to Mark
MATCH (p:Person)
WHERE p.name = "Mark"
CREATE UNIQUE (p)-[:PLACED_ORDER]->(order3)

WITH order3

// Populate the order with some products
MATCH (p:Product)
WHERE p.name = "Heinz Baked Beans"
CREATE UNIQUE (order3)-[:CONTAINS {count: 2}]->(p)

WITH order3

MATCH (p:Product)
WHERE p.name = "Bag of Apples"
CREATE UNIQUE (order3)-[:CONTAINS {count: 2}]->(p)

Now let’s check on the availability of each item in the order in a particular store:

// Find the products in the order
MATCH (o:Order)-[c:CONTAINS]->(product)
WHERE o.id = 3
WITH product, c.count AS count

// Check which items are out of stock in our store
MATCH (s:Store)-[inStock:HAS_IN_STOCK]->(product)
WHERE s.name = "Southwark"
RETURN product, inStock

==> +--------------------------------------------------------------------------------------------+
==> | product                                            | inStock                               |
==> +--------------------------------------------------------------------------------------------+
==> | Node[11444]{name:"Heinz Baked Beans",type:"brand"} | :HAS_IN_STOCK[60053]{availability:0}  |
==> | Node[11447]{name:"Bag of Apples"}                  | :HAS_IN_STOCK[60056]{availability:10} |
==> +--------------------------------------------------------------------------------------------+

Now if we change that query to only return items which have less items in stock than we attempted to order we’ll see that Heinz Baked Beans aren’t available:

// Find the products in the order
MATCH (o:Order)-[c:CONTAINS]->(product)
WHERE o.id = 3
WITH product, c.count AS count

// Check which items are out of stock in our store
MATCH (s:Store)-[inStock:HAS_IN_STOCK]->(product)
WHERE s.name = "Southwark" AND count > inStock.availability
RETURN product, inStock

==> +-------------------------------------------------------------------------------------------+
==> | product                                            | inStock                              |
==> +-------------------------------------------------------------------------------------------+
==> | Node[15281]{name:"Heinz Baked Beans",type:"brand"} | :HAS_IN_STOCK[86079]{availability:0} |
==> +-------------------------------------------------------------------------------------------+

MATCH (p:Person)-[:PLACED_ORDER]->order-[c:CONTAINS]->product-[:HAS_CATEGORY]->category
WHERE p.name = "Mark" AND category.name = "Baked Beans" AND order.id <> 3
RETURN product.name, product.type, order.id

==> +---------------------------------------------------+
==> | product.name            | product.type | order.id |
==> +---------------------------------------------------+
==> | "Heinz Baked Beans"     | "brand"      | 1        |
==> | "Branstone Baked Beans" | "brand"      | 2        |
==> +---------------------------------------------------+

Having identified Mark as a connoisseur of branded baked beans we might then run a query to check if there are any other branded baked beans available in that store:

MATCH (s:Store)-[inStock:HAS_IN_STOCK]->p-[:HAS_CATEGORY]->c
WHERE s.name = "Southwark"
AND c.name = "Baked Beans"
AND inStock.availability > 0
AND p.type = "brand"
RETURN p.name, inStock.availability

==> +------------------------------------------------+
==> | p.name                  | inStock.availability |
==> +------------------------------------------------+
==> | "Branstone Baked Beans" | 2                    |
==> +------------------------------------------------+

This is obviously an extremely naive approach so I went to twitter in the hope that I could find a more sophisticated approach:

Are there any general algorithms for product similarity? e.g. based on category/description etc. Or is it very much domain specific? __

Nicole White is currently playing around with association rule mining which sounds interesting.

— Mark Needham (@markhneedham) https://twitter.com/markhneedham/statuses/394062759089803264[October 26
2013]

@markhneedham I'm sure it can be! I'm actually working on association rule mining in a #neo4j db with purchases transaction data. __

It is described on its Wikipedia page like so:

Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.

Although it’s not exactly the same as what I want to do I need to look into it more to see if some of the ideas can be applied.

I also learnt that the terminology for what I’m looking for is a 'similar items' algorithm and I think what I’m looking to spike would be a hybrid recommender system which combines content similarity and user’s previous purchase history.

I’ve been looking around to see if there are any open or anonymised retail data sets to play around with but all I’ve come across is the 'Frequent Itemset Mining Dataset Repository'. Unfortunately when I tried to open the files they seem to just contain random numbers so I must be doing something wrong.

If anyone knows of a retail data set I can play around with please point me in the right direction.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.