Nick Lansley's Technology for Tesco.com Blog: R&D Christmas Experiment: Tesco Instant

OK so it's live - and achieved using no more than some C#/Java and SQL database skills! It helps me answer the question: Can I get to a particular product faster than via our standard web offering (of departments, aisles and shelves, or text searching)? Does it for you?

Go play at http://www.techfortesco.com/tescoinstant !!

The rest of this article is about the technical design and implementation, of interest to anyone wishing to look at similar solutions.

Step 1 - Design!

I created a database with the following tables and developed code that used the API to extract the products for a particular store (Greenford Dotcom Store which serves all homes in north London) and perform 'lexicon analysis' to make the data available efficiently.

My theory was that, if I broke all products down into their constituent words (called the 'Lexicon'), and indexed those words efficiently using all possible key combinations up to 5 characters, then I could make product searching really fast!

Pencil, paper and SQL Enterprise Manager brought me to this design (note that there are extra clustered indices applied to the final design that are not shown in this diagram):

Step 2 - Get The Data

Using the Tesco grocery API I parsed through all the departments, aisles and shelves and copied some essential product data - just ProductID, BaseProductID (known as TPNB for future use when I hook up the application to the Tesco API for 'add to basket') and the product's description name into a table called ProductsTable - totalling 21,638 rows.

Step 3 - Extract and store individual words to create the Lexicon.
I took each product description and extracted individual words thanks to C#'s String.Split(' ') method using space ' ' as a delimiter. Each word was inserted into the LexiconTable (unless it was already there, in which case I just incremented the ProductCount value at the rate of one per product - a value which would be useful later).

Since I was creating a many-to-many relationship between ProductTable and LexiconTable (that is, products have more than word and each word can be used by more than one product), I also created a lookup table - LexiconProductTable. That way I could get to list the products quickly using any particular word much because it would be indexed (the alternative would have been to search for all products containing a particular word using the 'LIKE' keyword in SQL, which is highly inefficient and slow).

LexiconTable ended up with 11,237 unique words that described all the Tesco products, and LexiconProductTable had inserted 124,413 lookup rows.

If I sort by descending product count, I can uncover the top words used in most Tesco products! Intrigued? Here is the top 20:

10	Tesco	6288
4	And	2337
14	Pack	1320
136	Chicken	873
238	75Cl	732
311	500G	714
275	Sauce	665
150	Finest	661
49	400G	660
29	In	638
199	200G	627
60	Chocolate	623
47	250Ml	568
21	Value	532
24	1	517
465	2	517
266	250G	508
37	4	498
120	Organic	486
25	Litre	475

The left column is the LexiconID unique identifier, and the right column is the count of the number of products in which the word is used.

Step 4 - Build Indexed Searching based in anticipated input
Now to get this application fast-acting when anyone typed in a search character. I needed a table that had a unique match for up to the first 5 letters typed in. That's ANY 1, 2, 3, 4, or 5 characters from 'aaaaa' to 'zzzzz' as long as there at is at least one lexicon word that satisfies the particular combination. Index this match and the system should react like lightning!

To do this I wrote some code to look at each word in the LexiconTable that had an 'a' in it and, if so, inserted an entry into a new table, LexiconSearchCharTable. I then searched 'b', 'c'' and so on. Indeed I had a 5-nested search loop that efficiently checked for up to 5-letter combinations from 'aaaaa' to 'zzzzz' like this:

Does the current Lexicon word I am studying have an 'a'?
If yes, does it have an 'aa'?
If yes, does it have an 'aaa'?

If yes, does it have an 'aaaa'?

If yes, does it have an 'aaaaa'?

In other words, I iterated through the alphabet from a-z, and if I got a 'hit' I then iterated again through the alphabet for a second character 'aa-az'. Any hit I got I iterated through the alphabet using a third character 'aaa'-'azz', then a fourth 'aaaa'-'azzz' and finally a fifth character 'aaaaa-azzzz'. This worked really well without going through every combination of 'aaaaa'-'zzzzz' since of course words that have no characters 'ab' could not possibly have characters 'aba' through 'abz' either!

LexiconSearchCharTable ended up with 211,561 rows. Here is what the first 20 rows look like:

1	E	8973
2	H	8973
3	HE	8973
4	T	8973
5	TH	8973
6	THE	8973
7	B	7136
8	BE	7136
9	BER	7136
10	BERR	7136
11	BERRY	7136
12	C	7136
13	E	7136
14	ER	7136
15	ERR	7136
16	ERRY	7136
17	R	7136
18	RR	7136
19	RRY	7136
20	RY	7136

The left column is the table's unique identifier for each row, the middle row contains the successfully tested characters, and the third column is the LexiconTable unique identifier for a word that matched. You can probably deduce that Lexicon ID 8973 is "THE" and 7136 has "BERRY" and a mysterious "C" - the word actually stored is "C/BERRY" (cranberry).

A user typing in 'B' will have row 7 returned. There will be other rows with just 'B' in them but they will be returned instantly since that column is indexed. Only rows with just 'B' in this column will be returned. Type in 'BE' and now row 8 is returned instead, really fast along with the other 'BE'-only rows,

Step 5 - Extract appropriate data on demand QUICKLY via a stored procedure
So now I have my data in a good place. I just needed a stored procedure to act swiftly on the tables to bring back the list of products quickly. Here we go:

CREATE PROCEDURE [dbo].[TescoGroceryInstantSearch]
(
@SearchCharacters varchar(5)
)
as
SELECT TOP 20
pt.ProductID,
pt.Description
FROM
ProductsTable pt
INNER JOIN LexiconProductTable lpt ON pt.ProductID = lpt.ProductID
INNER JOIN LexiconTable lt ON lt.LexiconID = lpt.LexiconID
INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID
WHERE
SearchCharacters = @SearchCharacters
ORDER BY lt.ProductCount DESC

Nice - an easy to read and a quick way to grab the data! As you type each characters into the search box on the web page, the list of products results quickly from this stored procedure. This is confirmed by the SQL Execution Plan revealed by my SQL Server - all the heavy percentage CPU costs are borne by clustered index seeking.

I may as well make the maximum use of the data so I have a second stored procedure that will run in parallel with the first one in my web application, revealing word suggestions to help the user get to the product even quicker. Here it is:

CREATE PROCEDURE [dbo].[TescoGroceryInstantSearchSuggestions]
(
@SearchCharacters varchar(5)
)
as
SELECT TOP 15
lt.LexiconID,
lt.LexiconWord
FROM
LexiconTable lt
INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID
WHERE
lsct.SearchCharacters = @SearchCharacters
ORDER BY lt.ProductCount DESC

Again its clustered index seeking all the way. Nice! No need for third party software - just some intelligent thinking around data storage, formatting and extraction. I always like to try stuff out myself before I recommend buying anything, as I prefer to use the tools I already have.

Step 6 - Build a simple web application to make it all happen!

Now back to the whole reason for doing this. Can you get to a chosen product quicker?
Go play at http://www.techfortesco.com/tescoinstant and let me know.

2 comments:

Anonymous29 December 2010 at 18:08
Nice attempt Nick!

I've been working on search algorithms over the last days (my problem is to match personal full names against watch lists) and have some humble conclusions so far..

1. The problem is easily underestimated
Like studying languages, you get enthusiastic with the 1st results, but what really makes the difference is the last 20% (fine tuning) which takes 80% of the effort.

2. We are reinventing the wheel
You just google for search algorithms at google to be flooded. There are so many solutions that the difficulty is to choose the appropriate one(s) to nail the last 20%

Specifically trying your demo:

1. looking for ILLY (coffe)
- too many results for COFFE
- ILL or ILLI don't do it
- the only way to find it was typing ILLY

Are you considering misspelling?
Are you giving the same weight to each single word?
"Illy Dark Roast Espresso Machine Coffee 250G"

2. looking for BERTOLLI (tomato sauce)
- BERTOLLI come in the lexic group but cant find it in the results (even clicking on the link)
- the only way to find it was typing PECORINO (couldn't find the BASILICO one I used to buy :-)

Think one way to get there is creating an open contest.
It doesn't need to be those 1M from the http://www.netflixprize.com/
Get 5 or 10K for 2 or 4 months and let the comp sci kids know about it (specially now with the fee's raise :) I'm sure they'll come with a far better solution for a fraction of the price of a contractor or off the shelf solution. Everybody will be happy.

Just write the rules/dataset/sandbox and spend the rest of your time spreading the word and thinking about TNBT

If you like competing/hands on, ask somebody else to manage the contest and take part on it :-)))

Happy new year mate!
Andre
Anonymous2 January 2011 at 12:46
Be better if you used AJAX for the request submissions!

As this blog grows in readership - and because it carries the Tesco brand - I have had to become more careful about the sort of comments that are acceptable. The good news is that I'm a champion of free speech so please be as praising or as critical as you wish! The only comments I DON'T allow through are:

1. Comments which criticise an individual other than myself, or are critical of an organisation other than Tesco. This is simply because they cannot defend themselves so is unfair and possibly libellous. Comments about some aspect of Tesco being better/worse than another equivalent organisation are allowed as long as you start by saying "in my personal opinion.." or "I think that...". ... followed by a "...because.." and some reasoned argument.

2. Comments which are totally unrelated to the context of the original article. If I have written about a mobile app and you start complaining about the price of potatoes then your comment isn't going stay for long!

3. Advertising / web links / spam.

4. Insulting / obscene messages.

Ok, rules done - now it's your go:

Wednesday, 29 December 2010

R&D Christmas Experiment: Tesco Instant - Write-up and Play

2 comments: