Wednesday, 29 December 2010

R&D Christmas Experiment: Tesco Instant - Write-up and Play

OK so it's live - and achieved using no more than some C#/Java and SQL database skills! It helps me answer the question:  Can I get to a particular product faster than via our standard web offering (of departments, aisles and shelves, or text searching)? Does it for you?

Go play at http://www.techfortesco.com/tescoinstant !!

The rest of this article is about the technical design and implementation, of interest to anyone wishing to look at similar solutions.

Step 1 - Design!

I created a database with the following tables and developed code that used the API to extract the products for a particular store (Greenford Dotcom Store which serves all homes in north London) and perform 'lexicon analysis' to make the data available efficiently. 

My theory was that, if I broke all products down into their constituent words (called the 'Lexicon'), and indexed those words efficiently using all possible key combinations up to 5 characters, then I could make product searching really fast!

Pencil, paper and SQL Enterprise Manager brought me to this design (note that there are extra clustered indices applied to the final design that are not shown in this diagram):






Step 2 - Get The Data
Using the Tesco grocery API I parsed through all the departments, aisles and shelves and copied some essential product data - just ProductID, BaseProductID (known as TPNB for future use when I hook up the application to the Tesco API for 'add to basket') and the product's description name into a table called ProductsTable - totalling 21,638 rows.



Step 3 - Extract and store individual words to create the Lexicon.
I took each product description and extracted individual words thanks to C#'s String.Split(' ') method using space ' ' as a delimiter. Each word was inserted into the LexiconTable (unless it was already there, in which case I just incremented the ProductCount value at the rate of one per product - a value which would be useful later).

Since I was creating a many-to-many relationship between ProductTable and LexiconTable (that is, products have more than word and each word can be used by more than one product), I also created a lookup table - LexiconProductTable. That way I could get to list the products quickly using any particular word much because it would be indexed (the alternative would have been to search for all products containing a particular word using the 'LIKE' keyword in SQL, which is highly inefficient and slow).

LexiconTable ended up with 11,237 unique words that described all the Tesco products, and LexiconProductTable had inserted 124,413 lookup rows.

If I sort by descending product count, I can uncover the top words used in most Tesco products! Intrigued? Here is the top 20:


10
Tesco
6288
4
And
2337
14
Pack
1320
136
Chicken
873
238
75Cl
732
311
500G
714
275
Sauce
665
150
Finest
661
49
400G
660
29
In
638
199
200G
627
60
Chocolate
623
47
250Ml
568
21
Value
532
24
1
517
465
2
517
266
250G
508
37
4
498
120
Organic
486
25
Litre
475


The left column is the LexiconID unique identifier, and the right column is the count of the number of products in which the word is used.


Step 4 - Build Indexed Searching based in anticipated input
Now to get this application fast-acting when anyone typed in a search character. I needed a table that had a unique match for up to the first 5 letters typed in. That's ANY 1, 2, 3, 4, or 5 characters from 'aaaaa' to 'zzzzz' as long as there at is at least one lexicon word that satisfies the particular combination. Index this match and the system should react like lightning!

To do this I wrote some code to look at each word in the LexiconTable that had an 'a' in it and, if so, inserted an entry into a new table, LexiconSearchCharTable. I then searched 'b', 'c'' and so on. Indeed I had a 5-nested search loop that efficiently checked for up to 5-letter combinations from 'aaaaa' to 'zzzzz' like this:

Does the current Lexicon word I am studying have an 'a'?
If yes, does it have an 'aa'?
    If yes, does it have an 'aaa'?
        If yes, does it have an 'aaaa'?
            If yes, does it have an 'aaaaa'?

In other words, I iterated through the alphabet from a-z, and if I got a 'hit' I then iterated again through the alphabet for a second character 'aa-az'. Any hit I got I iterated through the alphabet using a third character 'aaa'-'azz', then a fourth 'aaaa'-'azzz' and finally a fifth character 'aaaaa-azzzz'. This worked really well without going through every combination of 'aaaaa'-'zzzzz' since of course words that have no characters 'ab' could not possibly have characters 'aba' through 'abz' either!

LexiconSearchCharTable ended up with 211,561 rows. Here is what the first 20 rows look like:

1
E
8973
2
H
8973
3
HE
8973
4
T
8973
5
TH
8973
6
THE
8973
7
B
7136
8
BE
7136
9
BER
7136
10
BERR
7136
11
BERRY
7136
12
C
7136
13
E
7136
14
ER
7136
15
ERR
7136
16
ERRY
7136
17
R
7136
18
RR
7136
19
RRY
7136
20
RY
7136

The left column is the table's unique identifier for each row, the middle row contains the successfully tested characters, and the third column is the LexiconTable unique identifier for a word that matched. You can probably deduce that Lexicon ID 8973 is "THE" and 7136 has "BERRY" and a mysterious "C" - the word actually stored is "C/BERRY" (cranberry).

A user typing in 'B' will have row 7 returned. There will be other rows with just 'B' in them but they will be returned instantly since that column is indexed. Only rows with just 'B' in this column will be returned. Type in 'BE' and now row 8 is returned instead, really fast along with the other 'BE'-only rows,


Step 5 - Extract appropriate data on demand QUICKLY via a stored procedure
So now I have my data in a good place. I just needed a stored procedure to act swiftly on the tables to bring back the list of products quickly. Here we go:



CREATE PROCEDURE [dbo].[TescoGroceryInstantSearch]
(
@SearchCharacters varchar(5)
)
as
SELECT TOP 20
pt.ProductID,
pt.Description
FROM
ProductsTable pt
INNER JOIN  LexiconProductTable lpt ON pt.ProductID = lpt.ProductID
INNER JOIN LexiconTable lt ON lt.LexiconID = lpt.LexiconID
INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID
WHERE
SearchCharacters = @SearchCharacters
ORDER BY lt.ProductCount DESC


Nice - an easy to read and a quick way to grab the data! As you type each characters into the search box on the web page, the list of products results quickly from this stored procedure. This is confirmed by the SQL Execution Plan revealed by my SQL Server - all the heavy percentage CPU costs are borne by clustered index seeking.

I may as well make the maximum use of the data so I have a second stored procedure that will run in parallel with the first one in my web application, revealing word suggestions to help the user get to the product even quicker. Here it is:


CREATE PROCEDURE [dbo].[TescoGroceryInstantSearchSuggestions]
(
@SearchCharacters varchar(5)
)
as
SELECT TOP 15
lt.LexiconID,
lt.LexiconWord
FROM
LexiconTable lt 
INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID
WHERE
lsct.SearchCharacters = @SearchCharacters
ORDER BY lt.ProductCount DESC

Again its clustered index seeking all the way. Nice! No need for third party software - just some intelligent thinking around data storage, formatting and extraction. I always like to try stuff out myself before I recommend buying anything, as I prefer to use the tools I already have.

Step 6 - Build a simple web application to make it all happen!

Now back to the whole reason for doing this. Can you get to a chosen product quicker?
Go play at http://www.techfortesco.com/tescoinstant and let me know.

2 comments:

  1. Nice attempt Nick!

    I've been working on search algorithms over the last days (my problem is to match personal full names against watch lists) and have some humble conclusions so far..

    1. The problem is easily underestimated
    Like studying languages, you get enthusiastic with the 1st results, but what really makes the difference is the last 20% (fine tuning) which takes 80% of the effort.

    2. We are reinventing the wheel
    You just google for search algorithms at google to be flooded. There are so many solutions that the difficulty is to choose the appropriate one(s) to nail the last 20%

    Specifically trying your demo:

    1. looking for ILLY (coffe)
    - too many results for COFFE
    - ILL or ILLI don't do it
    - the only way to find it was typing ILLY

    Are you considering misspelling?
    Are you giving the same weight to each single word?
    "Illy Dark Roast Espresso Machine Coffee 250G"

    2. looking for BERTOLLI (tomato sauce)
    - BERTOLLI come in the lexic group but cant find it in the results (even clicking on the link)
    - the only way to find it was typing PECORINO (couldn't find the BASILICO one I used to buy :-)


    Think one way to get there is creating an open contest.
    It doesn't need to be those 1M from the http://www.netflixprize.com/
    Get 5 or 10K for 2 or 4 months and let the comp sci kids know about it (specially now with the fee's raise :) I'm sure they'll come with a far better solution for a fraction of the price of a contractor or off the shelf solution. Everybody will be happy.

    Just write the rules/dataset/sandbox and spend the rest of your time spreading the word and thinking about TNBT

    If you like competing/hands on, ask somebody else to manage the contest and take part on it :-)))

    Happy new year mate!
    Andre

    ReplyDelete
  2. Be better if you used AJAX for the request submissions!

    ReplyDelete

As this blog grows in readership - and because it carries the Tesco brand - I have had to become more careful about the sort of comments that are acceptable. The good news is that I'm a champion of free speech so please be as praising or as critical as you wish! The only comments I DON'T allow through are:

1. Comments which criticise an individual other than myself, or are critical of an organisation other than Tesco. This is simply because they cannot defend themselves so is unfair and possibly libellous. Comments about some aspect of Tesco being better/worse than another equivalent organisation are allowed as long as you start by saying "in my personal opinion.." or "I think that...". ... followed by a "...because.." and some reasoned argument.

2. Comments which are totally unrelated to the context of the original article. If I have written about a mobile app and you start complaining about the price of potatoes then your comment isn't going stay for long!

3. Advertising / web links / spam.

4. Insulting / obscene messages.


Ok, rules done - now it's your go: