Go play at http://www.techfortesco.com/tescoinstant !!
The rest of this article is about the technical design and implementation, of interest to anyone wishing to look at similar solutions.
Step 1 - Design!
Step 3 - Extract and store individual words to create the Lexicon.
I took each product description and extracted individual words thanks to C#'s String.Split(' ') method using space ' ' as a delimiter. Each word was inserted into the LexiconTable (unless it was already there, in which case I just incremented the ProductCount value at the rate of one per product - a value which would be useful later).
Since I was creating a many-to-many relationship between ProductTable and LexiconTable (that is, products have more than word and each word can be used by more than one product), I also created a lookup table - LexiconProductTable. That way I could get to list the products quickly using any particular word much because it would be indexed (the alternative would have been to search for all products containing a particular word using the 'LIKE' keyword in SQL, which is highly inefficient and slow).
LexiconTable ended up with 11,237 unique words that described all the Tesco products, and LexiconProductTable had inserted 124,413 lookup rows.
If I sort by descending product count, I can uncover the top words used in most Tesco products! Intrigued? Here is the top 20:
10 | Tesco | 6288 |
4 | And | 2337 |
14 | Pack | 1320 |
136 | Chicken | 873 |
238 | 75Cl | 732 |
311 | 500G | 714 |
275 | Sauce | 665 |
150 | Finest | 661 |
49 | 400G | 660 |
29 | In | 638 |
199 | 200G | 627 |
60 | Chocolate | 623 |
47 | 250Ml | 568 |
21 | Value | 532 |
24 | 1 | 517 |
465 | 2 | 517 |
266 | 250G | 508 |
37 | 4 | 498 |
120 | Organic | 486 |
25 | Litre | 475 |
The left column is the LexiconID unique identifier, and the right column is the count of the number of products in which the word is used.
Step 4 - Build Indexed Searching based in anticipated input
Now to get this application fast-acting when anyone typed in a search character. I needed a table that had a unique match for up to the first 5 letters typed in. That's ANY 1, 2, 3, 4, or 5 characters from 'aaaaa' to 'zzzzz' as long as there at is at least one lexicon word that satisfies the particular combination. Index this match and the system should react like lightning!
To do this I wrote some code to look at each word in the LexiconTable that had an 'a' in it and, if so, inserted an entry into a new table, LexiconSearchCharTable. I then searched 'b', 'c'' and so on. Indeed I had a 5-nested search loop that efficiently checked for up to 5-letter combinations from 'aaaaa' to 'zzzzz' like this:
Does the current Lexicon word I am studying have an 'a'?
If yes, does it have an 'aa'?
If yes, does it have an 'aaa'?
LexiconSearchCharTable ended up with 211,561 rows. Here is what the first 20 rows look like:
1 | E | 8973 |
2 | H | 8973 |
3 | HE | 8973 |
4 | T | 8973 |
5 | TH | 8973 |
6 | THE | 8973 |
7 | B | 7136 |
8 | BE | 7136 |
9 | BER | 7136 |
10 | BERR | 7136 |
11 | BERRY | 7136 |
12 | C | 7136 |
13 | E | 7136 |
14 | ER | 7136 |
15 | ERR | 7136 |
16 | ERRY | 7136 |
17 | R | 7136 |
18 | RR | 7136 |
19 | RRY | 7136 |
20 | RY | 7136 |
The left column is the table's unique identifier for each row, the middle row contains the successfully tested characters, and the third column is the LexiconTable unique identifier for a word that matched. You can probably deduce that Lexicon ID 8973 is "THE" and 7136 has "BERRY" and a mysterious "C" - the word actually stored is "C/BERRY" (cranberry).
A user typing in 'B' will have row 7 returned. There will be other rows with just 'B' in them but they will be returned instantly since that column is indexed. Only rows with just 'B' in this column will be returned. Type in 'BE' and now row 8 is returned instead, really fast along with the other 'BE'-only rows,
Step 5 - Extract appropriate data on demand QUICKLY via a stored procedure
So now I have my data in a good place. I just needed a stored procedure to act swiftly on the tables to bring back the list of products quickly. Here we go:
CREATE PROCEDURE [dbo].[TescoGroceryInstantSearch]
(
@SearchCharacters varchar(5)
)
as
SELECT TOP 20
pt.ProductID,
pt.Description
FROM
ProductsTable pt
INNER JOIN LexiconProductTable lpt ON pt.ProductID = lpt.ProductID
INNER JOIN LexiconTable lt ON lt.LexiconID = lpt.LexiconID
INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID
WHERE
SearchCharacters = @SearchCharacters
ORDER BY lt.ProductCount DESC
Nice - an easy to read and a quick way to grab the data! As you type each characters into the search box on the web page, the list of products results quickly from this stored procedure. This is confirmed by the SQL Execution Plan revealed by my SQL Server - all the heavy percentage CPU costs are borne by clustered index seeking.
I may as well make the maximum use of the data so I have a second stored procedure that will run in parallel with the first one in my web application, revealing word suggestions to help the user get to the product even quicker. Here it is:
CREATE PROCEDURE [dbo].[TescoGroceryInstantSearchSuggestions]
(
@SearchCharacters varchar(5)
)
as
SELECT TOP 15
lt.LexiconID,
lt.LexiconWord
FROM
LexiconTable lt
INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID
WHERE
lsct.SearchCharacters = @SearchCharacters
ORDER BY lt.ProductCount DESC
Again its clustered index seeking all the way. Nice! No need for third party software - just some intelligent thinking around data storage, formatting and extraction. I always like to try stuff out myself before I recommend buying anything, as I prefer to use the tools I already have.
Step 6 - Build a simple web application to make it all happen!
Now back to the whole reason for doing this. Can you get to a chosen product quicker?
Go play at http://www.techfortesco.com/tescoinstant and let me know.
Nice attempt Nick!
ReplyDeleteI've been working on search algorithms over the last days (my problem is to match personal full names against watch lists) and have some humble conclusions so far..
1. The problem is easily underestimated
Like studying languages, you get enthusiastic with the 1st results, but what really makes the difference is the last 20% (fine tuning) which takes 80% of the effort.
2. We are reinventing the wheel
You just google for search algorithms at google to be flooded. There are so many solutions that the difficulty is to choose the appropriate one(s) to nail the last 20%
Specifically trying your demo:
1. looking for ILLY (coffe)
- too many results for COFFE
- ILL or ILLI don't do it
- the only way to find it was typing ILLY
Are you considering misspelling?
Are you giving the same weight to each single word?
"Illy Dark Roast Espresso Machine Coffee 250G"
2. looking for BERTOLLI (tomato sauce)
- BERTOLLI come in the lexic group but cant find it in the results (even clicking on the link)
- the only way to find it was typing PECORINO (couldn't find the BASILICO one I used to buy :-)
Think one way to get there is creating an open contest.
It doesn't need to be those 1M from the http://www.netflixprize.com/
Get 5 or 10K for 2 or 4 months and let the comp sci kids know about it (specially now with the fee's raise :) I'm sure they'll come with a far better solution for a fraction of the price of a contractor or off the shelf solution. Everybody will be happy.
Just write the rules/dataset/sandbox and spend the rest of your time spreading the word and thinking about TNBT
If you like competing/hands on, ask somebody else to manage the contest and take part on it :-)))
Happy new year mate!
Andre
Be better if you used AJAX for the request submissions!
ReplyDelete