Categorization of product data

Date: Fri Jul 20 2007
It would be unweildy and unbrowsable to simply present a list of products. Some of the merchants shareasale offers have 20 thousand or more products available, and presenting that in a single list of products would be horrid to the users. To make it pleasant for the users to browse and search for what they want, it will be exceedingly helpful to break up the products in short lists.

This is where categorization comes in. By labeling each product with a category, you can simply make enough categories so that each category has a small list of products.

In the Shareasale datafeed format the relevant fields are Category, SubCategory, MerchantCategory and MerchantSubCategory. Using Category and SubCategory provides you with a two-level hierarchy of categories.

It's pretty straightforward to take this and create a simple topic directory style browsing experience that leads you to a given product. Using this scheme each product is going to be listed in exactly one place within the topic directory, namely in the index page for the given Category/SubCategory. Using MerchantCategory and MerchantSubCategory the products could be listed in a second place within the directory.

For most products this should work fairly well. Some products are either hard to categorize, or will easily fall into multiple categories. There are other categorization methods which are more comprehensive, but this is the categorization which Shareasale offers.

A huge problem is that many of the Shareasale datafeeds have bad values in these Category fields. Usually the MerchantCategory fields are empty, for example. Usually the Category field will have an overly generic value that's incorrect for most products. Often the categorization is really stored in the CustomN fields. Sometimes the Category/SubCategory fields are simply blank.

What I've done is to process the datafeed files before inserting them into a database. The processing is handled differently for each merchant, because each merchant is approaching this differently.

Sometimes the processing is a simple matter of rearranging data fields. Such as copying Custom1 or Custom2 into the SubCategory field. Other times I end up looking at the Name or Description fields for keywords, and based on the keywords the script invents a whole new Category/SubCategory field values.

The Category/SubCategory system, while simple, is rather rigid and old-style. The new Web 2.0 style is to tag things, and then let the users select items from the pile by specifying tags as a search string. Go to del.icio.us for an example of this sort of web site. What I ran into thinking about how to implement this is, what would be the most effective way to generate the tags?

I'm not going to generate, by myself, category tagging for a hundred thousand products. Nope. No way. This would have meant some kind of algorithmic tagging. For example you could take the text in the description and name, trim out conjunction words like "the" or "and", and assume that anything left over is an important word. Those words could be the tags.

Maybe that would be useful, but it didn't strike me that way. There are many words which have multiple meanings based on their context, and it's hard for an algorithm to know just what is being said in the text. It would be better, it seems, to have some kind of understanding closer to natural language processing.

Among Yahoo's public API's is a "Term Extraction" service which seems to be the best sort of approach. Yahoo and the other search engines have the expertise and motive to have deep understanding in categorizing text. It would be hard for me to beat the correctness of their Term Extraction service, as they most likely have world class experts in this field.

Unfortunately Yahoo's terms of service preclude using the service for categorizing text related to products for sale.