Where tagging falls apart 2008-04-24

Tobey Maguire has a daughter named Ruby. How I know? No, I haven't started following celebrity news. But I do regularly skim the Technorati Ruby tag. Tags suck. I get lots of what I'm after, namely posts about Ruby development, but I also get a hell of a lot of junk. Not junk as in spam, though there's that too, but junk as in semantic overload of terms that include meanings I have no interest in. I don't care about celebrities and their families, nor do I care about gemstones, people gushing over their pets on their blogs or the number of other things that end up tagged "ruby", even though the tag is very much reasonable seen with the eyes of the people who posted those entries, and presumably for a lot of users. What it boils down to is that "folksonomies", which seemed to be all people blogged about a couple of years ago, works best when they are confined to niches. Not so well when the internet at large starts tagging. It's one of those nasty cases where all is well and fine at small scale, when your audience is relatively homogenous at least in terms of the terms they use, that just falls apart as things scale up. (Incidentally, the failure of tagging is a pretty good example of why it makes next to no sense to test if your data set isn't realistic.) We learned the hard way at Edgeio that a varied audience and consistent tagging does not mix. While we started out trying to use tags for most things, we eventually moved more and more towards using various classification and feature extraction methods to improve ranking and make the search results for the classifieds we were fed more cohesive, as people were chronically unable to apply the same semantic meanings to the same tags. And that was despite the fact that by far most of our listings came from large providers that fed us hundreds or thousands listings - most of them professionals in their niches. The feedback loop doesn't work. One of the big hopes of folksonomies in many peoples minds was that common usages of specific terms would crystallize as people saw how specific tags were used elsewhere. I.e. the Ruby gemstone people might see their stuff getting swamped by the Ruby programmming language mob and start tagging their stuff "ruby gemstone" instead of just Ruby. So far I've seen very little evidence that people look at how other people tag their stuff and adapt. I'm not saying it doesn't happen, but certainly not enough to "clean up" tags with massive semantic disconnect between different groups of users. Another tag with unfortunate disconnects is Rack. When I first took a look at it to look for people writing about the Ruby web server to framework adapter 'Rack', I was completely oblivious to the fact that of course I'd also be facing a lot of posts about scantily clad women... I was surprised, honest - while of course I knew the word is also used for breasts, it didn't cross my mind at all at the time. Classification of data is a hard problem, and classification of short snippets of data even more so (people have less space to distinguish their listing from someone elses, and so each word has a greater chance to skew things). Tags are still useful hints - I could easily write about a specific Ruby project, for example and manage to avoid including the word Ruby in the text. If so, including the term Ruby in the text would be useful in reducing the chance of ambiguity about what the project name referred to. Rack being a good example. But tagging isn't enough. I'd be happy to type in extra search terms if I would get posts more closely related to what I'm looking at. Wikipedia style disambiguation, for example, or clustering like what some of the Google competitors such as Clusty are experimenting with. But most search engines still expect your entire search term to be a literal search. I.e. if I search for "Ruby programming", even most "next generation" engines expect you to be looking for documents containing both of these terms. They may very well be, but what I want to ask for is post related to Ruby programming, regardless of the terms. In fact, almost regardless what I search for, I am searching for something related to a concept, not something containing specific strings (unless I'm searching for a quote, which does happen). There are lots of search startups with different approaches to this, such as the much hyped Powerset. But "generic" search is actually the lesser problem for me - I'm good at translating searches into text to find what I want. I'm more interested in finding recent blog posts matching a specific topic rather than matching specific tags or text strings from the content. I'd really like to hear about sites that are getting close to doing the right thing in this space.

blog comments powered by Disqus