I can't quite remember where I heard this, but someone compared reading their Twitter feed to dipping into a stream. They didn't attempt to keep up with it and it was essentially luck if they popped in at the right time to pick up something interesting. I tend to be rather free with the "Follow" button, so as a result my feed is fairly busy. Some time ago, I idly wondered if anyone had implemented a Bayesian filter on Twitter, so that you could begin to refine your feed over time - meaning you can still follow hundreds of people but are less likely to miss those 140 characters that will change your life.
After taking the necessary first step and gently rubbing some Google on the affected area, it looks like there's only one working attempt out there. It's a C# experiment from Ade Miller making use of the Witty client. There are a few speculative articles out there too, but as far as I could tell his was the only attempt that resulted in something concrete. So, it's pleasing that I'm not too late to the party and there's still time to do something different with this idea. That's where the motivation for this article came from, as I suspect the best way for me to finish coding this is to publicly tell everyone that I will.
Currently, the proof-of-concept is a Zend Framework application on my netbook. It retrieves my feed, rates the tweets and lets me love or hate individual ones. Over the coming week or so, I'd like to get a working alpha up on tweetist.org for people to have a play with. However, to prove it's not total vapourware let's have a look under the hood at this first incarnation.
The class at the top of the library is the Classifier which is able to take a SimpleXMLElement representing a tweet from Zend_Service_Twitter and return a score. Since we're talking about Bayesian classification that score is a probability predicting whether or not the tweet will be worth reading. For the Classifier to make this decision it needs a number of resources.
The FunctionExtractors specify the features that will be extracted from the tweet. The Probabilities class is responsible for pulling the various read/don't read probabilities for the features from the dataset we've trained so far. Finally, the Scorer takes the probabilities and chooses how we combine them to reach a final rating for the tweet.
It's the function extractors that are going to be the most important part of making this application work as they determine what aspects of a tweet are being rated. Currently, it extracts the author, any other mentioned users, any topics and the words in the tweet. There are some obvious improvements to make to the word extractor: dropping common words, normalising, stemming and so on which will reduce the noise ratio. Once these are bedded in, there are more complex features that might be worth exploring, for example: average number of tweets per day from the user, trending topics, shared friends, etc.
Anyway, hopefully that's enough to whet your appetite. I'll try and hold up my end of the bargain and get this public soon. There will also be some follow up posts with the database structure and the maths involved too.