The other day I threw together a little service which I’ve nicknamed Extractomatic. It’s a very simple web-based API to detect and extract the main content from a web page, removing all of the clutter, such as headers, footers, advertising and so on. I guess it’s somewhat similar to Readability or Instapaper, but more suitable for building into your own applications. Watch this space.
Under the hood it uses a Java library called Boilerpipe, which is excellent. Not every page comes out perfectly, but it’s more than good enough.
Extractomatic is written in Sinatra on JRuby on Google App Engine. Which is sort of worthy of a blog post in its own right. If you’re interested in doing something similar, appengine-jruby is what you want to be looking at, but it’s still a bit bleeding edge, and you might find yourself trawling the Google Group and pouring over stack traces. But when it comes together, it’s going to be a great way of getting little bits of HTTP glue onto the web with very little effort.
Oh, and there’s some code for Extractomatic on Github.