Extractomatic in Sinatra on JRuby on Google App Engine on the Internet

Thursday, January 28, 2010

The other day I threw together a little service which I’ve nicknamed Extractomatic. It’s a very simple web-based API to detect and extract the main content from a web page, removing all of the clutter, such as headers, footers, advertising and so on. I guess it’s somewhat similar to Readability or Instapaper, but more suitable for building into your own applications. Watch this space.

Under the hood it uses a Java library called Boilerpipe, which is excellent. Not every page comes out perfectly, but it’s more than good enough.

Extractomatic is written in Sinatra on JRuby on Google App Engine. Which is sort of worthy of a blog post in its own right. If you’re interested in doing something similar, appengine-jruby is what you want to be looking at, but it’s still a bit bleeding edge, and you might find yourself trawling the Google Group and pouring over stack traces. But when it comes together, it’s going to be a great way of getting little bits of HTTP glue onto the web with very little effort.

Oh, and there’s some code for Extractomatic on Github.

If I Concentrated Hard Enough

Sunday, January 17, 2010

I seem to remember reading a story, perhaps aged 10 or 11 13 or 14, about quantum physics. It might have been The Time and Space of Uncle Albert, but I can’t find the contents anywhere to check.

Anyway, I was struck by a line in it. The line said that there was no particular reason that time travelled forwards, and that it was a possibility, albeit an unfeasibly small one, that an event could occur in reverse, purely by chance, in the everyday. I seem to remember it used the example of a diver, leaping backwards out of the swimming pool and onto the diving board.

This completely blew me away. If it was possible an event could ‘jump’ backwards in time, however infinitesimal, then surely it might have already happened? Or be about to happen somewhere? Or right in front of me? Perhaps, if I looked hard enough, it might.

I started thinking about laying physical models on top of the world, and laying the world on top of physical models.

Perhaps, as the water sloshed side to side in the sink in front of me, that moment, just then, as the water splashed up the side, would be the fastest velocity that those particular Oxygen and two bonded Hydrogen molecules would ever reach, in the entirety of time. What will be the knock on effects of those two ripples meeting? If I touched one, would a plane fall out the sky? Maybe, in that unexpected splash, out of the corner of my eye, next to the overflow drain, the one in a googolplex event occurred. And I missed it, because I was doing my teeth.

Perhaps, if I concentrated hard enough, all of the data would pour out of the surfaces and motions surrounding me. And maybe, just maybe, that’s what it would be like to be god.

It was a good book.

In The Future

Thursday, January 14, 2010

In the future, maybe by the year 2010, our watches will be able to tell us when the next bus is coming.

Muni Watch

This is my Sony Ericsson MBW-150 bluetooth watch, showing the next few SF Muni bus arrival times for a nearby stop. The code to fetch the arrival times is running on my Droid phone, and communicating with the watch using Marcel Dopita’s OpenWatch software for the Android platform.

Smart stuff by Joe Hughes.

The Practical Application of Codes and Pictures

Wednesday, January 13, 2010

Noticings code

It still amazes me that with the Practical Application of Codes and Pictures, 1145 lines of gobbledegook and 554KB of compressed images can be turned into this:

Why did I go out with my bike in this weather? For points on @noticings? Feel silly & soaked & prey my front wheel safe from thieves.

I mean, that’s fucking magic.

Atomkraft

Sunday, January 10, 2010

I might be wrong, but right now, and in lieu of a better alternative: Atomkraft? Ja Bitte.

Atomkraft? Ja Bitte

Also: SVG, EPS, AI

No idea where the original “Nein, Danke” logo came from, so it seems cheeky to license this as anything other than public domain.

Simulated failure

Tuesday, November 24, 2009

A circular from the Civil Aviation Authority (PDF), picked up by Chris Fleming on the OSM Talk-GB mailing list:

NOTIFICATION OF GPS JAMMING TRIALS – NORTH SCOTLAND 16-27 NOVEMBER 2009.
The purpose of this Circular is to give notification of the trial to be performed by the Ministry of Defence (MoD) Air Warfare Centre, in which Global Positioning System (GPS) signals will be intentionally jammed.

Date: 16-27 November 2009.
Time: A maximum of 6, fifteen minute periods between 1100 and 1500.
Location: The trial uses a 500 Watt airborne jammer at 10000 ft amsl, transmitting to the west along a 50 nm flight path on a 270° T radial from Kirkwall, Orkney Islands. The aircraft will fly between two points, situated at a distance of 10 nm and 60 nm from Kirkwall.

EG_Circ_2009_P_089_en.pdf

Infrastructure is only noticed when it’s not there. Failure, simulated or not, is sometimes the only way to remind us what we smother ourselves in.

Tomorrow: leaving my phone at home.

Noticings iPhone app

Thursday, November 19, 2009

Photos

Yesterday, slightly quicker than expected, the Noticings iPhone app went on sale. It does one thing well, and that’s getting your photos onto Flickr with all the metadata required for Noticings.

There’s been lots of chat about the App Store recently, specifically about the approval process. I was prepared for the worst, especially since the application reads the photos directly out of the /private/mobile directory to get access to the original EXIF metadata which the UIImagePickerController doesn’t provide. It’s not a private API, but I could see how it might be contentious.

Thankfully, my experience was smooth and painless. I submitted the app on 6th November, but resubmitted on the 11th with a bug fix. And it went on sale on the 18th. My contract and approval for paid applications was very quick and didn’t involve having to sign or fax any paperwork. Quite impressed really.

No-one is going to get rich off it, but hopefully it’ll provide a small revenue stream for Noticings, enough to keep the server bills paid and the game ticking over.

This screeching noise

Monday, November 16, 2009

This noise, this screeching, whining noise, has forever embedded itself in my auditory system as the sound of amazing things happening.

First, the long screech of the handshake; two machines sizing each other up. Questions asked and answered. Decisions made.

With both satisfied, the stuttering chatter. Quick as you can, what we have to say is too important to waste with frivolities.

In the silence — both machines process their exchange across the ether.

Somewhere, a bit flips. We repeat.

Update: Thanks to Chris for letting me know that, of course, Toshiba shamelessly ripped this advert off Simon Faithfull’s Escape Vehicle — which even has the original screeching noise.

Using Geoplanet Data in Ruby on Rails

Tuesday, November 10, 2009

The Unrendered City is Here for You to Use

Noticings is possibly one of the first services to integrate the Yahoo Geoplanet Data deeply, although it seems we can now add Twitter to the list. I imagine we’ll see a few more services begin to use it soon – Yahoo have released it under a Creative Commons Attribution license, and if Twitter are using it then a whole bunch of things are going to spring up around that.

It gives us the opportunity to use colloquial geography rather than bounding boxes and radial searches and the like. I banged on about this in my talk at the AGI conference recently. I am such a geography bore.

Anyway, we couldn’t have built Noticings without it.

However, it is a little bit difficult to get up and running at first, so I want to delve a bit into how we’re using it, hopefully helping others get rolling a bit quicker. Noticings is written in Ruby on Rails, but I’m sure the same principles apply to whatever you’re framework/language you’re using.

First, some background. Geoplanet is a database of 5.4 million places in a hierarchy. Each entry has a unique, permanent ID (WOEID), a name and a place type. For example, Homerton (20089379) is a Suburb in the London Borough of Hackney (12695808), which is a LocalAdmin in London (44418), which is a Town in Greater London (23416974), and so on.

Once you’ve discounted the place types that Flickr doesn’t use for associating photos with (there are huge numbers of zip codes and telephone dialling zones, for example), then there are about 1.4 million places that Noticings cares about.

The Geoplanet download contains three tab-separated files. Places, which does what it says. Aliases, which contains alternate language names for each place. Adjacencies, which contains info about which places are adjacent to each other (although not necessarily geographically continuous).

There are three tables in our database, one for each of these:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
create_table "geoplanet_adjacencies", :force => true do |t|
  t.integer "woeid",              :limit => 8
  t.string  "iso_code"
  t.integer "neighbour_woeid",    :limit => 8
  t.string  "neighbour_iso_code"
end
 
add_index "geoplanet_adjacencies", ["woeid"], :name => "index_geoplanet_adjacencies_on_woeid"
 
create_table "geoplanet_aliases", :force => true do |t|
  t.integer "woeid",         :limit => 8
  t.string  "name"
  t.string  "name_type"
  t.string  "language_code"
end
 
add_index "geoplanet_aliases", ["woeid"], :name => "index_geoplanet_aliases_on_woeid"
 
create_table "geoplanet_places", :force => true do |t|
  t.integer "woeid",        :limit => 8
  t.integer "parent_woeid", :limit => 8
  t.string  "country_code"
  t.string  "name"
  t.string  "language"
  t.string  "place_type"
  t.string  "ancestry"
end
 
add_index "geoplanet_places", ["ancestry"], :name => "index_geoplanet_places_on_ancestry"
add_index "geoplanet_places", ["parent_woeid"], :name => "index_geoplanet_places_on_parent_woeid"
add_index "geoplanet_places", ["woeid"], :name => "index_geoplanet_places_on_woeid", :unique => true

And there’s rake task which handles the import. This takes ages. In addition to the 5.4 million places, there are about 2 million aliases and 8.4 million adjacencies. Go and make several cup of teas if you’re running this. Do the crossword too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
namespace :geoplanet do
 
  DATA_PATH = File.join(Rails.root, 'data', 'geoplanet', 'geoplanet_data_7.3.2')
 
  namespace :import do
 
    task :all => [:places, :aliases, :adjacencies]
 
    task :places => :environment do
      ActiveRecord::Base.connection.execute("TRUNCATE TABLE geoplanet_places")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_places DISABLE KEYS")
      ActiveRecord::Base.connection.execute("LOAD DATA LOCAL INFILE '#{DATA_PATH}/geoplanet_places_7.3.2.tsv' REPLACE INTO TABLE geoplanet_places
      FIELDS TERMINATED BY '\\t' OPTIONALLY ENCLOSED BY '\"'
      IGNORE 1 LINES
      (woeid, country_code, name, language, place_type, parent_woeid);")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_places ENABLE KEYS")
    end
 
    task :aliases => :environment do
      ActiveRecord::Base.connection.execute("TRUNCATE TABLE geoplanet_aliases")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_aliases DISABLE KEYS")
      ActiveRecord::Base.connection.execute("LOAD DATA LOCAL INFILE '#{DATA_PATH}/geoplanet_aliases_7.3.2.tsv' REPLACE INTO TABLE geoplanet_aliases
      FIELDS TERMINATED BY '\\t' OPTIONALLY ENCLOSED BY '\"'
      IGNORE 1 LINES
      (woeid, name, name_type, language_code);")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_aliases ENABLE KEYS")
    end
 
    task :adjacencies => :environment do
      ActiveRecord::Base.connection.execute("TRUNCATE TABLE geoplanet_adjacencies")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_adjacencies DISABLE KEYS")
      ActiveRecord::Base.connection.execute("LOAD DATA LOCAL INFILE '#{DATA_PATH}/geoplanet_adjacencies_7.3.2.tsv' REPLACE INTO TABLE geoplanet_adjacencies
      FIELDS TERMINATED BY '\\t' OPTIONALLY ENCLOSED BY '\"'
      IGNORE 1 LINES
      (woeid, iso_code, neighbour_woeid, neighbour_iso_code);")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_adjacencies ENABLE KEYS")
    end
  end
 
end

We’ve also got three models — one for each of the tables.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class GeoplanetPlace < ActiveRecord::Base
 
  set_primary_key 'woeid'
 
  has_many :aliases, :class_name => 'GeoplanetAlias', :foreign_key => 'woeid'
  has_many :adjacencies, :class_name => 'GeoplanetAdjacency', :foreign_key => 'woeid'
  has_many :adjacent_places, :through => :adjacencies
 
end
 
class GeoplanetAlias < ActiveRecord::Base
 
  belongs_to :geoplanet_place, :foreign_key => 'woeid', :primary_key => 'woeid'
 
end
 
class GeoplanetAdjacency < ActiveRecord::Base
 
  belongs_to :place, :class_name => 'GeoplanetPlace', :foreign_key => 'woeid', :primary_key => 'woeid'
  belongs_to :adjacent_place, :class_name => 'GeoplanetPlace', :foreign_key => 'neighbour_woeid', :primary_key => 'woeid'
 
end

Now we’ve got a usable copy of Geoplanet, and we can do things like:

1
2
london = GeoplanetPlace.find_by_name "London"
londons_children = GeoplanetPlace.find(:all, :conditions => { :parent_woeid => london.id })

Which is all well and good, but if you want to fetch the second level children you have to add a JOIN, and another for third level children. It quickly becomes slow and impossible.

We need to start caching the tree for each row somehow, making it easier and quicker to find children, siblings and ancestors.

Step forward Ancestry, a plugin by Stefan Kroes for organising ActiveRecord models in a tree structure. It store the ancestors for each row in a string, using the ‘LIKE‘ statement to SELECT them.

1
2
3
4
5
6
>> homerton = GeoplanetPlace.find_by_name "Homerton"
=> #<GeoplanetPlace id: 51640, woeid: 20089379, parent_woeid: 12695808, country_code: "GB", name: "Homerton", language: "ENG", place_type: "Suburb", ancestry: "1/23424975/24554868/23416974/44418/12695808">
>> homerton.parent
=> #<GeoplanetPlace id: 2669, woeid: 12695808, parent_woeid: 44418, country_code: "GB", name: "London Borough of Hackney", language: "ENG", place_type: "LocalAdmin", ancestry: "1/23424975/24554868/23416974/44418">
>> homerton.parent.parent
=> #<GeoplanetPlace id: 3134, woeid: 44418, parent_woeid: 23416974, country_code: "GB", name: "London", language: "ENG", place_type: "Town", ancestry: "1/23424975/24554868/23416974">

Ancestry adds a class method called build_ancestry_from_parent_ids!, for transforming a more traditional parent_id tree structure into the format for Ancestry. That’s what we’ve got here, except the parent field is called parent_woeid. In our case Earth is the root of the tree, and has a parent_woeid of -1, not nil.

By overriding that method in GeoplanetPlace we can convert the Geoplanet tree structure into something usable by Ancestry.

1
2
3
4
5
6
7
8
9
def self.build_ancestry_from_parent_ids! parent_id = nil, ancestry = nil
  parent_id = parent_id || -1
  self.base_class.all(:conditions => {:parent_woeid => parent_id}).each do |node|
    node.without_ancestry_callbacks do
      node.update_attribute ancestry_column, ancestry
    end
    build_ancestry_from_parent_ids! node.id, if ancestry.nil? then "#{node.id}" else "#{ancestry}/#{node.id}" end
  end
end

This takes bloody ages. About 4 hours on my laptop. You can go to bed now if you like.

To save you all that here’s a prebuilt SQL dump of all three tables (156MB gzipped SQL), ready for import. It’s built from version 7.4.0 of the Geoplanet Data, but you should check whether that’s the latest.

And once you’ve done all that you’ll be able to do things like this in a blink of an eye:

1
2
3
4
5
6
7
8
9
# in use is a named_scope in GeoplanetPlace with conditions on place_type
>> homerton.siblings.in_use.map(&:name)
=> ["Shoreditch", "Upper Clapton", "Kingsland", "Lower Clapton", "Shacklewell", "Haggerston", "Clapton Park", "Homerton", "Hackney Wick", "South Hackney", "Dalston", "De Beauvoir Town", "Dalston Kingsland", "Brownswood Park", "Stoke Newington", "Stamford Hill", "Finsbury Park", "Clapton", "Hackney"]
>> homerton.ancestors.map(&:name)
=> ["Earth", "United Kingdom", "England", "Greater London", "London", "London Borough of Hackney"]
>> homerton.parent.descendants.count
=> 5038
>> homerton.parent.descendants.in_use.count
=> 20

One thing is missing, which may or may not be an issue for you: bounding boxes and polylines for each place. You’ll have to use the Geoplanet live API or Flickr for that – the downloadable data provided by Yahoo doesn’t contain this information. But hopefully it soon will — Yahoo have said they will open up all their geodata by the end of 2010.

The Scruffying of Print

Saturday, November 7, 2009

Printers

The books and newspapers I read are, on the whole, designed by people pushing boxes around in expensive pieces of software. They’ve been carefully nudged, tweaked and adjusted to look Just Right. Designed, in the first order sense of the word.

Lots of the traditional print businesses are looking to print-on-demand for new business models and approaches. And rightly so – it’s exciting stuff. But if you’re going to, for example, tailor a portion of the content in your newspaper to the reader’s local area then you need to be able to automate that completely, because you can’t afford to tweak 300,000 newspapers individually.

The job of the designer becomes a second-order one — of creating templates, workflows and systems to support that — not pushing around boxes of text by hand.

Web designers are used to this. You quickly learn when making a website that coding each page by hand is impossible, and that you’re going to have to use layouts and templates. And these layouts and templates need to support headlines which are slightly longer than you might expect, or content that wraps in an unexpected place. So you design things with high tolerances and gentle failure modes which still look OK when everything isn’t quite as expected.

This is something that print is going to have to get used to. Slightly messier, with a few more bad line breaks and unbalanced columns. I don’t think readers are going to have any trouble with this — if anything people are getting used to scruffy nature of the web, and they definitely don’t care as much about design as designers think they do. It’s the producers and editors that are going to have to get used to letting go, giving a little bit less time to perfecting visual design and a bit more time to making sure the content shines through.