Using Geoplanet Data in Ruby on Rails

The Unrendered City is Here for You to Use

Noticings is possibly one of the first services to integrate the Yahoo Geoplanet Data deeply, although it seems we can now add Twitter to the list. I imagine we’ll see a few more services begin to use it soon – Yahoo have released it under a Creative Commons Attribution license, and if Twitter are using it then a whole bunch of things are going to spring up around that.

It gives us the opportunity to use colloquial geography rather than bounding boxes and radial searches and the like. I banged on about this in my talk at the AGI conference recently. I am such a geography bore.

Anyway, we couldn’t have built Noticings without it.

However, it is a little bit difficult to get up and running at first, so I want to delve a bit into how we’re using it, hopefully helping others get rolling a bit quicker. Noticings is written in Ruby on Rails, but I’m sure the same principles apply to whatever you’re framework/language you’re using.

First, some background. Geoplanet is a database of 5.4 million places in a hierarchy. Each entry has a unique, permanent ID (WOEID), a name and a place type. For example, Homerton (20089379) is a Suburb in the London Borough of Hackney (12695808), which is a LocalAdmin in London (44418), which is a Town in Greater London (23416974), and so on.

Once you’ve discounted the place types that Flickr doesn’t use for associating photos with (there are huge numbers of zip codes and telephone dialling zones, for example), then there are about 1.4 million places that Noticings cares about.

The Geoplanet download contains three tab-separated files. Places, which does what it says. Aliases, which contains alternate language names for each place. Adjacencies, which contains info about which places are adjacent to each other (although not necessarily geographically continuous).

There are three tables in our database, one for each of these:

create_table "geoplanet_adjacencies", :force => true do |t|
  t.integer "woeid", :limit => 8
  t.string  "iso_code"
  t.integer "neighbour_woeid", :limit => 8
  t.string  "neighbour_iso_code"

add_index "geoplanet_adjacencies", ["woeid"], :name => "index_geoplanet_adjacencies_on_woeid"

create_table "geoplanet_aliases", :force => true do |t|
  t.integer "woeid", :limit => 8
  t.string  "name"
  t.string  "name_type"
  t.string  "language_code"

add_index "geoplanet_aliases", ["woeid"], :name => "index_geoplanet_aliases_on_woeid"

create_table "geoplanet_places", :force => true do |t|
  t.integer "woeid", :limit => 8
  t.integer "parent_woeid", :limit => 8
  t.string  "country_code"
  t.string  "name"
  t.string  "language"
  t.string  "place_type"
  t.string  "ancestry"

add_index "geoplanet_places", ["ancestry"], :name => "index_geoplanet_places_on_ancestry"
add_index "geoplanet_places", ["parent_woeid"], :name => "index_geoplanet_places_on_parent_woeid"
add_index "geoplanet_places", ["woeid"], :name => "index_geoplanet_places_on_woeid", :unique => true

And there’s rake task which handles the import. This takes ages. In addition to the 5.4 million places, there are about 2 million aliases and 8.4 million adjacencies. Go and make several cup of teas if you’re running this. Do the crossword too.

namespace :geoplanet do
  DATA_PATH = File.join(Rails.root, 'data', 'geoplanet', 'geoplanet_data_7.3.2')
  namespace :import do
    task :all => [:places, :aliases, :adjacencies]
    task :places => :environment do
      ActiveRecord::Base.connection.execute("TRUNCATE TABLE geoplanet_places")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_places DISABLE KEYS")
      ActiveRecord::Base.connection.execute("LOAD DATA LOCAL INFILE '#{DATA_PATH}/geoplanet_places_7.3.2.tsv' REPLACE INTO TABLE geoplanet_places FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES (woeid, country_code, name, language, place_type, parent_woeid);")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_places ENABLE KEYS")

    task :aliases => :environment do      
      ActiveRecord::Base.connection.execute("TRUNCATE TABLE geoplanet_aliases")      
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_aliases DISABLE KEYS")      
      ActiveRecord::Base.connection.execute("LOAD DATA LOCAL INFILE '#{DATA_PATH}/geoplanet_aliases_7.3.2.tsv' REPLACE INTO TABLE geoplanet_aliases FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES (woeid, name, name_type, language_code);")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_aliases ENABLE KEYS")

    task :adjacencies => :environment do
      ActiveRecord::Base.connection.execute("TRUNCATE TABLE geoplanet_adjacencies")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_adjacencies DISABLE KEYS")
      ActiveRecord::Base.connection.execute("LOAD DATA LOCAL INFILE '#{DATA_PATH}/geoplanet_adjacencies_7.3.2.tsv' REPLACE INTO TABLE geoplanet_adjacencies FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES (woeid, iso_code, neighbour_woeid, neighbour_iso_code);")
      ActiveRecord::Base.connection.execute("ALTER TABLE geoplanet_adjacencies ENABLE KEYS")

We’ve also got three models — one for each of the tables.

class GeoplanetPlace < ActiveRecord::Base
  set_primary_key 'woeid'
  has_many :aliases, :class_name => 'GeoplanetAlias', :foreign_key => 'woeid'  has_many :adjacencies, :class_name => 'GeoplanetAdjacency', :foreign_key => 'woeid'  has_many :adjacent_places, :through => :adjacencies

class GeoplanetAlias < ActiveRecord::Base
  belongs_to :geoplanet_place, :foreign_key => 'woeid', :primary_key => 'woeid'

class GeoplanetAdjacency < ActiveRecord::Base
  belongs_to :place, :class_name => 'GeoplanetPlace', :foreign_key => 'woeid', :primary_key => 'woeid'  belongs_to :adjacent_place, :class_name => 'GeoplanetPlace', :foreign_key => 'neighbour_woeid', :primary_key => 'woeid'

Now we’ve got a usable copy of Geoplanet, and we can do things like:

london = GeoplanetPlace.find_by_name "London"
londons_children = GeoplanetPlace.find(:all, :conditions => { :parent_woeid => })

Which is all well and good, but if you want to fetch the second level children you have to add a JOIN, and another for third level children. It quickly becomes slow and impossible.

We need to start caching the tree for each row somehow, making it easier and quicker to find children, siblings and ancestors.

Step forward Ancestry, a plugin by Stefan Kroes for organising ActiveRecord models in a tree structure. It store the ancestors for each row in a string, using the ‘LIKE‘ statement to SELECT them.

>> homerton = GeoplanetPlace.find_by_name "Homerton"
>> homerton.parent
>> homerton.parent.parent

Ancestry adds a class method called build_ancestry_from_parent_ids!, for transforming a more traditional parent_id tree structure into the format for Ancestry. That’s what we’ve got here, except the parent field is called parent_woeid. In our case Earth is the root of the tree, and has a parent_woeid of -1, not nil.

By overriding that method in GeoplanetPlace we can convert the Geoplanet tree structure into something usable by Ancestry.

def self.build_ancestry_from_parent_ids! parent_id = nil, ancestry = nil
  parent_id = parent_id || -1
  self.base_class.all(:conditions => {:parent_woeid => parent_id}).each do |node|
    node.without_ancestry_callbacks do 
      node.update_attribute ancestry_column, ancestry
    build_ancestry_from_parent_ids!, if ancestry.nil? then "#{}" else "#{ancestry}/#{}" end

This takes bloody ages. About 4 hours on my laptop. You can go to bed now if you like.

To save you all that here’s a prebuilt SQL dump of all three tables (156MB gzipped SQL), ready for import. It’s built from version 7.4.0 of the Geoplanet Data, but you should check whether that’s the latest.

And once you’ve done all that you’ll be able to do things like this in a blink of an eye:

# in use is a named_scope in GeoplanetPlace with conditions on place_type
["Shoreditch", "Upper Clapton", "Kingsland", "Lower Clapton", "Shacklewell", "Haggerston", "Clapton Park", "Homerton", "Hackney Wick", "South Hackney", "Dalston", "De Beauvoir Town", "Dalston Kingsland", "Brownswood Park", "Stoke Newington", "Stamford Hill", "Finsbury Park", "Clapton", "Hackney"]
["Earth", "United Kingdom", "England", "Greater London", "London", "London Borough of Hackney"]
>> homerton.parent.descendants.count
>> homerton.parent.descendants.in_use.count

One thing is missing, which may or may not be an issue for you: bounding boxes and polylines for each place. You’ll have to use the Geoplanet live API or Flickr for that – the downloadable data provided by Yahoo doesn’t contain this information. But hopefully it soon will — Yahoo have said they will open up all their geodata by the end of 2010.