The version I have contains a whopping 6.7 million. names of populated places, locations and all kinds of other things that have a name and location. It's great for putting place names on a map for example.
I've been playing with it a bit and wanted to load the dataset into a SQL database to make it easier to work with. Here's an example Ruby script to use Sequel to load it into a Sqlite database.
It should be trivial to make it work with MySQL or Postgres (or any of the other databases supported by Sequel) too. In theory it's just a matter of changing the "DB =" line. In practice you may have to change the schema too.
To make this code work you need Rubygems, Sequel and Sqlite installed. You can install sequel from a gem with "gem install sequel". For Fedora/Redhat at least, Sqlite and the ruby bindings are included - if you don't have it installed do "yum install ruby-sqlite" or "up2date -i ruby-sqlite" if you're on RHEL.
First "model.rb", a tiny helper file requiring rubygems and Sequel, and initializing the database:
require 'rubygems'
require 'sequel'
DB = Sequel("sqlite:///allCountries.db")
require 'geoname'
Here's the main script:
require 'model'
ARGF.each do |line|
g = nil
begin
g = GeoName.parse(line)
rescue Exception => e
puts "Exception during parsing: #{e.message}"
end
if g
g.save
else
puts "Unable to parse line: #{line.chomp}"
end
end
As you can see it's a pretty trivial little driver that just loads the data from a file provided to standard input or as an argument on the command line, tries to parse the line, and insert it into the database (g.save), with some minimal error handling.
The guts of the import is in the GeoName model class ("geoname.rb"), which I'll break in two:
require 'time'
class GeoName < Model
set_schema do
primary_key :geonameid
text :name
text :asciiname
text :alternatenames
real :latitude
real :longitude
text :feature
text :feature_code
text :country_code
text :cc2
text :admin1
text :admin2
text :admin3
text :admin4
integer :population
integer :elevation
integer :gtopo30
text :timezone
timestamp :modification_date
index :population
index :name
index [:country_code, :feature, :population]
index [:country_code, :feature_code, :population]
end
create_table if !table_exists?
This is a simple way of specifying the database schema straight in your model. Note that Sequel doesn't require this - it will be perfectly happy to just work with whatever is in your database, and you can also use migrations to give you more flexibility in changing the database schema when your application changes. In this case, though, I'm expecting to blow away the existing table whenever I load a new version of the file, so it's convenient to just have the schema in the same location.
The "index" statement does exactly what they say: They create indexes on the columns listed after them. The indexes above are the ones I found convenient for my use - change as you please.
The column names have a 1-to-1 correspondence with the column names in the Geonames file.
Note the last line. It does exactly what it says: It creates the table it the table doesn't already exists. Combined with Sqlite this lets me just blow away the database file ("allCountries.db" in the same directory) and rerun the script and the database and table is recreated.
Here's the rest of the code:
def self.parse line
g = GeoName.new
s = line.chomp.split("\t")
g.geonameid = s[0]
g.name = s[1]
g.asciiname = s[2]
g.alternatenames = s[3]
g.latitude = s[4]
g.longitude = s[5]
g.feature = s[6]
g.feature_code = s[7]
g.country_code = s[8]
g.cc2 = s[9]
g.admin1 = s[10]
g.admin2 = s[11]
g.admin3 = s[12]
g.admin4 = s[13]
g.population = s[14]
g.elevation = s[15]
g.gtopo30 = s[16]
g.timezone = s[17]
g.modification_date = Time.parse(s[18])
return g
end
end
Trivial, isn't it? It just splits up the line and assign each field to the appropriate column. You could use some hack to make it more compact, but for a simple script like this I prefer just keeping it simple and straightforward.
That's all there's to it - after that you can use the database (and the model class above), to do queries using Ruby. Here's an example using irb: