Skip to main content

forceTen geographic APIs, a simple example about RDF design

Posted by fabriziogiudici on December 2, 2009 at 3:22 AM EST
forceTen has been born as the container of components for rendering geographic views and representing the related modes for the geotagging capabilities of blueMarine; but it has been also reused in two more server-side projects, where a special focus has been given to the models.

The most trivial feature is the capability of managing accurate geo tags: for instance, no duplicates and no erroneous spelling when entering the name of a place. These features imply the need to keep a hierarchical structure of names, in order to disambiguate places that might have the same name, but are in different provinces or countries.

At first, it seemed that all that I needed was the use of a geocoding service. I started with GeoNames and was able to import hierarchies in my applications (a GeoCoder is the model behind the GeoExplorer of forceTen).



The GeoExplorer is the panel at the left side with flags. You can see it in action in a screencast.



The first trivial issue was the need for having a permanent connection to the GeoNames web service in order to operate. This was easily solved by caching queries - a trivial task with a REST web service.

The second less trivial issue was a certain lack of coherence in GeoNames data: for instance, looking at how italian regions are named, you discover that some are named in plain italian (e.g. “Liguria”), others have the “regione” prefix (e.g. “Regione Lombardia”), others have the english name (e.g. “Tuscany”). GeoNames also provides localized names and with some work this part can be normalized - but in the end, with at least one customer scenario the capability to customize the display name of the entity has been requested. This introduced for the first time the idea that you can't just import an external database, but you need at least a few local overridding information on a local storage, that must be bound to the external data.

Another issue was the capability to add more locations than those present at GeoNames. While the list of places up to the municipality level seems to be pretty complete, there are other named geo entities around: for instance, the name of a mountain, or the mouth of a river; inside a town, a customer needs to geotag at building and even at finer level (e.g. the “Galleria degli Uffizi” museum in Florence and “Sala del Caravaggio”, a specific room inside of it). This introduced the idea that the local data aren't just plain properties attached to GeoNames data, but a complete new hierarchy.

The latest problem appeared at last, but it should have been clear since from the beginning. If you are creating a long-lived geotag archive, you must consider that geopolitics are mutable. For instance, in the latest ten years many new provinces have been introduced in Italy, meaning that some subtrees of the hierarchy have been re-parented. If you keep the GeoNames cache forever, you'll never see these updates; if you expire it periodically, you have to reparent your data structure, which is an annoyance.

One more thing: I didn't have any explicit request to change the underlying geocoding service (e.g. to use Yahoo! in place of GeoNames), but it might happen. So a well designed, reusable component library should be able to work with multiple geocoders - maybe at the same time.

This lead me to the design of the new APIs of forceTen. The idea is to keep as two separate concepts:
  • the GeoLocation that you use to tag your data
  • the GeoCoder data (just to give names, I call each node in a GeoCoder hierarchy a GeoCoderEntity).
GeoLocations are under your control: you can create and destroy them, give them names, bind to your data, eventually create hierarchies where you explicitly need them (such as in the example of the Galleria degli Uffizi). The GeoCoder assists you in finding the initial names, the coordinates and other attributes, and gives the whole hierarchy structure; the important point is that GeoCoder data is “attached” to your GeoLocations, which is a reversed view of the original idea.


The information about which is the parent of “Firenze” is in the GeoCoderEntity hierarchy, not in GeoLocations.

This means that you can strip the GeoCoder data at any time, and preserve your GeoLocations (and bound information) as they are; and later re-attach GeoCoder data to GeoLocations, for instance by matching coordinates. Of course, this means that you can attach data from a different GeoCoder too, and if the GeoCoder hierarchy has changed, this is not a problem of your GeoLocations. All GeoLocations are locally persisted, while only the strictly needed GeoCoderEntities behind them are locally persisted as a cache, for performance reasons and to support offline operations. The GeoCoder, at this point, becomes an implementation detail, that I hide behind a more abstract “GeoSchema” concept.

Let's look at some RDF stuff. I'm supposing I need to geotag to Polanesi, a small village in the east Riviera. The GeoNames hierarchy is: Earth / Europe / Italy / Regione Liguria / Provincia di Genova / Recco, to which I add / Polanesi as a custom leaf (it is not in the GeoNames database).

First I define a GeoSchema:

GeoSchemaManager schemaManager = GeoSchemaManager.Locator.findSchemaManager();
GeoSchema geoSchema = schemaManager.findSchemaByName("GeoNames");

At the first call, the above code stores the following statements into the local RDF store:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
        xmlns:geo="http://www.tidalwave.it/rdf/geo/2009/02/22#"
        xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"
        xmlns:skos="http://www.w3.org/2004/02/skos/core#"
        xmlns:owl="http://www.w3.org/2002/07/owl#"
        xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<geo:schema rdf:about="http://www.geonames.org">
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#ConceptScheme"/>
        <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
        <skos:prefLabel>GeoNames</skos:prefLabel>
</geo:schema>

</rdf:RDF>

I remind you that RDF is an abstract thing, not necessarily related to XML - in a RDF store, data are made persistent in some implementation-dependent format. The XML/RDF I'm using in this example is what I see when I dump the repository to a file.

Let's see where these statements originated from:
  • Since GeoNames is a web service, I decided to use its URL as the id of the entity which represents it.
  • If you read my previous post where I introduced SKOS, you should know that SKOS is a reusable ontology (i.e. a set of standard definitions) that, among other things, is well suited to represent hierarchy of concepts. The ConceptScheme definition in SKOS actually can be used to identify such a thing. SKOS also defines a way to attach labels to concepts: skos:prefLabel (a thing that we can consider equivalent to a “display name”).
  • Furthermore, you also learned that OWL is another reusable ontology that provides the concept of “semantic equivalence” (same-as) and works with a base concept named Thing: you'll see that I need it later. 

Now I can get to Recco (the lowest level branch in GeoNames) in two ways: navigating a hierarchy, or querying by its coordinates (for instance because I clicked a pixel in a map). In the former case, my code is:

GeoLocation earth = geoSchema.getRoot();
GeoLocation europe = earth.findChildren(geoSchema).name("Europe").result();
GeoLocation italy = europe.findChildren(geoSchema).name("Italy").result();
GeoLocation liguria = italy.findChildren(geoSchema).name("Liguria").result();
GeoLocation provinciaDiGenova = liguria.findChildren(geoSchema).name("Genoa").result();
GeoLocation recco = provinciaDiGenova.findChildren(geoSchema).name("Recco").result();

or, as a quicker alternative:

GeoLocation earth = geoSchema.getRoot();
GeoLocation recco = earth.findChildren(geoSchema).path("/Europe/Italy/Liguria/Genoa/Recco").result();

In the latter case, the code is:

List<GeoLocation> results = geoSchema.findLocations().closeTo(new Coordinate(44.363244, 9.137166, km(1)).results();
GeoLocation recco = results.get(0);

The data from the GeoCoder get imported into the local RDF repository as:

<geo:entity rdf:about="http://sws.geonames.org/6540563/">
        <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
        <skos:prefLabel>Recco</skos:prefLabel>
        <wgs84_pos:lat rdf:datatype="http://www.w3.org/2001/XMLSchema#double">44.363244</wgs84_pos:lat>
        <wgs84_pos:long rdf:datatype="http://www.w3.org/2001/XMLSchema#double">9.137166</wgs84_pos:long>
        <wgs84_pos:alt rdf:datatype="http://www.w3.org/2001/XMLSchema#double">0.0</wgs84_pos:alt>
        <geo:type>ADM3</geo:type>
        <skos:inScheme rdf:resource="http://www.geonames.org"/>
        <skos:broader rdf:resource="http://sws.geonames.org/3176217/"/>
</geo:entity>

As before, let's see where these statements originated from:
  • I used the same id (http://sws.geonames.org/6540563/) as the one defined by GeoNames, also because it's a real URL - connecting to is you can download some further RDF assertions about this data item. More about this later. 
  • You already know what a Thing, a Concept and a skos:prefLabel are. The news here are the three wgs_84_pos:* statements: they are part of another standard ontology (Basic Geo) that allows to work with some geographic concepts. 
  • geo:type is a custom statement from forceTen, and stores an attibute coming from GeoNames, describing the level of the tree where the node is located: ADM3 means “administrative level 3”, that for Italy is a municipality. 
  • skos:inScheme is another new thing, and means that this node is part of the GeoNames ConceptScheme that I previously defined (note the matching id http://www.geonames.org).
  • At last, you should know from my previous post that skos:broader is a way to say that this node is a child of an upper level node. 
What's http://sws.geonames.org/3176217/?

<geo:entity rdf:about="http://sws.geonames.org/3176217/">
        <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
        <skos:prefLabel>Genoa</skos:prefLabel>
        <wgs84_pos:lat rdf:datatype="http://www.w3.org/2001/XMLSchema#double">44.5</wgs84_pos:lat>
        <wgs84_pos:long rdf:datatype="http://www.w3.org/2001/XMLSchema#double">9.0666667</wgs84_pos:long>
        <wgs84_pos:alt rdf:datatype="http://www.w3.org/2001/XMLSchema#double">0.0</wgs84_pos:alt>
        <geo:code>GE</geo:code>
        <geo:type>ADM2</geo:type>
        <skos:inScheme rdf:resource="http://www.geonames.org"/>
        <skos:broader rdf:resource="http://sws.geonames.org/3174725/"/>
        <skos:narrower rdf:resource="http://sws.geonames.org/6540563/"/>
</geo:entity>

It's the data for the Province of Genoa. We could recursively track parents until we get to the root, which represents the Earth.

So, the GeoCoder data needed to back our GeoLocations is now part of our local repository. The next time we will refer to the same GeoCoderEntities, forceTen won't query the remote webservice any longer, but use data in the local repository.

Now, let's look at what's happening with GeoLocations:

<geo:location rdf:about="urn:tidalwave:geo/location#7f3a0f10-de68-11de-8523-002332c672e6">
        <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
        <owl:sameAs rdf:resource="http://sws.geonames.org/6540563/"/>
        <skos:prefLabel>Recco</skos:prefLabel>
</geo:location>

This is the GeoLocation representing Recco. It has got its own id (urn:tidalwave:geo/location#7f3a0f10-de68-11de-8523-002332c672e6, using my own prefix plus a UUID, which is IMHO the most convenient way to generate internal IDs), and its skos:prefLabel - it's “Recco”, but I could change it later. The most important thing is the owl:sameAs  statement, that makes it semantically equivalent to http://sws.geonames.org/6540563/, which - of course - is the way GeoNames represents Recco. This binds my GeoLocation to the GeoNames hierarchy.

If I call:

GeoLocation provinceOfGenoa = recco.findParent(geoSchema);

I will get the GeoLocation representing the Province of Genoa: forceTen has been able to navigate the hierarchy by looking at the (cached) GeoNames data. Note that in order to find a parent you need to specify a GeoSchema: in fact, it's the latter to define a hierarchy (and multiple GeoSchemata could define different hierarchies). If, by absurd, Recco was moved to another province, I could just strip the cache of GeoNames data and retrieve the fresh ones to get the update (of course, supposing that GeoNames correctly plays and doesn't change the ids of already existing entities, a thing that is guaranteed by that service).

Now I can create Polanesi:

GeoLocation polanesi = recco.createChild().
                             name(Locale.ITALIAN, "Polanesi").
                             code("XYZ").
                             coordinate(new Coordinate(44.36667, 9.11667)).build();

which appears in the RDF repository as:

<geo:location rdf:about="urn:tidalwave:geo/location#7f7f2e60-de68-11de-8523-002332c672e6">
        <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"/>
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
        <skos:prefLabel xml:lang="en">Polanesi</skos:prefLabel>
        <wgs84_pos:lat rdf:datatype="http://www.w3.org/2001/XMLSchema#double">44.36667</wgs84_pos:lat>
        <wgs84_pos:alt rdf:datatype="http://www.w3.org/2001/XMLSchema#double">9.11667</wgs84_pos:alt>
        <wgs84_pos:long rdf:datatype="http://www.w3.org/2001/XMLSchema#double">0.0</wgs84_pos:long>
        <geo:code>XYZ</geo:code>
        <skos:prefLabel>Polanesi</skos:prefLabel>
        <skos:broader rdf:resource="urn:tidalwave:geo/location#7f3a0f10-de68-11de-8523-002332c672e6"/>
</geo:location>

Note that skos:broader statement that makes it a child of Recco; furthermore, there is no owl:sameAs statement, as there's no equivalence in GeoNames. This is a piece of hierarchy that I'm maintaining on my own and doesn't rely on any external GeoCoder data. That's also the reason for which, in this case, all the attributes such as the coordinates or the code are stored within the GeoLocation, while they were previously inferred from the equivalent GeoCoderEntity.

The final word is about URLs that can be referenced, thus making themselves a good candidate for an id. I previously said that http://sws.geonames.org/6540563/ is a real URL as it references a real document (not all URL-shaped strings in a semantic database are necessarily doing that). If you point your browser to it, you'll see a fact-sheet HTML page. The most interesting thing occurs when you explicitly ask for a RDF document, for instance by using the curl command (needed because it makes it possible to specify the MIME type of the requested document):

% curl -I -H "Accept: application/rdf+xml" http://sws.geonames.org/6540563/
HTTP/1.1 303 See Other
Date: Tue, 01 Dec 2009 12:17:02 GMT
Server: Apache/2.2.10 (Linux/SUSE)
Location: http://sws.geonames.org/6540563/about.rdf
Vary: Accept-Encoding
Content-Type: text/html; charset=iso-8859-1


Note that “See Other” and “Location” headers. Let's follow the suggestion:

% curl http://sws.geonames.org/6540563/about.rdf
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<rdf:RDF xmlns="http://www.geonames.org/ontology#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#">
<Feature rdf:about="http://sws.geonames.org/6540563/">
<name>Recco</name>
<featureClass rdf:resource="http://www.geonames.org/ontology#A"/>
<featureCode rdf:resource="http://www.geonames.org/ontology#A.ADM3"/>
<inCountry rdf:resource="http://www.geonames.org/countries/#IT"/>
<wgs84_pos:lat>44.363244</wgs84_pos:lat>
<wgs84_pos:long>9.137166</wgs84_pos:long>
<parentFeature rdf:resource="http://sws.geonames.org/3176217/"/>
<childrenFeatures rdf:resource="http://sws.geonames.org/6540563/contains.rdf"/>
<locationMap rdf:resource="http://www.geonames.org/6540563/recco.html"/>
</Feature>
</rdf:RDF>

Here you can see some further RDF assertions about Recco, directly provided by GeoNames. In the specific case, they don't provide any further information about Recco, but they could (e.g. the population or other facts). This is a good example of how a distributed repository of information has been created. If I export my RDF, thanks to the use of the GeoNames real URL for identifying Recco, other people are enabled to perform some aggregate queries. The same happens, for instance, if they followed my scheme and used the same URL inside their own database.

Advanced topic. Since the above document is RDFXML, I could have just imported it and embedded into my store as is. Why didn't I do that - and wrote instead some specific code to convert to my geo:entity? Because GeoNames doesn't use SKOS for mapping the hierarchy, but its own ontology with parentFeature and childrenFeature. If they used SKOS, I would have saved some work.
Even more advanced topic: I *think* it could be possible to declare a semantic equivalence between skos:broader and parentFeature (more doubts about skos:narrower and childrenFeatures). If it was possible, then I would really be able to save code and import the GeoNames RDF as is. But it's too advanced for my current semantic skills.

I must point out that while this distributed aspect is the true nature of the Semantic Web, it's just one more thing you can do with it. I find a RDF exciting enough just as a local database, thanks its greater flexibility when compared with a rigid SQL schema.



PS I must say I lied: Polanesi is indeed in the GeoNames database. It must have been added after I wrote the referenced code (which actually comes from actual tests of forceTen). The example, of course, is still valid if you think of Polanesi as any other location not present in the GeoNames database.

AttachmentSize
GeoLocations.002.png125.33 KB
Related Topics >>