What is SPARQL?
What is SPARQL?
by David Wood, Marsha Zaidman, Luke Ruth, and Michael Hausenblas, authors of Linked Data
SPARQL is the query language for RDF and Linked Data. SPARQL is to RDF data as the Structured Query Language (SQL) is to a relational database. SPARQL's name is nicely pronounceable and sounds interesting and fresh. SPARQL is interesting and fresh. This article, based on chapter 5 from Linked Data, shows you why.
SPARQL's name looks like an acronym, but the truth is the acronym was reverse engineered after the fact. The SPARQL Protocol and RDF Query Language is a recursive acronym in the tradition of the GNU (GNU's Not Unix) Project.
Like SQL, SPARQL is based on a widely implemented standard but various vendors have extended the language to suit themselves or show off particular features of their products. This article focuses on the standard language components. That is generally appropriate in any case: SPARQL implementations have not (yet) fragmented as much as SQL implementations.
NOTE The SPARQL Query Language for RDF is a rich language and far too complicated to describe in a single article. All the details may be found in the W3C Recommendation that describes SPARQL in detail, at http://www.w3.org/TR/rdf-sparql-query/. SPARQL version 1.1 is likely to become a W3C Recommendation shortly after this book is published; the specification for SPARQL version 1.1 is available at http://www.w3.org/TR/sparql11-query/.
Figure 1 W3C's SPARQL logo
SPARQL is defined by a family of W3C Recommendations and related Working Group Notes. The W3C SPARQL logo is shown in figure 1.
When you need SPARQL
SPARQL is best used when you want to query RDF graphs, as if one or more (possibly distributed) RDF graphs formed a database. Note that there are (many) native RDF databases, but you don't need to use one in order to query RDF using SPARQL.
Let's look at a sample query. We can query a real-world FOAF document for people that the FOAF document's owner knows.
Listing 1 SPARQL query to find FOAF friends
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
select ?name ?url
?person rdfs:seeAlso ?url ;
foaf:name ?name .
Look at the query in listing 1 and the RDF graph diagram in figure 2 at the same time. The query's WHERE clause defines “triple patterns,” ways of matching patterns within an RDF graph. We would like to query the graph in figure 2 (which represents the FOAF document for one of the authors).
The use of triple patterns in the WHERE clause is one of the largest differences one will notice from SQL. The other is probably the prefixes at the top of the query. Since RDF and Linked Data use (sometimes quite long) URIs as universal identifiers, we need a way to make our queries readable. Prefixes do that by mapping a short placeholder to a long URI that may then be used interchangeably in the rest of the query.
Figure 2 has the triple patterns matched by the query highlighted in red. We wanted to find any resource (that we called ?person) that has a foaf:name property and an rdfs:seeAlso property. We expect the query to return the variables listed in the SELECT clause, which are the two variables ?name and ?url; the objects of the RDF statements that were identified in the query's WHERE clause.
We can see that query in action using the ARQ utility from the Apache Jena Project. ARQ is a SPARQL processor that may be used from a command line interface.
ARQ may be downloaded from the Jena project's site in the Apache Incubator (http://incubator.apache.org/jena/). Download and install ARQ in order to follow along with the examples.
Figure 2 Sample FOAF data showing the triple patterns matched by the SPARQL query in listing 1
Setting up ARQ is straightforward. A single environment variable is used to tell ARQ where its installation directory is located. Listing 2 shows how that is done on various operating systems.
Listing 2 Setting up ARQ
# For Unix-like systems, including Linux and OS X:
$ export ARQROOT='/Applications/ARQ-2.8.8'
$ /Applications/ARQ-2.8.8/bin/arq -h
# For Windows:
Put the SPARQL query from listing 1 into a file called foaf.rq (.rq is the standard file extension for SPARQL queries). Next, get some real FOAF data from (for example) http://3roundstones.com/dave/me.rdf and save it to a file called foaf.rdf. We can query those two files using ARQ. Listing 3 shows the relevant command line.
Executing ARQ as shown will result in the output shown in listing 4. ARQ will output its own textual representation of query results when run in a terminal. However, there are other result formats. ARQ's other output options may be found in the ARQ help.
Listing 3 Running a SPARQL query from the command line
$ /Applications/ARQ-2.8.8/bin/arq --query foaf.rq --data foaf.rdf
This query will return some number of people and their URLs. The exact number may change depending on when you make the query (because the file on the Web may change). Any people listed in the FOAF file with an rdfs:seeAlso URL and a foaf:name will be returned. To see how changes will impact your query results, edit the file foaf.rdf and either add some more people with those parameters or change some of the existing data, then execute the ARQ again.
Listing 4 Partial query results for the FOAF query
| name | url |
| "Michael Hausenblas" | <http://sw-app.org/foaf/mic.rdf> |
Here SPARQL is acting as a query language for RDF when the RDF data is in files. Later we will show how to use SPARQL to query RDF data on the Internet. First, though, we can demonstrate how to query an RDF graph that is built from multiple data sources.
SPARQL, unlike SQL isn't limited to querying a single data source. We can use SPARQL to query multiple files, Web resources, databases, or a combination thereof. A simple example should serve to make that more clear.
The personal information in a FOAF profile may be extended with, say, address information. A common way to represent address information in RDF is via the vCard vocabulary. vCard files are like virtual business cards. A minimal vCard address file is that augments the sample FOAF data we have been working with is available at http://3roundstones.com/dave/vcard.ttl. That file is in Turtle format.
Take the vCard data from http://3roundstones.com/dave/vcard.ttl and put it into a file called vcard.ttl. Now we can run ARQ again with both the FOAF and vCard data acting as input, as shown in listing 5. Note the additional --data parameter. Make sure to put the contents of listing 6 into a file called foafvcard.rq.
This shows that RDF files may be combined, just like any other RDF graphs. Graphs of information combine well (unlike tables and trees). The magic is in the reuse of identifiers. Both files refer to the same URI identifying a person.
NOTE One of the primary assumptions of Linked Data is that two people using the same identifier are talking about the same thing. Reusing identifiers for resources allows data to be combined.
Listing 5 Running a SPARQL query from the command line with multiple data files
$ /Applications/ARQ-2.8.8/bin/arq --query foafvcard.rq --data foaf.rdf \
Figure 3 shows the combined data graph created by merging the sample FOAF and vCard data. The triples highlighted in red match the triple patterns in the WHERE clause in listing 6.
Figure 3 Sample combined FOAF/vCard data showing the triple patterns matched by the SPARQL query in listing 6
Listing 6 shows a query against the combined FOAF and vCard graph. We wish to find the names of people and some address information associated with each person. The name comes from the sample FOAF data and the address information comes from the sample vCard data.
Listing 6 A SPARQL query that combines FOAF and vCard data
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix vcard: <http://www.w3.org/2006/vcard/ns#>
select ?name ?city ?state
?person foaf:name ?name ;
vcard:adr ?address .
?address vcard:locality ?city ;
vcard:region ?state .
Note how the ?address variable is used. The first constraint in the WHERE clause matches a person with a foaf:name. The next line (#2 in the listing) matches the same person (that is, triples with the same subject) with a vcard:adr address. The address is a blank node. It doesn't have an identifier. But, we can get a handle to it by referring to it as a variable called ?address. The ?address variable may then be used in the next two lines (#3) to find the city and state associated with that address. We don't care about the address per se. We are just “walking the graph” to ensure that the address we use is the same one that is associated with the person's name.
The result of running the command in listing 5 (with the query in listing 6) is shown in listing 7. The name comes from the FOAF data, and the city and state come from the vCard data.
Listing 7 Query results for the combined FOAF and vCard query
| name | city | state |
| "David Wood" | "Fredericksburg" | "Virginia" |
Developers used to SQL might note that variable names in SPARQL's SELECT clause do not name variables to query from the database; they determine which variables used in the WHERE clause's triple patterns get returned in the output. That is confusing for some new users but makes sense once you wrap your mind around the concept of matching triple patterns against an RDF graph. The approach works even if the RDF graph is temporarily created for the purposes of satisfying the query!
It is naturally possible to use SPARQL to query RDF on the Internet, too. Try modifying the query in listing 1 by adding a “FROM clause” after the SELECT clause. Add this line after the “select” line:
If you save the modified query into a file called livefoaf.rq (since .rq is the file extension for SPARQL queries), you can run the query as shown in listing 8.
Listing 8 Running a remote SPARQL query from the command line
$ /Applications/ARQ-2.8.8/bin/arq --query livefoaf.rq
The URL in the FROM clause points to David Wood's live FOAF file on the Internet. The query will return a larger number of people's names and rdfs:seeAlso URLs than the results shown in listing 4.
One of the things that make structured data on the Web interesting is that it is distributed, unlike a relational database where the data exists in a single system. SPARQL allows you to have multiple FROM clauses in a single query. Try it yourself by adding another FROM clause with a URL to another FOAF file to the query above (for example, Michael's FOAF URL is given in listing 4). Now you will see a listing of people's names and rdfs:seeAlso URLs from both David's and Michael's FOAF files. Try that with a relational database.
It is also possible to construct queries that provide detailed control over which triple patterns match which source graphs by naming graphs using SPARQL's FROM NAMED clause. The FROM NAMED clause operates much like the FROM clause, but allows data sources to be named via URIs and, therefore, referred to by those URIs in the WHERE clause. See the SPARQL version 1.1 specification for details (in section 8 of that document). While you are there, you might want to read the documentation on the SPARQL SERVICE keyword, which expands SPARQL's ability to control complex federated queries.
This article has introduced the SPARQL query language for RDF.SPARQL allows us to query the Web of Data as if were a database, albeit a very large one with many distributed data sets.
Here are some other Manning titles you might be interested in:
Big DataNathan Marz and James Warren
Neo4j in ActionJonas Partner, Aleksa Vukotic, and Nikki Watt
Redis in ActionJosiah L. Carlson