Pages

Sunday, July 07, 2013

Visualizing SPARQL end point results with d3.js (but not yet for #WikiPathways)

SVG pie chart visualization of SPARQL end point
results in Firefox with the Web Console.
SPARQL and RDF are very quickly becoming the (Open) standard for linking and accessing database works. Readers of my blog I have been searching the corners of what can and cannot be achieved with this for some time now.

Triggered by some nice visualization work at the BioHackathon on ChEMBL content, I picked up visualization of RDF data (see this 2010 post where I asked people to visualize data using SPARQL). So, and since d3.js is cool nowadays (it was processing.js in the past), so I had a go at the learning curve.

I started with a pie chart and this example code. Because I was working on the SPARQL queries for metabolites in WikiPathways (using Andra's important WP-RDF work, doi:10.1038/npre.2011.6300.1).

Because SPARQL end points can spit results in many formats, there are many ways to hook it up. CSV (MIME typetext/csv) seems to be the simplest, and because my SPARQL query has all the semantics already, I am happy with simple column headers like source and count.

So, the first step is the SPARQL query, and I am asking here for the number of unique identifiers per database (and note the variable names I marked in red):

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>

select
  str(?datasource) as ?source
  count(distinct ?identifier) as ?count
where {
  ?mb a wp:Metabolite ;
    dc:source ?datasource ;
    dc:identifier ?identifier .
}
order by desc(?count)

When I ask for the results as CSV, this looks like (and note the column headers):

"source","count"
"HMDB",524
"Kegg Compound",389
"CAS",267
"ChEBI",240
"Entrez Gene",143
"PubChem-compound",87
"Wikipedia",8
"PubChem-substance",8
"ChemIDplus",7
"Chemspider",6
"Ensembl",4
"ChEMBL compound",2
"3DMET",1
"TAIR",1
"LIPID MAPS",1

Then, the HTML/JavaScript code I am using is based on the earlier-listed example code:

<!DOCTYPE html>
<body>
<script src="d3.v3.js"></script>
<script>

var width = 300,
    height = 300,
    radius = Math.min(width, height) / 2;

var color = d3.scale.category20();

var arc = d3.svg.arc()
  .outerRadius(radius - 10)
  .innerRadius(0);

var pie = d3.layout.pie()
  .sort(null)
  .value(function(d) { return d.total; });

var svg = d3.select("body").append("svg")
  .attr("width", width)
  .attr("height", height)
  .append("g")
  .attr(
    "transform",
    "translate(" + width / 2 + "," + height / 2 + ")"
  );

d3.csv("data.csv", function(data) {

  data.forEach(function(d) {
    d.total = +d.count;
  });

  var g = svg.selectAll(".arc")
    .data(pie(data))
    .enter().append("g")
    .attr("class", "arc");

  g.append("path")
    .attr("d", arc)
    .style("fill", function(d) {
      return color(d.data.source); }
    );

  g.append("text")
    .attr("transform", function(d) {
      return "translate(" + arc.centroid(d) + ")";
    })
    .attr("dy", ".35em")
    .style("text-anchor", "middle")
    .text(function(d) {
      if (d.data.count > 40) {
        return d.data.source;
      } else {
        return "";
      }
    });
});
</script>
</body>

I cannot say I understand 100% of this code yet, but here goes: the first half defines some parameters, like how large the pie chart is, which colors to use, etc. The red bits are the column names in the CSV output. One of the more puzzling things was the code that calculated the total count, but I changed the "count" variable name to "total" (colored blue) to make it more clear to me.

The d3.csv() method loads the data from a CSV file or stream. The stream is asked for by replacing the file name (viz data.csv) with the full URL with the SPARQL query, such as this one for the running example. This URL has this bit "&format=text/csv" which ensures the SPARQL end point returns the data in CSV format. Now, changing this to "&format=text/html" returns a HTML version of the results. The results are shown at the top right of this blog post.

However, you may now be wondering why I load it from file in the above example, rather than from the stream. Well, it turns out there is something wrong with the Virtuoso 6.1 installation or version for WikiPathways SPARQL end point in that does not complete the stream. Indeed, my Firefox returns a partial file, even though it is complete. I tried it with DBPedia instead which does close the stream, and then it does work directly from the SPARQL end point:



But that problem aside, which will resolve itself when upgrading the Virtuoso installation I expect, mission accomplished.