SVG from stats software: the good, the bad and the ugly

What and why

If there could only be one file format to save your charts in, it should be SVG. It stands for So Very Good (not really). But there’s good SVG and not so good SVG, and I want to explore that a little here, while also hopefully winning over some new SVG fans. I’m going to use scatterplots throughout of Ronald Fisher’s classic iris data, sepal width vs sepal length.

Advantage one: it’s a vector format, so you can rescale it and it doesn’t get fuzzy or pixelated. Vector graphics are a series of instructions to the computer: put a line here and a circle there, etc. Rescale it and the image is re-drawn. The alternative is raster graphics, like .png, .jpg or the old .bmp, which tells the computer that this pixel is blue and that pixel is red and so on. If you re-scale it (and of course, web browsers on other people’s computers do that all the time no matter what you tell them to do) then you get fuzz where it has to combine or interpolate pixels.

Advantage two: you can read it, and edit it. PDF is a vector format (although it can have embedded rasters) but isn’t human-readable. Let’s open iris_stata.pdf in a text editor:

Screen Shot 2017-03-13 at 22.31.04

Not helpful. And now the corresponding iris-stata.svg file:

 

Screen Shot 2017-09-06 at 08.00.34

Mmmm, human-readable; nice.

This is the reason why open data activists get furious when PDFs are dumped on a webpage by a government agency claiming to have fulfilled its open data obligations. You can see immediately that SVG is plain text (which is to say, some kind of ASCII / UTF-8 / Unicode text that you can read and indeed edit), whereas PDF is squashed down to a binary format. Also, SVG can be trimmed back to the bare essentials, while PDFs are full of weird and wonderful things that are not relevant to your chart (like the ability to store bookmarks, highlight text or digitally sign forms).

Making them

However, you can write those SVG instructions in many ways. The nice example above came from Stata 14, but now let’s compare how different software writes out the same chart as SVG. We’ll get them all out and then delve into the SVG code. Stata first:

webuse iris, clear
scatter seplen sepwid, ylabel(,nogrid) graphregion(color(white))
graph export "iris_stata.svg", replace
graph export "iris_stata.png", replace
graph export "iris_stata.pdf", replace

Here’s the raster png:

iris_stata

Being able to export to SVG is new as of version 14, and it was undocumented! Thanks to Tim Morris for discovering it. Let’s not question how he found it. The file you get is here. (Then it became an official feature of Stata 15, out a month or two back. I’m generating output here from 14 but I’m going to come back to 15 at the end.) I invite you to open these in your text editor as well as your browser. (Browsers will display SVGs, but if you find them interesting and potentially useful, you should get a vector graphics editor too. Good options are Inkscape (free) or Adobe Illustrator (not free)).

Next up, R. Let’s use the built-in graphics::svg device first.

data(iris)
sw<-iris$Sepal.Width
sl<-iris$Sepal.Length
svg("iris_Rsvg.svg")
plot(sw,sl)
dev.off()

You get this SVG file and this PNG:

iris_Rsvg

But you can also install the svglite package.

data(iris)
sw<-iris$Sepal.Width
sl<-iris$Sepal.Length
svglite::svglite("iris_Rsvglite.svg")
plot(sw,sl)
dev.off()

You might see a ‘fontconfig warning’ (I do), but it runs and you get this SVG file.

As a final option, any software that spits out PDFs can be useful after a fashion, because you could open them in Inkscape/Illustrator and then save them as SVG. So, in SPSS, we could get this PDF file and this PNG:

iris-spss

When you open it in Inkscape, you can save it as “Plain SVG” and get this file, or as “Inkscape SVG” and get this file.

The circle markers are represented by paths, when circles would be much more compact. Presumably Inkscape can’t tell from the PDF that they are circles and therefore a special kind of path. You could write a little program to do some simplification like this (because SVG is just a plain text file), but it would be tedious to get it right each time you fed in a different chart. Interestingly though, if you open the simple Stata SVG in Inkscape and then save it as “Inkscape SVG”, you’ll find none of that complexity afterwards, so it is in the PDF-to-SVG conversion that things get tangled up.

Alternatively, you could save your chart from SPSS in .eps (embedded PostScript) format, which is another old vector format. When converting from EPS, the Inkscape and plain SVGs are a bit different but not much simpler. I omit the .eps for download here because it’s just short of 1MB, but here’s the plain SVG that results. The SVG file sizes vary according to complexity. Stata is smallest at 18KB, R svglite is 25KB, R svg 70KB, SPSS-Inkscape plain SVG 77KB, and SPSS-Inkscape SVG 88KB.

But there’s no need for that complexity; it’s just bloating added around the core components of the chart. And although none of these files are so big to trouble you, why waste space if you don’t have to? More importantly, it is squandering one of the great features of SVG if it’s no longer easily human-readable.

 

Embedding fonts

One of the big causes of the bloat is embedding fonts. Suppose I make a chart on an Apple OS X computer and want to send it to you on a Windows one . I have Helvetica, which comes with OS X, and you don’t. I can either embed it in the SVG, which takes up more space but means that when you open the file, it will look just as I intended, or I can just stick text in there and call it Helvetica, in which case it will look right if you have paid up for that ubiquitous font, and different otherwise. It’ll probably turn into Arial. Embedding works by defining the exact shape of the characters used in the chart, then placing them where needed like a rubber stamp. I don’t know about the legal niceties of embedding a copyrighted font.

Unpacking the SVG files

Now let’s take a look inside.

Starting with the simplest, the Stata SVG, we find 184 lines of code, which is pretty slim considering there’s 150 observations to be drawn. After minimal preliminaries, it draws a rect for the background (what Stata calls the graphregion), then another rect for the background inside the axes (the plotregion). Then, if you have gridlines, they come in with one line of code for each, and then the circles. Then the line that is the y-axis, then its ticks (lines), labels and title (text), then the same for the x-axis. Essentially, all the contents from background to foreground. Here’s the snapshot again:

Screen Shot 2017-09-06 at 08.00.34

Good features: no unnecessary wasted code, having the components of the chart appear in a consistent order, using text for labels is way better than embedding fonts (unless you really want to)

Bad features: no id or class info makes it hard to see what is the x-axis, for example, until you realise there’s a fixed order to the components.

This is what I mean by id and class: instead of just having a line for a circle that states its centre co-ordinates cx and cy, its radius r, its fill and stroke colors and strioke width, you could also add class=’datamarker’ (for example) and id=’weird_outlier_3′ (for example). Those don’t affect what’s drawn at all, but they function like little bookmarks to help you find your way to things of interest, and they come into their own if you want to use JavaScript to manipulate the SVG interactively (that’s another story though, too much for this post).

Next up is our base R output. There are some symbol definitions at the beginning which are the embedded characters. Scrolling to get past them takes much longer than you’d think; it feels like driving past the Googleplex. We have no idea what they are without drawing them.

Screen Shot 2017-09-06 at 12.19.56

Then there are 150 path objects which each look like this:

Screen Shot 2017-09-06 at 13.39.27

Those are circles. Wouldn’t it much easier, shorter and more human-readable to make them, ya know, circles.

Then there are little and big lines going from hither to yon. They are axes, ticks, and such. Each looks like this:

Screen Shot 2017-09-06 at 13.39.44

Getting tedious. Then the text comes in, like this:

Screen Shot 2017-09-06 at 13.40.10

That says something like “1.0” — talk about over-thinking it.

The svglite version uses circles and text (hooray), but wants everything to be in its own clip-path and its own <g> object (don’t worry too much if you don’t know what I mean), and gives silly long ids to them. It’s lite but it could be liter.

Going via Inkscape from PDF, on the other hand, produces a nightmare of specifying every conceivable option, rather than sticking to simple stuff and defaults. It does use text, but take a look at this “3.0”:

Screen Shot 2017-09-06 at 13.47.50

As far as I’m concerned, font-family will do the trick. Who the hell knows what -inkscape-font-specification adds? And so it continues, like this. There’s nothing wrong with it, it just doesn’t put the customer first.

Oh no, what happened in Stata 15?

Now to say something about Stata 15. When you draw a scatterplot marker as a circle in Stata 14, you get one <circle> object. When you do it in 15, you get two almost identical ones. Whaaaat? The reason for this is that v15 introduced semi-transparent graphics, which is something I and others had been grumbling about for years, so I’m thrilled it’s there. The only problem is this odd SVG implementation. Going semitransparent makes the fill of the marker fade out but not the stroke (the line round the outside). And they achieve that by superimposing a hollow circle on top of a solid one. The solid one gets the transparency. But to make matters worse, they are not quite the same radius. This is just weird because you could get this effect (and it’s not at all clear that that’s what most people want when they ask for semi-transparency) by just having one circle and chucking in a fill-opacity value between 0 and 1 (default 1).

Edit 8 Sept 2017

Chinh Nguyen at StataCorp got in touch and clarified this. There is a good reason for the ring around the marker, which is to do with operating systems and browsers and unpleasant stuff like that. It’s worth reading this post on Statalist. I am vaguely reminded of the computing module with Assembler I did as an undergrad: changing from terminal ASCII display mode to pixel-drawing display mode was associated with the students’ brains melting. It’s hard to make graphics appear on the screen the way you intended. Nevertheless, the rings could be omitted from the SVG output. For now though, you can add the option mlcol(%0) to drop the ring in Stata 15.

Some fun with Stata

So, Tim Morris and I talked about having some simple little commands that take an SVG file that Stata has made, and amend it to create something that Stata doesn’t currently produce. Once you start playing with the text file that is SVG, the sky’s the limit. In particular I want to point you to Nadieh Bremer’s talks and web pages, and Sarah Drasner’s book. We’re talking about in on Friday at the Stata Users’ Group in olde London Towne. We have three commands at present: make stuff semi-transparent (handy if you got Stata 14), make a hexbin plot, and embed your SVG in a web page with some simple interactive controls. The GitHub repo is here. We invite you all to play around too.

Advertisements

4 Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s