What and why
If there could only be one file format to save your charts in, it should be SVG. It stands for So Very Good (not really). But there’s good SVG and not so good SVG, and I want to explore that a little here, while also hopefully winning over some new SVG fans. I’m going to use scatterplots throughout of Ronald Fisher’s classic iris data, sepal width vs sepal length.
Advantage one: it’s a vector format, so you can rescale it and it doesn’t get fuzzy or pixelated. Vector graphics are a series of instructions to the computer: put a line here and a circle there, etc. Rescale it and the image is re-drawn. The alternative is raster graphics, like .png, .jpg or the old .bmp, which tells the computer that this pixel is blue and that pixel is red and so on. If you re-scale it (and of course, web browsers on other people’s computers do that all the time no matter what you tell them to do) then you get fuzz where it has to combine or interpolate pixels.
Advantage two: you can read it, and edit it. PDF is a vector format (although it can have embedded rasters) but isn’t human-readable. Let’s open iris_stata.pdf in a text editor:
Not helpful. And now the corresponding iris-stata.svg file:
Mmmm, human-readable; nice.
This is the reason why open data activists get furious when PDFs are dumped on a webpage by a government agency claiming to have fulfilled its open data obligations. You can see immediately that SVG is plain text (which is to say, some kind of ASCII / UTF-8 / Unicode text that you can read and indeed edit), whereas PDF is squashed down to a binary format. Also, SVG can be trimmed back to the bare essentials, while PDFs are full of weird and wonderful things that are not relevant to your chart (like the ability to store bookmarks, highlight text or digitally sign forms).
However, you can write those SVG instructions in many ways. The nice example above came from Stata 14, but now let’s compare how different software writes out the same chart as SVG. We’ll get them all out and then delve into the SVG code. Stata first:
webuse iris, clear
scatter seplen sepwid, ylabel(,nogrid) graphregion(color(white))
graph export "iris_stata.svg", replace
graph export "iris_stata.png", replace
graph export "iris_stata.pdf", replace
Here’s the raster png:
Being able to export to SVG is new as of version 14, and it was undocumented! Thanks to Tim Morris for discovering it. Let’s not question how he found it. The file you get is here. (Then it became an official feature of Stata 15, out a month or two back. I’m generating output here from 14 but I’m going to come back to 15 at the end.) I invite you to open these in your text editor as well as your browser. (Browsers will display SVGs, but if you find them interesting and potentially useful, you should get a vector graphics editor too. Good options are Inkscape (free) or Adobe Illustrator (not free)).
Next up, R. Let’s use the built-in graphics::svg device first.
You get this SVG file and this PNG:
But you can also install the svglite package.
You might see a ‘fontconfig warning’ (I do), but it runs and you get this SVG file.
As a final option, any software that spits out PDFs can be useful after a fashion, because you could open them in Inkscape/Illustrator and then save them as SVG. So, in SPSS, we could get this PDF file and this PNG:
The circle markers are represented by paths, when circles would be much more compact. Presumably Inkscape can’t tell from the PDF that they are circles and therefore a special kind of path. You could write a little program to do some simplification like this (because SVG is just a plain text file), but it would be tedious to get it right each time you fed in a different chart. Interestingly though, if you open the simple Stata SVG in Inkscape and then save it as “Inkscape SVG”, you’ll find none of that complexity afterwards, so it is in the PDF-to-SVG conversion that things get tangled up.
Alternatively, you could save your chart from SPSS in .eps (embedded PostScript) format, which is another old vector format. When converting from EPS, the Inkscape and plain SVGs are a bit different but not much simpler. I omit the .eps for download here because it’s just short of 1MB, but here’s the plain SVG that results. The SVG file sizes vary according to complexity. Stata is smallest at 18KB, R svglite is 25KB, R svg 70KB, SPSS-Inkscape plain SVG 77KB, and SPSS-Inkscape SVG 88KB.
But there’s no need for that complexity; it’s just bloating added around the core components of the chart. And although none of these files are so big to trouble you, why waste space if you don’t have to? More importantly, it is squandering one of the great features of SVG if it’s no longer easily human-readable.
One of the big causes of the bloat is embedding fonts. Suppose I make a chart on an Apple OS X computer and want to send it to you on a Windows one . I have Helvetica, which comes with OS X, and you don’t. I can either embed it in the SVG, which takes up more space but means that when you open the file, it will look just as I intended, or I can just stick text in there and call it Helvetica, in which case it will look right if you have paid up for that ubiquitous font, and different otherwise. It’ll probably turn into Arial. Embedding works by defining the exact shape of the characters used in the chart, then placing them where needed like a rubber stamp. I don’t know about the legal niceties of embedding a copyrighted font.
Unpacking the SVG files
Now let’s take a look inside.
Starting with the simplest, the Stata SVG, we find 184 lines of code, which is pretty slim considering there’s 150 observations to be drawn. After minimal preliminaries, it draws a rect for the background (what Stata calls the graphregion), then another rect for the background inside the axes (the plotregion). Then, if you have gridlines, they come in with one line of code for each, and then the circles. Then the line that is the y-axis, then its ticks (lines), labels and title (text), then the same for the x-axis. Essentially, all the contents from background to foreground. Here’s the snapshot again:
Good features: no unnecessary wasted code, having the components of the chart appear in a consistent order, using text for labels is way better than embedding fonts (unless you really want to)
Bad features: no id or class info makes it hard to see what is the x-axis, for example, until you realise there’s a fixed order to the components.
Next up is our base R output. There are some symbol definitions at the beginning which are the embedded characters. Scrolling to get past them takes much longer than you’d think; it feels like driving past the Googleplex. We have no idea what they are without drawing them.
Then there are 150 path objects which each look like this:
Those are circles. Wouldn’t it much easier, shorter and more human-readable to make them, ya know, circles.
Then there are little and big lines going from hither to yon. They are axes, ticks, and such. Each looks like this:
Getting tedious. Then the text comes in, like this:
That says something like “1.0” — talk about over-thinking it.
The svglite version uses circles and text (hooray), but wants everything to be in its own clip-path and its own <g> object (don’t worry too much if you don’t know what I mean), and gives silly long ids to them. It’s lite but it could be liter.
Going via Inkscape from PDF, on the other hand, produces a nightmare of specifying every conceivable option, rather than sticking to simple stuff and defaults. It does use text, but take a look at this “3.0”:
As far as I’m concerned, font-family will do the trick. Who the hell knows what -inkscape-font-specification adds? And so it continues, like this. There’s nothing wrong with it, it just doesn’t put the customer first.
Oh no, what happened in Stata 15?
Now to say something about Stata 15. When you draw a scatterplot marker as a circle in Stata 14, you get one <circle> object. When you do it in 15, you get two almost identical ones. Whaaaat? The reason for this is that v15 introduced semi-transparent graphics, which is something I and others had been grumbling about for years, so I’m thrilled it’s there. The only problem is this odd SVG implementation. Going semitransparent makes the fill of the marker fade out but not the stroke (the line round the outside). And they achieve that by superimposing a hollow circle on top of a solid one.
The solid one gets the transparency. But to make matters worse, they are not quite the same radius. This is just weird because you could get this effect (and it’s not at all clear that that’s what most people want when they ask for semi-transparency) by just having one circle and chucking in a fill-opacity value between 0 and 1 (default 1).
Edit 8 Sept 2017
Chinh Nguyen at StataCorp got in touch and clarified this. There is a good reason for the ring around the marker, which is to do with operating systems and browsers and unpleasant stuff like that. It’s worth reading this post on Statalist. I am vaguely reminded of the computing module with Assembler I did as an undergrad: changing from terminal ASCII display mode to pixel-drawing display mode was associated with the students’ brains melting. It’s hard to make graphics appear on the screen the way you intended. Nevertheless, the rings could be omitted from the SVG output. For now though, you can add the option mlcol(%0) to drop the ring in Stata 15.
Some fun with Stata
So, Tim Morris and I talked about having some simple little commands that take an SVG file that Stata has made, and amend it to create something that Stata doesn’t currently produce. Once you start playing with the text file that is SVG, the sky’s the limit. In particular I want to point you to Nadieh Bremer’s talks and web pages, and Sarah Drasner’s book. We’re talking about in on Friday at the Stata Users’ Group in olde London Towne. We have three commands at present: make stuff semi-transparent (handy if you got Stata 14), make a hexbin plot, and embed your SVG in a web page with some simple interactive controls. The GitHub repo is here. We invite you all to play around too.