I quickly want to go through some of the points that have been raised by the ODI’s report on PDFs and open data.
The most important one that I want to bring up is foolproofing. In quite a few of the jobs I’ve had in or around government in the past few years I have had to train people to use new tools (usually a CMS) or new practices (putting H1, H2 etc. into Word docs to help with accessiblity).
I also write this as someone with a background in user research rather than open data, although I have worked on a number of data projects. My experience has significant crossover with the groups mentioned in the ODI’s report (parliament, town planners, nerds).
Most users of software take the easiest route. In the case of creating a PDF that means saving as a PDF in word. That means that embedded data may well be word tables, because that’s what people know how to do.
People
Garbage in, garbage out.
To an extent I don’t think it matters hugely what format data is in. If we went to BMP files as long as it works I don’t mind. The thing is that you need to mandate the format with the least technical nous needed to create the data. Creating data is not a job that is always well paid or high up the experience ladder. Creators are not often consumers. So, make it easy. Few options. Exporting as CSV makes sense in this case, even for data that might be contained within a PDF elsewhere.
- You will never be not “stuck thinking about PDFs” until you change your business practices to not be stuck with them.
One other thing I’d say is that I have worked in parts of the public sector where the release of important data was done poorly to make machine reading more difficult. I worry that a bells-and-whistles PDF format runs the risk of malicious hobbling.
What is the user need?
When we come back to the root of all this, I’m left wondering what problem the recommendation of PDF use solves.
If it is words on a screen, I would recommend HTML. It puts words on a screen without heating your phone up to do it.
If it is data in a grid, I would recommend CSV or whatever your spreadsheet format of choice. Whatever.
Then we get into the murky world of mixed use case that I think the report is trying to address.
A quick aside: “with exceptions for accessible views or views on mobile” is a fucking big exception. That’s a) a set of users you have a legal requirement and a moral duty to satisfy and b) over 50% of users. Also, look at that highlighting. PDFs have done that forever, and they copy and paste like shit, full of formatting errors. They are no good as “a document that looks the same”.
Scientific papers are often PDFs because of legacy reasons such as controlling access to resources and the likelihood of the journal being printed. The challenge of open access publication makes this model seem to be in a bit of flux at the moment.
If the article were saying “hey, PDFs can do some things with data that they couldn’t do a while ago” then sure, fine.
If the user need is reading data, then use a data format. If the user need is reading words then use a (better) document format. Right now, no clear user need is expressed other than “sometimes you have both”, which is true but doesn’t then logically lead on to “…so PDF is the right solution”. It may be, but there is a logic gap there.
People mentioned some other things:
- Linkrot happens to all URLs. Anything on the internet can disappear. HTML is as prone to this as PDFs as often they are all in a CMS with a filesystem for PDFs too. I don’t think ability to self archive is the killer feature here. Until you download something you don’t have certainty on it. That is true of any internet document.
Conclusion
Long story short, PDF is fine if you know how to extract data from it already, are not on a mobile device, don’t use accessibility software, are acting after the zombocalypse and wish to analyse your pre-downloaded stash of data and assuming that the author wanted you to do your analysis unimpeded.
If you write a report that is meant to be read on the internet by many people, then do it as a webpage. Terry is right. I read thinktank reports on the train to work in the morning usually. I don’t print them out. If you don’t understand user needs in the medium of the report, how can I be sure that you understand more complex ones?