Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Gov.uk content should be published in HTML and not PDF (2018) (gds.blog.gov.uk)
165 points by rahuldottech on Dec 22, 2019 | hide | past | favorite | 59 comments


PDF seems to fit a world-view where the highest objective of a document is to be printed on paper.

From a machine-parsing perspective PDF files are a nightmare. Chunks of text may be broken anywhere, mid-sentence, mid-word. These chunks may appear in the document in any order.

Spaces may be encoded as spaces, or they may be created a number of other ways, like by positioning chunks, or setting character spacing per character.

The mapping from code point to glyph does not need to be pure Unicode, a PDF document may contain a custom font with additional glyphs.

This is all stuff I learned by trying to parse a limited set of PDFs found in the wild.

All of these gotchas are by the way completely PDF/A compliant.


An alternative thesis - PDFs fit a world view where the highest objective of a document is to be read by a human.

If I am given something that I am personally expected to read thoroughly - be it a report, long-form article, slide deck, etc - then the most professional format by far is a LaTeX PDF.

I can only claim anything beyond personal experience but if I want to signal to someone that a document is important and was written with care then they are getting a PDF.


The thing is, human consumption tend to rely on machine consumption. We want search engines to index our documents, and we want to be able to search within the documents. These features rely on machine parsability.

It is perfectly possible to generate a PDF file with none of the issues mentioned, the problem is that most people don't have the required control of their toolchain, and a lot of tools will create such issues by default.


A LaTeX-generated PDF along with the .tex file used to generate it solves all the problems mentioned by the parent. Now to convince casual users that LaTeX is worth learning... that's a completely different problem.


Not true, I've seen the most abhorrent pdfs generated by latex in academia. When I was working in the digitization department of a public university library we realized that we need to handle pdf just like every other scanned page, rasterize then OCR.


That doesn't make sense. In the case of a PDF+Tex bundle you just run the TeX through Pandoc and you have a neat result. Why would you OCR the PDF when you can just the raw markup?


Because a.) almost no Latex document is published as Pdf+Tex, it's either PDF or die, and b.) latex has a bazillion Turing-complete extensions, that don't make sense semantically until you render them.


Yet, for a visually impaired, PDFs are a nightmare because of what is cited up above.


So PDFs are the Mercedes Benz of document formats?


I worked on a bot that parses NDAs using (among other things) flags, regexes and ML to tell you if you could sign it. In the end from all the files that had to be parsed (doc, docx, txt, rtf and pdf) PDF was the most troublesome. When parsing a PDF nothing was 100% certain.

In the end it was one of the most interesting projects I worked on in my (short) career but sometimes it sucked.


>PDF seems to fit a world-view where the highest objective of a document is to be printed on paper.

I've seen this in academia as well. And, inspired by PoC||GTFO, I've been thinking about downloading academic PDFs, writing a web server that provides an interactive model of the topic in the paper, patching it into the PDF to turn it into a polyglot, and then re-uploading the PDFs. This way people who want their PDFs can have it, those who want something a little more modern can have that simply by interpreting the exact same file as a bash script, and I get to understand the paper by modeling it.


> highest objective of a document is to be printed on paper.

In my line of work, I must lay out information and facts as if they were on paper, in order, to create a specific narrative.

This narrative can not be lost in hyperlinks, other web specific languages. Cases must be laid out in a very specific order from beginning to end, to make my argument, as to why things were done they way they were.

No other medium fits this except paper, or PDF, in a digital sense.

Edit: On websites, content should be replicated in an appropriate format, but most certainty referenced to its original. And, the original should be readily available.


Can you clarify what your line of work is? As it stands I'm unclear on why an HTML document can't represent "cases ... laid out in a very specific order from beginning to end". You don't need to use links or other functionality just because it's there.


Healthcare. Too many documents from many different sources in varying formats, too much time to compile electronically.

It’s such a mess.


>From a machine-parsing perspective PDF files are a nightmare.

Can someone with experience care to explain why? Does it have to do with each letter having an absolute position on the document? I have no clue, to be honest.


You have to essentially render the document yourself in order to figure out what the order of chunks is. Then you might be able to extract content from the chunks you're interested in - or not. A given zipped chunk might be literally anything.

Didier Stevens is the best expert of PDFs that I know of: https://blog.didierstevens.com/programs/pdf-tools/


There's a whole pile of different gotchas.

* identifying characters that won't actually print [white text; zero-size font; beneath another layer; not in printable space]. Once this lead to every letter aappppeeaarriinngg ttwwiiccee..

* text in fonts where a != aa [leaving the text as a complicated substitution cypher; caused by font embedding for only the characters in the document]

* text in images

* no spaces: have to infer from gaps in letters

And these are generated by a whole host of different software with different assumptions, and you never know if there's something else you're missing.


I have no experience in handling raw PDF data, but as a user, I sometimes notice that the computer is not reading the PDF text the same way as I'm reading it. Here are a few examples:

1. When searching (Ctrl+F) a commonly used phrase that occurs multiple times in a PDF, some occurrences fail to show up because of line breaks, accidental hyphenation, etc.

2. Once in a while, I come across PDF files where searches for words containing "fi", "ff", etc. fail because of some insane ligature glyph replacements.

3. Some PDF files that have a two-column layout for text still treat lines across the two columns as one line. Search fails again.


Yeah, pretty much exactly what you said. Since PDF is focused on presentation rather than content it can write text content in any order, and the rules for converting the byte values to unicode values are extremely complex, supporting many different font formats. Some fonts (type 3) don't even include mappings to unicode in some scenarios, instead just encoding the appearance of glyphs and not their meaning.

Assuming you have a reasonable PDF file you have to parse the entire page content stream which includes things like colors, lines, Bezier curves, etc, extract the text showing operations and then stitch the letters back in to words and then words back into reading order, as best you can.

Many sensible PDF producers encode letters and whitespaces reasonably thereby preserving reading order, but this is far from universal.

For an idea of what content stream parsing involves, this is how I currently do it: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad....


It was never really designed for that, only for display and printing. There are some features that let you mark up the text for easier searching and selection, not every producer will use them though.

It is a bit of a dogs dinner now, mostly because of backwards compatibility. XPS is better but obviously failed in the market.


sure... but html is not easier, if not harder. can't bother to get printing correctly either. i will take PDF any given day.


It is easier to generate a good document with HTML, you just have to leave out the bells and whistles.

But of course with the HTML+CSS+JS stack being more a programming language than a document format, there is no bounds to how awful one can make it either.


Why not both? Wasn't the point of the PDF/A ISO standard to use it for archiving? I always felt PDF is better for content like this than HTML which can change dynamically

https://en.wikipedia.org/wiki/PDF/A

VeraPDF also exist for PDF/A validation https://github.com/verapdf


> I always felt PDF is better for content like this than HTML which can change dynamically

It's not that hard to edit a PDF file either, though.


That's not the issue, the problem is we don't know what a document wood look like with next years browser or if it will render at all. That isn't an issue with PDF.


The same applies to PDFs too. There are hundreds of PDF readers and not all of them support all the features besides feature deprecation. You could run flash in PDFs, so archived PDFs that utilized that will be broken.


Hence the aforementioned PDF/A, which is a version with a fixed set of features: https://en.wikipedia.org/wiki/PDF/A


Presumably there are text formats which are both machine parseable and which are static? Or why not simply use the static subset of HTML?


I painstakingly convert some PDFs from Boston and Massachusetts government into HTML and people prefer them for the flexibility and accessibility.

My workflow is to use pdftohtml then edit the result as Markdown and screenshot the figures before converting to HTML with pandoc.

I have been pushing to publish in HTML from the get-go, in part by citing this blog, with some success.

For example, the Massachusetts Secretary of the Office of Environmental Affairs recently published a typed letter as a raster PDF, which I converted into https://nattaylor.com/eastboston/blog/2019/3247-2017-logan-a... from https://eeaonline.eea.state.ma.us/EEA/emepa/mepacerts/2019/s...


Best idea would probably be a download option for good, oldfashioned XML and providing a nice XLST-stylesheet for transformation into HTML/processing to pdf. This approach would ensure that you still can reliably save/read the content (compared to responsive JS-nightmares) in the future...


I think it would be best to have plain HTML available for download (without any JS or any other cruft). Even .epub (e-books) is just zipped HTML with some extra parts.



I really like the work of the UK's GDS (government digital service)


Me too. Focusing on digital services was, and remains, one of the few things the UK government has got right in recent years, IMHO - and excelled at.


Is it anything to do with the government per se, or is it driven entirely by civil servants?


Francis Maude was the MP responsible for a lot of that. He was the only politician I ever heard talk about lean practices in software.

https://en.wikipedia.org/wiki/Francis_Maude#Return_to_Govern...


Yeah -- look into Francis Maude, he's broadly responsible for the current gov't focus on (good) IT practices.


Interesting about PDFs. I think they still have some uses though, like scanned documents (like court records), downloadable books... I know some software with invoicing for say hosting might offer a downloadable version created with some script, but just spiting out a web page is probably easier in that case, and still printable.

However I know some PDFs don't even let you select text, they just scan in something it seems. I noticed that even for some government sites, like some city had their ordinances in PDFs that looked like they were typed up on a computer or type writer, signed by someone, scanned in and re-uploaded. Selectable text not only useful for blind people, but searching the page or copying/pasting parts of it for research, a bit of a SEO boost too probably as not sure if robots try to parse text in images even though it's probably very possible with machine learning. I know though there's been lawsuits that government websites have to be accessible but it seems like many cities and even some higher levels sites aren't taking this seriously. Not sure if they are just unaware, under budget or maybe just older sites but if they rebuilt a more modern one it'd be better. I remember reading somewhere on HN once that some city but I forget which one but they just deleted their website and replaced it with a plain text web page since people complained, so now if you need anything you got to go city hall. So basically negates having a website in the first place.


While PDF is supposed to be an open standard I'm noticing more and more functionality in PDF files that isn't supported by readers who implement the open standard. I have project right now to find out why Adobe and Chrome are so different and, though it's obvious to me, the loss of functionality because Chrome doesn't support certain things must be explained to those to whom it is not so obvious.


IIRC, implementing the PDF standard requires a JavaScript implementation. To many readers, that's not worth it.


My municipal government insists on sending out via email attached PDFs or links to download PDFs.

Some of the PDFs are prepared by the government, others by vendors working for the government (probably based on specs that called for this experience). It ignores the facts that many (if not most) people read on small screen devices, don't have logon credentials handy, or even know where to find downloaded documents. It erodes civic engagement and leads to real problems when policies aren't followed or generate backlash because people didn't know about them prior to changes.

The simplest thing to do would be just to send the damn data as plaintext email. The mayor actually gets this -- her newsletters are always in the body of the email itself, never as a PDF attachment or link to download a PDF. Yet her administration is still stuck on PDFs for everything.


It was refreshing to get a notice about cookie settings which when ignored, accepted or rejected remained readily accessible (on mobile at least). Normally, you get to reject setting once and then it vanishes meaning you can't update your choice either way. It was a pleasant surprise to be able to change my mind.


Still, I was annoyed at it being a floating header, covering up the bottom few lines.

I’m confused the vast majority of “mobile optimized” sites don’t even have anyone look at the site in landscape mode.


>Normally, you get to reject setting once and then it vanishes meaning you can't update your choice either way.

Funny how you phrase it. Normally dark websites lets you accept the setting and never show it to you again, so you wouldn't change your mind. If you reject it - you get a redirect or a nagging screen which lets you change your mind and accept.


Having to have a separately implemented version of this on every website in the world is the dark pattern. Browsers already let you white- or black-list individual domains for cookies; it's crazy that every website design has to accommodate this redundant feature that is never going to be possible to implement perfectly for all use cases.


When I worked at GDS the aim was to use PDF’s they were considered to have a long shelf life, where as HTML and the web in general can evolve and leave some people without a way to read it. I wonder whats changed?


PDFs aren't and have never been accessible, and PDFs aren't and have never been reactive. Administrations are just finding out how frustrating the format is.


No. Terrible idea. The authors take the perspective that content should be easy for browsers to render. That's important but shouldn't be the only consideration. Browsers can render HTML well, but browsers have incredibly complicated rendering engines. This makes HTML non-portable and non-exportable, not printable, not sharable. I'm sure for each one of those you can create a workaround to bridge the gap but the big picture is that HTML content tends to be dynamic and rendered by a heavy server component from database which makes creating a true standard actually quite hard. HTML actually really sucks as a portable document standard without some extra semantic definition and a container format. So to make HTML workable you have to do something like what docx and odf do: XML definition+resources in zip container ... which now look a lot like PDF.

>We also intend to build functionality for users to automatically generate accessible PDFs from HTML documents.

How about the other way.. PDFs -> HTML documents.


> you have to do something like what docx and odf do: XML definition+resources in zip container ... which now look a lot like PDF.

Only superficially. .docx and .odt are single files as PDF is, but the contents of those files are very different. A PDF document is designed around the assumption of printed pages with a fixed layout. Yes, tagged PDF has been retrofitted onto the base PDF format for accessibility, but most PDFs that I've seen in the wild do not use this extra feature.

To understand why starting with a PDF and converting to HTML is a bad idea, try any PDF-to-HTML converter you can find; there's the pdftohtml utility included with Xpdf and Poppler, but any of the commercial alternatives are not a lot better. Once you learn more about how PDF works, you'll understand why PDF to HTML conversion can't be very good.


> We also intend to build functionality for users to automatically generate accessible PDFs from HTML documents. This would mean that publishers will only need to create and maintain one document, but users will still be able to download a PDF if they need to.

Oh I see you've added that. The problem with PDF to HTML is thst your trying to fit something static back into something dynamic. I also don't see how pdf rendering is any less complicated. There are many browser implementations that can easily render gov.uk without issues.


This raises the question of how they create these 'HTML documents'. As far as I'm aware, there really isn't an HTML document format. More than likely, these documents will be created by some web-based system (maybe even proprietary and cloud-based, making portability extremely hard) and simply exported as PDF later. That's how every web-based content management system does it today. It's very convenient right now, but when it comes to document maintenance for the next few decades, I'm not sure that's better than something like PDF or ODF.


The code is available on GitHub, you can see for yourself. It's mostly written in a customised markdown variant called Govspeak[0].

You appear to be jumping to ridiculous conclusions. Decisions on how we deliver content are made on the back of a huge amount of research. It's more accessible than it's ever been, other governments and organisations are following our lead.

To suggest PDF, ODF or other crippled, inaccessible, OS-constrained formats over HTML - something everyone can read with any of their devices - shows you need to do more research before posting your ill-informed views.

[0] https://github.com/alphagov/govspeak

Edit: Also, web page have plenty of other advantages. Indexed by every search engine, bookmarkable/shareable headings. What other medium offers that?


You can find out more about how GDS operates on their github[0].

[0]: https://alphagov.github.io/


> HTML actually really sucks as a portable document standard without some extra semantic definition and a container format.

You're probably looking for the WARC [0] ISO standard. Widely supported and used. Including by the UK [1].

WARC is simple enough that tools to create or extract from it are dime a dozen.

As for the incredibly complicated rendering engines to look at the extracted HTML when you want, then even links doesn't struggle to render the content.

[0] https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...

[1] https://www.nationalarchives.gov.uk/pronom/fmt/289


You could offer content in an ebook format (ePub, mobipocket) which is just HTML and resources with styles and offer it through the web as straight HTML with the option of downloading the container.


For sure. But notice you need to define another document standard because you probably don't want to just use an ebook format since that won't capture all your use-cases. So why not just use PDF, docx, or ODF?


Honestly because those formats are more restrictive (in an engineering sense), proprietary (in the sense of who really controls them) and difficult to parse. You’d need some sort of special subset of the format to make it so that people’s use didn’t obscure information, as is easily doable with scrambled PDFs and the like. You’d have to validate that format and reject submissions that didn’t use it. Laziness would probably lead to whatever the printer driver produces being the de facto standard.


That would not work perfectly as some PDFs do not have the text stored as "text".


That's not the fault of PDF. You can have an HTML document with image links as well. The problem is that you have no standard way of exporting and storing this kind of document.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: