PDF seems to fit a world-view where the highest objective of a document is to be printed on paper.
From a machine-parsing perspective PDF files are a nightmare. Chunks of text may be broken anywhere, mid-sentence, mid-word. These chunks may appear in the document in any order.
Spaces may be encoded as spaces, or they may be created a number of other ways, like by positioning chunks, or setting character spacing per character.
The mapping from code point to glyph does not need to be pure Unicode, a PDF document may contain a custom font with additional glyphs.
This is all stuff I learned by trying to parse a limited set of PDFs found in the wild.
All of these gotchas are by the way completely PDF/A compliant.
An alternative thesis - PDFs fit a world view where the highest objective of a document is to be read by a human.
If I am given something that I am personally expected to read thoroughly - be it a report, long-form article, slide deck, etc - then the most professional format by far is a LaTeX PDF.
I can only claim anything beyond personal experience but if I want to signal to someone that a document is important and was written with care then they are getting a PDF.
The thing is, human consumption tend to rely on machine consumption. We want search engines to index our documents, and we want to be able to search within the documents. These features rely on machine parsability.
It is perfectly possible to generate a PDF file with none of the issues mentioned, the problem is that most people don't have the required control of their toolchain, and a lot of tools will create such issues by default.
A LaTeX-generated PDF along with the .tex file used to generate it solves all the problems mentioned by the parent. Now to convince casual users that LaTeX is worth learning... that's a completely different problem.
Not true, I've seen the most abhorrent pdfs generated by latex in academia. When I was working in the digitization department of a public university library we realized that we need to handle pdf just like every other scanned page, rasterize then OCR.
That doesn't make sense. In the case of a PDF+Tex bundle you just run the TeX through Pandoc and you have a neat result. Why would you OCR the PDF when you can just the raw markup?
Because a.) almost no Latex document is published as Pdf+Tex, it's either PDF or die, and b.) latex has a bazillion Turing-complete extensions, that don't make sense semantically until you render them.
I worked on a bot that parses NDAs using (among other things) flags, regexes and ML to tell you if you could sign it. In the end from all the files that had to be parsed (doc, docx, txt, rtf and pdf) PDF was the most troublesome. When parsing a PDF nothing was 100% certain.
In the end it was one of the most interesting projects I worked on in my (short) career but sometimes it sucked.
>PDF seems to fit a world-view where the highest objective of a document is to be printed on paper.
I've seen this in academia as well. And, inspired by PoC||GTFO, I've been thinking about downloading academic PDFs, writing a web server that provides an interactive model of the topic in the paper, patching it into the PDF to turn it into a polyglot, and then re-uploading the PDFs. This way people who want their PDFs can have it, those who want something a little more modern can have that simply by interpreting the exact same file as a bash script, and I get to understand the paper by modeling it.
> highest objective of a document is to be printed on paper.
In my line of work, I must lay out information and facts as if they were on paper, in order, to create a specific narrative.
This narrative can not be lost in hyperlinks, other web specific languages. Cases must be laid out in a very specific order from beginning to end, to make my argument, as to why things were done they way they were.
No other medium fits this except paper, or PDF, in a digital sense.
Edit: On websites, content should be replicated in an appropriate format, but most certainty referenced to its original. And, the original should be readily available.
Can you clarify what your line of work is? As it stands I'm unclear on why an HTML document can't represent "cases ... laid out in a very specific order from beginning to end". You don't need to use links or other functionality just because it's there.
>From a machine-parsing perspective PDF files are a nightmare.
Can someone with experience care to explain why? Does it have to do with each letter having an absolute position on the document? I have no clue, to be honest.
You have to essentially render the document yourself in order to figure out what the order of chunks is. Then you might be able to extract content from the chunks you're interested in - or not. A given zipped chunk might be literally anything.
* identifying characters that won't actually print [white text; zero-size font; beneath another layer; not in printable space]. Once this lead to every letter aappppeeaarriinngg ttwwiiccee..
* text in fonts where a != aa [leaving the text as a complicated substitution cypher; caused by font embedding for only the characters in the document]
* text in images
* no spaces: have to infer from gaps in letters
And these are generated by a whole host of different software with different assumptions, and you never know if there's something else you're missing.
I have no experience in handling raw PDF data, but as a user, I sometimes notice that the computer is not reading the PDF text the same way as I'm reading it. Here are a few examples:
1. When searching (Ctrl+F) a commonly used phrase that occurs multiple times in a PDF, some occurrences fail to show up because of line breaks, accidental hyphenation, etc.
2. Once in a while, I come across PDF files where searches for words containing "fi", "ff", etc. fail because of some insane ligature glyph replacements.
3. Some PDF files that have a two-column layout for text still treat lines across the two columns as one line. Search fails again.
Yeah, pretty much exactly what you said. Since PDF is focused on presentation rather than content it can write text content in any order, and the rules for converting the byte values to unicode values are extremely complex, supporting many different font formats. Some fonts (type 3) don't even include mappings to unicode in some scenarios, instead just encoding the appearance of glyphs and not their meaning.
Assuming you have a reasonable PDF file you have to parse the entire page content stream which includes things like colors, lines, Bezier curves, etc, extract the text showing operations and then stitch the letters back in to words and then words back into reading order, as best you can.
Many sensible PDF producers encode letters and whitespaces reasonably thereby preserving reading order, but this is far from universal.
It was never really designed for that, only for display and printing. There are some features that let you mark up the text for easier searching and selection, not every producer will use them though.
It is a bit of a dogs dinner now, mostly because of backwards compatibility. XPS is better but obviously failed in the market.
It is easier to generate a good document with HTML, you just have to leave out the bells and whistles.
But of course with the HTML+CSS+JS stack being more a programming language than a document format, there is no bounds to how awful one can make it either.
Why not both? Wasn't the point of the PDF/A ISO standard to
use it for archiving? I always felt PDF is better for content like this than HTML which can change dynamically
That's not the issue, the problem is we don't know what a document wood look like with next years browser or if it will render at all. That isn't an issue with PDF.
The same applies to PDFs too. There are hundreds of PDF readers and not all of them support all the features besides feature deprecation. You could run flash in PDFs, so archived PDFs that utilized that will be broken.
Best idea would probably be a download option for good, oldfashioned XML and providing a nice XLST-stylesheet for transformation into HTML/processing to pdf. This approach would ensure that you still can reliably save/read the content (compared to responsive JS-nightmares) in the future...
I think it would be best to have plain HTML available for download (without any JS or any other cruft). Even .epub (e-books) is just zipped HTML with some extra parts.
Interesting about PDFs. I think they still have some uses though, like scanned documents (like court records), downloadable books... I know some software with invoicing for say hosting might offer a downloadable version created with some script, but just spiting out a web page is probably easier in that case, and still printable.
However I know some PDFs don't even let you select text, they just scan in something it seems. I noticed that even for some government sites, like some city had their ordinances in PDFs that looked like they were typed up on a computer or type writer, signed by someone, scanned in and re-uploaded. Selectable text not only useful for blind people, but searching the page or copying/pasting parts of it for research, a bit of a SEO boost too probably as not sure if robots try to parse text in images even though it's probably very possible with machine learning. I know though there's been lawsuits that government websites have to be accessible but it seems like many cities and even some higher levels sites aren't taking this seriously. Not sure if they are just unaware, under budget or maybe just older sites but if they rebuilt a more modern one it'd be better. I remember reading somewhere on HN once that some city but I forget which one but they just deleted their website and replaced it with a plain text web page since people complained, so now if you need anything you got to go city hall. So basically negates having a website in the first place.
While PDF is supposed to be an open standard I'm noticing more and more functionality in PDF files that isn't supported by readers who implement the open standard. I have project right now to find out why Adobe and Chrome are so different and, though it's obvious to me, the loss of functionality because Chrome doesn't support certain things must be explained to those to whom it is not so obvious.
My municipal government insists on sending out via email attached PDFs or links to download PDFs.
Some of the PDFs are prepared by the government, others by vendors working for the government (probably based on specs that called for this experience). It ignores the facts that many (if not most) people read on small screen devices, don't have logon credentials handy, or even know where to find downloaded documents. It erodes civic engagement and leads to real problems when policies aren't followed or generate backlash because people didn't know about them prior to changes.
The simplest thing to do would be just to send the damn data as plaintext email. The mayor actually gets this -- her newsletters are always in the body of the email itself, never as a PDF attachment or link to download a PDF. Yet her administration is still stuck on PDFs for everything.
It was refreshing to get a notice about cookie settings which when ignored, accepted or rejected remained readily accessible (on mobile at least). Normally, you get to reject setting once and then it vanishes meaning you can't update your choice either way. It was a pleasant surprise to be able to change my mind.
>Normally, you get to reject setting once and then it vanishes meaning you can't update your choice either way.
Funny how you phrase it. Normally dark websites lets you accept the setting and never show it to you again, so you wouldn't change your mind. If you reject it - you get a redirect or a nagging screen which lets you change your mind and accept.
Having to have a separately implemented version of this on every website in the world is the dark pattern. Browsers already let you white- or black-list individual domains for cookies; it's crazy that every website design has to accommodate this redundant feature that is never going to be possible to implement perfectly for all use cases.
When I worked at GDS the aim was to use PDF’s they were considered to have a long shelf life, where as HTML and the web in general can evolve and leave some people without a way to read it. I wonder whats changed?
PDFs aren't and have never been accessible, and PDFs aren't and have never been reactive. Administrations are just finding out how frustrating the format is.
No. Terrible idea. The authors take the perspective that content should be easy for browsers to render. That's important but shouldn't be the only consideration. Browsers can render HTML well, but browsers have incredibly complicated rendering engines. This makes HTML non-portable and non-exportable, not printable, not sharable. I'm sure for each one of those you can create a workaround to bridge the gap but the big picture is that HTML content tends to be dynamic and rendered by a heavy server component from database which makes creating a true standard actually quite hard. HTML actually really sucks as a portable document standard without some extra semantic definition and a container format. So to make HTML workable you have to do something like what docx and odf do: XML definition+resources in zip container ... which now look a lot like PDF.
>We also intend to build functionality for users to automatically generate accessible PDFs from HTML documents.
> you have to do something like what docx and odf do: XML definition+resources in zip container ... which now look a lot like PDF.
Only superficially. .docx and .odt are single files as PDF is, but the contents of those files are very different. A PDF document is designed around the assumption of printed pages with a fixed layout. Yes, tagged PDF has been retrofitted onto the base PDF format for accessibility, but most PDFs that I've seen in the wild do not use this extra feature.
To understand why starting with a PDF and converting to HTML is a bad idea, try any PDF-to-HTML converter you can find; there's the pdftohtml utility included with Xpdf and Poppler, but any of the commercial alternatives are not a lot better. Once you learn more about how PDF works, you'll understand why PDF to HTML conversion can't be very good.
> We also intend to build functionality for users to automatically generate accessible PDFs from HTML documents. This would mean that publishers will only need to create and maintain one document, but users will still be able to download a PDF if they need to.
Oh I see you've added that. The problem with PDF to HTML is thst your trying to fit something static back into something dynamic. I also don't see how pdf rendering is any less complicated. There are many browser implementations that can easily render gov.uk without issues.
This raises the question of how they create these 'HTML documents'. As far as I'm aware, there really isn't an HTML document format. More than likely, these documents will be created by some web-based system (maybe even proprietary and cloud-based, making portability extremely hard) and simply exported as PDF later. That's how every web-based content management system does it today. It's very convenient right now, but when it comes to document maintenance for the next few decades, I'm not sure that's better than something like PDF or ODF.
The code is available on GitHub, you can see for yourself. It's mostly written in a customised markdown variant called Govspeak[0].
You appear to be jumping to ridiculous conclusions. Decisions on how we deliver content are made on the back of a huge amount of research. It's more accessible than it's ever been, other governments and organisations are following our lead.
To suggest PDF, ODF or other crippled, inaccessible, OS-constrained formats over HTML - something everyone can read with any of their devices - shows you need to do more research before posting your ill-informed views.
> HTML actually really sucks as a portable document standard without some extra semantic definition and a container format.
You're probably looking for the WARC [0] ISO standard. Widely supported and used. Including by the UK [1].
WARC is simple enough that tools to create or extract from it are dime a dozen.
As for the incredibly complicated rendering engines to look at the extracted HTML when you want, then even links doesn't struggle to render the content.
You could offer content in an ebook format (ePub, mobipocket) which is just HTML and resources with styles and offer it through the web as straight HTML with the option of downloading the container.
For sure. But notice you need to define another document standard because you probably don't want to just use an ebook format since that won't capture all your use-cases. So why not just use PDF, docx, or ODF?
Honestly because those formats are more restrictive (in an engineering sense), proprietary (in the sense of who really controls them) and difficult to parse. You’d need some sort of special subset of the format to make it so that people’s use didn’t obscure information, as is easily doable with scrambled PDFs and the like. You’d have to validate that format and reject submissions that didn’t use it. Laziness would probably lead to whatever the printer driver produces being the de facto standard.
That's not the fault of PDF. You can have an HTML document with image links as well. The problem is that you have no standard way of exporting and storing this kind of document.
From a machine-parsing perspective PDF files are a nightmare. Chunks of text may be broken anywhere, mid-sentence, mid-word. These chunks may appear in the document in any order.
Spaces may be encoded as spaces, or they may be created a number of other ways, like by positioning chunks, or setting character spacing per character.
The mapping from code point to glyph does not need to be pure Unicode, a PDF document may contain a custom font with additional glyphs.
This is all stuff I learned by trying to parse a limited set of PDFs found in the wild.
All of these gotchas are by the way completely PDF/A compliant.