Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

EDIT: Release notes: https://tug.org/texlive/doc/texlive-en/texlive-en.html#

One technology that has stood the test of time- amazingly well designed TeX by Donald Knuth and LaTeX by Leslie Lamport.

Plain TeX can be a bit beginner unfriendly because most beginners use LaTeX and experienced people have their own 'coding style' but it's amazingly powerful.

I only wish somebody would add the minimum required primitives to plain TeX to render reflowable text in browsers and ebook readers

TeX was designed for "beautiful" typesetting but it gives you much more control than that IMHO. I'd urge everyone to try it out once, I use it offline but I think overleaf.com allows for plain TeX too. (XeTeX may work best for unicode, it's plain tex + minimal additions for Unicode)

On that note- I am not fond of PDFs because of their awfully poor unicode search support- does anybody knowledgeable know of a good target format I should use (and the appropriate drivers?)



You might like this LaTeX to HTML converter I'm working on: https://github.com/arxiv-vanity/engrafo

What primitives do you think need adding to TeX? LaTeXML, which powers Engrafo, does a pretty good job of converting plain TeX, as well as LaTeX.


Ironically, Computer Modern is terrible to read on computer displays. It was designed for ink and toner bleed, and looks great when printed on 1970s-era Xerox printers [1], but it's far too thin for digital rendering.

I might suggest Bitstream Charter for a well-designed and readable analogue to CMR for digital use.

[1] https://www.typografie.info/3/topic/22238-ist-die-computer-m... (German language)


That is the Type 1 realization of it, the original Metafont outputs bitmaps tailored for the target device. Unfortunately no widely font rendering library supports Metafont directly, so they are often converted to Type 1 or OpenType and loose that capability.


A very good font family for screens that's openly licensed is Inter[2]. It's well designed and also quite versatile.

[2] https://rsms.me/inter/


Out of interest, what are the advantages of LaTeXML over Pandoc's LaTeX to XML/HTML conversion? Why did you choose LaTeXML?


Last time I checked, pandoc’s LaTeX parser did not support too much of the TeX syntax. It only works for some subset of LaTeX, basically


This is super cool, thank you for sharing!


The problem is that browsers choose not to implement high-quality justification algorithms like the Knuth-Plass algorithm that Tex uses because it is computationally intensive. That’s why justified text looks like garbage on the web.

There are some experimental JavaScript implementations, but without browser support reflowing high-quality justified text is a non-starter on the web.


TeX’s line-breaking algorithm is certainly not computationally intensive. On my 7-year-old MacBook Pro, it takes 0.34 seconds to run the entire 495 page The TeXbook on a single thread. That includes parsing, macro expansion, page-breaking, lots of (slower) math layout, and dvi output, which means that the line-breaking takes at most a few hundred microseconds per page.

Remember, TeX was written to be usable on a 1 megabyte, 10 megahertz machine, where it ran about a page a second. One of my contributions at the time was to modify the Pascal compiler on the Sail PDP10 to count cycles for every machine instruction executed by TeX over all users over a number of months, and Knuth fine-tuned the inner-loops of TeX here and there based on the results (the code that automatically inserts kerns and ligatures got the most attention, IIRC).


My comment was based on what I've read about this from multiple supposed authorities on web development. Intuitively it made sense that you wouldn't want to install a computationally intensive algorithm on browsers on mobile devices or when the content area changes size frequently, as on web pages. It's fascinating to have those assumptions overturned by someone so deeply involved in Tex.


Well, if you're going to be nice about it, here's some more info: The Mozilla discussion claims that TeX's line-breaking algorithm is "quadratic," which seems a bit far-fetched. So, I just pulled the raw text of Moby Dick off the web, removed the blank lines so it's all one paragraph, and ran it. TeX produces 112 pages (hmmm, it was just "Volume 1") in 2.1 seconds. So, 30x slower than "normal-size" paragraphs, but hardly quadratic, as the single-paragraph Moby Dick is 1000x as large as the average paragraph in The TeXbook. Of course, as pointed out elsewhere, with a little effort, one could make minor changes that would remove even this speed penalty.

I'm much more sympathetic to the point that, while TeX's line-breaking algorithm can easily handle paragraphs with different line lengths for each line, it needs to know at the start what the different line lengths are. It's not clear how to generalize it to be able to handle layouts where the length of the nth line of a paragraph depends on the earlier (or later!) line breaks. Think tall floating figures which impinge on the text area of the paragraph they're in. I'm guessing that was the real impediment in using it in Web-land.


My assumption was also that performance is the reason we aren't getting more esthetically pleasing line breaking. Until I read a comment[1] by Philip Walton, who works on WebRender at Mozilla, that is.

[1]: https://news.ycombinator.com/item?id=19473277


> ... it's not possible in the general case, at least not with the specs as they are today.

That's a fair response, but how about changing the (CSS) specs to allow better line breaking? Surely that would take less time than WebUSB, and Google or Mozilla could quickly push it through the IETF.


See also the 8-year-old Mozilla bug for a more detailed discussion: https://bugzilla.mozilla.org/show_bug.cgi?id=630181


I'm not a web developer, so take this with a huge pinch of salt, but, if floats are the problem, does that imply that with layouts that use CSS Grid or Flexbox, we could have a decent justification algorithm?


Oh, thanks for this. Very intertesting.


He's Patrick, not Philip.


Woups. Thanks for the correction. I think it's too late to edit my original post. :/


It's not (just) about the computation power, it's also incompatible with the standard. The only compatible algorithm is a naive greedy one.


I haven't tried this myself but I think ConTeXt can output PDFs with embedded XML markup and even epub files. You need to use \startsection and \stopsection instead of just \section, and maybe there are other limitations, but it's a small price to pay, isn't it?

https://wiki.contextgarden.net/Epub

https://wiki.contextgarden.net/Export


TeX is fundamentally not compatible with reflowable text because it's at the lowest level about putting glyphs in positions.


I fail to see any fundamental incompatibility. Every text renderer of any kind is about putting glyphs in positions. The layout would declare few things not moveable and algorithms needs to decide how to fit text around those constraints. The algorithms for TeX are computationally expensive but I don't see why with faster computers, you can't reflow the text. About 15 years ago it used to take dozen second to compile my typical PDF, now overleaf.com does that in just a second in browser.


If you are outputting a layout for a reflowable medium like HTML or EPUB, you are not putting glyphs in positions. You are constructing graphical objects and defining their relationships (and how those relationships change based on form factor), and you are permitting the output device to render glyphs and put them in positions.

This is why we don't use PDF for web pages.


The web browser also "puts glyphs into positions." Neither HTML nor TeX source specifies positioning in the source format though. The same TeX document can be re-rendered at different page sizes, font sizes etc. Really not sure what fundamental difference you are seeing.


It's still way too slow. Lots of the documents I've worked on recently have taken over ten seconds to compile on my 2017 MacBook Pro (touch bar). That's 3 orders of magnitude too slow for reflowing at 60hz when you resize a window. I doubt a laptop will ever be 3 orders of magnitude faster (in sequential execution) than the one I have now, radical post-semiconductor computers notwithstanding.


Compilation seems to take place on the server.


Doesn't Xe(La)TeX/Lua(La)TeX produce PDFs with decent unicode search support? What issues do you have?


XeTeX in particular does but my problem is with the uncertainty around the whole process and the frustration when a search fails on a 300 page document


? PDF works fine for searching.

It depends on the document, a pdf page could be:

1) Text

2) An image of text

3) Lines/Curves that happen to be in the shape of letters/text

If it's text it perfectly searchable. If it's one of the others the creator has to also OCR it and add the text behind the image or in front but invisible (no stroke or fill).


It could also be mostly text, but mixed with ligatures having their own codepoints (sometimes non-Unicode).


If you set the \XeTeXgenerateactualtext=1 option in a Xe(La)TeX document, the resulting PDF will include ActualText annotations to support searching even in complex scripts with ligatures, character reordering, etc.


Unicode search in PDF works fine, there are just a few issues that can cripple it for specific files, or specific files using specific PDF implementations. The following is advice for anyone working with the format, it should not be construed as making excuses for how PDF does things. It's a ginormous format with a ginormous spec that's been around for a very long time and has accumulated a lot of baroque qualities over the years. So I don't mean to condemn it, but I'm not exonerating it either. It is what it is, and if you have to work with it this might be useful. Anyhoo:

Regardless of what one thinks of their other qualities, if a PDF can be searched in Adobe Reader or Acrobat, then the file is probably OK but the PDF reader that you were trying to search it with has a bug on its end (or more likely, some unimplemented dark corner of the PDF spec).

On the other hand, if the file isn't searchable via Reader/Acrobat, then the problem is most likely with the authoring of the file itself. The most common thing that breaks searching is when instead of embedding all fonts used, a PDF refers to fonts by name from the local system. This can cause unpredictable issues when reading the PDF from another OS that can't resolve those font names.

Another common breaker of search that seems to be much more common with TeX are workflows which somehow produce PDFs that have embedded Type 3 fonts. Type 3 fonts represent glyphs using PDF drawing instructions. It's less a file format than something that only exists as an embedded font within a PDF, but I've only seen them in the wild when authored by pdflatex or similar. It seems like TrueType or OpenType are the most reliable formats to embed across the most commonly used PDF implementations, but that's an educated guess. Type 3 font support is spotty in non-Adobe implementations.

Finally, font subsetting might screw up search. Most software that knows how to produce PDFs can produce PDFs with embedded but subsetted fonts. This means the software that created the file embedded TrueType fonts (for example), but created a special version of the font for embedding that only contained the glyphs used in the PDF. Depending on the quality of the software used to do the font subsetting, the output PDF might not retain the mapping of glyphs back to Unicode characters. If that happens, text using that font in the PDF becomes unsearchable, but the PDF remains renderable.

None of this is meant to excuse the shortcomings of the format, it's clearly overcomplicated and fragile. But when I've seen problems with searchability of PDFs, it's most often been with the software that was used to create the files, or how that software was configured by the author. And when it is a problem with the authored file itself, it's almost always because of fonts. Either the fonts aren't embedded, or they're embedded in a slightly oddball format that's in spec for PDF but not perfectly, universally supported by all of the various non-Adobe PDF implementations.


I hit the comment limit earlier but thank you for writing in detail about what's actually going on behind the scenes so I can take care to not let the problem affect me much.

That said, and despite your yourself having mentioned that there is no excuse for the shortcomings, I feel like this decision in particular is just plain inexcusable anyway because if you're going to write portable document, at least use some form of UTF encoding rather than indexing into glyphs (or on top of !)


Kind of.

I was deep into LaTeX during the 90's while at the university, my Lamport's book copy state reflects how much I used to refer back to it.

Nowadays I rather prefer the convenience of something like FrameMaker.


Thought about updating, but the actual changelog is pretty underwhelming for a 5+GB download... https://tug.org/texlive/doc/texlive-en/texlive-en.html#news




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: