More

jlink · on July 22, 2022

Make me think of this cluster of 200 PS3 that was used in 2009 by EPFL to solve the elliptic curve discrete logarithm problem: https://www.epfl.ch/labs/lacal/articles/112bit_prime/

jlink · on March 28, 2021

Initially, I wanted to go way further and add 3D avatars dancing like the connected users. This would use the webcam + bodypix to map the user dance with its 3D avatar. However, all of this requires just too much computation on client side to have something useable. Anyway, any suggestions on that lighter version?

jlink · on Aug 30, 2020

... I didn't know she was a celebrity. Initially, I intended to make a youtube video with a compilation of random people on chatroulette trying to solve an integral that I was showing on the webcam. I've never finished that project but now 10 years later I retrieved that video rush and it made me smile.

jlink · on March 15, 2017

haha nice one!

jlink · on Feb 25, 2017

It works also with empty cells in the middle.

jlink · on Feb 25, 2017

I didn't know about Tabula and i've given a try at the instant. Apparently it only extracts tables and ignores everything around. This might be good in some cases but it is a problem if you want to extract a form, a whole textbook, your bank statements or anything. Also, I noticed that Tabula has some slight troubles when columns are not drawn in the table. But overall it is a good tool for extracting only tables, that's true.

jlink · on Feb 25, 2017

could be a nice feature but not easy task. I'll give a try, though.

ganwar · on March 3, 2017

Please update us/me when you do. I'm also working on the same problem, would love to chat.

jlink · on Feb 25, 2017

During the development I compared my results with the ones of pdftotext utility and i obtained more or less similar results. The objective of my code was to have an equivalent tool easily embeddable in any java/android project and to learn more about apache pdfbox.

tyingq · on Feb 25, 2017

I imagine it's not an easy task guessing about proportionally spaced fonts, overlapping bounding boxes, columns, tables, wrapping, and so forth.

jlink · on Feb 25, 2017

yes, definitely not easy but fortunately pdfbox offers a solid base to start with.

jlink · on Feb 25, 2017

Happy to know it could help you. Good cooking to you!

marak830 · on Feb 25, 2017

Both I and my accountant thank you haha.

jlink · on Feb 25, 2017

glad to hear that!

jlink · on Feb 25, 2017

who would be interested by an online website doing the job?

logicallee · on Feb 25, 2017

if you really want to rake it in, serve, at static speeds (meaning instantly, I swear, boot a ramdrive (Tmpfs) and serve static html from nginx all from RAM), text versions of the top 10,000 web sites. there is so much crap on most sites. re-crawl hourly.

monetize via Google adwords.

EDIT: I'm not sure why I'm being downvoted. I am not suggesting serving PDF's. I am suggesting serving tiny text renders of top sites, that otherwise are much too bloated.

the hard part is getting the text and layout right. many people read many sites for the text IMO.

So I am suggesting you make an all-text version.

As an example, the front page of the New York Times right now, copied into Microsoft Word, is 2504 words. When I save from the word I copied into into .txt - I get a 16.4 KB file.

By comparison, when I put the site into a Page Size Checker -- http://smallseotools.com/website-page-size-checker/ -- I get 214.23 KB. That is impressively small, and it's a fast page.

If I try their competition, the Washington Post, I get 237 KB. If I try the Wall Street Journal, I get 938.15 KB -- nearly a full Megabyte. (This is actually more what I was expecting - I'm impressed by the Times.)

Suppose someone desperately wants to glance at the Wall Street Journal from a poor connection where they barely get data. The difference between 12 KB and nearly a megabyte is huge. Its the difference between 4 seconds and 312 seconds: 4 seconds as compared with 5 full minutes.

So there is a large need in my opinion for such a service in case someone desperately wants to see a text render. Preserving any formatting at all, helps hugely.

cozzyd · on Feb 25, 2017

You can use the SHA-1 of the PDF's to avoid serving the same pdf twice.

coulix · on Feb 25, 2017

SHA-256 ;)

vgb2k11 · on Feb 25, 2017

You've been away from HN a few days? The SHA-1 collision example uses PDFs in its demo. Hence other commenter saying SHA 256

jhlgkhkhil · on Feb 25, 2017

The OP was almost certainly being sarcastic.

logicallee · on Feb 25, 2017

For clarity can you edit your comment to add cozzyd (the OP you mention) - I am sometimes sarcastic but not in this case. I'll then delete this comment.

yeldarb · on Feb 25, 2017

Sounds cool and all but two huge problems:

1) Copyright: completely re-serving the complete content of the top 100 sites with your own ads does not fall under fair use and would almost certainly be a magnet for lawsuits.

2) Distribution: how do you find your niche of people with poor internet connections and get them to use your mirror instead of whatever site it is they want to read?

logicallee · on Feb 25, 2017

no clue on 2. for 1, you could have it be "opera mini/turbo as a service" so that you are arguing you are just shifting the viewer to the site, but it's still the user doing the viewing. it helps if you preserve any text ads on the site (or links, with alt-text, given you're probably not doing images. you could also replace images with a grainy black-and-white very low-fidelity version, this also shifts most ads on the original site, without adding hugely to your footprint.) To be honest I also thought perhaps javascript etc could be run, so that the heaviest sites of all are still downloaded and then turned into text versions. In many cases that can let someone browse a site that is otherwise incredibly slow.

This isn't legal advice, just the approach I would use off of the top of my head. I agree with you that it's hard. with the framework "opera minifier/turbofier as a service" it could work, though. Like a remote browser. (in a VM). Like, present it as "lynx as a service." (Lynx being an old terminal-based text browser.) Something like that, anyway.

fsiefken · on Feb 25, 2017

Doesn't Opera Mini or Turbo already provide this sevice? Perhaps add PPMD proxy text compression with an English dictionary with a JavaScript browser plugin on top of that. You can't get more efficient than that

logicallee · on Feb 25, 2017

Maybe, but asking someone to use a new browser is asking a lot. If you like, you can think of this as Opera minifier/turbofier as a service.

rodw · on Feb 25, 2017

For what it's worth, here's a service that does that https://documentalchemy.com/demo/pdf2txt (and more: https://documentalchemy.com/demo)

flexie · on Feb 25, 2017

Just tried the demos on this website.

I tried to extract text from a pdf that already has searchable text, which can be copy-pasted. This should be the easiest task of all but it made mistakes in every second word.

Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.

rodw · on Feb 25, 2017

> Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.

Really? I'm pretty sure that's not the way this works.

jlink · on Feb 25, 2017

thanks for sharing this one, didn't know it.

andreif · on Feb 25, 2017

Yeah, sure, a public one for not privacy-critical PDFs plus something like a Heroku button to build own secure app (with auth and no storage).

See e.g. my file sharing app https://github.com/andreif/SecretFile

akouri · on Feb 25, 2017

I bet most would, but privacy would be a big concern for me at least. script is optimal format for me

marak830 · on Feb 25, 2017

Privacy is the reason I'd prefer to do this in house.

a3n · on Feb 25, 2017

Not to mention corporate privacy/IP.

marak830 · on Feb 25, 2017

Indeed. I might also show this to my wife(works in Nissan), might save her a bit of time.

2_listerine_pls · on Feb 25, 2017

check docparser.com

jlink · on Feb 25, 2017

interesting service which was not present yet back in 2015 when I wrote my class.

krakaukiosk · on Feb 25, 2017

Correct! We launched July 2016