Hacker News new | past | comments | ask | show | jobs | submit | jlink's comments login

Make me think of this cluster of 200 PS3 that was used in 2009 by EPFL to solve the elliptic curve discrete logarithm problem: https://www.epfl.ch/labs/lacal/articles/112bit_prime/


Initially, I wanted to go way further and add 3D avatars dancing like the connected users. This would use the webcam + bodypix to map the user dance with its 3D avatar. However, all of this requires just too much computation on client side to have something useable. Anyway, any suggestions on that lighter version?


... I didn't know she was a celebrity. Initially, I intended to make a youtube video with a compilation of random people on chatroulette trying to solve an integral that I was showing on the webcam. I've never finished that project but now 10 years later I retrieved that video rush and it made me smile.


haha nice one!


It works also with empty cells in the middle.


I didn't know about Tabula and i've given a try at the instant. Apparently it only extracts tables and ignores everything around. This might be good in some cases but it is a problem if you want to extract a form, a whole textbook, your bank statements or anything. Also, I noticed that Tabula has some slight troubles when columns are not drawn in the table. But overall it is a good tool for extracting only tables, that's true.


could be a nice feature but not easy task. I'll give a try, though.


Please update us/me when you do. I'm also working on the same problem, would love to chat.


During the development I compared my results with the ones of pdftotext utility and i obtained more or less similar results. The objective of my code was to have an equivalent tool easily embeddable in any java/android project and to learn more about apache pdfbox.


I imagine it's not an easy task guessing about proportionally spaced fonts, overlapping bounding boxes, columns, tables, wrapping, and so forth.


yes, definitely not easy but fortunately pdfbox offers a solid base to start with.


Happy to know it could help you. Good cooking to you!


Both I and my accountant thank you haha.


glad to hear that!


who would be interested by an online website doing the job?


if you really want to rake it in, serve, at static speeds (meaning instantly, I swear, boot a ramdrive (Tmpfs) and serve static html from nginx all from RAM), text versions of the top 10,000 web sites. there is so much crap on most sites. re-crawl hourly.

monetize via Google adwords.

EDIT: I'm not sure why I'm being downvoted. I am not suggesting serving PDF's. I am suggesting serving tiny text renders of top sites, that otherwise are much too bloated.

the hard part is getting the text and layout right. many people read many sites for the text IMO.

So I am suggesting you make an all-text version.

As an example, the front page of the New York Times right now, copied into Microsoft Word, is 2504 words. When I save from the word I copied into into .txt - I get a 16.4 KB file.

By comparison, when I put the site into a Page Size Checker -- http://smallseotools.com/website-page-size-checker/ -- I get 214.23 KB. That is impressively small, and it's a fast page.

If I try their competition, the Washington Post, I get 237 KB. If I try the Wall Street Journal, I get 938.15 KB -- nearly a full Megabyte. (This is actually more what I was expecting - I'm impressed by the Times.)

Suppose someone desperately wants to glance at the Wall Street Journal from a poor connection where they barely get data. The difference between 12 KB and nearly a megabyte is huge. Its the difference between 4 seconds and 312 seconds: 4 seconds as compared with 5 full minutes.

So there is a large need in my opinion for such a service in case someone desperately wants to see a text render. Preserving any formatting at all, helps hugely.


You can use the SHA-1 of the PDF's to avoid serving the same pdf twice.


SHA-256 ;)


You've been away from HN a few days? The SHA-1 collision example uses PDFs in its demo. Hence other commenter saying SHA 256


The OP was almost certainly being sarcastic.


For clarity can you edit your comment to add cozzyd (the OP you mention) - I am sometimes sarcastic but not in this case. I'll then delete this comment.


Sounds cool and all but two huge problems:

1) Copyright: completely re-serving the complete content of the top 100 sites with your own ads does not fall under fair use and would almost certainly be a magnet for lawsuits.

2) Distribution: how do you find your niche of people with poor internet connections and get them to use your mirror instead of whatever site it is they want to read?


no clue on 2. for 1, you could have it be "opera mini/turbo as a service" so that you are arguing you are just shifting the viewer to the site, but it's still the user doing the viewing. it helps if you preserve any text ads on the site (or links, with alt-text, given you're probably not doing images. you could also replace images with a grainy black-and-white very low-fidelity version, this also shifts most ads on the original site, without adding hugely to your footprint.) To be honest I also thought perhaps javascript etc could be run, so that the heaviest sites of all are still downloaded and then turned into text versions. In many cases that can let someone browse a site that is otherwise incredibly slow.

This isn't legal advice, just the approach I would use off of the top of my head. I agree with you that it's hard. with the framework "opera minifier/turbofier as a service" it could work, though. Like a remote browser. (in a VM). Like, present it as "lynx as a service." (Lynx being an old terminal-based text browser.) Something like that, anyway.


Doesn't Opera Mini or Turbo already provide this sevice? Perhaps add PPMD proxy text compression with an English dictionary with a JavaScript browser plugin on top of that. You can't get more efficient than that


Maybe, but asking someone to use a new browser is asking a lot. If you like, you can think of this as Opera minifier/turbofier as a service.


For what it's worth, here's a service that does that https://documentalchemy.com/demo/pdf2txt (and more: https://documentalchemy.com/demo)


Just tried the demos on this website.

I tried to extract text from a pdf that already has searchable text, which can be copy-pasted. This should be the easiest task of all but it made mistakes in every second word.

Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.


> Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.

Really? I'm pretty sure that's not the way this works.


thanks for sharing this one, didn't know it.


Yeah, sure, a public one for not privacy-critical PDFs plus something like a Heroku button to build own secure app (with auth and no storage).

See e.g. my file sharing app https://github.com/andreif/SecretFile


I bet most would, but privacy would be a big concern for me at least. script is optimal format for me


Privacy is the reason I'd prefer to do this in house.


Not to mention corporate privacy/IP.


Indeed. I might also show this to my wife(works in Nissan), might save her a bit of time.


check docparser.com


interesting service which was not present yet back in 2015 when I wrote my class.


Correct! We launched July 2016


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: