Hacker Newsnew | past | comments | ask | show | jobs | submit | avdempsey's commentslogin

Internet Archive | https://archive.org | Full-Time | Remote PT-ET Hours | Non-Profit

Help build web crawlers, preservation, and public access services for over one thousand partner libraries and other cultural heritage institutions.

https://app.trinethire.com/companies/32967-internet-archive/...


Hello! Love the Internet Archive (use it daily - thank you public web infrastructure!) and this would be a genuine dream job.

Can build web crawlers in my sleep. Last job was a founding engineer that led to a Director of Data position.


I really wish companies like you or mozilla, with righteous missions, had opportunities for people below senior level occasionally


This is USA only.


That's so cool. Good luck finding who you need. Love the archives work.


this website is sick


We’ve been using temporal (the Python SDK) for some new projects at Internet Archive. It’s early days but we’re very excited. We run our own infrastructure, and we get more power-loss events or other intermittent issues than most. The durable execution of workflows that temporal promises seems like it was made for us. And once code is “temporalized” we get to eject a bunch of ad-hoc resiliency stuff into the sun, and what’s left is a lot clearer.

There’s a lot to learn with it. I’ve seen ramp-up take a few months per engineer, though we’re also making it a little harder on ourselves by self-hosting, being the early adopters on the Python SDK (Go, Java, and TypeScript are the most mature I think), and dealing with a mix of Python async and multiprocessing (a bunch of CPU bound activities in the mix). The docs are solid, and the team is responsive to community users.


Can I ask what issues or difficulties you faced going the self-hosting route? We’re a small outfit and we’re thinking of self-hosting temporal to save costs, would be interested to hear about your experience.


Our department is still in a world of ansible and VMs, so we can't yet take advantage of some of the work that's gone into making it easy to run in k8s. We're using Postgres for Temporal's persistence because we're already familiar with operating it (and it's pretty cool already you can have some choice in DB). We've hit a bug in the newest version of Temporal server that doesn't play nicely with pgbouncer in transaction pooling mode, but they've responded to our ticket and seem to have a solution. We're running on the previous version for now. Other than that, it's just been the up-front cost of building the ansible playbooks. We haven't pushed it to any kind of load limits yet, or built-out a real high availability deploy of it yet. Day to day operation at our utilization level has been no drama so far.


In my career I've seen resources pulled away from QA teams (sometimes completely disbanding). It would be interesting to see things swing the other way.


I'm more or less waiting on the same thing, but wonder how many it will be able to pick up. The maintainer states somewhere that many of Pylint's remaining checks involve type-checking and inferring types for less than fully typed code. I've also seen him state somewhere that writing a type-checker is probably out of scope since these are massive projects in their own right. I've not seen any statement about the expected future coverage of Pylint.

Perhaps running Ruff plus a type checker gives us close to what Pylint does today? Pylint is pretty comprehensive, and awesome for that, but I'd love to lint at the speed of Ruff.


Internet Archive | Data Engineer | Remote (US, CA) | Full-Time | archive.org

Internet Archive is a non-profit building a free library of all of the published works of humanity to share with the world. We're not there yet, but we've managed to accumulate some data along the way. Can you help us engineer it?

The Archiving and Data Services department provides services to mission-aligned organizations (primarily other libraries and cultural heritage institutions). These services include: web crawling SaaS, managed large-scale crawls, long-term digital preservation, and particularly relevant for this role: making use of these web archives and digital collections.

We're looking for a Data Engineer to help us with some of the following: - Turn researcher Jupyter notebooks into robust systems (these notebooks are mostly in Scala) - Develop data munging/wrangling/deriving workflows (we use Spark and Temporal.io) - Help administrate a 7.5 Petabyte Hadoop cluster - Potentially write jobs for our main, in-house long term storage cluster - There's always APIs that need work (these are mostly in Python) - ML experience is an interesting bonus

We're fully remote, employees can be based anywhere in US or Canada.

This is a new opening as of Dec 1, so new we're still working on getting it posted. If interested, please reach out to Alex at avdempsey [at] archive [dot] org.


Does ActivityPub support hierarchical federation? Or could it be grafted on?


I've been wondering about that. I joked not too long ago that Mastodon is the new FidoNet, but it does occur to me that there are probably things Mastodon could learn from FidoNet (or Usenet) if it hasn't learned them already.


The meat and meatspace “replacements” are bridge technologies that reduce the suffering of animals and people respectively. Sometimes a vegan just wants a burger. Sometimes an exec just needs a little help to let go.

Sure these bridges have plenty of problems today, and maybe they’ll go to nowhere. But maybe in the far future they’ll lead us to cruelty free, healthy designer tissues that taste better than anything an animal makes, and Bret Victor’s dynamic medium.


This is just like early smart phones, early PCs, and the internet before the 2000s. Yes, there are many flaws, but can’t people see the future potential?


People can't see the future potential because the people actively working on it aren't showing us any future potential. Everything they're showing is worse versions of what we currently have, but "in VR!".


That’s not true. The main reason you’ve written this is simply because you have yet to try a VR system that wasn’t the equivalent to Google cardboard


I used one of the newer oculus recently. It was fun for a short period of time. It wasn't so great that I'd spend a considerably amount of time wearing it, because it's so much effort to use.

The discussion is in terms of using it for meetings and such, and in that case, there's nothing they've shown us that isn't "attend a meeting in person, remote", which I can do in zoom, without needing to wear a heavy, uncomfortable thing on my entire face, for most of a day. A number of things about it are actively worse. This is why I'm saying they're just showing us the same thing, in VR. There's nothing to be excited about.


Zoom is terrible for presence, as in feeling like both of you are in the same place which VR achieves quite easily even with terrible graphics. That is very hard to demo. You have to experience it yourself for more than one 5 minute session. I strongly doubt that you’ve even used a Quest just based on your comments. It has many problems and flaws, but “too much effort to use” isn’t one of them.


Putting something on that requires full immersion requires you to be able to fully block out time, and have physical space available, with all of the necessary equipment is effort. That physical space also needs to be a trusted space, since you're unable to know what's going on around you. For zoom, I can join on my cell phone, from basically anywhere.

That's too much effort to use.


You can get pretty far if you don't have much in the way of concurrent writes. Concurrent writes can be safe mind you, but that safety comes from locks that are polled against at increasingly long intervals (at least in my experience backing Django with SQLite). If you have one thread making many short writes over a long period, another thread will check to see if it can acquire a lock at 1 second, 2 seconds, 5 seconds, etc and if it doesn't time that check exactly perfectly you'll keep polling until your timeout. That is, out of the box you don't have a fair queue for the write lock. That kind of access pattern doesn't describe all apps of course! And in WAL mode your reads won't be blocked by these long sequences of writes.


I don’t think there is any scheduling. Each connection polls to see if the write lock is available up to the max busy timeout setting.

The connection polls at these intervals: static const u8 delays[] = { 1, 2, 5, 10, 15, 20, 25, 25, 25, 50, 50, 100 };

So, if you are using the default 5 second timeout, and you are trying to acquire a lock while an exclusive lock is held, you will wait 1 second, then 2 seconds, then 5 seconds, and timeout. I’m not sure if you timeout after 3 total seconds have elapsed, or sometimes after the 2 and sometimes after the 5.

If you have a thread running many fast queries in a loop you can deny access to another thread that needs a lock. The other thread may get lucky and poll for the lock at the exact moment in between locks from the other thread, but it might not.


Internet Archive | Software Engineer | Remote | Full-Time | archive.org

How do you build a machine that preserves the ever-changing cultural expression of humanity and make it free for all?

https://archive.org/about/jobs.php


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: