Can I be your friend? :) I'm looking for a job, and just started hacking togethe...

isbvhodnvemrwvn · on Oct 15, 2023

I'm scraping jobs for a specific country I don't think you're in, but a couple of things:

- always keep the original data you scraped in as close to original form as you can (in next.js example, I just strip the unnecessary props - but only those I know are useless, see point 2) - there aren't that many job offers out there, text is easy to compress, storage is cheap. Eventually you will change the idea for what's important, and having original files to work with allows you to re-process them. Kind of like data vault approach in data engineering.

- prepare for the APIs to evolve, version your stuff. Have stringent validation (like json schema with enums and not allowing extra fields) to detect this evolution, so that you can adapt. This is extra important for portals with complex data models where you might not fully understand them at first.

- understand key fields - it's often the case that "posted date" is auto-bumped every now and then, depending on which package you bought. Try going through buy-a-listing workflow to understand the options. For multi-city listings there is often some sort of way of de-duplicating (like URL slug having a common prefix, or the data containing "main" listing). Spending an hour going through the data you scraped goes a long way.

- a lot of the data comes from manual user input or from some imports from the company's ATS. You will need to normalize those values in some way, and it's not always possible to do it in a lossless way. A good example is company name, which tends to change over time or be used in different formats. This is a pretty difficult problem to solve and I don't have any good pointers for this.

For the most part I keep the data in sqlite databases in format specific to the job portal, I haven't found a decent way of normalizing the data as the portals have very different ideas about the level of information they keep in a structured way. The only processing I do is indexing the values I need and deduplicating multi-location listings and same offers across many days/weeks/months. This is good enough for my use case, but falls short of making it useful for any commercial use.