Hacker News new | past | comments | ask | show | jobs | submit login

Can I be your friend? :)

I'm looking for a job, and just started hacking together a job posts database with data scraped from a few aggregators and a couple of specific companies. Currently it's all just local and for myself, mostly raw data with some basic statistics and keyword similarity heuristics.

I decided to try something new and different and take it into my own hands (instead of hoping LinkedIn shows me the "right job").

I'm frustrated by the sheer volume of clearly duplicate job posts which often only vary by the "city" (even if they're 100% remote) and the lack of any kind of basic filtering for programming languages, tech stacks, platforms, years of experience requirements, company industry / size / age etc.

I'm really curious what others who have been doing this for much longer have learned! How do you store your data? How do you process it? Have you eliminated manual cleansing and grooming? So many questions...




I'm scraping jobs for a specific country I don't think you're in, but a couple of things:

- always keep the original data you scraped in as close to original form as you can (in next.js example, I just strip the unnecessary props - but only those I know are useless, see point 2) - there aren't that many job offers out there, text is easy to compress, storage is cheap. Eventually you will change the idea for what's important, and having original files to work with allows you to re-process them. Kind of like data vault approach in data engineering.

- prepare for the APIs to evolve, version your stuff. Have stringent validation (like json schema with enums and not allowing extra fields) to detect this evolution, so that you can adapt. This is extra important for portals with complex data models where you might not fully understand them at first.

- understand key fields - it's often the case that "posted date" is auto-bumped every now and then, depending on which package you bought. Try going through buy-a-listing workflow to understand the options. For multi-city listings there is often some sort of way of de-duplicating (like URL slug having a common prefix, or the data containing "main" listing). Spending an hour going through the data you scraped goes a long way.

- a lot of the data comes from manual user input or from some imports from the company's ATS. You will need to normalize those values in some way, and it's not always possible to do it in a lossless way. A good example is company name, which tends to change over time or be used in different formats. This is a pretty difficult problem to solve and I don't have any good pointers for this.

For the most part I keep the data in sqlite databases in format specific to the job portal, I haven't found a decent way of normalizing the data as the portals have very different ideas about the level of information they keep in a structured way. The only processing I do is indexing the values I need and deduplicating multi-location listings and same offers across many days/weeks/months. This is good enough for my use case, but falls short of making it useful for any commercial use.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: