It also shows how awful some sites are. I'm scraping a jobs offer portal written in next.js (although I use the endpoint for hydration), one of the lovely things they do is they pass in props containing all possible skills candidates can have, they send 200KB per listing, less than 10KB is the average volume of job-specific (actually required) data.
Previously they were rendering it from a REST endpoint which only had the necessary data, but now I get a lovely dictionary of all possible values of various properties, provided on a silver platter.
Another portal I'm scraping uses GraphQL, and they don't have access control, I get all the data they have on impressions, I know how much money somebody paid for listing, I know when the offer is going to be auto-bumped - just lovely. And no need to use Playwright either.
However the entire post is worth reading because other people contributed some good points also.
The Mangadex devs seem to have followed some of the advice and the referenced inefficient JS API call has been made more efficient since my comment, but notably it is still sending many times more data than what is required.
I once had a discussion with a management that thought it’s fine not to implement access control for GraphQl because no one knows the endpoint. I explained it’s possible to figure out the endpoint through the Network tab in the console. They assumed it’s fine because that’s too technical.
That’s when you stop involving management in technical discussions. As a software engineer, you do what must be done. If access control is a must, management can wait on the less important features. If they are stupid and don’t understand despite explaining them, they don’t have to know.
Yeah, but don’t get yourself in trouble and be a renegade either.
I do not agree that engineers should avoid technical talks with managers. They should just always be prepared to break it down and provide examples. Look people in the eye, empathize with their sentiments, don’t be an asshole if you can help it, in my humble opinion.
One of the easiest fixes for Next page sizes is pruning unused prop data. It’s super common, even in relatively professional settings. I think it isn’t intuitive that unused data will be sent. The whole tree shaking and optimization pipeline seems like it should handle stuff like that, but sure enough, it doesn’t (for good reason, though).
The irony is that I discovered this also while scraping a job portal. I wonder if we are scraping the same site lol. I am building a job board for tech jobs, but with very granular filters.
How do you plan to implement very granular filters when the data is rarely complete or neatly organised?
I had this discussion with my friend who runs ArbeitNow. It's hard to create good filters when the data is just paragraphs of text. It's hard to augment these listings when the data is not reliable.
Won't comment on this now, but will follow up once it launches. I've been working on this now for 3 months. So it isn't simple aggregator. There is a lot of data enrichment happening by collecting auxiliary data from LinkedIn, Crunchbase, GitHub, etc.
Interesting. I was initially intending to do the same, but now I just use the project as OSINT source for research on various companies, if I or any of my friends is looking for a job they get a dump of everything they posted on popular portals in the last 3 years or so.
I'm looking for a job, and just started hacking together a job posts database with data scraped from a few aggregators and a couple of specific companies. Currently it's all just local and for myself, mostly raw data with some basic statistics and keyword similarity heuristics.
I decided to try something new and different and take it into my own hands (instead of hoping LinkedIn shows me the "right job").
I'm frustrated by the sheer volume of clearly duplicate job posts which often only vary by the "city" (even if they're 100% remote) and the lack of any kind of basic filtering for programming languages, tech stacks, platforms, years of experience requirements, company industry / size / age etc.
I'm really curious what others who have been doing this for much longer have learned! How do you store your data? How do you process it? Have you eliminated manual cleansing and grooming? So many questions...
I'm scraping jobs for a specific country I don't think you're in, but a couple of things:
- always keep the original data you scraped in as close to original form as you can (in next.js example, I just strip the unnecessary props - but only those I know are useless, see point 2) - there aren't that many job offers out there, text is easy to compress, storage is cheap. Eventually you will change the idea for what's important, and having original files to work with allows you to re-process them. Kind of like data vault approach in data engineering.
- prepare for the APIs to evolve, version your stuff. Have stringent validation (like json schema with enums and not allowing extra fields) to detect this evolution, so that you can adapt. This is extra important for portals with complex data models where you might not fully understand them at first.
- understand key fields - it's often the case that "posted date" is auto-bumped every now and then, depending on which package you bought. Try going through buy-a-listing workflow to understand the options. For multi-city listings there is often some sort of way of de-duplicating (like URL slug having a common prefix, or the data containing "main" listing). Spending an hour going through the data you scraped goes a long way.
- a lot of the data comes from manual user input or from some imports from the company's ATS. You will need to normalize those values in some way, and it's not always possible to do it in a lossless way. A good example is company name, which tends to change over time or be used in different formats. This is a pretty difficult problem to solve and I don't have any good pointers for this.
For the most part I keep the data in sqlite databases in format specific to the job portal, I haven't found a decent way of normalizing the data as the portals have very different ideas about the level of information they keep in a structured way. The only processing I do is indexing the values I need and deduplicating multi-location listings and same offers across many days/weeks/months. This is good enough for my use case, but falls short of making it useful for any commercial use.
That's not quite right, if you have a single person company (with unlimited personal liability) you can choose between revenue and income tax. Both are flat rates, revenue is 12% for certain groups of companies, including software engineering with revenues below several million euro, income tax is 19% universally.
My main use case for postman like tools is to get me into the middle of a workflow to reproduce some sort of a scenario. Stateless tools just don't do it, as I need to spend a long time copying things around.
Allows arbitrary scripts to save the context of requests. Hardcoded examples you can't change without committing in contract are not even close to that functionality.
I think it's fair to criticize them given a fair number of alternatives in this thread (with their own shortcomings). It's kind of silly to post something to hackernews and not expect criticism.
And if JMeter just isn't quite doing it for you, turn the dial back another few years and experience the joy of the MDI GUI provided by SoapUI. You too can experience the thrills of getting to tweak JVM memory options so it doesn't crash when you try to load a particularly large and unwieldly industry-specific WSDL for an interface standard that undergirds major national infrastructure.
Oh and, it actually does do OpenAPI specs for REST and even mock services and all that, so I'm delighted to share that it's actually relevant to this post! :^) What are you waiting for?
Your Scratch Pad data is safe and accessible.
It can be Migrated to a new Workspace, once you create an account, using the 'Cog Icon > Settings > Data > Migrate data' menu option.
Alternatively, it can be Exported from the Lightweight API Client(signed out version) using the 'Cog Icon > Settings > Data > Export data' menu options.
I'm assuming you work for Postman. I hope you know that because of this, as well as the account requirement, other REST tools are going to eat your lunch. I don't think this was a good move.
Ah, I follow. I was confused because a small, open-source extension should be sufficient to handle sending test requests. Sounds like they're doing something much bigger than that.
POSTman started as an extension; hence my confusion. I was wondering if there was a technical issue I was unaware of.
It falls apart instantly when you need to pass data from one endpoint to another or add any sort of logic like filtering through data - so any time you have non-trivial workloads where you don't want to spend half your time fighting against jq or shell.
`curl ... > out.json` then `curl ... -d out.json`. Wrap it in a shell script for quick iterations.
> filtering through data
`curl ... | jq | grep`. I don't know of any tool that will find what I'm after faster than the shell.
For bootstrapping and quick experiments curl is right there at my fingertips, no need to spin up an electron app, make a bunch of definitions and all that. When I want something more usable OpenAPI serves as stateful and interactive test environment and documentation at the same time.
I do agree curl can get a little verbose but create an alias: `alias jc='curl -H Content-Type: application/json` and using it is as simple as `jc $URL` for GET or `jc -X POST -d '{ ... }' $URL` for the rest of the methods.
I really recommend getting comfortable in the shell, it's amazing how productive it can be and becoming a bit of a lost art these days. All the tools are composable and working together it's so zen.
You don’t have to fight tools if you learn them, but I understand— as a fellow programmer— that you don’t always have time to learn them. However, it’s pretty easy to use pipes and tools like jq to do complex stuff.
While jq is powerful and I use it in scripts, it's one of the least intuitive languages I use, to the point I have to look up basically everything non-trivial.
It's not a bug to release a version which removed a massive chunk of functionality. It's intentionally fucking over their users who are not willing to create an account.
Previously they were rendering it from a REST endpoint which only had the necessary data, but now I get a lovely dictionary of all possible values of various properties, provided on a silver platter.
Another portal I'm scraping uses GraphQL, and they don't have access control, I get all the data they have on impressions, I know how much money somebody paid for listing, I know when the offer is going to be auto-bumped - just lovely. And no need to use Playwright either.