More

isbvhodnvemrwvn · on Oct 14, 2023

It also shows how awful some sites are. I'm scraping a jobs offer portal written in next.js (although I use the endpoint for hydration), one of the lovely things they do is they pass in props containing all possible skills candidates can have, they send 200KB per listing, less than 10KB is the average volume of job-specific (actually required) data.

Previously they were rendering it from a REST endpoint which only had the necessary data, but now I get a lovely dictionary of all possible values of various properties, provided on a silver platter.

Another portal I'm scraping uses GraphQL, and they don't have access control, I get all the data they have on impressions, I know how much money somebody paid for listing, I know when the offer is going to be auto-bumped - just lovely. And no need to use Playwright either.

jiggawatts · on Oct 14, 2023

Modern JS development just blows my mind.

There was a person here complaining that hosting his anime fan site was costing too much.

He just couldn’t understand the concept of not sending the entire dataset to the web page on every page view.

inemesitaffia · on Oct 15, 2023

Link?

jiggawatts · on Oct 15, 2023

An in-context link is here: https://news.ycombinator.com/item?id=28440742#28442591

However the entire post is worth reading because other people contributed some good points also.

The Mangadex devs seem to have followed some of the advice and the referenced inefficient JS API call has been made more efficient since my comment, but notably it is still sending many times more data than what is required.

csomar · on Oct 15, 2023

I once had a discussion with a management that thought it’s fine not to implement access control for GraphQl because no one knows the endpoint. I explained it’s possible to figure out the endpoint through the Network tab in the console. They assumed it’s fine because that’s too technical.

speedgoose · on Oct 15, 2023

That’s when you stop involving management in technical discussions. As a software engineer, you do what must be done. If access control is a must, management can wait on the less important features. If they are stupid and don’t understand despite explaining them, they don’t have to know.

awill88 · on Oct 15, 2023

Yeah, but don’t get yourself in trouble and be a renegade either.

I do not agree that engineers should avoid technical talks with managers. They should just always be prepared to break it down and provide examples. Look people in the eye, empathize with their sentiments, don’t be an asshole if you can help it, in my humble opinion.

pavel_lishin · on Oct 15, 2023

And get stupid decisions in writing.

awill88 · on Oct 15, 2023

Tell them all the juniors are using it and it’s causing widespread security breaches and could cost the company money (it’s half true at least).

If they don’t care about money when you talk about it, well, then that may signal something deeper.

steve_adams_86 · on Oct 15, 2023

One of the easiest fixes for Next page sizes is pruning unused prop data. It’s super common, even in relatively professional settings. I think it isn’t intuitive that unused data will be sent. The whole tree shaking and optimization pipeline seems like it should handle stuff like that, but sure enough, it doesn’t (for good reason, though).

punkpeye · on Oct 14, 2023

The irony is that I discovered this also while scraping a job portal. I wonder if we are scraping the same site lol. I am building a job board for tech jobs, but with very granular filters.

nicbou · on Oct 14, 2023

How do you plan to implement very granular filters when the data is rarely complete or neatly organised?

I had this discussion with my friend who runs ArbeitNow. It's hard to create good filters when the data is just paragraphs of text. It's hard to augment these listings when the data is not reliable.

punkpeye · on Oct 14, 2023

Won't comment on this now, but will follow up once it launches. I've been working on this now for 3 months. So it isn't simple aggregator. There is a lot of data enrichment happening by collecting auxiliary data from LinkedIn, Crunchbase, GitHub, etc.

altdataseller · on Oct 15, 2023

What are some examples of some of these filters?

isbvhodnvemrwvn · on Oct 14, 2023

Interesting. I was initially intending to do the same, but now I just use the project as OSINT source for research on various companies, if I or any of my friends is looking for a job they get a dump of everything they posted on popular portals in the last 3 years or so.

rrrix1 · on Oct 15, 2023

Can I be your friend? :)

I'm looking for a job, and just started hacking together a job posts database with data scraped from a few aggregators and a couple of specific companies. Currently it's all just local and for myself, mostly raw data with some basic statistics and keyword similarity heuristics.

I decided to try something new and different and take it into my own hands (instead of hoping LinkedIn shows me the "right job").

I'm frustrated by the sheer volume of clearly duplicate job posts which often only vary by the "city" (even if they're 100% remote) and the lack of any kind of basic filtering for programming languages, tech stacks, platforms, years of experience requirements, company industry / size / age etc.

I'm really curious what others who have been doing this for much longer have learned! How do you store your data? How do you process it? Have you eliminated manual cleansing and grooming? So many questions...

isbvhodnvemrwvn · on Oct 15, 2023

I'm scraping jobs for a specific country I don't think you're in, but a couple of things:

- always keep the original data you scraped in as close to original form as you can (in next.js example, I just strip the unnecessary props - but only those I know are useless, see point 2) - there aren't that many job offers out there, text is easy to compress, storage is cheap. Eventually you will change the idea for what's important, and having original files to work with allows you to re-process them. Kind of like data vault approach in data engineering.

- prepare for the APIs to evolve, version your stuff. Have stringent validation (like json schema with enums and not allowing extra fields) to detect this evolution, so that you can adapt. This is extra important for portals with complex data models where you might not fully understand them at first.

- understand key fields - it's often the case that "posted date" is auto-bumped every now and then, depending on which package you bought. Try going through buy-a-listing workflow to understand the options. For multi-city listings there is often some sort of way of de-duplicating (like URL slug having a common prefix, or the data containing "main" listing). Spending an hour going through the data you scraped goes a long way.

- a lot of the data comes from manual user input or from some imports from the company's ATS. You will need to normalize those values in some way, and it's not always possible to do it in a lossless way. A good example is company name, which tends to change over time or be used in different formats. This is a pretty difficult problem to solve and I don't have any good pointers for this.

For the most part I keep the data in sqlite databases in format specific to the job portal, I haven't found a decent way of normalizing the data as the portals have very different ideas about the level of information they keep in a structured way. The only processing I do is indexing the values I need and deduplicating multi-location listings and same offers across many days/weeks/months. This is good enough for my use case, but falls short of making it useful for any commercial use.

altdataseller · on Oct 15, 2023

What differentiates this job board from Indeed and other job boards?

isbvhodnvemrwvn · on Oct 8, 2023

That's not quite right, if you have a single person company (with unlimited personal liability) you can choose between revenue and income tax. Both are flat rates, revenue is 12% for certain groups of companies, including software engineering with revenues below several million euro, income tax is 19% universally.

isbvhodnvemrwvn · on Oct 7, 2023

My main use case for postman like tools is to get me into the middle of a workflow to reproduce some sort of a scenario. Stateless tools just don't do it, as I need to spend a long time copying things around.

worksonmine · on Oct 7, 2023

Swagger and its derivatives support example requests and auth. How does postman do it differently I haven't used it in years?

isbvhodnvemrwvn · on Oct 7, 2023

Allows arbitrary scripts to save the context of requests. Hardcoded examples you can't change without committing in contract are not even close to that functionality.

isbvhodnvemrwvn · on Oct 6, 2023

Postman is rolling out updates over time, for many people they only saw the changes now, a couple weeks after the roll-out started.

isbvhodnvemrwvn · on Oct 6, 2023

I think it's fair to criticize them given a fair number of alternatives in this thread (with their own shortcomings). It's kind of silly to post something to hackernews and not expect criticism.

isbvhodnvemrwvn · on Oct 6, 2023

The problem then is that you have to use jmeter, it's a great example of early 2000s java desktop app user experience.

procflora · on Oct 7, 2023

And if JMeter just isn't quite doing it for you, turn the dial back another few years and experience the joy of the MDI GUI provided by SoapUI. You too can experience the thrills of getting to tweak JVM memory options so it doesn't crash when you try to load a particularly large and unwieldly industry-specific WSDL for an interface standard that undergirds major national infrastructure.

Oh and, it actually does do OpenAPI specs for REST and even mock services and all that, so I'm delighted to share that it's actually relevant to this post! :^) What are you waiting for?

isbvhodnvemrwvn · on Oct 6, 2023

Postman doesn't either, but it doesn't tell you where the backup is. Same deceptive shite.

dannydainton · on Oct 6, 2023

Your Scratch Pad data is safe and accessible. It can be Migrated to a new Workspace, once you create an account, using the 'Cog Icon > Settings > Data > Migrate data' menu option.

Alternatively, it can be Exported from the Lightweight API Client(signed out version) using the 'Cog Icon > Settings > Data > Export data' menu options.

Two4 · on Oct 7, 2023

I'm assuming you work for Postman. I hope you know that because of this, as well as the account requirement, other REST tools are going to eat your lunch. I don't think this was a good move.

dannydainton · on Oct 7, 2023

Appreciate the feedback. Have a lovely day.

isbvhodnvemrwvn · on Oct 6, 2023

I don't particularly enjoy a VC-fueled project having ability to intercept all my browser traffic to handle CORS.

snazz · on Oct 6, 2023

Having a dedicated browser profile is a good solution to this if you’re ever uncomfortable with a particular extension.

shadowgovt · on Oct 6, 2023

Ah, I follow. I was confused because a small, open-source extension should be sufficient to handle sending test requests. Sounds like they're doing something much bigger than that.

POSTman started as an extension; hence my confusion. I was wondering if there was a technical issue I was unaware of.

lima · on Oct 6, 2023

Running standalone binaries by a VC-fueled project on your host is even worse.

I guess a separate quarantine browser profile is the best of both worlds?

isbvhodnvemrwvn · on Oct 6, 2023

It falls apart instantly when you need to pass data from one endpoint to another or add any sort of logic like filtering through data - so any time you have non-trivial workloads where you don't want to spend half your time fighting against jq or shell.

worksonmine · on Oct 6, 2023

> pass data from one endpoint to another

`curl ... > out.json` then `curl ... -d out.json`. Wrap it in a shell script for quick iterations.

> filtering through data

`curl ... | jq | grep`. I don't know of any tool that will find what I'm after faster than the shell.

For bootstrapping and quick experiments curl is right there at my fingertips, no need to spin up an electron app, make a bunch of definitions and all that. When I want something more usable OpenAPI serves as stateful and interactive test environment and documentation at the same time.

I do agree curl can get a little verbose but create an alias: `alias jc='curl -H Content-Type: application/json` and using it is as simple as `jc $URL` for GET or `jc -X POST -d '{ ... }' $URL` for the rest of the methods.

I really recommend getting comfortable in the shell, it's amazing how productive it can be and becoming a bit of a lost art these days. All the tools are composable and working together it's so zen.

ok_dad · on Oct 6, 2023

You don’t have to fight tools if you learn them, but I understand— as a fellow programmer— that you don’t always have time to learn them. However, it’s pretty easy to use pipes and tools like jq to do complex stuff.

isbvhodnvemrwvn · on Oct 7, 2023

While jq is powerful and I use it in scripts, it's one of the least intuitive languages I use, to the point I have to look up basically everything non-trivial.

isbvhodnvemrwvn · on Oct 6, 2023

It's not a bug to release a version which removed a massive chunk of functionality. It's intentionally fucking over their users who are not willing to create an account.