More

Takennickname · on April 29, 2024

"We promise to limit our revenue"

Takennickname · on April 29, 2024

What a strong ad for Java

vbezhenar · on April 29, 2024

2011 for Java is nothing. I've used JDBC driver written for Java 1.4 (2002) in Java 17 and I absolutely sure that it'll work with Java 21 just as well.

Java backwards compatibility is real and it works absolutely fine unless you do bad things.

robertlagrant · on April 29, 2024

I think the last breaking change I remember was enumerations being added, which broke any code that used the class name as a variable name. But I could be wrong; it was almost 20 years ago.

matsemann · on April 29, 2024

With Java 11 they hid a lot of internal functions that people had used in their code, so breaking things. But that was never really part of the public API, so strictly not a breaking change.

I remember moving to Java8 changed the iteration order of hashmaps etc, which also broke some stuff for us. But again, that was mostly our fault for relying on unspecified behavior.

brabel · on April 29, 2024

With Java 11 they only hid it, you just need to pass some parameters to get it working. When it truly broke, I believe, was only on Java 21, which finally removed some clutches people had been using to work around the new limitations.

v-erne · on April 29, 2024

Maybe not a part of language per se but they throw out corba in java 11 and this decimated some very old libraries that used is as dependency.

Starlevel004 · on April 29, 2024

Thread#stop no longer works as of at least Java 21.

brabel · on April 29, 2024

True, I think since Java 17 they have been removing really obsolete stuff.. I think that's when they removed CORBA from the JDK as well (though you can still get it as a library I believe). Same with Nashorn (the JS runtime).

xxs · on April 29, 2024

Personally I consider 1.4.2 the point where java got stable, and had a true IO (java.nio), along with a decent JIT (hotspot)

_the_inflator · on April 29, 2024

Phew, that was around the internet Java applets phase, if I recall it correctly?

xxs · on April 29, 2024

That much it was all about JSP/XML and stuff; J2EE too. JDO came soon after too. The applets did exist and (javax.)swing was a thing - but it was far from the focus... and of course Java was still considered slow.

dehrmann · on April 29, 2024

Java's maintained almost perfect backward compatibility. If you want code you wrote today to run in 10 more years, Java is probably the best choice. Most other languages have too much of a history of breaking changes, or if you pick C/C++, you'll have issues linking against an old UI library.

worthless-trash · on April 29, 2024

There are many other languages with a better lindy effect rating ( https://en.wikipedia.org/wiki/Lindy_effect), common lisp, erlang, fortran and more.

I won't doubt how java doing well, but other languages have it beat.

EdwardDiego · on April 29, 2024

Except they're not in widespread usage, so aren't relevant.

No grads are going "gee, do I go with a .NET shop, a JVM shop, or a BEAM shop?"

And as for Common Lisp, which implementation? They can't even be compatible amongst themselves, so I'm dubious that SBCL CL from 13 years ago would just work.

worthless-trash · on May 3, 2024

> And as for Common Lisp, which implementation? They can't even be compatible > amongst themselves, so I'm dubious that SBCL CL from 13 years ago would just > work.

For a bit more solid example:

https://web.archive.org/web/20150217111426/http://www.inform...

This too will likely end up being downvoted for talking against the hivemind here.

worthless-trash · on April 29, 2024

> Except they're not in widespread usage, so aren't relevant.

I don't know any java programmers either. Apparently people use java, I just dont know any that do it willingly. I'm sure they -must- use it, but erlang/fortran/etc as mythical to you as java is to me.

> No grads are going "gee, do I go with a .NET shop, a JVM shop, or a BEAM shop?"

Fresh grads don't know better, they take whatever pays. Businesses choose the cheapest (not the best) option and grads don't have the experience to choose better. Grads choices are not a measure of quality or desirability.

> And as for Common Lisp, which implementation? They can't even be compatible

> amongst themselves, so I'm dubious that SBCL CL from 13 years ago would just

> work.

My SBCL from 2010ish works, I've patched it up and improved it over time, but I haven't tried to run the original, it probably does. Its been through svn to git so I've lost the history, all of this is as anecdotal as anything else. Previous lisp code that I wrote was trivial and ran out of the box on SBCL, however the code is very self contained and not 'networked' or 'modern'.

My erlang code however has been running in a cluster since the early 2000's, it has also been through several releases for additional features, however I no longer have access to that code so I can't validate what has been done to it for the last decade.

I like your arguments, I just dont think we're coming from the same historical viewpoint.

stavros · on April 29, 2024

The Lindy effect assumes you have no other information about the thing you're estimating. Obviously if I know the thing is on its deathbed, I can't invoke the Lindy effect. Similarly, a hundred-year-old language one person uses any more isn't likely to last another hundred years.

Given the relative sizes, I wouldn't bet on Common Lisp outlasting Java just because it's older.

worthless-trash · on April 29, 2024

Its likely we will both be dead before either of them are no longer being maintained.

lmm · on April 29, 2024

Yeah have you tried actually running a common lisp program from decades ago on a different implementation from the one it was developed on?

dehrmann · on April 29, 2024

> common lisp, erlang, fortran

That's a pretty esoteric list. Javascript would have been a better choice because of widespread deployment by multiple vendors and heavy legacy.

worthless-trash · on April 29, 2024

Javascript is still pretty young (comparatively) however I do believe you'd be right in saying that its likely to be around for a very long time.

itronitron · on April 29, 2024

Interestingly, Javascript is just as old as Java, and Python is older.

worthless-trash · on April 29, 2024

Python kinda dies each major release though, there is no backwards compatability goals.

palad1n · on April 29, 2024

Seriously. Remember applets? My first real Java app was a small 3D Pong game using AWT, and it still works using appletviewer on my M1 MacBook Pro. I mean, it's circa 1998.

twic · on April 29, 2024

I found an old applet of mine from the late '90s, and tried running it. Crashed with a NullPointerException from deep inside AWT. My guess is that there is now some setup that needs to be done that didn't at the time.

_ZeD_ · on April 29, 2024

that should be the norm.

the reality is that is a strong ad against almost all modern frameworks, that may live for as little as a football season

cqqxo4zV46cp · on April 29, 2024

Not really. Not automatically, anyway. Realistically, code lasting forever is, the majority of the time, some engineer’s nerdy wet dream almost completely devoid from any real-world requirements. “This code should last 20 years” should, for most people, be fairly low on the list of desires for a technology stack. In the vast majority of cases, the processes that the software seeks to automate will have been thrown out LONG before then. The business went bust and the only surviving copy is on some developer’s personal computer. Darlene from accounting left and her replacement likes to do things differently, so all this custom stuff was replaced with something off-the-shelf. Your $40B unicorn dating network very unceremoniously fell from the charts after Gen Z decided to throw their phones away and connect in person like we used to. After all that, you’re left there holding a perfectly functional(?) solution to a problem nobody is asking to be solved anymore.

Let’s be clear: I know that banks run on COBOL. Everyone knows that. Please don’t say it. I can name 5-10 other industries off the top of my head where this sort of longevity matters. But let’s not kid ourselves that the stuff we’re writing is even intended to last a long time.

v-erne · on April 29, 2024

>> But let’s not kid ourselves that the stuff we’re writing is even intended to last a long time.

Not my experience at all. I am literally at this moment releasing new version of private app framework that was created by few people (including me) about 18 year ago for few clients on long forgotten platform because some client (who is still paying support fees!) found some obscure bug building new application using this framework. The previous version was released about 8 years ago.

The_Colonel · on April 29, 2024

Longevity is very important in enterprise apps. Companies are full of small services / utilities which were developed many years ago, work for the most part just fine and need to be touched only rarely to fix a bug which just started manifesting, add a small feature or enable some integration.

vitro · on April 29, 2024

>>“This code should last 20 years” should, for most people, be fairly low on the list of desires for a technology stack.

>> But let’s not kid ourselves that the stuff we’re writing is even intended to last a long time.

Well, it depends. If you write custom software for enterprises, they very much see it as a long term investment. Software grows with the company and is embedded in it. Nobody wants to pay for complete rewrites every five years..

xxs · on April 29, 2024

stuff written for java 0.9 (1996), even with the default package (no namespaces), still runs normally. 2011 is past java 7.

guenthert · on April 29, 2024

compile once -- run forever!

Seriously though, this seems to be due to happenstance (well, commercial interest motivating great continuous engineering effort), rather than by design (forward-thinking) though; unlike, say, IBM's Technology Independent Machine Interface of AS/400.

xxs · on April 30, 2024

bit late on "by design" party - java 'binary' (not source, which is easier) compatibility is an exceptionally important feature. E.g. changing method signature like void x(int val) to void x(long val) does break the binary compatibility and it means the original method has to be preserved, potentially as something like void x(int val){x((long) val);} - which just calls the new method by casting val. It some cases the original method might be marked as deprecated.

There are countless examples of such a behavior.

davidalayachew · on April 29, 2024

It's ABSOLUTELY by design.

On the Java Mailing Lists, the creators/stewards of Java are constantly fighting back so many feature requests BECAUSE those features would threaten backwards compatibility. And that mailing list has been going on for a long time now. You can see feature requests (and their subsequent rejections) going as far back as the late 90's lol

jjav · on April 30, 2024

> rather than by design

Remember, Java came from Sun. Backward compatibility was an absolute requirement at Sun for nearly everything. Compatibility is hard-baked-in to the culture.

Oracle plays more loose, but a lot of the people are still around.

Definitely by design.

contrarian1234 · on April 29, 2024

Except now a days you're not encouraged to run a system-wide JVM!

You can still download a JVM for Java 21, but it's from weird third parties like Adoptium

EdwardDiego · on April 29, 2024

Adoptium is a JVM from the Eclipse Foundation. They're hardly weird, even if the branding is. It's basically "people into JVMs not controlled or licenced by Oracle that work good".

https://www.eclipse.org/membership/explore-membership/

pjmlp · on April 29, 2024

Almost correct, except the part that it is a bit like saying Red-Hat is a Linux not controlled by Linus.

https://devclass.com/2023/03/22/despite-openjdk-70-of-java-f...

contrarian1234 · on April 29, 2024

I never realized it was from Eclipse! They've cleaned up the webpage where it's a bit more clear now. I always figured it was a weird consultancy or something like Gluon - where there was a paid product when you dig a little

Thanks for clearing it up. Hope they rebrand and drop the Adoptium name eventually

arjvik · on April 29, 2024

It used to be called AdoptOpenJDK, and was a project that essentially just provided prebuilt binaries of OpenJDK.

But post several Oracle changes that I admittedly have not kept up with, they have grown in scope and also been forced to remove OpenJDK from their name. They went with Adoptium, to keep the Adopt part that got them famous.

brabel · on April 29, 2024

That's bullshit.

Here are the JDK distributions supported by SDKMAN, a really good JVM-oriented package manager: https://sdkman.io/jdks

There's a couple of dozen vendors in there , including very weird ones like IBM, Microsoft, AWS, Azul, Eclipse, SAP, Redd Hat, and even... Oracle!

Takennickname · on April 27, 2024

Is there really no good open source form backend? That doesn't sound right.

beanclap · on April 27, 2024

Formbricks can do what Formspree does but open source see here: https://formbricks.com/vs-formspree

rroose · on April 28, 2024

You could use Drupal and the very versatile Webform module: https://www.drupal.org/project/webform

Takennickname · on April 26, 2024

Good. So more people can stop pretending America is about freedom. Tired of this bullshit rhetoric. Military industrial complex gets richer by robbing other countries, and politicians get richer by robbing Americans.

Takennickname · on April 25, 2024

Provides the prerequisites for an authoritarian regime when they inevitable coopt the internet

IfOnlyYouKnew · on April 25, 2024

Well some authoritarian regime would otherwise just do it whenever it got started, and it would require maybe a week?

Takennickname · on April 26, 2024

Maybe this is what's happening right now

Takennickname · on April 24, 2024

With all due respect, this is ready to launch when you are using it for your own website.

Takennickname · on April 23, 2024

"Prabhakar made search bad"

Takennickname · on April 15, 2024

It is a little bit painful to read. Capital letters exist for a reason (to make reading easier)

keeganpoppen · on April 16, 2024

what exactly do capital letters make easier to read? i dont think readability is why they are used for proper nouns, names, nor the pronoun I. and obviously ALL CAPS isnt a readbility improvement either. presumably just as delineation of one sentence to the next? (forgive my ironic non-use of caps to start sentences haha).

Takennickname · on April 16, 2024

Correct. The dileneation of one sentence to the next, which is somewhat an indicator of the end of one thought - or fragment of thought - and the beginning of the next.

WuxiFingerHold · on April 16, 2024

Look how nicely davetron5000's comment below can be read. Proper capitalization makes text much more readable.

Takennickname · on April 12, 2024

Sports and sex.

131012 · on April 12, 2024

I don't know for sports, but sex indeed. And just like any other art form, some are absolute virtuosos and some draw cheap logos and are about to be replaced by AI.

Takennickname · on April 12, 2024

Other than competitive team sports like football, you also have figure skating and synchronized swimming. But the vast majority of sports are art.

Takennickname · on April 11, 2024

It's a honeypot. He's telling people openai doesn't respect robots.txt and just scrapes whatever the hell it wants.

cwillu · on April 11, 2024

Except the first thing openai does is read robots.txt.

However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.

flutas · on April 11, 2024

> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.

niutech · on April 12, 2024

This robots.txt has Disallow rule commented out:

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

darkwater · on April 11, 2024

And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt

So, GPTBot is not following robots.txt, apparently.

AgentME · on April 11, 2024

All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.

Pannoniae · on April 11, 2024

It wasn't commented out a few hours ago when I checked it. I think that's a recent change.

cwillu · on April 11, 2024

Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt

andybak · on April 11, 2024

This is a moderately persuasive argument.

Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.

kubanczyk · on April 12, 2024

You've been able to convince me to accept his second pint. Friday it is.

fsckboy · on April 11, 2024

humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.

gunapologist99 · on April 11, 2024

This must be why self-driving cars always ignore the speed limit. ;)

microtherion · on April 11, 2024

More directly, e.g. Tesla boasts of training their FSD on data captured from their customer's unassisted driving. So it's hardly surprising that it imitates a lot of humans' bad habits, e.g. rolling past stop lines.

roughly · on April 11, 2024

Jesus, that’s one of those ideas that looks good to an engineer but is why you really need to hire someone with a social sciences background (sociology, anthropology, psychology, literally anyone who’s work includes humans), and probably should hire two, so the second one can tell you why the first died of an aneurism after you explained your idea.

yreg · on April 11, 2024

AI DRIVR claims that beta V12 is much better precisely because it takes rules less literally and drives more naturally.

queuebert · on April 11, 2024

Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?

everforward · on April 11, 2024

No, because there’s no legal weight behind robots.txt.

The second someone weaponizes robots.txt all the scrapers will just start ignoring it.

Retric · on April 11, 2024

That’s how you weaponize it. Set things up to give endless/randomized/poisoned data to anybody that ignores robots.txt.

everforward · on April 12, 2024

You mean human users? That is and always will be the dominant group of clients that ignore robots.txt.

What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users.

That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user).

Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could.

Retric · on April 12, 2024

It’s fairly trivial to treat Google’s crawler differently if you want. https://developers.google.com/search/docs/crawling-indexing/...

The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

everforward · on April 12, 2024

It’s trivial to treat it differently, but doing so runs the risk of being accused of cloaking and getting banned from Google’s index: https://developers.google.com/search/docs/essentials/spam-po...

> The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

Sure, and they’ll pay the scrapers you haven’t banned for your content, because it costs those scrapers $0 to get a copy of your stuff so they can sell it for far less than you.

> People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

The bots are currently dumb _because_ we don’t try to stop them. There’s no need for smarter scrapers.

Watch how quickly that changes if people start blocking bots enough that scraped content has millions of dollars of value.

At the scale of a company, it would be trivial to buy request log dumps from one of the adtech vendors and replay them so you are legitimately mimicking a real user.

Even if you are catching them, you also have to be doing it fast enough that they’re not getting data. If you catch them on the 1,000th request, they’re getting enough data that it’s worthwhile for them to just rotate AWS IPs when you catch them.

Worst case, they just offer to pay users directly. “Install this addon. It will give you a list of URLs you can click to send their contents to us. We’ll pay you $5 for every thousand you click on.” There’s a virtually unlimited supply of college students willing to do dumb tasks for beer money.

You can’t price segment a product that you give away to one segment. The segment you’re trying to upcharge will just get it for cheap from someone you gave it to for free. You will always be the most expensive supplier of your own content, because everyone else has a marginal cost of $0.

Retric · on April 12, 2024

Google doesn’t care what you do to other crawlers that ignore your TOS. This isn’t a theoretical situation it’s already going on. Crawling is easy enough to “block” there’s court cases on this stuff because this is very much the case where the defense wins once they devote fairly trivial resources to the effort.

And again blocking should never be the goal poisoning the well is. Training AI on poisoned data is both harder to detect and vastly more harmful. A price compared tool is only as good as the actual prices it can compare etc.

a_c · on April 11, 2024

What about making it slow? One byte at a time for example while keeping the connection open

bityard · on April 11, 2024

That would make it a tarpit, a very old technique to combat scrapers/scanners

happymellon · on April 11, 2024

A slow stream that never ends?

SteveNuts · on April 11, 2024

This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it.

I'm sure the big players like Google would deal with it gracefully.

gtirloni · on April 11, 2024

Here you go (1 req/min, 10 bytes/sec), please report results :)

  http {
    limit_req_zone $binary_remote_addr zone=ten_bytes_per_second:10m rate=1r/m;
    server {
      location / {
        if ($http_user_agent = "mimo") {
          limit_req zone=ten_bytes_per_second burst=5;
          limit_rate 10;
        }
      }
    }
  }

beau_g · on April 11, 2024

Scrapers of the future won't be ifElse logic, they will be LLM agents themselves. The slow loris robots.txt has to provide an interface to it's own LLM, which engages the scraper LLM in conversation, aiming to extend it as long as possible. "OK I will tell you whether or not I can be scraped. BUT FIRST, listen to this offer. I can give you TWO SCRAPES instead of one, if you can solve this riddle."

iosguyryan · on April 11, 2024

Can I interest you in a scrape-share with Claude?

reasonabl_human · on April 12, 2024

Solid use case for Saul Goodman LLM alignment

throw_a_grenade · on April 11, 2024

You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.

starttoaster · on April 11, 2024

Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.

Calzifer · on April 11, 2024

Those are usually connection and no-data timeouts. A total time limit is in my experience less common.

Phelinofist · on April 11, 2024

Sounds like endlessh

Takennickname · on April 12, 2024

> Except the first thing openai does is read robots.txt.

What good is reading it if it doesn't respect it

GaggiX · on April 11, 2024

It seems to respect it as the majority of the requests are for the robots.txt.

flutas · on April 11, 2024

He says 3 million, and 1.8 million are for robots.txt

So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows

    # buzz off
    User-agent: GPTBot
    Disallow: /

Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.

otherme123 · on April 11, 2024

A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".

jeremyjh · on April 11, 2024

That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.

otherme123 · on April 11, 2024

I agree with you. I only stated how the crawlers seem to work, if you read their pages or try to block/slow down them it seems clear that they scan-first-respect-after. But somehow people understood that I approve that behaviour.

For those bad crawlers, which I very much disapprove, "not respecting robots.txt" equals "don't even read robots.txt, or if I read it ignore it completely". For them, "respecting robots.txt" means "scan the page for potential links, and after that parse and respect robots.txt". Which I disapprove and don't condone.

vertis · on April 11, 2024

Except now it says

    # silly bing
    #User-agent: Amazonbot          
    #Disallow: /

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

    # Don't Allow everyone
    User-agent: *
    Disallow: /archive

    # slow down, dudes
    #Crawl-delay: 60

Which means he's changing it. The default for all other bots is to allow crawling.

jeffnappi · on April 11, 2024

His site has a subdomain for every page, and the crawler is considering those each to be unique sites.

sangnoir · on April 11, 2024

There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"

jameshart · on April 12, 2024

Of course it’s considering them as unique sites. They are unique sites.

swyx · on April 11, 2024

for the 1.2 million are there other links he's not telling us about?

flutas · on April 11, 2024

I'm assuming those are homepage requests for the subdomains.

swatcoder · on April 11, 2024

I'm not sure any publisher means for their robots.txt to be read as:

"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."

mminer237 · on April 12, 2024

How would one know he is disallowed without reading each site?

swatcoder · on April 12, 2024

The convention is that crawlers first read /robots.txt to see what they're encouraged to scrape and what they're not meant to, and then hopefully honor those directions.

In this case, as in many, the disallow rules are intentionally meant to protect the signal quality and efficiency of the crawler.