Hacker News new | past | comments | ask | show | jobs | submit | more Takennickname's comments login

"We promise to limit our revenue"


What a strong ad for Java


2011 for Java is nothing. I've used JDBC driver written for Java 1.4 (2002) in Java 17 and I absolutely sure that it'll work with Java 21 just as well.

Java backwards compatibility is real and it works absolutely fine unless you do bad things.


I think the last breaking change I remember was enumerations being added, which broke any code that used the class name as a variable name. But I could be wrong; it was almost 20 years ago.


With Java 11 they hid a lot of internal functions that people had used in their code, so breaking things. But that was never really part of the public API, so strictly not a breaking change.

I remember moving to Java8 changed the iteration order of hashmaps etc, which also broke some stuff for us. But again, that was mostly our fault for relying on unspecified behavior.


With Java 11 they only hid it, you just need to pass some parameters to get it working. When it truly broke, I believe, was only on Java 21, which finally removed some clutches people had been using to work around the new limitations.


Maybe not a part of language per se but they throw out corba in java 11 and this decimated some very old libraries that used is as dependency.


Thread#stop no longer works as of at least Java 21.


True, I think since Java 17 they have been removing really obsolete stuff.. I think that's when they removed CORBA from the JDK as well (though you can still get it as a library I believe). Same with Nashorn (the JS runtime).


Personally I consider 1.4.2 the point where java got stable, and had a true IO (java.nio), along with a decent JIT (hotspot)


Phew, that was around the internet Java applets phase, if I recall it correctly?


That much it was all about JSP/XML and stuff; J2EE too. JDO came soon after too. The applets did exist and (javax.)swing was a thing - but it was far from the focus... and of course Java was still considered slow.


Java's maintained almost perfect backward compatibility. If you want code you wrote today to run in 10 more years, Java is probably the best choice. Most other languages have too much of a history of breaking changes, or if you pick C/C++, you'll have issues linking against an old UI library.


There are many other languages with a better lindy effect rating ( https://en.wikipedia.org/wiki/Lindy_effect), common lisp, erlang, fortran and more.

I won't doubt how java doing well, but other languages have it beat.


Except they're not in widespread usage, so aren't relevant.

No grads are going "gee, do I go with a .NET shop, a JVM shop, or a BEAM shop?"

And as for Common Lisp, which implementation? They can't even be compatible amongst themselves, so I'm dubious that SBCL CL from 13 years ago would just work.


> And as for Common Lisp, which implementation? They can't even be compatible > amongst themselves, so I'm dubious that SBCL CL from 13 years ago would just > work.

For a bit more solid example:

https://web.archive.org/web/20150217111426/http://www.inform...

This too will likely end up being downvoted for talking against the hivemind here.


> Except they're not in widespread usage, so aren't relevant.

I don't know any java programmers either. Apparently people use java, I just dont know any that do it willingly. I'm sure they -must- use it, but erlang/fortran/etc as mythical to you as java is to me.

> No grads are going "gee, do I go with a .NET shop, a JVM shop, or a BEAM shop?"

Fresh grads don't know better, they take whatever pays. Businesses choose the cheapest (not the best) option and grads don't have the experience to choose better. Grads choices are not a measure of quality or desirability.

> And as for Common Lisp, which implementation? They can't even be compatible

> amongst themselves, so I'm dubious that SBCL CL from 13 years ago would just

> work.

My SBCL from 2010ish works, I've patched it up and improved it over time, but I haven't tried to run the original, it probably does. Its been through svn to git so I've lost the history, all of this is as anecdotal as anything else. Previous lisp code that I wrote was trivial and ran out of the box on SBCL, however the code is very self contained and not 'networked' or 'modern'.

My erlang code however has been running in a cluster since the early 2000's, it has also been through several releases for additional features, however I no longer have access to that code so I can't validate what has been done to it for the last decade.

I like your arguments, I just dont think we're coming from the same historical viewpoint.


The Lindy effect assumes you have no other information about the thing you're estimating. Obviously if I know the thing is on its deathbed, I can't invoke the Lindy effect. Similarly, a hundred-year-old language one person uses any more isn't likely to last another hundred years.

Given the relative sizes, I wouldn't bet on Common Lisp outlasting Java just because it's older.


Its likely we will both be dead before either of them are no longer being maintained.


Yeah have you tried actually running a common lisp program from decades ago on a different implementation from the one it was developed on?


> common lisp, erlang, fortran

That's a pretty esoteric list. Javascript would have been a better choice because of widespread deployment by multiple vendors and heavy legacy.


Javascript is still pretty young (comparatively) however I do believe you'd be right in saying that its likely to be around for a very long time.


Interestingly, Javascript is just as old as Java, and Python is older.


Python kinda dies each major release though, there is no backwards compatability goals.


Seriously. Remember applets? My first real Java app was a small 3D Pong game using AWT, and it still works using appletviewer on my M1 MacBook Pro. I mean, it's circa 1998.


I found an old applet of mine from the late '90s, and tried running it. Crashed with a NullPointerException from deep inside AWT. My guess is that there is now some setup that needs to be done that didn't at the time.


that should be the norm.

the reality is that is a strong ad against almost all modern frameworks, that may live for as little as a football season


Not really. Not automatically, anyway. Realistically, code lasting forever is, the majority of the time, some engineer’s nerdy wet dream almost completely devoid from any real-world requirements. “This code should last 20 years” should, for most people, be fairly low on the list of desires for a technology stack. In the vast majority of cases, the processes that the software seeks to automate will have been thrown out LONG before then. The business went bust and the only surviving copy is on some developer’s personal computer. Darlene from accounting left and her replacement likes to do things differently, so all this custom stuff was replaced with something off-the-shelf. Your $40B unicorn dating network very unceremoniously fell from the charts after Gen Z decided to throw their phones away and connect in person like we used to. After all that, you’re left there holding a perfectly functional(?) solution to a problem nobody is asking to be solved anymore.

Let’s be clear: I know that banks run on COBOL. Everyone knows that. Please don’t say it. I can name 5-10 other industries off the top of my head where this sort of longevity matters. But let’s not kid ourselves that the stuff we’re writing is even intended to last a long time.


>> But let’s not kid ourselves that the stuff we’re writing is even intended to last a long time.

Not my experience at all. I am literally at this moment releasing new version of private app framework that was created by few people (including me) about 18 year ago for few clients on long forgotten platform because some client (who is still paying support fees!) found some obscure bug building new application using this framework. The previous version was released about 8 years ago.


Longevity is very important in enterprise apps. Companies are full of small services / utilities which were developed many years ago, work for the most part just fine and need to be touched only rarely to fix a bug which just started manifesting, add a small feature or enable some integration.


>>“This code should last 20 years” should, for most people, be fairly low on the list of desires for a technology stack.

>> But let’s not kid ourselves that the stuff we’re writing is even intended to last a long time.

Well, it depends. If you write custom software for enterprises, they very much see it as a long term investment. Software grows with the company and is embedded in it. Nobody wants to pay for complete rewrites every five years..


stuff written for java 0.9 (1996), even with the default package (no namespaces), still runs normally. 2011 is past java 7.


compile once -- run forever!

Seriously though, this seems to be due to happenstance (well, commercial interest motivating great continuous engineering effort), rather than by design (forward-thinking) though; unlike, say, IBM's Technology Independent Machine Interface of AS/400.


bit late on "by design" party - java 'binary' (not source, which is easier) compatibility is an exceptionally important feature. E.g. changing method signature like void x(int val) to void x(long val) does break the binary compatibility and it means the original method has to be preserved, potentially as something like void x(int val){x((long) val);} - which just calls the new method by casting val. It some cases the original method might be marked as deprecated.

There are countless examples of such a behavior.


It's ABSOLUTELY by design.

On the Java Mailing Lists, the creators/stewards of Java are constantly fighting back so many feature requests BECAUSE those features would threaten backwards compatibility. And that mailing list has been going on for a long time now. You can see feature requests (and their subsequent rejections) going as far back as the late 90's lol


> rather than by design

Remember, Java came from Sun. Backward compatibility was an absolute requirement at Sun for nearly everything. Compatibility is hard-baked-in to the culture.

Oracle plays more loose, but a lot of the people are still around.

Definitely by design.


Except now a days you're not encouraged to run a system-wide JVM!

You can still download a JVM for Java 21, but it's from weird third parties like Adoptium


Adoptium is a JVM from the Eclipse Foundation. They're hardly weird, even if the branding is. It's basically "people into JVMs not controlled or licenced by Oracle that work good".

https://www.eclipse.org/membership/explore-membership/


Almost correct, except the part that it is a bit like saying Red-Hat is a Linux not controlled by Linus.

https://devclass.com/2023/03/22/despite-openjdk-70-of-java-f...


I never realized it was from Eclipse! They've cleaned up the webpage where it's a bit more clear now. I always figured it was a weird consultancy or something like Gluon - where there was a paid product when you dig a little

Thanks for clearing it up. Hope they rebrand and drop the Adoptium name eventually


It used to be called AdoptOpenJDK, and was a project that essentially just provided prebuilt binaries of OpenJDK.

But post several Oracle changes that I admittedly have not kept up with, they have grown in scope and also been forced to remove OpenJDK from their name. They went with Adoptium, to keep the Adopt part that got them famous.


That's bullshit.

Here are the JDK distributions supported by SDKMAN, a really good JVM-oriented package manager: https://sdkman.io/jdks

There's a couple of dozen vendors in there , including very weird ones like IBM, Microsoft, AWS, Azul, Eclipse, SAP, Redd Hat, and even... Oracle!


Is there really no good open source form backend? That doesn't sound right.


Formbricks can do what Formspree does but open source see here: https://formbricks.com/vs-formspree


You could use Drupal and the very versatile Webform module: https://www.drupal.org/project/webform


Good. So more people can stop pretending America is about freedom. Tired of this bullshit rhetoric. Military industrial complex gets richer by robbing other countries, and politicians get richer by robbing Americans.


Provides the prerequisites for an authoritarian regime when they inevitable coopt the internet


Well some authoritarian regime would otherwise just do it whenever it got started, and it would require maybe a week?


Maybe this is what's happening right now


With all due respect, this is ready to launch when you are using it for your own website.


"Prabhakar made search bad"


It is a little bit painful to read. Capital letters exist for a reason (to make reading easier)


what exactly do capital letters make easier to read? i dont think readability is why they are used for proper nouns, names, nor the pronoun I. and obviously ALL CAPS isnt a readbility improvement either. presumably just as delineation of one sentence to the next? (forgive my ironic non-use of caps to start sentences haha).


Correct. The dileneation of one sentence to the next, which is somewhat an indicator of the end of one thought - or fragment of thought - and the beginning of the next.


Look how nicely davetron5000's comment below can be read. Proper capitalization makes text much more readable.


Sports and sex.


I don't know for sports, but sex indeed. And just like any other art form, some are absolute virtuosos and some draw cheap logos and are about to be replaced by AI.


Other than competitive team sports like football, you also have figure skating and synchronized swimming. But the vast majority of sports are art.


It's a honeypot. He's telling people openai doesn't respect robots.txt and just scrapes whatever the hell it wants.


Except the first thing openai does is read robots.txt.

However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.


> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.


This robots.txt has Disallow rule commented out:

    # buzz off
    #User-agent: GPTBot
    #Disallow: /


And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt

So, GPTBot is not following robots.txt, apparently.


All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.


It wasn't commented out a few hours ago when I checked it. I think that's a recent change.


Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt


This is a moderately persuasive argument.

Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.


You've been able to convince me to accept his second pint. Friday it is.


humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.


This must be why self-driving cars always ignore the speed limit. ;)


More directly, e.g. Tesla boasts of training their FSD on data captured from their customer's unassisted driving. So it's hardly surprising that it imitates a lot of humans' bad habits, e.g. rolling past stop lines.


Jesus, that’s one of those ideas that looks good to an engineer but is why you really need to hire someone with a social sciences background (sociology, anthropology, psychology, literally anyone who’s work includes humans), and probably should hire two, so the second one can tell you why the first died of an aneurism after you explained your idea.


AI DRIVR claims that beta V12 is much better precisely because it takes rules less literally and drives more naturally.


Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?


No, because there’s no legal weight behind robots.txt.

The second someone weaponizes robots.txt all the scrapers will just start ignoring it.


That’s how you weaponize it. Set things up to give endless/randomized/poisoned data to anybody that ignores robots.txt.


You mean human users? That is and always will be the dominant group of clients that ignore robots.txt.

What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users.

That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user).

Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could.


It’s fairly trivial to treat Google’s crawler differently if you want. https://developers.google.com/search/docs/crawling-indexing/...

The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.


It’s trivial to treat it differently, but doing so runs the risk of being accused of cloaking and getting banned from Google’s index: https://developers.google.com/search/docs/essentials/spam-po...

> The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

Sure, and they’ll pay the scrapers you haven’t banned for your content, because it costs those scrapers $0 to get a copy of your stuff so they can sell it for far less than you.

> People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

The bots are currently dumb _because_ we don’t try to stop them. There’s no need for smarter scrapers.

Watch how quickly that changes if people start blocking bots enough that scraped content has millions of dollars of value.

At the scale of a company, it would be trivial to buy request log dumps from one of the adtech vendors and replay them so you are legitimately mimicking a real user.

Even if you are catching them, you also have to be doing it fast enough that they’re not getting data. If you catch them on the 1,000th request, they’re getting enough data that it’s worthwhile for them to just rotate AWS IPs when you catch them.

Worst case, they just offer to pay users directly. “Install this addon. It will give you a list of URLs you can click to send their contents to us. We’ll pay you $5 for every thousand you click on.” There’s a virtually unlimited supply of college students willing to do dumb tasks for beer money.

You can’t price segment a product that you give away to one segment. The segment you’re trying to upcharge will just get it for cheap from someone you gave it to for free. You will always be the most expensive supplier of your own content, because everyone else has a marginal cost of $0.


Google doesn’t care what you do to other crawlers that ignore your TOS. This isn’t a theoretical situation it’s already going on. Crawling is easy enough to “block” there’s court cases on this stuff because this is very much the case where the defense wins once they devote fairly trivial resources to the effort.

And again blocking should never be the goal poisoning the well is. Training AI on poisoned data is both harder to detect and vastly more harmful. A price compared tool is only as good as the actual prices it can compare etc.


What about making it slow? One byte at a time for example while keeping the connection open


That would make it a tarpit, a very old technique to combat scrapers/scanners


A slow stream that never ends?


This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it.

I'm sure the big players like Google would deal with it gracefully.


Here you go (1 req/min, 10 bytes/sec), please report results :)

  http {
    limit_req_zone $binary_remote_addr zone=ten_bytes_per_second:10m rate=1r/m;
    server {
      location / {
        if ($http_user_agent = "mimo") {
          limit_req zone=ten_bytes_per_second burst=5;
          limit_rate 10;
        }
      }
    }
  }


Scrapers of the future won't be ifElse logic, they will be LLM agents themselves. The slow loris robots.txt has to provide an interface to it's own LLM, which engages the scraper LLM in conversation, aiming to extend it as long as possible. "OK I will tell you whether or not I can be scraped. BUT FIRST, listen to this offer. I can give you TWO SCRAPES instead of one, if you can solve this riddle."


Can I interest you in a scrape-share with Claude?


Solid use case for Saul Goodman LLM alignment


You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.


Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.


Those are usually connection and no-data timeouts. A total time limit is in my experience less common.


Sounds like endlessh


> Except the first thing openai does is read robots.txt.

What good is reading it if it doesn't respect it


It seems to respect it as the majority of the requests are for the robots.txt.


He says 3 million, and 1.8 million are for robots.txt

So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows

    # buzz off
    User-agent: GPTBot
    Disallow: /
Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.


A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".


That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.


I agree with you. I only stated how the crawlers seem to work, if you read their pages or try to block/slow down them it seems clear that they scan-first-respect-after. But somehow people understood that I approve that behaviour.

For those bad crawlers, which I very much disapprove, "not respecting robots.txt" equals "don't even read robots.txt, or if I read it ignore it completely". For them, "respecting robots.txt" means "scan the page for potential links, and after that parse and respect robots.txt". Which I disapprove and don't condone.


Except now it says

    # silly bing
    #User-agent: Amazonbot          
    #Disallow: /

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

    # Don't Allow everyone
    User-agent: *
    Disallow: /archive

    # slow down, dudes
    #Crawl-delay: 60
Which means he's changing it. The default for all other bots is to allow crawling.


His site has a subdomain for every page, and the crawler is considering those each to be unique sites.


There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"


Of course it’s considering them as unique sites. They are unique sites.


for the 1.2 million are there other links he's not telling us about?


I'm assuming those are homepage requests for the subdomains.


I'm not sure any publisher means for their robots.txt to be read as:

"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."


How would one know he is disallowed without reading each site?


The convention is that crawlers first read /robots.txt to see what they're encouraged to scrape and what they're not meant to, and then hopefully honor those directions.

In this case, as in many, the disallow rules are intentionally meant to protect the signal quality and efficiency of the crawler.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: