Hacker News new | past | comments | ask | show | jobs | submit | StableAlkyne's comments login

> publishers can't pursue those violations too severely

A decade ago the publishing system harassed a researcher because he was downloading too many papers, going after him for millions in copyright "damages," only stopping proceedings after he ended his own life.

* https://en.m.wikipedia.org/wiki/United_States_v._Swartz


Yes, and we still talk about him and that one case today, a decade later. It was also a case where circumstances around it (the 'breaking' into an unlocked cabinet, the 'hidden' laptop, the different university, the manifesto, and so on) all allowed the case to be presented as particularly bad by publishers and the government.

And that's the risk here, in part: this system allows the practice to be presented as a paid criminal enterprise, and allows individual users to be presented as criminal participants.


> dark UI patterns

Even better is when you tap "No Tip" and they pull out a receipt for you to sign with - you guessed it - another tip field.

Just in case you were on the fence about tipping for the $16 hamburger you drove yourself over to pick up, and haven't even received yet.


I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks.

Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in.

I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it.


> I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.

https://enterprise.wikimedia.com/blog/kaggle-dataset/


More recently is very recently, not enough time yet for data collectors to evaluate changing processes.


Good timing to learn about this, given that it's Friday. Thanks! I'll check it out


I think there are engineers working for crawler companies who are paid well enough to figure out how to do this without kneecapping the most well-known noncommercial projects still surviving on the capitalized internet.


Have you tried any of the ZIM file exports?

https://dumps.wikimedia.org/kiwix/zim/wikipedia/


Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with:

1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)

2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.

3. You can either:

  a. unpack the first file
  b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
  c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly
4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.

The XML contains pages like this:

    <page>
      <title>AccessibleComputing</title>
      <ns>0</ns>
      <id>10</id>
      <redirect title="Computer accessibility" />
      <revision>
        <id>1219062925</id>
        <parentid>1219062840</parentid>
        <timestamp>2024-04-15T14:38:04Z</timestamp>
        <contributor>
          <username>Asparagusus</username>
          <id>43603280</id>
        </contributor>
        <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
        <origin>1219062925</origin>
        <model>wikitext</model>
        <format>text/x-wiki</format>
        <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]

    {{rcat shell|
    {{R from move}}
    {{R from CamelCase}}
    {{R unprintworthy}}
    }}</text>
        <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
      </revision>
    </page>
so all you need to do is get at the `text`.


The bigger problem is this is wikitext markup. It would be helpful if they also provide HTML and/or plain text.

I know there are now a couple pretty-good wikitext parsers, but for years, it was a bigger problem. The only "official" one was the huge php app itself.


Oh, it's godawful; the format is a crime against all things structured. I use `parse-wiki-text-2` [0], which is a fork of `parse-wiki-text`, a Rust library by an author who has now disappeared into the wind. (Every day that I parse Wikipedia, I thank him for his contributions, wherever he may be.)

I wrote another Rust library [1] that wraps around `parse-wiki-text-2` that offers a simplified AST that takes care of matching tags for you. It's designed to be bound to WASM [2], which is how I'm pretty reliably parsing Wikitext for my web application. (The existing JS libraries aren't fantastic, if I'm being honest.)

[0]: https://github.com/soerenmeier/parse-wiki-text-2

[1]: https://github.com/philpax/wikitext_simplified

[2]: https://github.com/genresinspace/genresinspace.github.io/blo...


What they need to do is have 'major edits' push out an updated static render physical file like old school processes would. Then either host those somewhere as is, or also in a compressed format. (E.G. compressed weekly snapshot retained for a year?)

Also make a cname from bots.wikipedia.org to that site.


Sci-Hub was an incredible achievement. It was the closest humanity came to the interconnected sharing of knowledge we dreamed the Internet would be in the 20th century.

And they tried their hardest to kill it because journals believe they're entitled to extract a century of rent from work they did not perform.


Something I wrote related to this in 2001: "An Open Letter to All Grantmakers and Donors On Copyright And Patent Policy In a Post-Scarcity Society" https://pdfernhout.net/open-letter-to-grantmakers-and-donors... "Foundations, other grantmaking agencies handling public tax-exempt dollars, and charitable donors need to consider the implications for their grantmaking or donation policies if they use a now obsolete charitable model of subsidizing proprietary publishing and proprietary research. In order to improve the effectiveness and collaborativeness of the non-profit sector overall, it is suggested these grantmaking organizations and donors move to requiring grantees to make any resulting copyrighted digital materials freely available on the internet, including free licenses granting the right for others to make and redistribute new derivative works without further permission. It is also suggested patents resulting from charitably subsidized research research also be made freely available for general use. The alternative of allowing charitable dollars to result in proprietary copyrights and proprietary patents is corrupting the non-profit sector as it results in a conflict of interest between a non-profit's primary mission of helping humanity through freely sharing knowledge (made possible at little cost by the internet) and a desire to maximize short term revenues through charging licensing fees for access to patents and copyrights. In essence, with the change of publishing and communication economics made possible by the wide spread use of the internet, tax-exempt non-profits have become, perhaps unwittingly, caught up in a new form of "self-dealing", and it is up to donors and grantmakers (and eventually lawmakers) to prevent this by requiring free licensing of results as a condition of their grants and donations."

Glad to see better policy happening -- even if all too slowly and only in some areas.


The difference is that while indirect costs are critical to research in most cases, journals are the poster child when it comes to skimming research funding.

They provide little to no real value beyond a CV trophy and only carry out the bare minimum to coordinate peer review. Their largest impact is siphoning tens of thousands of dollars from labs, and millions from cash-strapped university libraries.

Even if the current administration wasn't attacking university funding, the publishing system is in desperate need of reform.


That it survived two administrations in the current climate is a miracle to be thankful for


you'd be surprised at how many policies survived two administrations. the real big one (unless I'm missing something or there have been CIA covert ops) is "not invading any new countries" (yemen conflict started under obama)


If your tax dollars are funding research, you should be entitled to reading the results.

I don't think I've met any other researchers who prefer paywalls. The problem is the most prestigious journals (Cell, Nature, Science, etc) have extremely parasitic business models - you pay a bunch of money to publish in them, and then other people pay them to read. But in return you get a CV boost.

They charge out the nose for open access (the researcher pays). With funding as tight as it is these days, maybe we'll see a shift to more a ethical publishing model as researchers start questioning whether it's worth it.


>"If your tax dollars are funding research, you should be entitled to reading the results."

This statement begs the question, though I understand why it seemingly 'makes sense'. Your tax money also funds lots of things you don't have access to or visibility of, and it's not clear how far your logic should extend. Should you have access to intelligence assessments, or the ability to purchase any technology developed with government funding? What about licenses to patents developed with the aid of government funding? How about access to government or external labs, or the use of their equipment?


What goes to government should benefit the people, not the mythical entity


I don't think it's that crazy in certain careers; I've seen a similar sentiment in academics.

Back in grad school, we had several professors emeriti who were teaching a class or two, or collaborating with a lab, because they just enjoyed it.


> So... are _all_ introductions to machine learning just extremely wrong here?

It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.

It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.


There are also color science packages like this one that let you do conversions to various spaces - https://www.colour-science.org/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: