More

jwitthuhn · 2024-09-06T23:02:14 1725663734

As I understand it the sanctioned way of sharing code added to UE is to fork it on github and publish changes to your fork.

Being a fork it will only be available to other people in the Epic Games github org which is only people who have agreed to Epic's licensing terms, and your modified engine remains under that same license.

jwitthuhn · 2024-07-19T11:14:07 1721387647

Yeah, you need to manually fix each affected system by booting in safe mode. Not possible to do remotely.

martyvis · 2024-07-19T12:03:01 1721390581

And you will need your bitlocker recovery key to access your encrypted drive in safe mode. I luckily had mine available offline

There's going be a lot of handholding to get end users through this.

fifteen1506 · 2024-07-19T12:08:41 1721390921

You can enable safemode for next boot without the recovery key and then you can delete the offending file on that next boot.

kiririn · 2024-07-19T12:12:07 1721391127

That requires being able to boot in the first place

fifteen1506 · 2024-07-21T20:37:22 1721594242

You can do a minimal boot. I'm told.

rudasn · 2024-07-19T11:41:37 1721389297

Ouch!

jwitthuhn · 2024-06-21T23:10:35 1719011435

For anyone who wants a sneak peak at what the content will be, Andrej Karpathy already has a series on videos on youtube that covers roughly the first half of the list here: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

Starts with building micrograd to build an understanding of how pytorch understands how to calculate gradients, then it proceeds all the way to making a gpt-2 clone.

Looks like this is an effort to reorganize and build on that existing work.

jwitthuhn · 2024-06-09T04:24:22 1717907062

The extension needs to be signed by mozilla for the normal production builds of firefox to let you load it on startup. If it isn't signed, you need to manually load it in using about:debugging each time you restart firefox.

arp242 · 2024-06-09T04:27:03 1717907223

Mozilla is not preventing from signing anything here (and the "security checks" on who can sign are so weak it might as well not exist in the first place).

dartharva · 2024-06-09T06:42:29 1717915349

Same applies to Chrome as well by that logic; it allows you to sideload unverified extensions too at the cost of annoyingly making you set it up at every startup.

I guess we're all better off using Chrome then?

yjftsjthsd-h · 2024-06-09T04:36:35 1717907795

Okay, but you've moved the goalposts from

> You don't need Mozilla's approval

to pointing out that Mozilla has approved (signed) this extension.

arp242 · 2024-06-09T05:03:58 1717909438

That's you're pedantically language-lawyering my post while not engaging with the far greater falsehood that the previous poster was perpetuating is not a good look.

And the reality is Mozilla can always block any extension they want. They can just change the Firefox source code. It doesn't matter what functionality does or doesn't exist now or what the policy they do or don't have – everything can always be changed. That's true for almost anything.

So what they "could do" is a complete distraction in the first place because the "could do" anything. What they ARE doing matters.

yjftsjthsd-h · 2024-06-09T05:21:54 1717910514

No, pointing out that your claims are conceptually false is a fine look.

It's not about things Mozilla could theoretically do to block you, it's that they require you to proactively get their permission to run an extension (in a prod version of the browser on an ongoing basis, which I think is reasonable table stakes). Here's their official docs for self-distribution, i.e. not using the AMO at all: https://extensionworkshop.com/documentation/publish/submitti... Notice that step 1 starts with giving Mozilla your extension to approve of, step 4 goes so far as to say that if your extension doesn't pass their checks then

> The message informs you of what failed. You cannot continue. Address these issues and return to step 1.

then step 7 is make sure Mozilla reviewers can read your source code, step 9 is wait for them to get back to you, and step 13 is download the XPI that Mozilla has approved to be allowed to run in their browser.

So yes, you absolutely need Mozilla's approval to publish an extension, even if you self-publish the XPI after they've blessed it. If they do not perform the action of signing it, they don't need to change any source code, it won't install. It may be true that in this case they have given that approval, but that doesn't invalidate the general point, and this is a fundamental restriction, not "language-lawyering".

jwitthuhn · 2024-06-09T05:30:19 1717911019

I have to disagree that I'm perpetuating any falsehood here. Mozilla literally needs to approve an addon for it to behave normally. That you are satisfied with the process they have for approving doesn't change that.

To me it seems absurd for a company that claims to be so pro-privacy to not allow any genuinely private extensions to exist. Anyone who wants to make a 'real' addon has to share their code with mozilla.

arp242 · 2024-06-09T05:53:48 1717912428

I actually mostly had the top poster in mind, not you, sorry for the confusion.

What you're saying is technically true, but also not relevant, as explained. They can have the best system in place today, and just change Firefox tomorrow. So it doesn't really matter how the system works now. This is true for anything from Mozilla to XFree86 to Redis to left-pad.

De-facto reality is that right now anyone can create an account and just create a signing key and distribute their extensions $anywhere. Approval is little more than rubber stamp. Mozilla not going around granting "approval" or anything like that.

And they certainly didn't revoke the very weak "approval" here; people can distribute and install it. It's just not listed on the Russian add-on store. So that makes it doubly irrelevant.

jwitthuhn · 2024-06-06T17:43:00 1717695780

I'm not aware of any other openly-licensed model of comparable size to 54b. That seems like a worthwhile addition to what is already available, imo.

The closest is mixtral 8x7b but that one only uses a fraction of its parameters for each pass. This one should produce better but slower results at roughly the same memory requirement.

lhl · 2024-06-07T10:26:31 1717755991

Mixtral 8x7B has 13B activations (2 experts/pass) on 47B weights, so not so different from the Qwen 2 MoE (14B activations on 57B weights). I'd agree that the new model is probably the new strongest option in this "middle-sized" weight class, although Yi 1.5 34B isn't bad (a dense model, so 2.4X slower inference, but also almost half the weights).

One nice thing is that all three of these models are Apache 2.0 licensed.

jwitthuhn · on April 10, 2024

It is ~260GB with presumably fp16 weights. Should fit into 64GB at 3-bit quantization (~49GB).

Edit: To add to this, I've had good luck getting solid output out of mixtral 8x7b at 3-bit, so that isn't small enough to completely kill the model's quality.

wkat4242 · on April 11, 2024

I wonder, can you quantize it yourself with some tool?

pja · on April 11, 2024

llama.cpp can quantize a model for you:

https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#qu...

wkat4242 · on April 11, 2024

Thanks!!

jwitthuhn · on March 16, 2024

Not even iframes, by my memory it was regular frames (rip).

One frame at the top with a banner ad and then your site below it.

reidrac · on March 17, 2024

Yes, you're probably right. It was quite blunt :)

jwitthuhn · on March 7, 2024

Is this still available somewhere? I attempted to download it several months ago and saw the download link 404ing, seems it is still like that.

TrueDuality · on March 7, 2024

Most of the distribution for this is via torrents/magnet links and in person hard drive exchanges. I'd go look at some public trackers if you want a copy and don't know someone that already has it.

Do be aware that it does include copyrighted content so distribution is piracy.

Der_Einzige · on March 7, 2024

Almost all LLM training datasets include copyrighted content so almost all open source LLM distribution is piracy and almost all API based LLMs, including ChatGPT, are also piracy and copyright laundering.

Also, most image-text dataset pairs contain far worse than that. You might want to check out LAION-5B and what stanford researchers have found in there. Technically, anyone who even touched that could in theory be in some serious, serious trouble. I find it quite remarkable that nothing has happened yet.

visarga · on March 7, 2024

> almost all open source LLM distribution is piracy and almost all API based LLMs, including ChatGPT, are also piracy and copyright laundering

That's an amplification of copyright, original expression is protected, but not the ideas themselves, those are free. And don't forget when we actually get to use these models we feed them questions, data, we give corrections - so they are not simply replicating the training set, they learn and do new things with new inputs.

In fact if you think deeply about it, it is silly to accuse AI of copyright violation. Copying the actual book or article is much much faster and cheaper, and exact. Why would I pay a LLM provider to generate it for me from the title and starting phrase? If I already have part of the article, do I still need to generate it with AI? it's silly. LLM regurgitation are basically attacks with special key, entrapments. They don't happen in normal use.

Workaccount2 · on March 7, 2024

Models are not information archives. The size of the final model is orders of magnitude smaller than the size of the training data.

Somehow people are just not able to get this through their heads. Stable diffusion is like 12GB or something and you have people convinced it's a tool that is cutting and pasting copyrighted works from an enormous image archive.

7moritz7 · on March 7, 2024

Stable Diffusion 1.5 is 1.5 to 6 GB depending on the finetune and trained on like 5 billion images

feoren · on March 7, 2024

> The size of the final model is orders of magnitude smaller than the size of the training data.

Good to know I can avoid copyright on a book just by zipping it up!

Workaccount2 · on March 8, 2024

No, you can't. But you can by reading the book, writing down the general gist of it (even including some passages), and then storing that.

LLM's are not compressing petabytes of information down to a few gigabytes.

vineyardmike · on March 7, 2024

The courts (in the US) have not found LLM model weights to be piracy, nor the outputs, but it’s really surprising that LAION was used for so long consider the content you allude to.

Filligree · on March 7, 2024

LAION is essentially a list of every image on the public internet. It was filtered, of course, but do you really expect perfection?

It's impossible to create such a list while evading all such material.

vineyardmike · on March 7, 2024

There exists databases of “the hash of problematic photos” (CSAM), so it seems trivial to search your billions of photos against them before training an AI. You can’t catch everything, but this seems like an obvious miss considering the explicitly tried to scrape pornography.

These hashes is exactly how researchers later discovered this content, so it’s clearly not hard.

duskwuff · on March 7, 2024

The Stanford researchers also found a substantial number of CSAM images in the LAION-5B dataset which were not recognized by PhotoDNA, probably because the images in question were not in wide distribution prior to their inclusion in LAION.

Full paper: https://stacks.stanford.edu/file/druid:kh752sm9123/ml_traini...

SEGyges · on March 7, 2024

You are uploading 5 billion examples of <something>. You cannot filter it manually, of course, because there are five billion of it. Given that it is the year 2024, how hard is it to be positive that a well-resourced team at Stanford in 2029 will not have better methods of identifying and filtering your data, or a better reference dataset to filter it against, than you do presently?

It is a pretty hard problem.

vineyardmike · on March 7, 2024

You don’t have to do it manually. There is a database of file hashes.

And this isn’t just “one engineer”. Companies like StabilityAI, Google, etc have used LAION datasets. If you built a dataset you should expend some resources on automated filtering. Don’t include explicit imagery as an intentional choice if you can’t do basic filtering.

beeboobaa · on March 7, 2024

Turns out you can ignore copyright law if your company has enough money.

doctorpangloss · on March 7, 2024

> I find it quite remarkable that nothing has happened yet.

While I don't think it's because you're wrong, per se, it's just that none of this drama really matters.

littlestymaar · on March 7, 2024

It's only piracy if it's private individual doing it, otherwise it's just “ask for forgiveness not for permission”-type Capitalism.

gosub100 · on March 7, 2024

It'll be some epic lawsuit like google-v-samsung that will get drawn out for a decade, awarded, and reduced, appealed, etc. where the only winners will be both party's lawyers.

littlestymaar · on March 7, 2024

It's gonna be way worse than this:

- OpenAI and others will just settle with MPAA, RIAA and the likes for a revenue stream (a single digit billion a year, likely) + some kind of control over what people can and cannot do with the AI + the access to the technology to produce their own content.

- artists will see peanuts from the deal, and the big names are going to be able to stop doing any kind of business with artists which are just expenses in their eyes. They will have been replaced by machines that where trained using their art with no compensation whatsoever.

IP is already predatory capitalism, AI will definitely be weaponized against the workers by the owners of the means of “production”.

HanClinto · on March 7, 2024

Is it kosher to post magnet links here? I'm not sure.

magnet:?xt=urn:btih:0d366035664fdf51cfbe9f733953ba325776e667&dn=EleutherAI_ThePile_v1

SEGyges · on March 7, 2024

This is the correct one.

archon1410 · on March 7, 2024

> The Pile is old news, check out more recent datasets like; https://huggingface.co/datasets/bigcode/the-stack-v2

— https://the-eye.eu/public/AI/pile/readme.txt

natch · on March 7, 2024

Super odd message since the stack v2 seems to be exclusively code and The Pile is (mostly?) text.

spindump8930 · on March 7, 2024

Also good to note that that the Pile contains lots of curated sources and recent trends have been to take curated data sources and combine them with filtered webcrawls (i.e. commoncrawl with heavy processing). See dolma or the stack v2 (for code models) as others have mentioned.

jwitthuhn · on Feb 29, 2024

Don't see it mentioned in the post, can any of these problems exist for models in the safetensors format? Can't say I know enough about model serialization to understand exactly how much safer it is.

rahimnathwani · on Feb 29, 2024

No, and that's the main reason safetensors was created.

Pickle was created to store Python objects. Safetensors was designed to store (only) weights.

jwitthuhn · on Feb 20, 2024

Everything you type in the start menu gets sent to microsoft by default and this can only be changed with Group Policy, so anyone on home edition is stuck with that.

vosper · on Feb 20, 2024

Thanks for giving me an answer, all the other posts in this chain just seem to be people naming Microsoft products (I get it, people love bashing Microsoft...)