Hacker News new | past | comments | ask | show | jobs | submit login
Call of Duty: Warzone Caldera Data Set for Academic Use (activision.com)
178 points by noch 9 months ago | hide | past | favorite | 95 comments



Link to the data is in a Github repo at the bottom: https://github.com/Activision/caldera

Reading the article, they don't seem to know what people should do with it. It feels like a recruiting tool more than anything, especially given the non-commercial license.


It's useful to have AAA tier sample game level assets available for engine development or for apps like Blender.


As an avid CoD player, I literally have no idea why this would be useful. Map data isn’t really interesting.

The player data seems far too low of resolution to be meaningful.


These sorts of data sets can be useful for graphics research, particular as a data set to test ray tracing algorithms on.

See for example, the Moana Island data set. [1]

I definitely foresee papers on BVH construction using this scene.

For graphics research in academia, there's a dearth of real-world data sets like this, so the ones that do get released are gold. And for graphics research in industry, one may have access to good internal data sets for development and testing, but getting permission to publish anything with them tends to be a giant hassle. It's often easier to just use publicly available data sets. Plus, that makes it easier to compare results across papers.

[1] https://www.disneyanimation.com/resources/moana-island-scene...


The Moana island has complete material data though. This release seems to be only geometry. No materials or textures at all.


Yep. That's still fine for building BVHs and shooting some rays around.


Thank you for explaining that. Very helpful.


Since they provide player movement data, you can train a transformer to predict which player will win the BR given movement patterns. Or maybe create "player embeddings" to see if player behaviors can be clustered. That could be a fun project...but definitely not useful.

Extracting and converting the player data from the .usd files would not be fun, though.


> Since they provide player movement data, you can train a transformer to predict which player will win the BR given movement patterns.

You didn't consider the main factor for CoD - cheating. Which clearly seems to be an inside thing.

Not sure if anything meaningful can be obtained by analyzing anything that has player data on it considering every video game out there is prone to this.


Why would having player movement data help cheating?

Why is the cheating clearly an insider thing?

Why aren't you sure if anything meaningful can be derived from the movement data?

What do you mean by "prone to this"?

Are you sure they didn't consider "cheating" as a possible use of the movement data?

Could they have considered it but thrown it away as off-topic and implausible?


They are implying player teleporting, which is a common hack in BRs.

Player movement data that is too fast for normal players could be seen as cheating. An AI isn't strictly needed for that, just check displacement over time.


Is it really a common hack? I would have guessed teleportation is the easiest to detect server-side, or impossible from the start as the server is authoritative (clients sends inputs, the server computes the positions and any important change, sends them back to clients, clients cannot hack their movement).


I’ve never seen it in CoD. Last time I was this was in like 2010 when MWII was hacked to death.


Given all the other variables that introduce a bunch of noise to the player movement data, I doubt you could ever determine any useful predictive pattern.

If anything though, I could see how player behavior of match winners could be used to both identify varying level of cheaters and players that use various methods for providing an advantage (i.e., keyboard mouse, joystick extensions, etc) and automatically sequester or even handicap their accounts.

It appears to me that so much effort is placed on trying to identify and hamper cheaters in real time, when that both seems extremely resource intensive and unnecessary, considering you have all the digital evidence proof of cheating you need after the fact, you just have to understand what you are looking at.


> so much effort is placed on trying to identify and hamper cheaters in real time, when that both seems extremely resource intensive and unnecessary, considering you have all the digital evidence proof of cheating you need after the fact, you just have to understand what you are looking at

It's not resource intensive at all compared to the alternative of ahaving humans doing post match reviews. It's all "AI" and automated reviews because it's cheaper. Half of the "anti-cheat" tactic is anyway using your computer resources to run some anti cheat tool.

These games are optimized for revenue so every action is dictated by that. Including catching/banning cheaters. If it costs too much to do it properly, or (and this is actually plausible) cheaters are a significant enough portion of the already small chunk of players who create recurring revenue, then there's no incentive to take real action.

This data is probably useful for actual academic rather than practical purposes today. They're building the knowledge they might want to use in a few years.


> considering you have all the digital evidence proof of cheating you need after the fact,

It's actually getting increasingly hard to tell. Old cheating use to be snap-to-the head type of cheating.

The newer cheats work really hard to resemble natural players. Soft aim, intentionally missed shots, non-perfect recoil control.


>Given all the other variables that introduce a bunch of noise to the player movement data, I doubt you could ever determine any useful predictive pattern.

Predicting a winner will be difficult but I would not be surprised if you could loosely predict rank (does Warzone track player rank?) off of movement alone. You may be able to predict more accurately by looking at the associations between two players and their movement. From my prior experience in FPS games, positioning, awareness, and aim are the core pillars of success. Unfortunately as far as I can tell from the data set, only player position is tracked.


It sounds like this is simply not for you, then, and that's fine.


From an information theory perspective, it should be possible to define strategically important locations in terms of Empowerment [1]. As a map designer there are likely some rough rules you want to abide by, such as roughly equal highly empowered locations throughout the map to reduce location bias.

I remember an old game where they defined map rules in FPS CTF maps that there should be more than one path to each flag (usually three) and flag areas should be partially visible from one base to another. There were lots of rules like these, some more flexible than others.

[1] https://arxiv.org/abs/1310.1863


Not really, they released it primarily for artist training and tutorials. Getting hold of XXL gaming maps of this high-quality from a super popular game is definitely something that most game design training courses will use 100%.


Is there anything particularly novel about this vs the game map of a different FPS?


Other fps game maps don't have licenses that let you use them to stress test your renderer or game engine. Existing freely available scenes are all too small and poorly made to be proper stress tests with modern hardware (eg. Old sponza is way too light, Intel sponza they just spammed the subdivision modifier to make it stupidly high poly, Bistro is small and really weirdly made, etc).


Not that it makes it novel, but this appears to be a “battle royale” map based on the picture shown in the post. So it is fairly large, for whatever that’s worth.

The assumption with this type of game is that player will play the same big map over an over for the season (or something like that, changing the map is very rare, might not happen at all over the lifespan of the game), and but they can pick which part of the map they start in and explore from there. So it is, I guess, more similar to having data from all of the maps, for classic first person shooters.


Specifically, during each play session, the map has a "storm" which converges on a random location over time: staying in the storm is lethal, so you are forced to eventually go to that location and points-of-interest along the way, which adds play variance.


Make killer robots on the island of caldera


Enemy AI has been pretty stagnant for the last couple decades, right? Can somebody use this to make less inhuman bots?


> less inhuman bots

In games, player don't want AI that 100% strong (it's not fun), what we want is AI that make mistakes like human do.

So it's possible (assuming if the datasets is good), in fact it's already done in chess[1].

> Maia’s goal is to play the human move — not necessarily the best move. As a result, Maia has a more human-like style than previous engines, matching moves played by human players in online games over 50% of the time.

Also something that I just realized: in this particular case, we want the AI to be biased like human, which is easier to do since the bias is already in the datasets. AI safety is the exact opposite, which is harder if not impossible.

[1] https://maiachess.com/


We don’t even want that. Human like AI in a game like this would be really annoying for that vast majority of players that absolutely suck. People who want to play against AI generally want something that’s really dumb.


Playing against an AI that's really dumb gets boring quickly. Playing against an AI that's way too good gets annoying quickly.

I want an AI that can play like a human at my level would, such that the game is competitive and fun.


You probably don’t though. It’s actually really unfun to lose 50% of matches against an AI, or worse, because it doesn’t get tired or tilted or distracted.

It’s much more fun to go against an AI that is dumber than you but generally more powerful.


Different kinds of AI are likely fun for different players. Games have difficulty levels partly because not everyone wants the same level of difficulty relative to their own skill level. Some may want something easily beatable for them, some may want something difficult for them to beat.


It's unfun if the AI feels like it's cheating.

In Counter Strike the AI can be super dumb and slow until progressively it becomes a dumb aimbot. It doesn't become a better player with gamesense ans tacticts, just raw aim (Try arms race if you wanna feel it yourself)

In Fear the AI is pretty advanced in a more organic way. It coordinates several enemies to locate you and engage with you in a way that sometimes feels totally human. That feels great and when you lose you try again thinking something like "I should have positioned myself better" instead of "I guess my aim is just not fast enough".

We just don't get enough of good AIs to know how good they can feel.


I think what people really want is a “human”-like AI that is worse than them, maybe quite a bit worse. But maybe not exactly what this type of dataset can offer. Which is to say:

I want to play against an AI that is dumb enough for me to beat it, but is dumb in human-ish ways. Depending on the genre of the game, I want it to make the sort of mistakes that actual people make in wars, but more often. Or I might want it to make the types of mistakes that the baddies make in action movies.

Human players in videogames might provide a little bit of a signal, but they do engage in a lot of game-y and not “realistic” movements, so point taken there. Some people are less familiar with games I think, so they tend to make less gamey movements. Anyone who’s been playing games since the 90’s will bounce around and circle strafe, so they need to be filtered out somehow, haha. Maybe they can provide training data for some sort of advanced “robot” enemy.

But the existing AI characters also make some pretty non-human mistakes. Like often it seems that AI difficulty slider is just like: I’m going to stand stupidly in the middle of the road either way, but on easy I’ll spray bullets randomly around you, and in very hard I’ll zap you with lightning reflex headshots.

Moves like examining a tree of possible movements, flanking, cover, better coordination, that sort of stuff would be more interesting. Maybe the player data-set can provide some of that? I’m actually not sure.

As far as I can think, early Halo games and the FEAR series had be best AI. It’s been a while. Time to advance.


Yeah but the AI in the games you like was good because it was fun, not because they were circle strafing or doing movement mechanics that are explicitly effective but not realistic. Doing these things would make those AI much less fun as well as immersion breaking.


I haven't paid attention to player-simulating bots in online shooters in forever, what are the current issues? I'd argue that players are a bigger issue, e.g. cheaters, big skill gaps, or the fact they keep shooting at me.

There's singleplayer games, but those are intended to be gameplay challenges, not simulate other players. And in that area, I haven't heard any "this is really good" since F.E.A.R. which is nearly 20 years old now.


The original F.E.A.R. also still remains the game that has most impressed me with the enemy behavior patterns.


The problem regarding enemy AI is not lack of capability, it's that the market just doesn't care much.

At the end of the day, when playing against the computer, you want to be a badass shooting baddies. The vast majority of the playerbase doesn't care enough about the baddies being able to bait and flank you, they just want to shoot the baddies and those who want a challenge find it in multiplayer anyway


There is a mod for Single Player Tarkov, SAIN's that has, in my opinion, better AI than FEAR. It has some quirks sometimes, it's a mod for a game with severe issues at the end of the day. But the bot behaviors are really human-like while making human mistakes. The difficulty can be tweaked too, giving them worse or slower eyesight, better or faster aim, change the chance they have of making specific decisions.

The game is a pain and it needs lots of mods to get good. Even then is still janky but that mod is so worth it if you miss something like FEAR.


I wonder what that would look like for the twitch shooter genre?


Very cool. Does anyone remember how Bungie would release heatmaps and other data for each Halo match?


Absolutely. I used to love all the data you'd get after an ODST match. I went digging for the online stats a while back and came up empty-handed... sad to see it all seems to be gone now.


Great, now I can finally find the top 5 hiding places for these friggin campers!


A lot of the comments here are very cynical, perhaps because they’re focused on the license or the use for gaming.

However as someone in the graphics community, these kinds of assets are great for researchers and demo purposes. Other scenes like this are the Disney Moana Island, Intels Moore Lane house, Sponza, various NVIDIA scenes, Amazons bistro, and animal logics Alab2 scene. Khronos also maintains a set of test assets for the same purpose with glTF.

When we develop content creation applications, they’re great for benchmarking ingest and making sure we have good feature coverage.

They’re great for graphics researchers to have shared bases for data processing, rendering and other important R&D.

The non-commercial aspect just means you can’t use them for commercial marketing, but they’re hugely beneficial for any kind of graphics research.

Having real production quality data is a huge undertaking for researchers to do in addition to their own novel work.

Thus far, many sample assets have been simple standalone assets, film quality production assets, or archvis. Activision releasing something from a AAA game is a huge boon for people targeting that market.

I’ll also call attention to Natalya being involved. She’s recently joined Activision as CTO , but has been a very influential graphics engineer with a long and storied career before that. She has long helped run the excellent Advances in Real-time Rendering courses at SIGGRAPH (https://advances.realtimerendering.com/) and I believe this release comes from the same intention of mutually advancing shared knowledge.


Are there similar data sets for video & audio assets?


Yet again the word "open source" is being used in a way that doesn't make any sense. We're going to wind up with a weird situation where "open source" means "free and open source" for software specifically but just means "available free-of-charge" for data and ML model weights. Which is strange. The word "free" is right there. This is not "source code", and it certainly isn't "open source" even if it was.

I know this is a tangent, but unfortunately it bears repeating.


Right or wrong licensing source code separately from data isn't a new thing. I can think of some very famous video games that have released their source code under a Free Software license, but kept the game data proprietary.

According to the FSF there is a separation between data and code [1] (search for "data" on that page). They specifically say that data inputted or outputted by a program isn't affected by the programs license which indicates a separation from their perspective.

[1] https://www.gnu.org/licenses/gpl-faq.en.html


For any given data, it can be used as code, and vice versa. But for any given program, it should be very clear what's code and what's data.

If I send a Python file over SSH, it should most definitely be data for all software involved. And I for sure should be able to send a Python file via OpenSSH, not matter what either is licensed as.


According to the FSF there is a separation between data and code

Which of course is a complete denial of the reality. Code is data, and data is code. That duality is the crucial reason why general-purpose computers are so powerful. The only ones to profit from trying to make a distinction, as usual, are the lawyers and corporations behind them who seek to restrict instead of empower.

Especially in this era when decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!), and with the rise of AI, IMHO the whole idea of "source code" being somehow more special than the executable binary is quickly losing relevance.


  decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!)
Citation needed.


I have definitely read some teammates’ code that felt like it would be more readable doing a compiler-decompiler round-trip. Never actually did it, but I doubt it would be less readable than that seemingly intentionally obfuscated garbage.


Cant want for the jetbrains "deabstract" plugin, that compiles it, decompiles it and reconstructs a indirection free AST and then cleaner code from that AST via AI. De-Tech-Bro-My-Code. Pull the plug on all-the-patterns in one project devs and get cleaner code today.

Refactor> ThrowIt> IntoTheBin


Personal experience.

There is a lot of decompiler research which isn't public.

A sibling comment mentions Hex-Rays and Ghidra. Those are only now slowly approaching the capabilities of what I've used.

The fact that the majority of code tends to not be intentionally obfuscated and is compiler-generated and thus easily pattern-matched also makes it quite straightforward. Of course the fact that decompilers are often used on code that is (e.g. malware, DRM) skews a lot of people's perceptions.


Just to be completely clear, the conditions I have been using Ghidra/Hex-Rays/BN with were not that bad. I wasn't analyzing malware or heavily-DRM'd software. Even with symbols and full debug info, many of those gripes still apply. (Hex-Rays is able to do a lot more with debug info. It can usually get a lot of the vtable indirections typed correctly, including with a bit of effort, multi-inheritance offset this pointers.)

I'd love to see this non-public decompiler research but I have some skepticism, as a lot of the information that is lost would require domain-specific reconstruction to get back to anywhere near full fidelity. I do not deny that you have seen impressive results that I have not, but I really do wonder if the results are as generalizable as you're making it sound. That sounds like quite a breakthrough that I don't think Ghidra or IDA are slowly approaching.

But since it's non-public, I suppose I'll just have to take you at your word. I'll be looking forward to it some day.


> Especially in this era when decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!)

As someone who is knee-deep in a few hobby reverse engineering projects, I certainly wish this was the case :)

Hex-Rays and Ghidra both do a very commendable job, but when it comes to compiled languages, it is almost never better than reading the original source code. Even the easier parts of reversing C++ binaries still aren't fully automated; nothing that I'm aware of is going to automatically pull your vtables and start inferring class hierarchies.

Variable names are lost in executable code. When it comes to naming variables, most of the tools support working backwards from "known" API calls to infer decent function names, but only Binary Ninja offers a novel approach to providing variable names. They have an LLM service called Sidekick which offers suggestions to improve the analysis, including naming variables. Of course, it isn't very impressive if you were to just drop into a random function in a random binary where you have no annotations and no debug information.

Most of the "framework" stuff that compiles down, by some form of metaprogramming, is nearly non-sense and requires you to know the inner workings of the frameworks that you're touching. In my case I spend a lot of time on Win32 binaries, so the tricky things I see often are a result of libraries like MFC/ATL/WTL/etc. And I'll grant you that in some cases the original source code wouldn't exactly be the most scrutable thing in the world, but I'd still really rather have the MFC message handler mapping in its original form :) COM becomes a complete mess as its all vtable-indirected and there's just no good way for a decompiler to know which vtable(s) or (to some degree) the function signatures of the vtable slots, so you have to determine this by hand.

Vectorized code is also a nightmare. Even if the code was originally written using intrinsics, you are probably better off sticking to the graph view in the disassembly. Hex-Rays did improve this somewhat but last I checked it still struggled to actually get all the way through.

The truth is that the main benefit of the decompiler view in IDA/Ghidra/etc. is actually the control flow reconstruction. The control flow reconstruction makes it vastly easier to read than even the best graph view implementation, for me. And this, too, is not perfect. Switch statements that compile down to jump tables tend to be reconstructed correctly, but many switch statements decompile down to a binary tree of conditionals; this is the case a lot of the time for Win32 WndProc functions, presumably because the WM_* values are almost always too sparse to be efficient for a jump table. So I'd much rather have the original source code, even for that.

Of course it depends a bit on the target. C code on ELF platforms probably yields better results if I had to guess, due to the global offset table and lack of indirection in C code. Objective C is probably even better. And I know for a fact that Java and C# "decompiling" is basically full fidelity, since the bytecode is just a lot less far away from the source code. But in practice, I would say we're a number of major breakthroughs away from this statement in general not being a massive hyperbole.

(I'm not complaining either. Hex-Rays/Ghidra/BN/etc. are all amazing tools that I'm happy to have at my disposal. It's just... man. I wish. I really wish.)


The repo contains some source code, so therefore it's open source


These files are source assets, which is as close to source code as you can get with non-code stuff. For regular people who didn't drink the OSI koolaid, this is a perfectly valid and logical use of the term "open source". I don't know if that's the angle you're coming from, or if you just didn't know what usd was, but either way this is a good release.


The phrase "open source" is itself open source and is freely available for use, modification and redistribution.



Open Source, with the capitals, however, is not, and is a trademark of the Open Source Initiative (OSI).

https://opensource.org/trademark-guidelines


No, it's not. From the page you linked to:

> OSI, Open Source Initiative, and OSI logo (“OSI Logo”), either separately or in combination, are hereinafter referred to as “OSI Trademarks” and are trademarks of the Open Source Initiative.


> In all cases, use is permitted only provided that:

> the use of the term “Open Source” is used solely in reference to software distributed under OSI Approved Licenses.


The map data is provided in the USD format which is a 3D authoring and interchange format that can be used with a lot of software. Unlike the final optimized data used by the game this doesn't require revere engineering and can be seen as source data that is in fact useful for graphics researchers and game developers.


I’m confused as to why the convention isn’t to consider ML weights data-sets instead of any type of code (closed or open).


Model weights are functions.

In the same way

`lambda x: x > 0.25` is a function.


The article claims it’s open source (which it clearly isn’t, especially since they say things like “open source for non-commercial use” which is a bit of a contradiction), but the GitHub makes no such claim only stating that the OpenUSD format is open source.


Feels like a covert way to destroy the term “open source” by making it meaningless over time.


Yes it's all one big conspiracy.


No, it's a Schelling point but evil


I find it rather odd that after all the years of exposed and revealed conspiracies too numerous and pervasive to even necessitate listing any of them, people like you just reject the notion that any additional, unknown conspiracies may exist.

It is an odd phenomenon among humans that I at least don’t quite understand, the seeming tendency to ignore or dismiss possibilities of proven negative outcomes … for whatever reason. “I know all those other conspiracies I dismissed all turned out to be true, but I am sure I would know if there were any additional conspiracies” … totally ignoring one’s track record.

It appears to be the same kind of mentality of “hey, you know who we should trust with our lives … the government made up of people who lie to us, steal from us, and mass murder on a regular basis; that’s who we should give control over to.”

People conspire, I’ve witnessed it personally numerous times; sometimes for greedy business reasons, at other times to mass murder and commit genocide on a scale not seen since. Humans conspire, even if sometimes only because they’re not prevented from doing so naturally.


It’s probably convenient for them to dismiss this as it aligns with whatever goals they have..


ML weights are code


I'll accept that if you would like. However, they are not source code if so. They are object code. And open source is about source code, not object code. (And this particular press release isn't about ML weights anyways, at least unless I'm grossly misunderstanding; it is just a dataset. So even failing this, it still doesn't really make any sense.)


No it is not object code unless you want to get so stupidly pedantic that you want to argue a Python script in a zip file can’t be considered open source because it’s compressed.

The model pickles unpack back to their original form. The picked binary forms are merely for convenience.


Look, please go do research as to what "object code" and "source code" are before saying my argument is "stupidly pedantic". I'm not elaborating because the example you gave has nothing to do with what I said.


Your analogy does not make sense. ML weights are distributed in binary form, like object code, but it is nothing like compiled binary. It’s just temporarily in binary form for convenience. It unpacks directly into its original form.

This is not a technicality like “technically up can reverse engineer or modify binary code”. The binary form of model weights is just a fancy zip file format that is useful because they are so large that text is impractical.


Source code is human readable. Object code is not, and produced from some mechanical process.

Model weights are not written by hand. You don't manually tweak individual weights. You have to run a training process that has multiple "raw" inputs. Trying to read model weights directly is no better than trying to read object code directly. Heck, reading object code directly is probably easier, because at least it's just machine code at the bottom; I will never be able to comprehend what's going on in an ML model just by reading the weights.

The closest thing to "source code" in ML models would be the inputs to the training process, because that's the "source" of the model weights that pops out the other end. If the analogy doesn't make sense, that's because ML models are probably not really code in the same sense that source code and object code.

(It may be tempting to look at "ML weights" as source code because of the existence of "closed-weight" API services. Please consider the following: If Amazon offers me a unique database service that I can only use with Amazon Web Services, and then releases a closed-source binary that you can run locally, that is still closed-source, because you don't have the source code.)


“Human readable” is not a requirement. Visual programming code breaks down to some obtuse data structure. But with the right tools, it’s easy for humans to interact with it. Visual programming node workflows can be open sourced. ML models are the same. Tooling is required to interact with it. The limits of your human understanding do not determine if something is open source. Otherwise a really complicated traditional program might be argued as not open source. You can individually explore specific vectors and layers of a model and their significance.

Produced by a non mechanistic process is not a requirement. I can generate a hello world script with code, and open source the hello world script. It does not matter how it was formed. I do not need to open source the hello world generator either.

Data and training code is not source code of the model. That is the source code of a model maker. That’s `make_hello_world.py` not `hello_world.py`

The closed source database is not a correct analogy. Excluding unreasonably difficult efforts to decompile the binary, you CANNOT modify the program without expecting it to break. With an ML model, the weights are the PREFERRED method of modifying the program. You do NOT want the original data and training code. That will just be a huge expense to get you what you already have. If you want the model to be different, you take the model weights and change them. Not recreate them differently from scratch. Which is the same for all traditional code. Open source does not mean I provide you with the design documents and testing feedback to demonstrate how the code base got created. It means you get the code base. Recreating the codebase is not something we think about because it doesn’t make sense because we have the code and we have the models.


Human readable is a requirement. The existence of things that don't fit into this paradigm doesn't invalidate it entirely, it just proves that it is imperfect. However, it being imperfect does not mean that 1 + 1 != 2. Semantics debates don't grant you the power to just invalidate the entire purpose of words.

What you are proving repeatedly is that model weights are not code, not that they are "source" code.

- The existence (barely, btw) of visual programming does not prove that model weights are code. It proves that there are forms of code other than source code that are useful to humans. There are not really forms of model weights that are directly useful to humans. I can't open any set of model weights in some software and get a useful visualization of what's going on. It's not source code. (Any visual programming language can output some useful human readable equivalent if it wants to. For some of them, the actual on-disk format is in fact human-readable source code.)

(A key point here: if you write assembly code, it's source code. If you assemble it, it's object code. This already stresses the paradigm a bit, because disassembly is reversible... but it's only reversible to some degree. You lose macros, labels, and other details that may not be possible to recover. Even if it was almost entirely reversible though, that doesn't mean that object code is source code. It just means that you can convert the object code into meaningful source code, which is not normally the case, but sometimes it is.)

- The existence of fine-tuning doesn't have anything to do with source code versus object code. Bytecode is easy to modify. Minecraft is closed source but the modding community has absolutely no trouble modifying it to do literally anything without almost any reverse engineering effort. This is a reflection of how much information is lost during the compilation process, which is a lot more for most AOT-compiled languages (where you lose almost all symbols, relocations, variable and class names, etc.) than it is for some other languages (and it's not even split on that paradigm, either; completely AOT languages can still lose less information depending on a lot of factors.) The mechanical process of producing model weights loses some information too; in some models, you can even produce models that are less suitable for fine-tuning (by pruning them and removing meta information that is useful for training). A closer analogy here would be closed source with or without symbols.


> Human readable is a requirement. The existence of things that don't fit into this paradigm doesn't invalidate it entirely, it just proves that it is imperfect. However, it being imperfect does not mean that 1 + 1 = 2. Semantics debates don't grant you the power to just invalidate the entire purpose of words.

well first of all, 1+1 does actually equal 2

Secondly, contradictions to your supposed hard rules absolutely means you don’t have hard rules. If you want to play the semantic game of saying words can mean whatever you want them to mean then sure. But then that’s pointless and you’re just saying you just want to be stubborn.

> I can't open any set of model weights in some software and get a useful visualization of what's going on. It's not source code.

Yes you can. Do you actually have any experience with what you’re talking about? This is a huge red flag that you do not.

Your Minecraft example is a straw man. I did not claim that the existence of fine tuning meant models are source code. I claimed that because fine tuning models is the preferred form of modifying models means that it meets the definitional requirement of being called open source.

Minecraft can be modified, but it is not the preferred form to do so, so it is not open source.

You are still failing to address helloworldmaker vs hello world. Helloworldmaker is explicitly not the source code of hello world. Model maker is not the source code of model.

Appealing to your own lack of capabilities to understand something doesn’t make it not source code.


> well first of all, 1+1 does actually equal 2

Sigh. That's a typo. I almost feel like it's not important to fix it considering that it's pretty obvious what I meant, but alas.

> Secondly, contradictions to your supposed hard rules absolutely means you don’t have hard rules. If you want to play the semantic game of saying words can mean whatever you want them to mean then sure. But then that’s pointless and you’re just saying you just want to be stubborn.

The "semantics game" I'm using is the long-understood definition of the term 'source code'.

American Heritage® Dictionary of the English Language, 5th Edition:

> source code, noun

> 1. Code written by a programmer in a high-level language and readable by people but not computers. Source code must be converted to object code or machine language before a computer can read or execute the program.

> 2. Human-readable instructions in a programming language, to be transformed into machine instructions by a compiler, assembler or other translator, or to be carried out directly by an interpreter.

> 3. Program instructions written as an ASCII text file; must be translated by a compiler or interpreter or assembler into the object code for a particular computer before execution.

Oxford Languages via Google:

> source code /ˈsôrs ˌkōd/

> noun: source code; plural noun: source codes; noun: sourcecode; plural noun: sourcecodes

> a text listing of commands to be compiled or assembled into an executable computer program.

Merriam-Webster:

> source code, noun

> : a computer program in its original programming language (such as FORTRAN or C) before translation into object code usually by a compiler

Wikipedia:

> In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.

So every source pretty much agrees. Merriam-Webster falls short of actually specifying that it must be "human readable", but all of them specify in enough detail that you can say with certainty that ML model weights simply don't come anywhere near the definition of source code. It's just not even close.

> Yes you can. Do you actually have any experience with what you’re talking about? This is a huge red flag that you do not.

I'm trying to be patient but having to explain things in such verbosity that you actually understand what I'm trying to say is so tiring that it should be a violation of the Hacker News guidelines.

YES, I am aware that tools which can input model weights and visualize them exist. NO, that doesn't mean that what you see is useful the way that a visual programming language is. You can not "see" the logic of model weights. This is the cornerstone of an entire huge problem with AI models in the first place: they're inherently opaque.

(P.S.: I will grant you that escalating my tone here is not productive, but this arguing goes nowhere if you're just going to take the weakest interpretation of everything I say and run with it. I have sincerely not been doing the same for you. I accepted early on that one could argue that model weights could be considered "code" even though I disagree with it, because there's absolutely zero ambiguity as to whether or not it's "source code", and yet here we are, several more comments deep and the point nowhere to be found.)

> Your Minecraft example is a straw man. I did not claim that the existence of fine tuning meant models are source code. I claimed that because fine tuning models is the preferred form of modifying models means that it meets the definitional requirement of being called open source.

First of all, to be called "open source", it first needs to meet the definition of being "source code". That's what the "source" part of "open source" means.

Secondly, to be called "open source", it also needs to meet the definition of being "open". That's the "open" part of open source.

Open-weight models that have actual open source licenses attached to them do meet the criteria for "open", but many models, like Meta's recent releases, do not. They have non-commercial licenses that don't even come close to meeting the requirements.

> Minecraft can be modified, but it is not the preferred form to do so, so it is not open source.

Whether or not source code is the preferred form to modify something is entirely beside the point. I'm not sure where you got this, but it's simply wrong. Please stop spreading blatant misinformation.

> You are still failing to address helloworldmaker vs hello world. Helloworldmaker is explicitly not the source code of hello world. Model maker is not the source code of model.

I'm not addressing it because it's not 100% agreed upon. If you read my above definitions, you will see that in some of them, the results of "Helloworldmaker" will qualify as source code, and in some of them, it wouldn't. Likewise, you can compile any Wasm blob down to C code, and I'd strongly argue that the resulting C code is not human readable source code, it's just in a programming language. This definition, though, has a degree of fallibility to it. Unfortunately, a rigid set of logic can not determine what should be considered source code.

That's OK though, because it actually has nothing to do with whether or not model weights are source code. They don't even come remotely close to anything resembling source code in this entire debate. Model training doesn't produce human-readable source code, it produces model weights, a bunch of data that is, on its own, not even particularly useful, less readable.

> Appealing to your own lack of capabilities to understand something doesn’t make it not source code.

With all due respect, I am not concerned about your judgement of my capabilities. (And it has nothing to do with this anyways. This is a pretty weak jab.)


> Whether or not source code is the preferred form to modify something is entirely beside the point. I'm not sure where you got this, but it's simply wrong. Please stop spreading blatant misinformation.

I’m not sure how you could read what I wrote in any way that contradicts that. Minecraft binaries is NOT open source because, unlike model weights, it’s NOT the preferred way to modify Minecraft.

> I'm not addressing it because it's not 100% agreed upon. If you read my above definitions, you will see that in some of them, the results of "Helloworldmaker" will qualify as source code,

Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.

Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.

The origin of a program has no bearing on whether the program’s source code is considerable to be source code.

Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb. It is a structured file of logic, in a form that is convenient to modify. That’s open source! The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.

Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.


> I’m not sure how you could read what I wrote in any way that contradicts that. Minecraft binaries is NOT open source because, unlike model weights, it’s NOT the preferred way to modify Minecraft.

Minecraft "binaries" can not be open source because binaries are not source code.

> Helloworldmaker is 100% source code. Buts it’s not the source code for helloworld. To make this even simpler, if I wrote a hello world program by rolling literal dice, SURELY you would pretend that the fully functional program’s source code is the worldly logic by which I rolled dice to generate the characters of code.

What I said is that the results of "helloworldmaker" would not be universally considered source code. This is because whether generated code is source code is already debated. Most accurately, the source code for "helloworld" would be a script that generates it, by calling "helloworldmaker" with some set of parameters, not the result of that generation. That is source code, by every definition past, present and future. (Whether the resulting "helloworld" is also source code is unclear and depends on your definitions.)

> Or if we had an LLM spit out doom, we would not claim that the doom source code is neither the doom code, nor the LLM model but the training code for the model originally.

If you overfit an LLM to copy data in a roundabout way, then you're just having it spit out copies of human code in the first place, which isn't particularly novel. The only real wrench in the cogs re: LLMs is that LLMs are effectively 'equivalent' to humans in this case, as they can generate "novel" code that I agree would qualify as source code.

> The origin of a program has no bearing on whether the program’s source code is considerable to be source code.

I would advise you to check the definition of the word "source" before claiming asinine things like this.

> Given that we have established this, you cannot argue that the training program and data, which are not required to make a random set of ML weights, are the source code of the ML model. Your only recourse here is to argue that there is no source code for this project, but frankly that seems very dumb.

Yes that is correct, ML weights do not have source code, because they are data, not code. This isn't particularly stunning as computers perform all kinds of computational operations over datasets that don't involve things that are called source code. Database data in general is not source code. If you paint something in Photoshop, there is no source code for your painting; you can save it with greater or less fidelity, but none of those things are "source code", they're just different degrees of fidelity to the original files you worked on.

I am not thusly claiming, though, that computer graphics can't involve source code; it can, like, for example, producing graphics by writing SVG code. Rendering this to raster is not producing "object code" though; "object code" would be more like converting the SVG into some compiled form like a PDF. This is a great example of how "source code" and "object code" are not universal terms. They have fairly specific meanings tied to programming that, while are not universally 100% agreed upon, have clear bounds on what they are not.

> It is a structured file of logic, in a form that is convenient to modify. That’s open source!

No, it isn't "open source". Open source as it's used today was coined in the late 90s and refers to a specific, well-defined concept. Even if we ignore the OSI, dictionary definitions generally agree. Oxford says that "open source" is an adjective "denoting software for which the original source code is made freely available and may be redistributed and modified." Wikipedia says "Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose."

Importantly, "open source" refers to computer software and in particular, computer software source code. It also has a myriad of implications about what terms software is distributed under. Even ignoring the OSI definition, "free for non-commercial use" is not a concept that has ever been meaningfully recognized as "open source", especially not by the experts who use this definition.

> The only reason we felt the need to gatekeep “preferred form” was to clarify that binaries being “technically able to be modified” shouldn’t count. But it’s ridiculous to assert that these assets shouldn’t meet the criteria just because it doesn’t resemble text code. And it’s ridiculous to argue that there is no source code. And it’s ridiculous to argue that the progenitor process to make a program is the source code of the program.

Frankly I have no idea what you're on about with how it is ridiculous to argue there is no source code. I mean obviously, the software that does inference and training has "source code", but it is completely unclear to me why it's "ridiculous" that I don't consider ML model weights, which are quite literally just a bunch of numbers that we do statistics on, to be "source code". Again, ML weights don't even come close to any definition of source code that has ever been established.

> Getting obsessive over antiquated definitions here is entirely missing the point of why source code and open source is defined the way it is.

The reasoning for why Open Source is defined the way it is is quite well-documented, but I'm not sure what part of it to point to here, because there is no part of it that has anything in particular to do with this. The open source movement is about software programs.

I am not against an "open weight" movement, but co-opting the term "open source" is stupid, when it has nothing to do with it. The only thing that makes "open source" nice is that it has a good reputation, but it has a good reputation in large part because it has been gatekept to all hell. This was no mistake: in the late 90s when Netscape was being open sourced, a strategic effort was made to gatekeep the definition of open source.

But otherwise, it's unclear how these "free for non-commercial usage" ML weights and especially datasets have anything to do with open source at all.

It's not that the definition of the word "source code" has failed to keep up with the times. It has kept up just fine and still refers to what it always has. There is no need to expand the definition to some literally completely unrelated stuff that you feel bears some analogical resemblance.

(P.S.: The earliest documentation I was able to dig up for the definitions of the words "source code" and "object code" go back to about the 1950s. The Federal Register covers some disputes relating to how copyright law applies to computer code. At the time, it was standard to submit 50 pages of source code when registering a piece of software for copyright: the first 25 pages and last 25 pages. Some companies were hesitant to submit this, so exceptions were made to allow companies to submit the first and last 25 pages of object code instead. The definitions of "source code" and "object code" in these cases remains exactly the same as it is today.)


No, they really aren’t and I’m not sure why I keep seeing this take. ML weights are binary and it’s painfully obvious.

They are the end result of a compilation process in which the training data and model code are compiled into the resulting weights. If you can’t even theoretically recreate the weights on your own hardware it isn’t open source.


ML weights are not binary. They are modifiable.

If produce a program that outputs a hello world file, I can open source the hello world script without open sourcing the hello world generator.


We can also say binaries are code, but if we are being pedantic that likely isn’t the source code that generated the binary (I also doubt the intention of hand writing binary or manually inputting billions of weights). I’d reckon that’s why it’s called open source, not open code or open binary, as the source code that generates the data is distributed. I’d actually just call this for what it is - open weights.


Binary is not the equivalent of models. Source code is the equivalent of models.

It doesn’t matter if a machine generated source code or a human did for it to be open source code.


You keep asserting this but without any reason. Do you have a reason? It seems to go against the general open source idea of source code being convenient for people to modify.


ML weights ARE convenient for people to modify. You can go look at the dozens of modifications of diffusion models being produced, daily, on civit ai. It’s very easy.


Would you say that once a model is trained, there's no need to go back and re-train it, even if you want to, say, remove some material from the training set? Anything can be done just with the weights? That's a big surprise to me.

Of course people hack binaries too, and binaries are obviously not source code. I once edited a book in PDF form because we didn't have the original Word/whatever document. It's not hard but a PDF still isn't considered to be source code for documentation despite that.


Technically, but it feels like you're intentionally missing the point being made. Sure, providing the weights is very useful given the cost of generating them, but you can't exactly learn much by looking through the 'code', make changes and gain an in-depth understanding in the same way you can from the code provided by an actual open source project.


You absolutely can and people do all the time. There are mountains of forks and dissections and improvements on open source models.


"Find us an AI use case that we can then turn around and market without compensating you for it you researching piece of shit.

Sincerely, Activision"


Player movement data can be used to build aimbot with undetectable lifelike movement. Thanks Activision!


This is not even necessary since current cheaters seemingly can't be detected anyway.


I wonder if the data includes information about which players were banned for cheating? That could open the door to new research into cheat detection.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: