From the rest of the article, it sounds like a big chunk of these lines are from generated files. What I don't understand is why they're checking in generated files into Git.
I've been switching a lot of generated files to being checked in (with CI verifying they haven't drifted from the source). The primary motivation has been performance. For example, in Rust code, it means I don't need to foist the code-gen process and all the dependencies needed for it on dependent crates. I've seen this play out similar in other build systems and circumstances. The key is the data needs to be independent of other factors (like the system doing the generation) and the rate of change of the code-gen source and generator has to be relatively low.
They're using git as a cache. Having generated files stored there means they're available if they're needed (eg in CI) without needing further access controls, they're versioned, and it's a simple and understandable strategy. As the article states, most devs are set up to ignore those files so they're not much of a source of the slowness. It's a common pattern for apps that have to serve lots of different locales.
I don't think it's good idea to store cache in Git. Any file remains forever in the repository after once committed. Local/remote repository become unnecessary big.
The bigger issues for me are it makes history impossible to read (every change is hidden in an avalanche of crap), merges are a mess (you definitely want to spend forever merging autogened files, right?), PR reviews are annoying, etc.
Depends how much generated stuff is there. We have our graphql schema in git even though its auto generated via a library. But its useful in PRs to see exactly how the schema changed as a result of the root change.
It's still surprising to have any generated things there. E.g. you could make the same case for keeping built binaries in Git as well.
Is there a reason why that type of file couldn't be better place into an artifact repository, or just generated and consumed in CI as part of generating a final build output?
> It's still surprising to have any generated things there. E.g. you could make the same case for keeping built binaries in Git as well.
This is not surprising at all. In fact, it's quite standard to commit string translations. Just because you can run the code generation/string replacement step as part of the build that does not mean it's a good idea to generate everything from scratch at every single build.
String translations hardly change once they are introduced, running the build step takes significant amounts of time, and if anything fails then your product can break in critical and hard to notice ways.
I'm not saying don't have the translations at all. I'm saying:
1) caching things in git in general is a bad idea; why is it not in this case?
2) these are not - to my understanding - the raw resource files, but rather machine-generated intermediate files. This is why it's about caching, rather than minimal source files.
Additionally, to respond to your comment, if string translations don't change much then it may be possible to push them out as an internal 3rd-party library, and then they're even quicker to build.
> I'm not saying don't have the translations at all. I'm saying: 1) caching things in git in general is a bad idea (...)
You're missing the point. Storing translated files is caching things in git, and it is not a bad idea. It's a standard practice that saves your neck.
You either place faith on a build step working deterministically when it was not designed to work like that, or you track your generated files in your version control system.
If you decide to put faith on your ability to run deterministic builds with a potentially non-deterministic system, you waste minutes with each build regenerating files that you could very well have checked out and in the process risk sneaking in hard to track bugs. Then you need to have internationalization test steps for each localization running as part of your integration tests to verify if your build worked, which consume even more resources.
Or... you stash them in git?
You use git to track changes, regardless of where they came from. Just because you place faith in some build step to always work deterministically that does not mean you are following a good practice and everyone else around you is wrong.
> You either place faith on a build step working deterministically when it was not designed to work like that
I'm sorry, what? Why would a build not work deterministically?
> If you decide to put faith on your ability to run deterministic builds with a potentially non-deterministic system
If your build is non-deterministic, how can you have any faith in the binaries it produces? You would have much larger problems in that case.
> You use git to track changes, regardless of where they came from
You probably don't want to do that if it is 70% of your codebase and slows down all your developer's git.
> Then you need to have internationalization test steps for each localization running as part of your integration tests to verify if your build worked
I'm convinced you've never used a build system before. Your build should fail if required files are missing. Downloading translation files at build time from some artefact repository vs storing them in git is how a lot of companies do it.
Anyone with any professional experience developing software can tell you countless war stories involving bugs that popped up when building the exact same project separate times. What leads you to believe that translations are any different? In fact, more often than not we see unexpected changes during translation update steps.
> If your build is non-deterministic, how can you have any faith in the binaries it produces?
First of all, all builds are not deterministic by default.
To start to come close to get a deterministic build, you need to do all your own legwork after doing all your homework.
Did you ever did any sort of this work? You didn't, didn't you? You're not looking and are instead just placing blind faith on stuff continuing to work by coincidence, aren't you?
> You probably don't want to do that (...)
Yes, I do. Anyone with their head on their shoulders wants to do that. It's either that or waste time tracking bugs that you allowed to go to production. Do you want to waste your time hunting down easily avoidable and hard to track bugs? Most of the professional world doesn't.
It is definitely possible to have determinism in a CI build step, and it's possible to have checks for it. If one needs determinism and a cache, they can store the files on S3 or some other place instead of git. Re-generating the files every time on the build isn't the only alternative. Instead of generate-and-commit, generate and upload. The difficulty is the same for developers.
If one has to be more granular than that, and have versioning and verification against the repository, they can still store the multiple versions on another service and store the hashes on git. Even though I'm not a fan of this for translation (especially if you have lots of languages/lots of strings), since there's an advantage of decoupling the translation process from the development process.
The problem with storing those files on git is that it can cause more problems, including developer experience issues.
It depends on how much you're storing on git. Some CSS files? Fine. 70% of files of the project, like in this case, slowing down everyone's workflow? Definitely not.
> Just because you place faith in some build step to always work deterministically that does not mean you are following a good practice and everyone else around you is wrong.
You're also doing that everywhere else. How do you think anything works? Why do you think Git is deterministic somehow? Why more so than including some files in a build?
Just an example, I had the non-deterministic case using JAXB to generate java classes from XSD Schema files. Running an ANT jaxb task to generate the classes from the same schema files would generate different class files each time. The class files were functionally the same, however it would reorder methods, the order of the variable definitions etc. Possibly due to some internal code using a Map vs List, so order was not guaranteed. In our case the schema files were in Source Control, the Java/Class files were not, the Java/Class files were generated by the build, packaged to a jar and published to our artifact repository.
Is there a reason why that type of file couldn't be better place into an artifact repository, or just generated and consumed in CI as part of generating a final build output?
No reason at all, but when you need the files during development, and testing, and CI, and in production, and you don't want those things to fail when your artefact repo or source of data is down, then putting the latest versions in git makes sense.
The cost of having them in the repo is a tiny bit more complexity in your git workflow and config. The benefit is being able to access those files everywhere you access the code. It seems like a no-brainer to me.
This adds yet another moving part to the system, and another place things can go wrong.
> generated and consumed in CI as part of generating a final build output
This can get quite slow, and on larger projects you have to expend a lot of effort to keep build times reasonable.
Also, if you're serving a library for public consumption, you generally don't want to add the burden of extra build steps for the user to follow before they can use it. If it can all be automated to the point of invisibility to the user that's fine, but often it can't.
author here, xlf files are translations that are coupled with the texts we set in the code so they're not really generated I admit that was misleading. What I wanted to get across is they're not touched directly by engineers but they're still created through our translation pipeline where real humans translate them
Sorry I've been a bit misleading. These xlf files aren't generated, they're just not interacted with by engineers but they're still created and edited by humans as translations. We want to keep track of them so that if we deploy a different commit, the texts and translations in other languages will match
Reading the article they are not generated files, but files that are never touched by developers. Translators will work with those files. I expect that for translators they have a different sparse checkout that only fetches .xlf files for their target languages.
Also plain text files are usually generated. They are arrays of 1s and 0s. No one wants to write that by hand.
I think that this is not a sensible definition of a generated file. A more sensible definition is that a generated file is created automatically from some source, which is not user input (i.e. an other file). This means generated files do not need to be kept under git, as long as their source is checked in.
Translations files, even if they are not created with a plain text editor but with some other tool that handles the XML layer, are clearly not generated, as long as the translation is done by a human.
For the app I work at the moment we use https://lokalise.com/. We add translation strings to a SaaS app, and then the translation team translate them. I've written a build tool that downloads the translation JSON files from the API using the CLI, or as part of our CI process. Other teams have tools that download their language packs for different iOS and Android apps. The translations are versioned in Lokalise and we using a branching strategy to manage the work. Lokalise has an option to generate xlf files (and JSON, xliff, arb, etc).
This is a very typical workflow. Most people are not out there modifying xlf files by opening them in a text editor. For a start, translations usually aren't done by developers.
(Huge shoutout to Lokalise btw. I can highly recommend it. It makes building a multi-lingual app across different platforms so much easier.)
You opted to keep translation files out of version control; you could also keep images there, or source files. All this stuff is the (pretty direct) output of non-deterministic human intervention.
(BTW, how do you build an old version of your application? Is lokalise able to give you the appropriate translations for a specific git commit / app version?)
.docx files are archives of xml-files. No one wants to write that by hand.
Or, in more words: The format of the files is just the representation on disk - it’s not directly connected to how the files are generated or edited. XML files can be written by hand with suitable editor support.
We check-in some generated CSS files. That are generated by external theme cli. Just to be sure, that after version update we can track all changes in the generated CSS files.
Regenerating certain things might be fast, but some might not be.
Hundreds of engineers pushing code and having to wait for these to be regenerated both locally and on CI means that caching is quite cheap after all.