The idea that you can export a csv file or excel worksheet full of unitless numbers is scientifically whacky. In the same way most would scoff at an graph without axis labels why too don't we scoff at a data file without units.
What's whacky about implicit units for a given format?
I use data files without explicit units all the time. A format specification, which might be explicit or implicit, might say that a given file is stored in CSV format where the first column contains a molecule representation (in my case as a SMILES string), followed by an identifier, followed by the molecular weight (in amu, which is the only reasonable unit), followed by surface area (in Ų, which is always the case), followed by volume (in ų, again, always the case). You'll note that none of these are SI units.
I have another file containing a molecular structure in SDF format, which starts:
Line 4, the 44 is "number of atoms", the 45 is "number of bonds", and the (4.5411, 4.0194, 0.0000) is a coordinate in angstroms. I see a distinct lack of units in the data file.
I have another file, in PDB format, containing lines like:
ATOM 34 N GLY 1 6 8.420 50.899 85.486 0.50 51.30
The (8.420, 50.899, 85.486) is a coordinate in Å. The other numbers are identifiers of one sort or other, or the unitless occupancy and B-factor.
Are you really going to scoff at my entire field, for working with data files without explicit units since the late 1960s?
When working specifically with Excel files, an organization tends to stabilize on what certain column titles mean, so the a title of "MW" means "molecular weight in AMU", etc. This is more complicated when applied to values which are parameter dependent (charge at pH 7.5 vs. 6.0), or depend on specific models and/or software version. You'll notice that pH is a unitless number.
Units are only a subset of the ontology usually omitted from a given file format. Others include unitless terms like B-factor, pH, "number of atoms", prediction model, and implementation version. This ontology is often instead made explicit in external format documentation or through shared knowledge of the users of that data file.
What advantage there is for me or my field to have 'SI units as a requirement for most datatypes', especially since we often deal with non-SI units like Å and amu? I don't see anything except more confusion, more chances for error, and performance overhead of going from/to SI units instead of staying in the domain-specific preferred system.
The people who I've seen try to use a Semantic Web/Linked Data approach end up bogged down in verbose and slow to parse data formats that make it hard to do real work, because the software has to be wary that the input one moment might be in Å, the next in nm, the third in m, and the fourth in yoctoparsecs.
The PDB format is an agreed-upon format that most fields don't have the luxury of using. And its implicit units keep it from being significantly more useful - outside of its own particular field. But you are correct, it has units - and that is a HUGE advantage over nearly any other scientific data format.
If the PDB format really had explicit units you could start to use it in other fields, easily - without knowing anything about the format itself. But again, PDB is an example of a well-codified format.
It'd be great if every figure/table you saw in a paper had an associated <.xsciformat> which was united (interesting that I meant unit-ed, but that's exactly what it would do - it would unite). That way you could download files from a gel-shift assay and directly and computationally compare the data with the diffusion data from a microscopy assay, and utilize pHs estimated from PDB files, or any other such really interesting co-interactions with the raw data itself. Right now this kind of co-linking of data across disparate fields is impossible. And I think much of it could be clarified if the user couldn't print out a graph/dataset that didn't have units - implicit or otherwise.
You asked "why too don't we scoff at a data file without units". You just answered your own question: because it's an "agreed-upon format."
All of my response was to point out that an Excel spreadsheet, CSV file, etc. can equally be considered an "agreed-upon format" by those who use it, so don't need explicit units.
My original question was a simple one. Why should units be a requirement for most datatypes?
I know all of the reasons for why it's useful. I don't understand why it should be a requirement.
The PDB format is not an easy format to understand. Unit conversion is one of the least of the problems in using it outside of its field. Determining bond assignments is much harder, and bond type assignment harder still. In fact, I have a hard time figuring out an example where an explicit "this is in unit X" would make things appreciably easier, as compared to near useless data taking up space.
Could you give an example of how someone could start to use it in another field, easily, where they couldn't now? I can only see it occurring by completely replacing the format, since adding an "A" after each coordinate, or a comment at the top that the coordinates are in angstroms, can't be what you mean. (Nor would including the PDB spec as a comment in each record be what you mean either - though it would be self-documenting!)
For that matter, the X-ray resolution field in a PDB record contains significant digits, so "2.0Å" and "2.00Å" mean different things. The ontology of units is not easy. 2.00Å is 2.00E-10m, not 2E-10m.
In any case, I deal with a lot of unitless numbers as well: pH, molarity, number of atoms and bonds, number of rotatable bonds, ratio between elongation and fixed elongation, etc. The ontology of values is not easy, and a required SI-unit system looks much more like it would get in the way than be useful.
I agree, but in some fields this is already done. For example, This is exactly why geoscientists have standardized on the fully self-described netcdf file format. With netcdf, you can specify units, axis labels and other metadata very straightforwardly.