I looked at the wc implementation, there's no way it's compatible with the GNU tools, how it handles wide characters, which is heavily ties to the idiosyncratic C implementation. I know because I once wrote a wc implementation in modern C++ for my own education, and that's where most of my time and code complexity went. Also, the C versions are devilishly fast, with all flags, with optimized code branches.
You're right! Data.ByteString.Lazy is Word8 under the covers, so wide characters are truncated. tr takes a similar short cut. Swapping to Data.Text would fix that.
Where simplicity conflicted with compatibility, I've chosen the former so far. Targeting the BSD options and behavior is another example of that. The primary goal is to feel out the data flow for each utility, rather than dig into all the edges.
Poking at the rust coreutils project it seems like they all faster than the c versions. The issue is the decades of features that have accumulated that someone somewhere uses but seem safe to disregard because so few do.
A lot of the Rust coreutils were written by people learning the language, and the code is often _more_ complicated than the corresponding code for the Unix utils. They don't seem to get enough experienced programmers fixing them up or usage for me to actually recommend anybody use them, but maybe that's changing.
I read your blog. Maybe you should take a look at some of the other utils. I worked on sort and ls and both have long been faster than their GNU equivalents.
> They don't seem to get enough experienced programmers fixing them up or usage for me to actually recommend anybody use them, but maybe that's changing.
The issue is obviously compatibility and the uutils won't be broadly "usable" or stable until is hits 100% PASS/0% FAIL on the GNU test coverage, although the numbers are getting better and better. See: https://raw.githubusercontent.com/uutils/coreutils-tracking/...
> A lot of the Rust coreutils were written by people learning the language,
I really, really hate this way of framing the project -- which implicitly sounds like GNU is full of pros. Yes, lots of code was written by people just learning Rust and systems programming. Once it reaches parity on GNU test coverage, which it is rapidly approaching, GNU coreutils would wish it had this kind of enthusiasm behind it.
> and the code is often _more_ complicated than the corresponding code for the Unix utils.
I think it really depends. Much of the code is much simpler. For instance -- I think it's much easier to generically do hashing in Rust, and it shows.
POSIX doesn't forbid new flags or options. It's up to the author to read the spec and test his portability, or else willingly rely on certain distros. Some GNU tools have had strict modes as a courtesy (they used to jokingly call it POSIX_ME_HARDER).
...and that's precisely why they are faster, if you can throw out decades of compatibility flags you can avoid expensive branching and such to make them faster. That's why such projects should be called alternatives, and not replacements, it's not a replacement if it cannot be seamlessly replaced and decades worth of scripts can continue working.
No problem. Don’t branch in the middle of the algorithm, choose the appropriate optimized algorithm based on the task. That’s why we haven’t had a separate fgrep binary for decades.
When is it the time to talk about alternatives replacing the current apps in favor of faster apps? 10 years or rare usage? Twenty? At some point the edge cases should be covered by the scripts not by the app, only to make it slower for everyone.
I get what you're saying, but I don't care how it's called. Some things must die. Eg. Python 3 and the depreciation of was a very controversial step, but ultimately the right choice at the right time.
python3 was practically the reason I'll never rely on it for anything.
But it's not even just that. Even within a major version it changes so much every few months and between different distros and platforms that random non-packaged scripts never work when moved from the authors fedora box to someone else's debian box, or god forbid bsd or sco or windows, or the same box a year later due to any number of random library/module differences.
It's great for the author and miserable for every other user.
It's ok for writing and using today and not at all for writing to keep.
It's ok for packaged apps where the package manager ensures that the script has exactly the environment it expects.
It's ok for controlled environments where the gods at the top have decreed that it is just part of the environment, so, large companies with IT departments and static published environments that everyone must use the same.
That's a lot of "it's ok"'s and so that's why it exists and is used in many places, but none of those changes why it's quite terrible.
Python 3 is a hugely successful language and implementation, and almost no one regrets that it exists aside from a few noisy holdouts, and people who never liked any Python anywhere at any time.
I don't disagree that python3 is hugely successful, but that doesn't mean the Python 2->3 pain was necessary. Certainly many Python users started using it too recently to remember it though.
I saw a solid dozen Python 2 projects leave Python entirely across a couple companies back during that timeframe.
The Python ecosystem has been growing overall especially because of the success of things like Pandas, but a lot of backend/fullstack web app programming did move away from it and never looked back.
(Though you might say the more interesting question is: would they have moved away to things like Node for async/perf or JVM-stuff for "maintainability of old large codebase with lots of devs" issues? Maybe? But at this point Python has added in a lot of things from those languages; but maybe if they'd been there five years earlier with a cleaner upgrade story the migrations wouldn't have happened.)
python3 had no compatibility mode, so everyone needed 100% of their dependencies to migrate. This was so painful that some teams abandoned their legacy python2 code and reimplemented in languages with better back compat stories.
Python 3 was announced way ahead and came with migration tools that worked pretty well. Besides character handling most of the stuff was pretty similar at the beginning and I never understood why apparently nobody or only a few projects made the switch early on.
That lead to a chicken and egg situation: if you depended on those libraries that did not migrate to python3 you where stuck at python 2 as well.
I believe being nice to the community and supporting python 2 for a long time was a mistake. They should have made a hard break and enforce the migration...
That document does not have the 2026 deadline. Instead, "At the same time, the authoring agencies acknowledge the commercial reality that
transitioning to MSLs will involve significant investments and executive attention. Further,
any such transition will take careful planning over a period of years." Also for the roadmap "agencies urge executives of software manufacturers to prioritize using MSLs in their products", not eliminate C. And a lot is about accountability: "The authoring agencies encourage software manufacturers to lead from the
top by publicly naming a business executive who will personally drive the elimination of
memory safety vulnerabilities from the product line."
It seems to use hand-written code for SIMD. Cannot one use normal C code using loops so that compiler optimizes it into SIMD? I know that gcc can convert some loops into SIMD code.
Hi! A few years ago I found myself wanting an equivalent of `column` that didn’t strip color codes. After I implemented it in Haskell, I found it was useful to use Nix to force statically linking against libraries like gmp to reduce startup time. Perhaps what I ended up doing might be helpful for you too: https://github.com/bts/columnate/blob/master/default.nix
Thank you for the suggestion, I'll give this a whirl! I've fussed around with `--ghc-options '-optl-static -fPIC'` and the like in years past without success.
I imagine for streaming tools like these it's pretty convenient. You don't have to manage buffers etc, just write code against a massive string and haskell takes care of streaming it for you and pulling in more data when needed.
There are libraries that handle it, but they probably have weird types, you can just use functions in the prelude to write a lot of these basic utilities.
For one thing, a "string" in Haskell by default is a linked list of unicode characters, so right out of the gate you've got big performance problems if you want to use strings. The exact way laziness is done also has serious performance consequences as well; when dealing with things as small as individual characters all the overhead looms large as a percentage basis. One of the major purposes of any of the several variants of ByteString is to bundle the bytes together, but that means you're back to dealing with chunks. Haskell does end up with a nice API that can abstract over the chunks but it still means you sometimes have to deal with chunks as chunks; if you turn them back into a normal Haskell "string" you lose all the performance advantages.
It can still come out fairly nice, but if you want performance it is definitely not just a matter of opening a file and pretending you've just got one big lazy string and you can just ignore all the details; some of the details still poke out.
It definitely has some sharp edges. One advantage is skipping computations (and the IO they'd need) that don't end up getting used, which let's you do some clever looking things/ ignore some details. That's hard to take advantage of in practice, I think.
The other advantage is just deferring IO. For instance in split or tee, you could decide that you need 500 output files and open all the handles together in order to pass them to another function that will consume them. I'd squint at someone who wrote `void process_fds(int fds[500]);`, but here it doesn't matter.
I think the scope of lazy constructs should usually be far less than that of strict constructs, so it's only in the cases where the librarified lazy abstractions don't fit that you need a rewrite. Lazy to strict isn't hard, but I don't want the performance and cognitive overhead of lazy-by-default.
Performance (execution, memory) is generally in the same ballpark as the BSD versions, with some caveats specific to utils that do lots of in place data manipulation.
cut comes to mind as an example, slicing and dicing lines into fields quickly without a ton of copies isn't easy. Using Streaming.ByteString generally makes a huge difference, but it's extremely difficult to use unless you get can your mind to meld with the types it wants. Picking it up again months later takes some serious effort.
I'm used to reading about a given C/C++ program being implemented in Rust, and was delighted to see such an effort in a functional programming language.
I know little about functional programming languages but I've always been interested in how languages like Ada and now Rust can help programmers write safer code. I'm curious what advantages a rewrite of a C/C++ app in a FP language provides and also what advantages a FP language brings in comparison to a language like Rust.
The original coreutils are very weak on wide-char formatting, item delimiting (i.e. -null / -print0), flag name consistency, discoverability, and overall input/output uniformity. Solving the edge cases rigorously would probably require minor breakages, but would be worthwhile.
Most coreutil re-workings i’ve seen either double-down on JSON output, 24-bit color, sixels, & other lipsticks on the pig—without addressing any of the basic breakages we’ve become accustomed to—or they go off into an entirely different direction like nushell.
There is definitely room for improvement, but it is a thankless job.
Most are simple programs. If bug-per-bug compat + memory safety is the goal, carefully reviewing the C would have been a vastly better investment of time compared to making 13,498 commits.
Definitely. Depending on how long you've spent staring at the contents of /bin/ and /usr/bin/ you'll notice there are definitely some array or matrix oriented utils (or options) missing like column.
cut comes to mind as a difficult one. In C, you can just hop around the char buffer[] and drop nulls in place for fields, etc before printing. You could go that way a Data.Array Char, but that's hard to justify as functional.
Shouldn't cut but the easiest one to do in functional style?
You basicaly map one line of a stream to another with some filtering and joining. Do I miss the part where it's terribly slow and/or not doable in Haskell or something?
I'm not much of a functional programmer either, but generally in the beginning you'll feel this way because you're not used to thinking about problems in a functional way. You're used to solving problems with imperative structures. So you try to implement that in a functional language and it turns out completely stupid because the language isn't made to solve problems that way. But it usually has good ways to solve things.
I like fp and array programming and logic programming (and the overlap) to solve problems that are (considered or for real; I don't usually know going in) really a bad fit for those and then implement them shorter and faster than in their imperative implementation. For some things it does not work obviously, but it's a really nice exercise for when in the gym, walking, cooking, driving and such; the performant or LoC optimising parts fit in my head so I can do the rewrites there and find ways to make them fit. For instance; going from an imperative piece of code to array programming (k/j), you have to find a way to translate the problem to a vector/matrix manipulation problem and that you can do while doing something else entirely. When you find a good one fit, it can be many, many times faster than the original just like that.
I got stuck trying to have an intuition about what would explode my memory usage due to lazy eval, I never got a grip on the transition from pseudofunctional at the io boundaries into real functional internally either.
I noticed that they are about as long and complex as the C versions. In early C++/STL days part of the pitch was reimplementing core utils in a much more concise way. In wonder if the is a result of focusing on that application, or a coincidence of design decisions.
> Symlinking doesn't appear to change the name reported by System.Environment.getProgName, so you'll need to create copies of the binary with different names
Would hardlinks help (would avoid multiple copies)?
This reminds me of a similar attempt with Perl[0], done a few… erm, twenty years ago. Might be interesting to do a three-way comparison for a few well-covered cases.
Are we referring to reference implementations? I have always found it easier to understand the C code, as Rust and Python tend to obscure too many details behind high-level abstractions.
C is simple enough to know what is going on, IMO.
For example:
numbers = [1, 2, 3, 4, 5]
squared_numbers = [num * num for num in numbers]
print(squared_numbers)
vs.
fn main() {
let numbers = vec![1, 2, 3, 4, 5];
let squared_numbers: Vec<i32> = numbers.iter().map(|&num| num * num).collect();
println!("{:?}", squared_numbers);
}
vs.
#include <stdio.h>
int main() {
int numbers[] = {1, 2, 3, 4, 5};
int squared_numbers[5];
// Squaring each number
for (int i = 0; i < 5; i++) {
squared_numbers[i] = numbers[i] * numbers[i];
}
// Printing the squared numbers
printf("[");
for (int i = 0; i < 5; i++) {
printf("%d", squared_numbers[i]);
if (i < 4) printf(", ");
}
printf("]\n");
return 0;
}
Python and Rust here seem concise and elegant, but can be more difficult to follow to the unaccustomed due to obscurity. I am sure there are examples that involve more higher-level abstractions, and reference implementations typically seem to use a lot of said abstractions. In case of this Rust code, abstractions (like iterators and closures) can also make the code more challenging to follow for those who are not accustomed to functional programming paradigms.
What I am trying to say is that the C implementation is more straightforward, IMO.
I was referring to stuff like the size of CHAR_BIT depending on the platform.
E.g., SystemC exists as a superset to allow hardware design in C, because ISO C isn't explicit on many occasions.
People who write C nowadays generally try to write it in the most "boring" and easy-to-read way possible, I think. But C can also be hard to understand if it uses a lot of side-effectful expressions and pointer arithmetic. Just look at "How Old School C Programmers Process Arguments", which looks over an excerpt from K&R [0]. To make your code slightly harder to read, while still being plausible, you could change it to use pointer arithmetic, like so:
// Printing the squared numbers
printf("[");
for (int n = 5, *p = squared_numbers; --n >= 0; p++)
printf("%d%s", *p, n > 0 ? ", " : "]\n");
> as Rust and Python tend to obscure too many details behind high-level abstractions.
C on the other hand forces handling too many low level details for a reference implementation IMO. Rust also isn't really good in that regard.
In a reference implementation what needs to be made explicit is the logic that needs to be implemented, rather than details like memory management. For that I would prefer something like Datalog instead.
There's no way that C implementation is easier to read than the Python one. There's no high-level abstraction going on there and the list comprehension reads like natural language.
I agree Rust often gets distracting with unwrapping and lifetime notation.
I know it is easier to read, and it is not difficult to figure out what is going on (in this case), but many languages do not support list comprehensions (for example), so you have to implement it in a way that is similar to the C version, making the C version more helpful.
The example above may not highlight what I was trying to, however.
Perhaps Ada could work even better than C? It is verbose, yes, but you will immediately know what is going on.
My point is, that in my opinion verbosity is better than conciseness in cases of reference implementations.
Great work! As a nitpick, I noticed that there is some contrived indirection going on inside `Coreutils.Util`. Instead of the Utility class they could have just used an datatype. The current way could be confusing to newcomers looking to learn from this repo (if that was the goal).
I'll admit it's mostly this way because I thought ExistentialQuantification sounded cool and wanted to give a try with classes - this could definitely be tidied up
Ive seen several coreutils reimplementations just using the name "coreutils" for themselves. I doubt anyone in the original project cares, but seems in bad taste to me.
The rules were different in 2014 when I made my account! It's actually quite annoying because lots of 3rd party GitHub integrations puke immediately saying I have an invalid username.
"On Windows - Symlinking doesn't appear to change the name reported by System.Environment.getProgName, so you'll need to create copies of the binary with different names."
I have not found this to be the case with "mklink /h".
https://bytepawn.com/tag/wc.html