More

MainJane · on April 3, 2022

> The comparison is not very fair to modern day xargs.

I am curious how you come to that conclusion.

> `nproc` is a relatively standard utility (coreutils). So, xargs -P$(nproc) gets you core (or core-proportional) parallelism.

I follow you on this point. A bit harder on remote systems, but definitely doable.

> Grouping output/Making a safe parallel grep is also easy-ish with `--process-slot-var=slot` and sending to `tmpOut.$slot`.

I tried spending 5 minutes on coding this, but the details seem to be very hard to get right: composed commands, grouping stderr, combined with not leaving tmp files behind if killed and allowing for the total output to be bigger than the free space on /tmp. I could not do it.

Could you consider spending 5 minutes on showing in code how you would do it?

> Jobs on remote computers can be done similarly with any kind of `arrayVar[$slot]` setup where `arrayVar` has a bunch of `ssh` targets, possibly duplicates if you want to run >1 job per host. (In pure POSIX sh you could use eval and $1, $2 positional args with shell arithmetic..)

This one seemed even harder to me: It was completely unclear how you would make sure that a given number of jobs were constantly running. And how would you need to quote data, so an eval would not cause "foo space space bar" turn into "foo space bar". And how you would kill remote jobs, if the local script was killed.

If you believe this is simple, could you spend 5 minutes on showing the rest of us how you would do it in actual working code? Because it seems the devil is really in the detail.

> Last I looked at the source for GNU parallel it looked like mountains upon mountains of Perl I would rather not depend upon, personally, but to each his own.

Personally, I would take production tested code over home-made untested code any day - no matter the language in which it was written.

cb321 · on April 3, 2022

> allowing for the total output to be bigger than the free space on /tmp. I could not do it.

This is an unreasonable standard when you do not know in advance how big the output is. What do you imagine GNU parallel does? Use `df` on every host it knows about to fill every disk partition it can? That sounds like a pretty system-hostile behavior to me.

Meanwhile, putting your temp files somewhere bigger is obv. as easy as $TMPDIR or such.

Best wishes/luck. I only have 5 minutes to explain why nothing can do the impossible like read a user's mind about disk free space management or the value of partial results. All software makes some assumptions... :-)

MainJane · on April 3, 2022

> This is an unreasonable standard when you do not know in advance how big the output is.

Why is that unreasonable?

Let us say a single job outputs 10% of the free space. As long as you run fewer than 10 jobs in parallel, GNU paralel can run forever, because it spits out the output when a job is done and then frees up the space for this job, while starting the next one.

A simple example:

    yes 1000000 | parallel -j10 seq | pv >/dev/null

On my laptop I get 600 MB/s which would fill /tmp in a few minutes, and it does not.

When dealing with big data it is not uncommon that the total data piped between commands is way larger than the free space on /tmp (which is typically fast, where as free space on $HOME is slow - thus setting $TMPDIR to $HOME/tmp may slow down your job drastically).

If you only have 5 minutes, I hope you will use them on providing actual code to support your claim, that "The comparison is not very fair to modern day xargs."

If it takes longer than 5 minutes to code, I would say your use of "easy-ish" is unwarranted.

You leave me with the feeling that you have not thought this through and that the reason why you do not provide any code is because you are now realizing you are wrong, but you do not have the guts to admit so.

Prove me wrong by posting the code. It should be "easy-ish" :)

You can use this as the test case to implement:

    yes 1000000 | parallel -kj10 "echo 'This  is  double  spaced  '{#}; seq {}" | pv >/dev/null

cb321 · on April 8, 2022

You are just moving goalposts from "grouping to not mix" in the comparison doc to "grouping to not mix with exact space management profile(s) of GNU parallel". Even worse, you now bring in IO space-speed assumptions, other use cases (hay generation not needle search), various dissembling and childish "taunts for proof" when you clearly understood the suggestion well enough to analyze it for potential limitations. Your attitude is the problem, not missing code. Also, I never said "/tmp" and the paths could be FIFOs with record size/buffering limitations instead.

Speaking of /tmp filling and questionable space management defaults:

    yes 2000000000 | parallel seq | pv > /dev/null

fills my /tmp disk partition (or $TMPDIR) before emitting one byte to pv with invisible (unlinked) temp files. Not ideal. GNU sort at least shows me there are files present yet also seems to clean up on Ctrl-C.

There is likely some solution to fix this in 15 kLOC of gross Perl. I did not find it in "5 minutes" (another unreasonable standard since the many 1000s of lines of GNU parallel docs take far longer to read, but you already seem to ignore my explanations of "unreasonable"). You even anticipate this in your 10% example. At least in my life, "way more" is often much more than 10x more. So, you basically contradict yourself.

As to the actual subtopic, besides being unfair/out-of-date, the comparison tableau is also incomplete - maybe willfully so, as per too common marketing dishonesty. "Proof?" People use parallelism to speed things up and need to make decisions about job granularity to not have perf killed by overhead. Some would say this matters more than 95% of the tableau evaluation points. Yet, no overhead benchmarks. Maybe they make GNU parallel look bad?

MainJane · on April 11, 2022

If you feel I am "moving the goalposts" why not just prove your original case? If you are spending 5 minutes on reading the source code, why not instead spend them on proving your original assertion is correct? You can then let the readers decide if they feel I "move the goalposts".

I included the example:

    yes 1000000 | parallel -kj10 "echo 'This  is  double  spaced  '{#}; seq {}" | pv >/dev/null

to give you some fixed "goalposts" to aim for: Provide a solution that gives the same output byte for byte.

Also you do not seem to get the point about the amount of data. I regularly have output from a single job that is bigger than RAM, but rarely have output from a single job that would fill /tmp. However, the total combined output from all the jobs will often take up more space than /tmp.

In numbers: RAM=32 GB, /tmp=400 GB, a single job=33 GB, number of jobs=1000, jobs in parallel=8.

In other words: Running all jobs and saving the outputs into files before outputting data will not be useful for me. If you want to use FIFOs I really cannot see how you can deal with output that is bigger than RAM, unless you mix output from different jobs - which again would not be useful to me. But prove me wrong by spending 5 minutes on building the solution.

As for your example:

    yes 2000000000 | parallel seq | pv > /dev/null

How would you design this, if output from different jobs are not allowed to mix?

If they are allowed to mix paralel gives you:

    # bytes are allowed to mix
    yes 2000000000 | parallel -u seq | pv > /dev/null
    # only full lines are allowed to mix
    yes 2000000000 | parallel --lb seq | pv > /dev/null

none of these use space in /tmp.

I sit back with the feeling you are willing to spend hours complaining, but not 5 minutes on proving your assertion that it can be done "easy-ish".

Prove me wrong: Spend 5 minutes on the task you believed was "easy-ish".

If it cannot be done in 5 minutes, be brave enough to admit you were wrong.

MainJane · on April 3, 2022

Would it not be more fruitful addressing the hard issue: Funding.

https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

I think many free software developers would rejoice if you cracked that problem.

MainJane · on April 3, 2022

I think that is the right thing to do: Don't like it? Don't use it.

Also: https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

MainJane · on April 3, 2022

I was curious how much breakage GNU Parallel has suffered. So I fetched all versions (in parallel) and ran:

    parallel -k --tag --argsep -- {} echo ::: 1 -- parallel-*

Every version since 20120622 work (except for 20121022). That is code which is almost 10 years old.

rurban · on April 4, 2022

you need to try with all the perl versions, not the parallel versions.

forgotmypw17 · on April 4, 2022

In my anecdotal, n=1 experience, nothing Perl-based I've ever used has EVER broken over 20+ years, not even ONCE.

Compare this with PHP, whose breaking changes between releases has taken down my sites on multiple occasions.

Compare this with Python, whose breaking changes prevent me from running the overwhelming majority of Python things I've tried to use.

MainJane · on April 3, 2022

Also try:

    parallel --tmux ...

MainJane · on April 3, 2022

    cat hosts.txt | parallel --quote --timeout=10 ssh {} 'echo {} $(md5sum ~/.config/file)'

Also try:

    parallel --slf hosts.txt --timeout=10 --nonall --tag md5sum .config/file

MainJane · on April 3, 2022

> I don't care what GNU thinks, but it's simply not scalable.

How so?

A lot of software requires you to configure it before the first run, and we regard that as scalable.

A lot of software requires you to pay for it before the first run (most Microsoft server software comes to mind), yet we regard that as scalable. You can also pay for gnu paralell: https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

Is it because you insist that you get software for free (zero cost in gnu speak)? Because that is really not what the free software movement is all about.

musicale · on April 4, 2022

    $ ls
    Thank you for using the /bin/ls utility!
    Did you know that you can upgrade to LS PRO for a mere fraction of a bitcoin? 
    Or just post a selfie tagged #LS_PRO_RULES on Twitter!
    LS PRO has many amazing features that you are missing.
    This message can be removed by using the --no-awesome-ls-pro-upgrade-msg flag.
    Here is your file listing:
    .bashrc .catconf .cprc .ddconfig .dfprefs  ...
    $ exit -1

MainJane · on April 4, 2022

Honestly, I fail to see the problem, if I had to run `ls --no-awesome-ls-pro-upgrade-msg` once when I installed it the first time. And if I did not like it, I could use one of the alternatives to `ls` or build my own.

In LibreOffice I have to click a "Don't show tip of the day again" every time I install it on an new machine, and personally I have no problem with that. If I had, I would use something else.

Zsh asks me to configure it, first time I run it. I find that slightly annoying, but not to the extend that I would even consider complaining, sending a patch, or using an alternative.

But I assume you are aware that your comparison is really not valid: Parellel is not limited in features - you do not get extra features by paying/citing. What you are doing is keeping it alive.

Also, if you really do not like the notice, why not just pay for it? Are you opposed to paying for free software? And if so, how do you suggest developers of free software make a living? And why are you not actively doing that for GNU Parallel, which you clearly have so strong opinions on, that you are willing to spend time complaining but not willing to ignore (and use another tool)?

MainJane · on April 3, 2022

I still remember when those were not of the "OK, don't show this again" type, so you could not simply stop them first after the first run.

MainJane · on April 3, 2022

> If you're not describing an experiment or system that uses GNU parallel as one of its key components then it makes no sense to cite it any more than it does to cite any other utility.

GNU Parallel agrees with you, but also gives you a test of when to regard it as a "key component" (as you put it):

https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

> If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool. [...] If it is too much work replacing the use of GNU Parallel, then it is a good indication that the contribution to the research is big enough to warrant a citation.

MainJane · on Nov 25, 2021

git-bisect is nice if you are looking for a git commit.

If you are looking for a limit or the failing part of a file have a look at: https://gitlab.com/ole.tange/tangetools/-/tree/master/find-f...