Every pattern being matched to every line can be a big win in more complex proce...

microtherion · on Oct 25, 2017

Yes, but as zephyrfalcon said, that maps onto a series of if statements in python. No 10:1 magic anywhere.

empthought · on Oct 25, 2017

The "series of if statements" also has to read the line, split it, and parse an integer. To behave like the AWK script it also has to catch an exception and continue when the input cannot be parsed as an integer.

Go ahead, write the Python script that behaves exactly as this AWK program does. It will likely be 4x as long, and that's because the number of different patterns and actions to take is quite low. More complex (and hence more situated and less easy-to-understand) use cases will benefit even more from AWK's defaults.

Moreover the pattern expressions are not constrained to simple tests: https://www.gnu.org/software/gawk/manual/html_node/Pattern-O...

They can match ranges, regular expressions, or indeed any AWK expression. They can use the variables managed by the AWK interpreter: https://www.gnu.org/software/gawk/manual/html_node/Auto_002d... (NR and NF are commonly used).

Actions one-way or two-way communicate with coprocesses with minimal ceremony: https://www.gnu.org/software/gawk/manual/html_node/Two_002dw...

All of those mechanisms can be done in a Python script, but they add up to a lot of boilerplate and mindless yet error-prone translation to the standard library or Python looping and conditional logic.

microtherion · on Oct 25, 2017

> The "series of if statements" also has to read the line, split it, and parse an integer

All of which are built in functions...

> To behave like the AWK script it also has to catch an exception and continue when the input cannot be parsed as an integer.

Not quite sure what behavior you're referring to here. When I tested your script, it happily treated "xy" as divisible by 15.

> Go ahead, write the Python script that behaves exactly as this AWK program does.

  import fileinput
  for line in fileinput.input():
    replaced = False
    if int(line) % 3 == 0: print("Fizz", end=''); replaced = True
    if int(line) % 5 == 0: print("Buzz", end=''); replaced = True
    if replaced: print();
    else: print(line, end='')

> Moreover the pattern expressions are not constrained to simple tests

And none of these, except maybe for the range operator, are particularly challenging for python.

empthought · on Oct 26, 2017

1. Your script crashes when it is given input that does not parse as an integer. The awk script does not. In this way, the awk design favors robustness over correctness, which is a valid choice to make at times.

2. How would you modify it so it parsed a tab-delimited file and did FizzBuzz on the third column? With awk it is a simple matter of setting FS="\t" and changing $0 to $3?

3. How would you modify it so instead of being output unmodified, rows with $3 that are neither fizz nor buzz output the result of a subprocess called with the second column's contents?

Now you might say that this is all goalpost-moving, but that's the point. AWK is more flexible and less cluttered in situations where the goalposts tend to get moved, but where the basic text processing paradigm stays the same.

microtherion · on Oct 27, 2017

1. Sure, it's a valid choice, and one that can easily be reproduced by python:

  def intish(str):
    try:
      return int(str)
    except:
      return 0

Can python's default be reproduced as easily in awk?

2. You'd insert field = line.split('\t') at the beginning of the loop and then refer to field[2]

3. os.popen or subprocess.run

I buy the "less cluttered" argument when the problem matches awk's defaults. I vehemently disagree with the "more flexible" argument. A problem perfectly suited to awk can easily turn to a poor fit with the addition of a single, seemingly innocuous requirement (e.g. in your subprocess example, log the standard error of your subprocess into a separate file).

empthought · on Oct 29, 2017

So what does that look like in your program? With respect to failing fast and verbose error reporting in AWK, it's as simple as

     !/^[0-9]+$/ {
         print "invalid input: " $0 > "/dev/stderr"
         exit 1
     }

at the beginning of the script. None of the other actions need to be changed; but with your implementation, all of the calls to "int" need to be changed to "intish".

I've got the following script (I stopped playing games with line breaks):

    #!/usr/bin/env gawk -f

    BEGIN {
	FS = "|"
    }

    $2 % 3 == 0 {
	printf("Fizz")
	replaced = 1
    }

    $2 % 5 == 0 {
	printf("Buzz")
	replaced = 1
    }

    replaced {
	replaced = 0
	printf("\n")
	next
    }

    {
	system("cal " $2 " 2018 2> errors.txt")
    }

Which can produce the following output:

    $ ./script.awk <<EOF
    > thing1|0
    > thing2|3
    > thing3|7
    > thing4|13
    > EOF
    FizzBuzz
    Fizz
	 July 2018
    Su Mo Tu We Th Fr Sa
     1  2  3  4  5  6  7
     8  9 10 11 12 13 14
    15 16 17 18 19 20 21
    22 23 24 25 26 27 28
    29 30 31

    $ cat errors.txt 
    cal: 13 is neither a month number (1..12) nor a name

- What does the equivalent program in Python look like?

- How many characters does it have with respect to the number of characters in the awk script? (259 with shebang).

- How many characters would need to change to split by "," instead? (1 for awk). (You can achieve this in Python, but you'll end up spending characters on a utility function.)

- How many characters would need to be added to print "INVALID: " and then the input value for lines with non-numeric values in the second column, then skip to the next line? (55 for awk)

Character adds/changes are the best proxy for "flexibility" I could think of that doesn't go far afield into static code analysis.

I love Python and don't think awk is a good solution for extremely large or complex programs; however, it seems obvious to me that it is significantly more flexible than Python in every line-oriented text-processing task. The combination of opinionated assumptions, built-in functions and automatically-set variables, and the pattern-action approach to code organization, all add up to a powerful tool that's still worth using in order to keep tasks from becoming large or complex in the first place.