I'm wondering: Does awk really provide that more value over sed while being easi...

Annatar · on Oct 25, 2017

AWK is a programming language. In the AWK book by Aho, Weinberger and Kernighan, towards the end of the book they implement an arbitrary assembler, the virtual processor and a virtual machine for the machine code they just invented to run. They also implement a relational database management system in AWK, as well as an autoscale graphing solution.

I myself have implemented XML SOAP command line client, a backup solution, a SAN UUID management application and an automated Oracle RAC SAN storage migration solution, a configuration management, and an Oracle database creation / management applications in AWK.

Usually I develop a thin getopts shell wrapper around an AWK core. Works every time, the executables are on the order of a few KB (the largest so far, the XML SOAP client is 24.5 KB) and they all run like a bandit. Memory requirements are miniscule. Dependencies are minimal: the only external dependency so far in my software has been the xsltproc binary from the libxslt package.

AWK is easier to use than Python or Perl, and is much faster than either of those. Typical code density ratio of Python versus AWK is 10:1, sometimes more. This means that if you have a 650 line Python program, you can implement the same functionality in about 280 lines of AWK, and the program will be far simpler. I've once collapsed a 280+ line Python program into a simple 15 lines of code in AWK.

AWK is an extremely versatile, powerful programming language.

For even more speed, AWKA can be used to transpile AWK source into C and then it will call an optimizing C compiler to compile it into a binary executable. Typical speedup is on the order of 100%, so if your AWK program ran in 12 seconds, it'll now finish in six.

segmondy · on Oct 25, 2017

I once introduced AWK to a team and shared the same book with them. They didn't know that AWK was a programming language. Told them they could achieve more faster with AWK than using python & php for transforming and shuffling around data. They looked at me like I was crazy. :(

microtherion · on Oct 25, 2017

I have quite fond memories of awk, but some of these claims might be a bit on the [citation needed] side.

"Easier to use" - maybe so, on the particular subset of problems that awk was designed for. However, the ease of use upside is limited - awk constructs map pretty much 1:1 onto Python/Perl constructs that are not particularly complicated. Conversely, there is a vast set of problems that are still straightforward to solve in Python/Perl and would be rather awkward in awk.

"much faster" - the comparisons I've seen (and done) usually had awk and perl5 roughly at parity.

"code density ratio 10:1" - I call BS on that one. Sure, with the benefit of hindsight, it's sometimes possible to vastly simplify a script, but that has little to do with the languages involved. There is no awk solution that cannot be expressed in about 2x the lines of Python code (and that 2x is mostly because idiomatic awk puts conditions and code on one line, while Python puts them on two lines).

zephyrfalcon · on Oct 25, 2017

"""Typical code density ratio of Python versus AWK is 10:1, sometimes more. This means that if you have a 650 line Python program, you can implement the same functionality in about 280 lines of AWK, and the program will be far simpler. I've once collapsed a 280+ line Python program into a simple 15 lines of code in AWK."""

How does this work? I am not saying it can't be done, but the main benefit of Awk seems to be quick one-liners, which are possible because you get "records" (splitting on whitespace) and lines (splitting on newline) and looping for free. But for larger programs, this easily translates to Python; just call readlines(), loop over it, call split() on each line. I would think that at this point, Awk doesn't have much of an advantage anymore... but apparently your experiences are different. What are some Awk constructs that would take a lot more code in Python?

empthought · on Oct 25, 2017

Every pattern being matched to every line can be a big win in more complex processing. This is a simple but familiar example:

     seq 1 30 | awk '
     $0 % 3 == 0 { printf("Fizz"); replaced = 1 }
     $0 % 5 == 0 { printf("Buzz"); replaced = 1 }
     replaced { replaced = 0; printf("\n"); next }
     { print }'

Note that the awk script is far more general than the typical interview question, which specifies the numbers to be iterated in order. The awk script works on any sequence of numbers.

microtherion · on Oct 25, 2017

Yes, but as zephyrfalcon said, that maps onto a series of if statements in python. No 10:1 magic anywhere.

empthought · on Oct 25, 2017

The "series of if statements" also has to read the line, split it, and parse an integer. To behave like the AWK script it also has to catch an exception and continue when the input cannot be parsed as an integer.

Go ahead, write the Python script that behaves exactly as this AWK program does. It will likely be 4x as long, and that's because the number of different patterns and actions to take is quite low. More complex (and hence more situated and less easy-to-understand) use cases will benefit even more from AWK's defaults.

Moreover the pattern expressions are not constrained to simple tests: https://www.gnu.org/software/gawk/manual/html_node/Pattern-O...

They can match ranges, regular expressions, or indeed any AWK expression. They can use the variables managed by the AWK interpreter: https://www.gnu.org/software/gawk/manual/html_node/Auto_002d... (NR and NF are commonly used).

Actions one-way or two-way communicate with coprocesses with minimal ceremony: https://www.gnu.org/software/gawk/manual/html_node/Two_002dw...

All of those mechanisms can be done in a Python script, but they add up to a lot of boilerplate and mindless yet error-prone translation to the standard library or Python looping and conditional logic.

microtherion · on Oct 25, 2017

> The "series of if statements" also has to read the line, split it, and parse an integer

All of which are built in functions...

> To behave like the AWK script it also has to catch an exception and continue when the input cannot be parsed as an integer.

Not quite sure what behavior you're referring to here. When I tested your script, it happily treated "xy" as divisible by 15.

> Go ahead, write the Python script that behaves exactly as this AWK program does.

  import fileinput
  for line in fileinput.input():
    replaced = False
    if int(line) % 3 == 0: print("Fizz", end=''); replaced = True
    if int(line) % 5 == 0: print("Buzz", end=''); replaced = True
    if replaced: print();
    else: print(line, end='')

> Moreover the pattern expressions are not constrained to simple tests

And none of these, except maybe for the range operator, are particularly challenging for python.

empthought · on Oct 26, 2017

1. Your script crashes when it is given input that does not parse as an integer. The awk script does not. In this way, the awk design favors robustness over correctness, which is a valid choice to make at times.

2. How would you modify it so it parsed a tab-delimited file and did FizzBuzz on the third column? With awk it is a simple matter of setting FS="\t" and changing $0 to $3?

3. How would you modify it so instead of being output unmodified, rows with $3 that are neither fizz nor buzz output the result of a subprocess called with the second column's contents?

Now you might say that this is all goalpost-moving, but that's the point. AWK is more flexible and less cluttered in situations where the goalposts tend to get moved, but where the basic text processing paradigm stays the same.

microtherion · on Oct 27, 2017

1. Sure, it's a valid choice, and one that can easily be reproduced by python:

  def intish(str):
    try:
      return int(str)
    except:
      return 0

Can python's default be reproduced as easily in awk?

2. You'd insert field = line.split('\t') at the beginning of the loop and then refer to field[2]

3. os.popen or subprocess.run

I buy the "less cluttered" argument when the problem matches awk's defaults. I vehemently disagree with the "more flexible" argument. A problem perfectly suited to awk can easily turn to a poor fit with the addition of a single, seemingly innocuous requirement (e.g. in your subprocess example, log the standard error of your subprocess into a separate file).

empthought · on Oct 29, 2017

So what does that look like in your program? With respect to failing fast and verbose error reporting in AWK, it's as simple as

     !/^[0-9]+$/ {
         print "invalid input: " $0 > "/dev/stderr"
         exit 1
     }

at the beginning of the script. None of the other actions need to be changed; but with your implementation, all of the calls to "int" need to be changed to "intish".

I've got the following script (I stopped playing games with line breaks):

    #!/usr/bin/env gawk -f

    BEGIN {
	FS = "|"
    }

    $2 % 3 == 0 {
	printf("Fizz")
	replaced = 1
    }

    $2 % 5 == 0 {
	printf("Buzz")
	replaced = 1
    }

    replaced {
	replaced = 0
	printf("\n")
	next
    }

    {
	system("cal " $2 " 2018 2> errors.txt")
    }

Which can produce the following output:

    $ ./script.awk <<EOF
    > thing1|0
    > thing2|3
    > thing3|7
    > thing4|13
    > EOF
    FizzBuzz
    Fizz
	 July 2018
    Su Mo Tu We Th Fr Sa
     1  2  3  4  5  6  7
     8  9 10 11 12 13 14
    15 16 17 18 19 20 21
    22 23 24 25 26 27 28
    29 30 31

    $ cat errors.txt 
    cal: 13 is neither a month number (1..12) nor a name

- What does the equivalent program in Python look like?

- How many characters does it have with respect to the number of characters in the awk script? (259 with shebang).

- How many characters would need to change to split by "," instead? (1 for awk). (You can achieve this in Python, but you'll end up spending characters on a utility function.)

- How many characters would need to be added to print "INVALID: " and then the input value for lines with non-numeric values in the second column, then skip to the next line? (55 for awk)

Character adds/changes are the best proxy for "flexibility" I could think of that doesn't go far afield into static code analysis.

I love Python and don't think awk is a good solution for extremely large or complex programs; however, it seems obvious to me that it is significantly more flexible than Python in every line-oriented text-processing task. The combination of opinionated assumptions, built-in functions and automatically-set variables, and the pattern-action approach to code organization, all add up to a powerful tool that's still worth using in order to keep tasks from becoming large or complex in the first place.

Annatar · on Oct 26, 2017

Hashed arrays are much simpler to do in AWK than they are in Python, for example.

Also record splitting and processing is highly configurable in AWK with RS, ORS, OFS and one gets it for free without having to write extra code. And don’t forget that Python needs about 25,000 files just to fire up, while AWK is a single 169 KB executable (on Solaris / illumos / SmartOS). Makes a huge difference come application deployment time.

asicsp · on Oct 25, 2017

* for search and replace on entire lines and other address based filtering, I prefer sed (or perl if I need PCRE features like non-greedy, lookarounds, code in substitution section, etc)

* for field processing, most multiple line processing, use of logic operators, arithmetic, control structures etc, I prefer awk or perl

* this repo is aimed at command line text processing tools, most awk examples given are single line, a few are 2-3 lines. Personally I prefer Python for larger programs

----

See also: https://unix.stackexchange.com/questions/303044/when-to-use-...

cyborgx7 · on Oct 25, 2017

To me, awk always occupied an awkward (no pun intended) spot. Too complicated for a proper single purpose command-line program. Not expressive enough to be in the same sphere as a scripting language like the ones you mentioned. I was hoping this article might illuminate its purpose, but I'm still just as clueless.

asicsp · on Oct 25, 2017

could you give an example use-case?

>Too complicated

please try out examples given and let me know if it helps you to understand the syntax better

brians · on Oct 25, 2017

For me, yes. I use Python plenty too, but Awk is great for a tiny state machine, and unique in the ease of setting up an event stream view of a text file. I could use Python, add a loop and a big set of conditionals—but a one page awk program gets all that right.

sohkamyung · on Oct 25, 2017

sed and awk complement each other, I think.

sed works on a line by line basis.

awk can work on a whole file. Subsequent line operations can depend on the state of previous lines.

Each has its own operating domain and you have to decide which tool is the best one for the task you have in mind.

Xophmeister · on Oct 25, 2017

Indeed: sed is useful for making small, line-wise tweaks to text. To be honest, I use it rarely (and this is largely because its regexp flavour leaves a lot to be desired). Things like delete the header line[1S] or a simple replacement[2S]. It, like Awk, has some useful line targeting functions (e.g., print lines between two regular expressions[3S], etc.) Awk, on the other hand, is more like a finite state machine for text processing, with the notion of records and fields baked in[4]. You can do the same thing in Awk as in sed (see [*A] references), but it's often easier in sed; vice versa, some things would be impossible or very difficult to do in sed which would be easy in Awk (e.g., [4], which prints the fifth field whenever the first field is "foo"). This doesn't even get into the multiline/statewise stuff you can do in Awk, but the examples would be too big/specific to fit into this comment.

I also learned recently that GNU Awk has networking support[5]. I have no idea why!

[1S] sed '1d'

[1A] awk 'NR!=1 {print}'

[2S] sed 's/foo/bar/'

[2A] awk '{sub(/foo/, "bar")}'

[3S] sed -n '/start_regex/,/end_regex/p'

[3A] awk '/start_regex/,/end_regex/ {print}'

[4] awk '$1=="foo" {print $5}'

[5] https://www.gnu.org/software/gawk/manual/gawkinet/gawkinet.h...

vram22 · on Oct 25, 2017

[3A] awk '/start_regex/,/end_regex/ {print}'

can be simplified to:

awk '/start_regex/,/end_regex/'

because in awk, if no action is given, the default action is to print the lines (that match the pattern). And if the pattern is omitted but the action is given, it means do the action on all lines of the input.

Edited to change:

print the line (that matches the pattern)

to

print the lines (that match the pattern)

vram22 · on Oct 26, 2017

>[1S] sed '1d'

Similarly

sed 15q

will print only the 1st 15 lines of the input and then terminate. E.g.:

sed 15q file

or

some_command | sed 15q

So, when put in a shell script and then called (with filename arg or using stdin):

sed $1q

is like a specific use of the head command [1]; it prints the first n ($1) lines of the standard input or of the filename argument given - where the value of $1 comes from the first command-line argument passed to the script.

[1] In fact on earlier Unix versions I worked on (which did not have the head command (IIRC), I used to use this sed command in a script called head - similar to tail.

And I also had a script called body :) to complement head and tail, with the appropriate invocation of sed. It takes two command-line arguments ($1 and $2) and prints (only) the lines in that line number range, from the input.

fosco · on Oct 25, 2017

I did not know gawk had networking support I wonder if it could be used on network traffic on the fly. Sort of like irules on an f5. Thank you for sharing!

lo_stronzo · on Oct 26, 2017

"sed is useful for making small, line-wise tweaks to text."

Couldn't agree more!

A great example of this was using (surprised it wasn't mentioned) sed with the -i & 's///g' operators while "cleaning" hundreds (seriously) of HTML/PHP files from injected content at a shared hosting provider.

bananicorn · on Oct 25, 2017

Honestly, that makes sense - doing multiline replaces with sed isn't very convenient (I believe it's possible if you replace newlines with NULL). I guess I'll probably learn awk then, it can't be that hard with the examples from this repo^^

padthai · on Oct 25, 2017

I use awk every day because I need state (I work with text files full of sections and subsections) but I am sure that there has to be something better out there.

What is the definitive tool to process text? Perl? Haskell? Some Lisp dialect?

macintux · on Oct 25, 2017

Biased because I've used Perl for over 20 years, but yeah, that's clearly one of its core reasons to exist. Regular expressions built into the language syntax instead of as a library makes a big difference.

rlonstein · on Oct 25, 2017

> What is the definitive tool to process text? Perl? Haskell? Some Lisp dialect?

Definitive? Being snarky, the one you have already installed and are familiar with. Like most I use Awk for one-liners, Perl if I need a little more or better regexes in a one- or two-liner. For the last several years I've been using TXR[1] if it gets complex. Lately I've been doing more fiddling with JSON than text and I'm using Ruby/pry and jq[2].

[1] http://www.nongnu.org/txr/

[2] https://stedolan.github.io/jq/

kazinator · on Oct 26, 2017

Hi; I replied to your Github gist quite a while ago:

https://gist.github.com/rlonstein/90d53fdeea31d2137737

about a matter related to the hash bang line in the script.

TXR has a nice little hack (that apparently I invented) to implement the intent of "#!/usr/bin/env txr args ..." on systems where the hash bang mechanism supports only one argument after the interpreter name.

znpy · on Oct 25, 2017

Perl. It was designed for doing just that :)

Annatar · on Oct 25, 2017

Perl has an intangible "write once" property: since it allows for writing extremely sloppy code under the "there is more than one way to do it!" mantra, nobody, including the original author can debug it afterwards. Not even with the built in Perl debugger. Perl encourages horrible spaghetti code.

In the interest of fair and accurate disclosure, I earned my bread for 3.5 years debugging Perl code for a living and I've also had formal education in Perl programming at the university. I would never want to do that again.

majewsky · on Oct 25, 2017

I have spent my first 2.5 years out of college working on legacy Perl code and I cannot agree. Perl is a very nice language if you follow a coding style, and really, any language gets ugly pretty quickly if you don't. There's this adage that "some developers can write C code in any language", and it's probably similarly true that some developers can write Perl one-liners in any language. (In that legacy Perl codebase that I maintained, one of the devleopers was clearly writing Fortran code in Perl. He was doing everything with nested loops over multi-dimensional integer arrays.)

macintux · on Oct 25, 2017

I experimented a bit with writing Erlang-style code in Perl. Wasn't terribly successful; pattern matching, even with regular expressions built into the language, is a fairly tough feature to emulate.

kbenson · on Oct 25, 2017

The problem is that with regexps you're generally still doing text matching, which is inefficient and error prone. Perl's default exception mechanism allows text based errors as well, so you end up doing it there too if you use exceptions and haven't decided on and strictly used some exception objects by default (and even then you either need to deal with strings as you encounter them, such as promoting them to an exception object of some sort). Objects at least allow you to definitively match on types. Perl's (now) experimental given/when constructs and smartmatch operator would help with this, but they've been plagued with problems for a long time now (or at a minimum are not seen as production ready still).

mohaine · on Oct 25, 2017

If I remember correctly, I believe perl was created to combine the power of sed and awk into a single tool so you didn't have keep switching back and fourth between the two.

Personally, I would suggest AWK for one to two liners when doing some one time data transform tasks. Anything more complicated and I would suggest a more "fully-fledged scripting language"

oblio · on Oct 25, 2017

awk is my go-to tool for simple command line tokenizer. Hard to beat:

awk '{ print $1; }'

Other than that... not really. Maybe the advantage would be ubiquity, if you really, really want to avoid Perl.

cyborgx7 · on Oct 25, 2017

>Hard to beat: awk '{ print $1; }'

How about: cut -f 1 -d ' '

You don't even need the -d flag if the you happen to be able to use the default delimiter, like in your example.

zb · on Oct 25, 2017

I used to regularly do ad-hoc text processing, typically on a 24 hour log of GPS data (on an early-2000s-era computer). Surprisingly enough, awk is many times faster than cut for any data set big enough for you to notice time passing.

fnord123 · on Oct 25, 2017

Awk delimits on whitespace by default. cut cannot do that afaict. So if you have something like this cut wont work:

    apples  1
    bananas 2

ninkendo · on Oct 25, 2017

Exactly, which makes cut kinda shit for 99% of common "split-on-whitespace" tasks IMO.

dredmorbius · on Oct 25, 2017

You can always use awk with IFS/OFS values to clean up the delimiters so you can pass the data off to cut ;-)

sbmassey · on Oct 25, 2017

There's always plain old bash

while read a _; do echo $a; done

abecedarius · on Oct 25, 2017

If you already know it, then yes, it's still handy sometimes. And it's a small language, unlike Perl.

OTOH I'm not sure I'd recommend bothering to learn it. Python is more verbose in Awk's domain, but not by so much as makes a huge difference, except at the scale of one-liners. (Or a-few-liners, at least.)

Another reason to learn it: the AWK book (by Aho, Kernighan, and Weinberger) is a great very short intro to the spirit of Unix-style coding. You could think of learning Awk as just the price of admission to that intro, paid along the way.

I wrote plenty of Awk in the 90s -- https://github.com/darius/awklisp isn't very representative but it was fun.

__MatrixMan__ · on Oct 25, 2017

I'll admit that I don't know perl, which I believe is the typical sed+awk replacement.

According to my potentially miscalibrated gut, referencing a "fully-fledged" scripting language in a shell script or at the command line is an indication that you should probably just be working in that environment in the first place.

There will be exceptions, it depends chiefly on the problem being solved, but overall I prefer my shell scripts to reference utilities with very specific purposes. It feels UNIXier that way.

cup-of-tea · on Oct 25, 2017

It's more powerful than sed while being much easier and quicker to use than perl or python, so yes. I use it quite often just as a filter. sed is still good for some things.