> Alternatively, awk '{print $2}' netflix.tsv would have given us the same result. For this tutorial, I use cat to visually separate the input data from the AWK program itself. This also emphasizes that AWK can treat any input and not just existing files.
Thanks to you... and OP! I have been using posix systems for years but never really touched awk because what I have seen have seemed like archaic chants of dark magic, very good resources for learning about this nifty tool.
I use sed & awk all the time. It is an invaluable tool while debugging issues, extracting fields from log files etc. I am not dissing python or perl, I use python extensively as well. But while you are in the middle of an incident, hard to beat a quick oneliner, like this random example.
Interestingly, perl was designed to fill exactly that role, with a syntax that was at the time considered cleaner and more intuitive than awk's (and of course with a regex backend that was much more powerful). In this case, something like:
Which is a shame, because I doubt I'll be loading modules from CPAN for awk any time soon, and I routinely use modules from CPAN or that I myself have written in my perl one liners. In fact, I have a specialized project-perl.sh in my project's PATH that loads a few modules (Try::Tiny, Path::Tiny, Data::Dumper, etc), sets a local::lib, and loads the main project module, allowing access to most the functions I've written for the current project, and passes all the extra params through to the underlying perl command.
If my codebase wasn't already in perl I wouldn't necessarily be able to benefit from a large library of my own functions relevant to the problem, but at least I would still have CPAN to fall back on. I'm not sure awk will ever be able to compete with that.
Of course it does, the implicit looping and BEGIN/END blocks (I'll concede I had actually forgotten about the implicit splitting) were in fact deliberately designed to emulate awk. Nonetheless I don't think that really changes the point much. The core language syntax is simple enough that perl one-liners can be written without that and still get the point across.
Dumb side note: I always end up using `awk '{print $1}'` instead of `cut -f1` because cut's field separation defaults are unwieldy in that it doesn't intelligently consider any whitespace to be a field separator, which is what I need 99% of the time.
can be skipped (at least in the awk's I've used, not sure if some more recent / strict awk differs), because awk initializes the variable total to 0.
I remember this because I do this all the time to quickly get the (non-recursive) sum of the sizes of the files in a directory - it's pretty much muscle memory from a while now:
ls -l | awk '{ s += $5 } END { print s/1024 " KB" }'
For recursive size, one can use ls -lR or the du command with various options according to need.
But that only works because in the final print, you have done arithmetic with s.
So in other words, we can take out the BEGIN block, but then we must remember to change print total to print total + 0.
Also, using uninitialized variables is basically a code golfing stupidity that will bite you in any halfway complicated program.
GNU Awk has a useful --lint argument which spots uses of uninitialized variables. If you make it habit to write code that way, if you then use --lint for finding a bug, you have to deal with false positives.
>Also, using uninitialized variables is basically a code golfing stupidity that will bite you in any halfway complicated program.
Nonsense. Not if you know what you are doing, and used it in a known way, which is what I did. The code I wrote works. I tested it on Linux before posting it. Also, such a usage (skipping the initializer) is mentioned (IIRC) in the classic Kernighan & Pike book "The Unix Programming Environment" (still a great resource, though not updated for modern Unix/Linux features), which is where I learned it from, years ago (and hence why I qualified my statement by saying it may not work in more strict or modern awk versions). Fine to talk about other variations but it does not mean that my variation is wrong.
Don't try to read my mind. My intention was not code golfing. Was just sharing some fun info. It's not a big deal to keep the initializer either, I'm quite aware of that.
You do not know 100% what you're doing. Evidence being, in the grandparent comment you wrote "[...] because awk initializes the variable total to 0" which isn't how awk works at all.
Your intention can be understood as the promotion of code golfing, as evidenced by these words:
The fragment BEGIN{total=0} can be skipped
by which you're clearly encouraging that other coder to make their code shorter by removing an initialization that works fine.
>Your intention can be understood as the promotion of code golfing, as evidenced by these words:
The fragment BEGIN{total=0} can be skipped
by which you're clearly encouraging that other coder to make their code shorter by removing an initialization that works fine.
Your statement (above) can be "understood" as not understanding my prior statement(s), including the one in which I said "Don't try to read my mind" (w.r.t. intention, because you cannot - it is mine (mind), not yours). If you cannot grok that after a second explanation, I have nothing further to say. Good day.
Although your true intent may be anyone's guess, true enough, the prima faciae intent of forum posting persona is a fairly straightforward function of the content via which it portrays itself.
Generally good advice, but can be OK in specific situations where you have a known HTML structure and are just scraping some values out of it. This is not so much parsing HTML as it is matching the patterns of the values you want to extract.
Attribute value minimization would be one limitation, since that regex is for XML, but it's a more robust approach than writing naive "<x>.*</x>"-style regexes.
Does awk really provide that more value over sed while being easier or faster to use than a fully-fledged scripting language (thinking of perl, python, etc).?
(and yes, one may argue that awk IS a scripting language, I'm not disputing that, just asking)
AWK is a programming language. In the AWK book by Aho, Weinberger and Kernighan, towards the end of the book they implement an arbitrary assembler, the virtual processor and a virtual machine for the machine code they just invented to run. They also implement a relational database management system in AWK, as well as an autoscale graphing solution.
I myself have implemented XML SOAP command line client, a backup solution, a SAN UUID management application and an automated Oracle RAC SAN storage migration solution, a configuration management, and an Oracle database creation / management applications in AWK.
Usually I develop a thin getopts shell wrapper around an AWK core. Works every time, the executables are on the order of a few KB (the largest so far, the XML SOAP client is 24.5 KB) and they all run like a bandit. Memory requirements are miniscule. Dependencies are minimal: the only external dependency so far in my software has been the xsltproc binary from the libxslt package.
AWK is easier to use than Python or Perl, and is much faster than either of those. Typical code density ratio of Python versus AWK is 10:1, sometimes more. This means that if you have a 650 line Python program, you can implement the same functionality in about 280 lines of AWK, and the program will be far simpler. I've once collapsed a 280+ line Python program into a simple 15 lines of code in AWK.
AWK is an extremely versatile, powerful programming language.
For even more speed, AWKA can be used to transpile AWK source into C and then it will call an optimizing C compiler to compile it into a binary executable. Typical speedup is on the order of 100%, so if your AWK program ran in 12 seconds, it'll now finish in six.
I once introduced AWK to a team and shared the same book with them. They didn't know that AWK was a programming language. Told them they could achieve more faster with AWK than using python & php for transforming and shuffling around data. They looked at me like I was crazy. :(
I have quite fond memories of awk, but some of these claims might be a bit on the [citation needed] side.
"Easier to use" - maybe so, on the particular subset of problems that awk was designed for. However, the ease of use upside is limited - awk constructs map pretty much 1:1 onto Python/Perl constructs that are not particularly complicated. Conversely, there is a vast set of problems that are still straightforward to solve in Python/Perl and would be rather awkward in awk.
"much faster" - the comparisons I've seen (and done) usually had awk and perl5 roughly at parity.
"code density ratio 10:1" - I call BS on that one. Sure, with the benefit of hindsight, it's sometimes possible to vastly simplify a script, but that has little to do with the languages involved. There is no awk solution that cannot be expressed in about 2x the lines of Python code (and that 2x is mostly because idiomatic awk puts conditions and code on one line, while Python puts them on two lines).
"""Typical code density ratio of Python versus AWK is 10:1, sometimes more. This means that if you have a 650 line Python program, you can implement the same functionality in about 280 lines of AWK, and the program will be far simpler. I've once collapsed a 280+ line Python program into a simple 15 lines of code in AWK."""
How does this work? I am not saying it can't be done, but the main benefit of Awk seems to be quick one-liners, which are possible because you get "records" (splitting on whitespace) and lines (splitting on newline) and looping for free. But for larger programs, this easily translates to Python; just call readlines(), loop over it, call split() on each line. I would think that at this point, Awk doesn't have much of an advantage anymore... but apparently your experiences are different. What are some Awk constructs that would take a lot more code in Python?
Note that the awk script is far more general than the typical interview question, which specifies the numbers to be iterated in order. The awk script works on any sequence of numbers.
The "series of if statements" also has to read the line, split it, and parse an integer. To behave like the AWK script it also has to catch an exception and continue when the input cannot be parsed as an integer.
Go ahead, write the Python script that behaves exactly as this AWK program does. It will likely be 4x as long, and that's because the number of different patterns and actions to take is quite low. More complex (and hence more situated and less easy-to-understand) use cases will benefit even more from AWK's defaults.
All of those mechanisms can be done in a Python script, but they add up to a lot of boilerplate and mindless yet error-prone translation to the standard library or Python looping and conditional logic.
1. Your script crashes when it is given input that does not parse as an integer. The awk script does not. In this way, the awk design favors robustness over correctness, which is a valid choice to make at times.
2. How would you modify it so it parsed a tab-delimited file and did FizzBuzz on the third column? With awk it is a simple matter of setting FS="\t" and changing $0 to $3?
3. How would you modify it so instead of being output unmodified, rows with $3 that are neither fizz nor buzz output the result of a subprocess called with the second column's contents?
Now you might say that this is all goalpost-moving, but that's the point. AWK is more flexible and less cluttered in situations where the goalposts tend to get moved, but where the basic text processing paradigm stays the same.
Can python's default be reproduced as easily in awk?
2. You'd insert field = line.split('\t') at the beginning of the loop and then refer to field[2]
3. os.popen or subprocess.run
I buy the "less cluttered" argument when the problem matches awk's defaults. I vehemently disagree with the "more flexible" argument. A problem perfectly suited to awk can easily turn to a poor fit with the addition of a single, seemingly innocuous requirement (e.g. in your subprocess example, log the standard error of your subprocess into a separate file).
at the beginning of the script. None of the other actions need to be changed; but with your implementation, all of the calls to "int" need to be changed to "intish".
I've got the following script (I stopped playing games with line breaks):
$ ./script.awk <<EOF
> thing1|0
> thing2|3
> thing3|7
> thing4|13
> EOF
FizzBuzz
Fizz
July 2018
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
$ cat errors.txt
cal: 13 is neither a month number (1..12) nor a name
- What does the equivalent program in Python look like?
- How many characters does it have with respect to the number of characters in the awk script? (259 with shebang).
- How many characters would need to change to split by "," instead? (1 for awk). (You can achieve this in Python, but you'll end up spending characters on a utility function.)
- How many characters would need to be added to print "INVALID: " and then the input value for lines with non-numeric values in the second column, then skip to the next line? (55 for awk)
Character adds/changes are the best proxy for "flexibility" I could think of that doesn't go far afield into static code analysis.
I love Python and don't think awk is a good solution for extremely large or complex programs; however, it seems obvious to me that it is significantly more flexible than Python in every line-oriented text-processing task. The combination of opinionated assumptions, built-in functions and automatically-set variables, and the pattern-action approach to code organization, all add up to a powerful tool that's still worth using in order to keep tasks from becoming large or complex in the first place.
Hashed arrays are much simpler to do in AWK than they are in Python, for example.
Also record splitting and processing is highly configurable in AWK with RS, ORS, OFS and one gets it for free without having to write extra code. And don’t forget that Python needs about 25,000 files just to fire up, while AWK is a single 169 KB executable (on Solaris / illumos / SmartOS). Makes a huge difference come application deployment time.
* for search and replace on entire lines and other address based filtering, I prefer sed (or perl if I need PCRE features like non-greedy, lookarounds, code in substitution section, etc)
* for field processing, most multiple line processing, use of logic operators, arithmetic, control structures etc, I prefer awk or perl
* this repo is aimed at command line text processing tools, most awk examples given are single line, a few are 2-3 lines. Personally I prefer Python for larger programs
To me, awk always occupied an awkward (no pun intended) spot. Too complicated for a proper single purpose command-line program. Not expressive enough to be in the same sphere as a scripting language like the ones you mentioned. I was hoping this article might illuminate its purpose, but I'm still just as clueless.
For me, yes. I use Python plenty too, but Awk is great for a tiny state machine, and unique in the ease of setting up an event stream view of a text file. I could use Python, add a loop and a big set of conditionals—but a one page awk program gets all that right.
Indeed: sed is useful for making small, line-wise tweaks to text. To be honest, I use it rarely (and this is largely because its regexp flavour leaves a lot to be desired). Things like delete the header line[1S] or a simple replacement[2S]. It, like Awk, has some useful line targeting functions (e.g., print lines between two regular expressions[3S], etc.) Awk, on the other hand, is more like a finite state machine for text processing, with the notion of records and fields baked in[4]. You can do the same thing in Awk as in sed (see [*A] references), but it's often easier in sed; vice versa, some things would be impossible or very difficult to do in sed which would be easy in Awk (e.g., [4], which prints the fifth field whenever the first field is "foo"). This doesn't even get into the multiline/statewise stuff you can do in Awk, but the examples would be too big/specific to fit into this comment.
I also learned recently that GNU Awk has networking support[5]. I have no idea why!
because in awk, if no action is given, the default action is to print the lines (that match the pattern). And if the pattern is omitted but the action is given, it means do the action on all lines of the input.
will print only the 1st 15 lines of the input and then terminate. E.g.:
sed 15q file
or
some_command | sed 15q
So, when put in a shell script and then called (with filename arg or using stdin):
sed $1q
is like a specific use of the head command [1]; it prints the first n ($1) lines of the standard input or of the filename argument given - where the value of $1 comes from the first command-line argument passed to the script.
[1] In fact on earlier Unix versions I worked on (which did not have the head command (IIRC), I used to use this sed command in a script called head - similar to tail.
And I also had a script called body :) to complement head and tail, with the appropriate invocation of sed. It takes two command-line arguments ($1 and $2) and prints (only) the lines in that line number range, from the input.
I did not know gawk had networking support I wonder if it could be used on network traffic on the fly. Sort of like irules on an f5. Thank you for sharing!
"sed is useful for making small, line-wise tweaks to text."
Couldn't agree more!
A great example of this was using (surprised it wasn't mentioned) sed with the -i & 's///g' operators while "cleaning" hundreds (seriously) of HTML/PHP files from injected content at a shared hosting provider.
Honestly, that makes sense - doing multiline replaces with sed isn't very convenient (I believe it's possible if you replace newlines with NULL).
I guess I'll probably learn awk then, it can't be that hard with the examples from this repo^^
I use awk every day because I need state (I work with text files full of sections and subsections) but I am sure that there has to be something better out there.
What is the definitive tool to process text? Perl? Haskell? Some Lisp dialect?
Biased because I've used Perl for over 20 years, but yeah, that's clearly one of its core reasons to exist. Regular expressions built into the language syntax instead of as a library makes a big difference.
> What is the definitive tool to process text? Perl? Haskell? Some Lisp dialect?
Definitive? Being snarky, the one you have already installed and are familiar with. Like most I use Awk for one-liners, Perl if I need a little more or better regexes in a one- or two-liner. For the last several years I've been using TXR[1] if it gets complex. Lately I've been doing more fiddling with JSON than text and I'm using Ruby/pry and jq[2].
about a matter related to the hash bang line in the script.
TXR has a nice little hack (that apparently I invented) to implement the intent of "#!/usr/bin/env txr args ..." on systems where the hash bang mechanism supports only one argument after the interpreter name.
Perl has an intangible "write once" property: since it allows for writing extremely sloppy code under the "there is more than one way to do it!" mantra, nobody, including the original author can debug it afterwards. Not even with the built in Perl debugger. Perl encourages horrible spaghetti code.
In the interest of fair and accurate disclosure, I earned my bread for 3.5 years debugging Perl code for a living and I've also had formal education in Perl programming at the university. I would never want to do that again.
I have spent my first 2.5 years out of college working on legacy Perl code and I cannot agree. Perl is a very nice language if you follow a coding style, and really, any language gets ugly pretty quickly if you don't. There's this adage that "some developers can write C code in any language", and it's probably similarly true that some developers can write Perl one-liners in any language. (In that legacy Perl codebase that I maintained, one of the devleopers was clearly writing Fortran code in Perl. He was doing everything with nested loops over multi-dimensional integer arrays.)
I experimented a bit with writing Erlang-style code in Perl. Wasn't terribly successful; pattern matching, even with regular expressions built into the language, is a fairly tough feature to emulate.
The problem is that with regexps you're generally still doing text matching, which is inefficient and error prone. Perl's default exception mechanism allows text based errors as well, so you end up doing it there too if you use exceptions and haven't decided on and strictly used some exception objects by default (and even then you either need to deal with strings as you encounter them, such as promoting them to an exception object of some sort). Objects at least allow you to definitively match on types. Perl's (now) experimental given/when constructs and smartmatch operator would help with this, but they've been plagued with problems for a long time now (or at a minimum are not seen as production ready still).
If I remember correctly, I believe perl was created to combine the power of sed and awk into a single tool so you didn't have keep switching back and fourth between the two.
Personally, I would suggest AWK for one to two liners when doing some one time data transform tasks. Anything more complicated and I would suggest a more "fully-fledged scripting language"
I used to regularly do ad-hoc text processing, typically on a 24 hour log of GPS data (on an early-2000s-era computer). Surprisingly enough, awk is many times faster than cut for any data set big enough for you to notice time passing.
If you already know it, then yes, it's still handy sometimes. And it's a small language, unlike Perl.
OTOH I'm not sure I'd recommend bothering to learn it. Python is more verbose in Awk's domain, but not by so much as makes a huge difference, except at the scale of one-liners. (Or a-few-liners, at least.)
Another reason to learn it: the AWK book (by Aho, Kernighan, and Weinberger) is a great very short intro to the spirit of Unix-style coding. You could think of learning Awk as just the price of admission to that intro, paid along the way.
I'll admit that I don't know perl, which I believe is the typical sed+awk replacement.
According to my potentially miscalibrated gut, referencing a "fully-fledged" scripting language in a shell script or at the command line is an indication that you should probably just be working in that environment in the first place.
There will be exceptions, it depends chiefly on the problem being solved, but overall I prefer my shell scripts to reference utilities with very specific purposes. It feels UNIXier that way.
It's more powerful than sed while being much easier and quicker to use than perl or python, so yes. I use it quite often just as a filter. sed is still good for some things.
These are the best stories on HN and why i subscribed here in the first place. I have often seen awk used so many times on SO but I've always put it up for something later to learn. Finally today I have some basic understanding of awk and this is really great stuff! I did get by with Perl but this is definitely more handy and the example approach to teaching it makes is super easy to understand!
awk is what got me into web programming around 1994. I was working at a GE subsidiary and all the documentation for the RTOS I was working on was printed in huge binders from actively maintained Interleaf documents. Once I found the SGML source documents on the server it only took a few hours to learn enough awk to convert the SGML into a fully interlinked set of HTML documents with a table of contents. Granted SGML to HTML is not that hard but it was fun and useful and much nicer to search as opposed to laying out a bunch of binders on my cube's desk.
The advantage of awk is that it is faster in some cases and a little more convenient for simple command-lines due to automatic field splitting. The disadvantage is that it swallows errors silently even more than perl (I think).
Yep Perl* is a great swiss army knife that rendered sed and awk obsolete -- I am old enough (50) to have been thru this progression. It's funny to see awk and sed come back around.
(It's a little like functional programming being rediscovered when Lisp has been around since the dawn of time).
* I think Perl got carried away when it added objects and folks started writing large programs with it (although I have written some large scripts for biologists doing genetic studies -- which is interesting popular use case == there are google groups and O'Reilly books focused on this use case).
Also, Larry Wall really humped the shark with Perl 6.
20 years ago when Python was not pre-installed on most systems, the awk/gawk used to be my Swiss army knife. My first real programming job was translating 1000 LOC awk program into C.
The syntax was simpler than ed, and getting combination of grep, uniq, cut etc correct.
It's still extremely useful for those of us in industries where change comes veeerrrryyyy slooowwwly. For instance, most production servers I work with are still on Python 1.5.2 if they have Python at all - but awk, that I can depend on!
And on the other end of the spectrum, it's becoming relevant again in container ops. A random container is not likely to have a Python, Ruby or even Perl, but if it has a shell, it most certainly also has an awk. Even Alpine has one in their busybox.
Has analogs for all salient POSIX Awk features and most GNU Awk extensions. (Of course, not semantic cruft like the weak type system, or uninitialized variables serving as zero in arithmetic.)
Plus:
* You can embed (awk ...) expressions anywhere, including other (awk ...) expressions.
* You can capture a delimited continuation (awk ...) and yield out of there.
* It supports richer range expressions than Awk. Range expressions combine with other range expressions unlike in Awk, so that you can express a range which spans from one range to another. Also, there are variations of the operator to exclude either endpoint of the range: rng, -rng, rng- and -rng-.
* You can "awk" over a list of strings, possibly an infinitely lazy one.
1> (awk (:inputs '("a" "b") '("c" "d"))
(t (prn nr fnr rec)))
1 1 a
2 2 b
3 1 c
4 2 d
nil
* It has a return value: whatever the last :end returns, or else nil:
1> (awk (:end 42) (:end 43))
[Ctrl-D]
43
Build a list from the first fields of /etc/passwd:
Type conversion of fields (which are just strings) is achieved by an elegant operator fconv which takes a condensed notation such as (fconv i : r : xz) which means convert the first field to integer as a decimal integer, the last field as a hexadecimal integer and the fields in between as reals. The xz means that if the last field is invalid, it gets converted to zero rather than nil. These letters are just the names of lexical functions available in the awk scope, rather than built-in fconv behaviors.
From what I know (which is not well sourced), the big problem with K is that it didn't have an open and accessible interpreter for a long time, and that hampered adoption. I know 5-6 years ago I was interested in it, but couldn't find any interpreters that were free for non-commercial and commercial user (it was mostly work related interest), and didn't stumble across Kona[1] at the time. Not it's on my list of languages to look into again, but that's a long list.