The Awk state machine parser pattern (2018)

asicsp · on Jan 31, 2022

State machine of the form

    awk '/start/{f=1} f; /end/{f=0}'

is commonly used to work with text bounded by unique markers. You can also use `awk '/start/,/end/'` but the state machine format can be easily adapted for more variations (like excluding either/both of the markers)

Here's a chapter from my GNU awk one-liners book with more such examples: https://learnbyexample.github.io/learn_gnuawk/processing-mul...

marklgr · on Jan 31, 2022

Good book, for all levels, I recall stealing several snippets into my cheatsheet.

nerdponx · on Jan 31, 2022

I always forget that omitting the "action" after a pattern prints the line by default. Very useful tip! I'll definitely pick up a copy of this book.

yiyus · on Jan 31, 2022

I used a similar technique for my awk markdown parser: https://github.com/yiyus/md2html.awk

An awk state machine is a quite straightforward way to deal with data like this log file. It is not so clear that this is the best way to write a relatively large piece of software, like a markdown parser (when I wrote md2html.awk in 2009, the standard md parser was the original one by John Gruber, written in Perl, so it actually was an improvement in code clarity, performance, and portability (we had no perl in Plan 9!), but nowadays it is easy to find much better solutions).

wernsey · on Jan 31, 2022

Wow, I've bookmarked this.

I also wrote an Markdown to HTML converter in Awk once, though my purpose was to convert marked down comments in source code to documentation. I started with an Awk script to extract the comments and then systematically added markdown features. My end result isn't very elegant though.

https://github.com/wernsey/d.awk/blob/master/d.awk

VeninVidiaVicii · on Jan 31, 2022

Glad I’m not the only one crazy enough to write long AWK parsers. Here’s my tool to convert “feature tables” into “bed” files, both files that describe genomes, but for some reason, NCBI uses the former, even though it’s totally useless. https://github.com/ryandward/tbl2bed/blob/main/tbl2bed.awk

eqvinox · on Jan 31, 2022

Not to disparage the nice awk script, but reading from /sys/class/hwmon/* seems more sensible...

(Which is my way of saying, rather than writing a script like this, I'd spend some time to get the data machine readable in the first place — or even just dig up where to already find it in a machine readable form.)

nousermane · on Jan 31, 2022

Reading a file from sysfs is great and script-friendly, sure. OTOH, finding the right file to read is less straight-forward.

For one, depending on kernel version and compile options, temperature/voltage/rpm files could be found under /sys/class/hwmon, or /sys/devices/virtual, or /sys/devices/platform/soc. And then, say, script found a dozen of those "temp", or "temp∗_input", or "microvolts" files. How to figure out which one is for CPU, motherboard, battery, PSU, air intake? Probably with extra logic, reading corresponding "temp∗_label", if those even exist? Parsing /sys/firmware/devicetree ? Taking hints from parts of the path-name, where such files are found?

lm_sensors is no silver bullet here either, but at least it does passable job discovering/labeling sensors most of the time.

eqvinox · on Jan 31, 2022

Well, in that case:

  # sensors --help
  Usage: sensors [OPTION]... [CHIP]...
  …
    -j                    Json output

EdwardDiego · on Jan 31, 2022

I have to say, that's the most readable and understandable Awk program I've seen.

Does anyone know if there's a repository of similarly literate awk scripts?

OskarS · on Jan 31, 2022

I’ve found that AWK is frequently surprisingly readable actually, as long as you understand the execution model. I think people tend to think of it (with some justification) as in the same vein as Perl, but it isn’t nearly as surface-level cryptic. The syntax is just “C-style” with dynamic typing and built in regular expressions.

I have a couple of AWK scripts to handle my personal finances (basically, consuming various bank/credit card statements and turning them to Ledger files) and it’s just the perfect language for that kind of task. My scripts look fairly similar to the examples in the blog post, they also use the same state-machine trick. With the possible exception of Perl/Raku, it’s my favourite language for that kind of thing.

patrec · on Jan 31, 2022

Have a look at the classic "The AWK Programming Language" by the A, W, and K in awk. It's full of great examples.

https://ia803404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC...

yiyus · on Jan 31, 2022

There used to be very good examples in awk.info. The domain is on sale now, but you can get all the old content from archive.org and it still is very valid.

asicsp · on Jan 31, 2022

Check out https://github.com/e36freak/awk-libs

baddate · on Jan 31, 2022

nice!

rottc0dd · on Jan 31, 2022

I don't know if this is idiomatic way of doing awk. It is just a port of a python script to awk.

https://github.com/berry-thawson/diff2html

This is first attempt in writing awk script. Would like to know how readable it is.

Edits : added a new line. changed some words.

dj_mc_merlin · on Jan 31, 2022

This is basically what Perl used to be used for too.

nimrody · on Jan 31, 2022

If you think of Ruby as a more readable / maintainable Perl -- it's much better suited to these text processing tasks.

Ruby even supports Perl regular expressions which are more powerful and convenient than Awk's.

Some version of Ruby is usually in the base system of every Linux system (perl5 is more ubiquitous but much more cryptic)

asicsp · on Feb 1, 2022

>Ruby even supports Perl regular expressions

No, Ruby Regexp is based on the https://github.com/k-takata/Onigmo library. There are plenty of differences compared to Perl, for example `^` and `$` anchors always match start/end of lines without needing a flag, subexpression syntax uses `\g` instead of `(?N)` and so on.

jjice · on Jan 31, 2022

I'm a bit upset to say that this is one of the few times I've seen AWK code outside of a one liner (some of those one liners are pretty beastly, but still).

It reads pretty well, and now I'm interested in using it a bit more for my scripts. Any good AWK examples/resources anyone can recommend?

temp0826 · on Jan 31, 2022

The AWK book written by the A W K from the program name has always been considered the bible.

Edit: misattributed to k&r

b3morales · on Jan 31, 2022

This one is good; clear explanations of the concepts and various useful examples to crib from: https://www.grymoire.com/Unix/Awk.html

Their document covering sed is also excellent.

pphysch · on Jan 31, 2022

AWK supports conditional branching and switching so you can represent nested states as well. Wouldn't recommended beyond depth ~1 though... Use a proper language for that.

kazinator · on Jan 31, 2022

TXR:

  @(bind idtab @(relate '("radeon" "k10temp")
                        '("GPU"    "CPU")
                        "SYS"))
  @(bind thash @(hash))
  @(repeat)
  @id-@bus-@code
  Adapter: @name
  @  (repeat)
  temp@num: @{temp}°C@nil
  @    (do (set [thash `@[idtab id]_@num`] temp))
  @  (until)
  
  @  (end)
  @(end)
  @(do (dohash (tag temp thash)
         (sh `echo gmetric -t uint16 -u Celsius -n @tag -v @temp`)))


  $ txr temp.txr temp.dat 
  gmetric -t uint16 -u Celsius -n SYS_2 -v +43.0
  gmetric -t uint16 -u Celsius -n SYS_1 -v +31.0
  gmetric -t uint16 -u Celsius -n CPU_1 -v +36.8
  gmetric -t uint16 -u Celsius -n GPU_1 -v +50.5
  gmetric -t uint16 -u Celsius -n SYS_3 -v +38.0

That's based on what I think the Awk is stuffing into the associative array. I was not able to run the code as pasted verbatim from the site:

  $ gawk -f temp.awk temp.dat 
  gawk: temp.awk:32:     temp = substr(
  gawk: temp.awk:32:                   ^ unexpected newline or end of string
  gawk: temp.awk:35:         matches[1, "length"]
  gawk: temp.awk:35:                             ^ unexpected newline or end of string
  gawk: temp.awk:35:     );
  gawk: temp.awk:35:     ^ 0 is invalid as number of arguments for substr

  $ mawk -f temp.awk temp.dat 
  mawk: temp.awk: line 30: regular expression compile failed (missing operand)
  +([0-9.]+)°C
  mawk: temp.awk: line 30: syntax error at or near ,
  mawk: temp.awk: line 31: missing ) near end of line
  mawk: temp.awk: line 32: syntax error at or near ,
  mawk: temp.awk: line 35: extra ')'

I suspect the author may have tweaked the code for presentation in the blog without rechecking that it still works. Or else it needs some specific implementation and version of awk, with specific command line arguments that are not given, unfortunately.

kazinator · on Jan 31, 2022

The blog's code works (with gawk) if some whitespace errors are fixed.

  $ gawk -f temp.awk temp.dat  # echo inserted into command
  gmetric -t uint16 -u Celsius -n GPU_1 -v 50.5
  gmetric -t uint16 -u Celsius -n SYS_1 -v 31.0
  gmetric -t uint16 -u Celsius -n SYS_2 -v 43.0
  gmetric -t uint16 -u Celsius -n SYS_3 -v 38.0
  gmetric -t uint16 -u Celsius -n CPU_1 -v 36.8

I think they are a design flaw in Awk; I'm going to look into that and recommend changes to POSIX via the Austin Group mailing list if it still exists.

Awk has some newline sensitivities due to the following ambiguities:

   condition             # condition with no action allowed: default { print } action
   { action }            # action with no condition allowed
   condition { action }  # both

Therefore, this is not allowed (or well, it is, but codifies a separate condition with a default action, and an unconditional action).

   condition 
   { action }

There can be no newline between a condition and the opening { of its action. And actions must be brace enclosed.

And thus (IIRC) the awk lexical analyzer (in the original One True Awk implementation) returns an explicit newline token to the Yacc parser. In any phrase structure that doesn't deal with that token, a newline will cause a syntax error:

   function(     # no good
     arg
   )

   function("string "   # no good
            foo + bar
            " catenation")

When the lexer produces the token which is the opening brace of an action, it could shift into a freeform state, in which it consumes newlines internally. Then when the action is parsed, it can be returned to the newline-sensitive mode.

The newline sensitivities don't seem to serve a purpose in the C-like language within the actions.

That language also occurs outside of actions via the function construct:

  function whatever(...) {
  }

here the lexer would also be shifted into the freeform mode, as appropriate.

vcdimension · on Jan 31, 2022

My first thought when I saw this item in the HN list was TXR :)

kazinator · on Jan 31, 2022

Here is the data into JSON (keeping the values a

  $ txr json.txr temp.dat 
  {"radeon-pci-0100":{"temp1":{"crit":120,"hyst":90,"value":50.5}},
   "f71889ed-isa-0480":{"temp2":{"hyst":77,"value":43,"high":85},"alarm":{"fan3":true,"fan2":true},
                        "sensor":{"crit":100,"name":"thermistor","hyst":92},"max-voltage":{"in1":2.04},
                        "temp1":{"hyst":81,"value":31,"high":85},"voltage":{"in5":1.23,"+3.3V":3.23,"in6":1.53,"in2":1.09,"Vbat":3.31,"in1":1.07,
                                                                            "in4":0.58,"3VSB":3.25,"in3":0.89},
                        "temp3":{"hyst":68,"value":38,"high":70},"rpm":{"fan3":0,"fan1":3978,"fan2":0}},
   "k10temp-pci-00c3":{"temp1":{"crit":80,"hyst":78,"value":36.8,"high":70}}}

Using just a straightforward approach of recognizing the cases that occur without trying to formally parse things. There is significant copy and paste between similar cases. I decided to use a post-processing pass on the dictionary to convert the numeric values to floating-point.

  @(bind dict @(hash))
  @(name file)
  @(repeat)
  @idstring
  Adapter: @adapter
  @  (collect :vars (entry))
  @    (line line)
  @    (assert error `unhandled stuff occurs at @file:@line`)
  @    (some)
  @{temp /temp\d+/}: @{val}°C  (crit = @{crit}°C,
                          hyst = @{hyst}°C)
  @      (bind entry @#J^{~temp : { "value" : ~val,
                                    "crit" : ~crit,
                                    "hyst" : ~hyst }})
  @    (or)
  @{temp /temp\d+/}: @{val}°C  (high = @{high}°C)
                         (crit = @{crit}°C,
                          hyst = @{hyst}°C)
  @      (bind entry @#J^{~temp : { "value" : ~val,
                                    "crit" : ~crit,
                                    "hyst" : ~hyst,
                                    "high" : ~high }})
  @    (or)
  @{temp /temp\d+/}: @{val}°C  (high = @{high}°C,
                          hyst = @{hyst}°C)
  @      (bind entry @#J^{~temp : { "value" : ~val,
                                    "high" : ~high,
                                    "hyst" : ~hyst }})
  @    (or)
                         (crit = @{crit}°C,
                          hyst = @{hyst}°C)
                          sensor = @sensor
  @      (bind entry @#J^{"sensor" : { "name" : ~sensor,
                                       "crit" : ~crit,
                                       "hyst" : ~hyst }})
  @    (or)
  @label: @voltage V
  @      (bind entry @#J^{"voltage" : {~label : ~voltage}})
  @    (or)
  @label: @voltage V (max = @max V)
  @      (bind entry @#J^{"voltage" : {~label : ~voltage},
                          "max-voltage" : {~label : ~max}})
  @    (or)
  @label: @rpm RPM
  @      (bind entry @#J^{"rpm" : {~label : ~rpm}})
  @    (or)
  @label: @rpm RPM ALARM
  @      (bind entry @#J^{"rpm" : {~label : ~rpm},
                          "alarm" : {~label : true}})
  @    (or)
  
  @      (bind entry @#J{})
  @    (end)
  @  (until)
  
  @  (end)
  @  (do (set [dict idstring]
              (reduce-left (op hash-uni @1 @2 hash-uni) entry #J{})))
  @(end)
  @(do
     (defun numify (dict)
       (dohash (k v dict dict)
         (typecase v
           (string (iflet ((f (tofloat v)))
                     (set [dict k] f)))
           (hash (numify v)))))
  
     (put-jsonl (numify dict)))