Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Learn to Process Text in Linux Using Grep, Sed, and Awk (linode.com)
68 points by qlacus on Jan 6, 2023 | hide | past | favorite | 17 comments


I feel like I have a weird relationship with these command line tools. They're obviously powerful, and grep in particular can be a huge time saver to be able to quickly process a file.

Whenever I think Sed or Awk would be correct for a problem, I find myself with a clever 1 liner an hour later I look at and think "I should have just used python," as it's untestable / unmaintainable, and I'll come back the next day and not be able to read it.

Are there people out there for whom these tools are daily drivers that are actually part of their core toolset, or are these mostly just hobbiest things where it feels neat to know. Like HAM radio operation.


I've probably used awk in excess of 10-20 times/day consistently (and somedays upwards of 100 times a day) for the last 15 years, despite having a full complement (and reasonable knowledge) of Python, Pandas, etc...

If I'm going to use the tool multiple times - it lands in Python, gets checked in, might even get unit tests if I'm feeling energetic. But there are tons of times a day when I want to ask a question of a 5 GB text file that I can pound out in < 60 seconds with a bit of piped awk/sed/tr and some bash looping on the results).


I have a ton of custom log monitoring written in awk/sed because it's way faster to process it that way in shell scripts than it would be in python. The key thing is that it doesn't need to allocate any memory and that it's by default processing text streams, which means by default you're just zipping the log files through the processor cache and outputting the result to a file to get emailed off.

It's important to realize that when you have gigabytes of data you need to analyze that you can go really fast this way. Python sure does have stuff like StringIO, but it just doesn't feel like the right kind of tool for that job. A ton of stuff is just columnar data like CSV or log files out there, and people have been using shell tools to process them for decades.

If you're having a hard time reading your complicated awk one liners, I suggest pretty printing it. [1] Plenty of people are using this in production by writing scripts in files.

[1] https://stackoverflow.com/questions/55745956/is-it-possible-...


I use sed and awk fairly frequently (and I'm familiar with the tools to the point where I got half a dozen problems into AoC with just sed once).

Most of what my sed and awk programs I write aren't supposed to be maintainable, but rather typically one-off stuff for-the-moment sort of thing. I'll grant it's append-only code, but if I need to do the same thing later (or something adjacent) I'll typically just write another oneliner since there's no real effort involved in it.

It's pretty much always more effort to write an actual script in some language like python, since it needs to be in a file and have a name and take arguments and so on.


My approach with these tools is to use then for speed, and to resist the (very strong) urge to golf with them.

So I always start with `cat filename | some tool`[1] and refine until the very moment that I think "damn, I can't remember how to do that with sed/awk/jq/xsv/..." and end my pipeline feeding into Python. There's no law against it! And on slow days I go back and figure out how I can do what I want with the other tools.

[1]: Yes, yes, UUOC[2], but I don't care!

[2]: https://groups.google.com/g/comp.unix.shell/c/532AcI3-zs4/m/...


I sorta, kinda agree. Tools written in AWK (and friends) are indeed somewhat unmaintainable, but they're really close to being just right for a LOT of applications. The vnlog toolkit (https://github.com/dkogan/vnlog) adds just a little bit of syntactic sugar to the usual commandline tools to make processing scripts robust and easy to read and write. This was not my intent initially, but I now do most of my data processing with the shell and vnl-wrapped awk (and sort and join, ...) It's really nice. If you write stuff in awk, you should check it out. (Disclaimer: I'm the author)


Not just using it daily, but I have scripts I've written literally a decade ago, and which haven't been modified at all, which still work just as well as the day I wrote them. Whereas there's a whole plethora of Python projects I've had to abandon in that time because they were never ported to Python 3.


"Whenever I think Sed or Awk would be correct for a problem..." is a good time to share the problem with HN.

Generally, that rarely happens. We almost never see someone present a text processing problem (input.txt, expected_output.txt) and ask to see what some solutions using UNIX utilties, so that someone might compare with python, for example. Instead, we see endless criticisms of UNIX utilities, without presenting any specific examples (input.txt, expected_output.txt) to ilustrate the unsubstantiated arguments being made.

Without a working example to illustrate a criticism, these criticisms come across as nothing more than worthless opinions. Programmers commenting online provide a lifetime supply of such opinions.

For me, sed is a "daily driver" for text processing. There are numerous small scripts that I use every day that feature sed. I have a folder with hundreds of scripts that use sed. Some use grep and awk as well. However the best program for text processing for me is neither grep, sed nor awk. It's flex.

If the argument is that python looks like "pseudocode" and sed does not, then I agree. No contest. No examples needed. Someone else long ago decided what "pseudocode" should look like. UNIX utility authors pursued different styles.

If the argument is that python has the best or most libraries, again I would not contest that.

However if the argument is something like "python solutions for text processing

(a) are faster to write,

(b) are smaller,

(c) run faster,

(d) are more robust/reliable,

(e) are quicker/easier to edit ("maintain"),

(f) use less memory/CPU,

(g) have fewer dependencies, or

(h) are easier to read",

then we need to look at the problem in question (input.txt, expected_output.txt). Otherwise we cannot have a meaningful discussion.

Python is just too slow for me. It may be fast enough for someone else, but not for me. Being forced to use python due to a job requirement, using python as a result of pressure from other programmers, or using python because it was "recommended", is not the same as first learning UNIX utilities and then evaluating python. I learned sed and other UNIX utilities first and so any potential "replacement" must offer the same benefits.


Can you post some of your scripts you use? I’d love to get some inspiration.


If provide a text-processing problem I will try to provide a solution.

Meanwhile here is one. Insert stdin or a src file at the top of a dst file.

     test ! -h $1||exec echo $0: error: symlink
     case $# in :)
     ;;1) x=$(sed -n '$=' $1)
     test $x -gt 1||exec echo usage: $0 dst src
     sed -i -e1r/dev/stdin -e1N $1
     ;;2) printf '0r '$2'\nwq\n'|ed -s $1
     esac
For example,

     1.sh 1.c < 1.h
     echo 93.184.216.34 example.com|1.sh /etc/hosts 
or

     1.sh 1.c 1.h
     1.sh /etc/hosts map-ip-host.txt
How is this done in Python.


Here is another variation that does not used "sed -i". It uses a favourite hex editor called ired.

      test ! -h $1||exec echo $0: error: symlink
      case $# in :)
      ;;1) x=$(sed -n '$=' $1)
      test $x -gt 1||exec echo usage: $0 dst src
      od -tx1 -An|sed 's/^/w /'|ired -n $1 /dev/stdin
      ;;2) printf '0r '$2'\nwq\n'|ed -s $1
      esac


Two more, edited for HN of course. These are scripts/snippets that would often be used in or by other scripts. Unlike Python, these scripts will keep working for many years with zero "maintenance". They will probably still be working long after I have passed away.

Task 1: Transform (a) the BIND format of stub resolver output, i.e., the format used by "dig" and many other stub resolvers, to (b) some other format, like HOSTS file, BIND/tinydns zone file, haproxy map file, etc. The input would typically be catenated stub resolver output for hundreds to thousands of domains. The epoch program is three lines of C.

NB. For large input on Linux, dash is significantly faster than bash. I wrote a C program that does the job of this script, faster than dash, and much faster than python. However I still prefer the shell for testing ideas, quickly.

   EPOCH=$(epoch);
   tr -d '\12'|tr ';' '\12' \
   |sed -n '/ANSWER SECTION/{s/ ANSWER SECTION://;
   s/\.[^-0-9a-zA-Z].*IN[^-0-9a-zA-Z]*A[^-0-9a-zA-Z]/ 1 IN A /;
   # list domains to exclude
   /www.google.com/d;
   /^$/d;/^.$/d;p;}' \
   |{
   exec 2>/dev/null 3>&3 4>&4 5>&5 6>&6 7>&7 8>&8 9>&9;
   while read NAME TTL CLASS TYPE IPADDR ;do 
   echo $NAME $TTL $CLASS $TYPE $IPADDR;
   echo $NAME 1 $CLASS $TYPE $IPADDR  >&3;
   echo $NAME $IPADDR  >&4;
   echo $NAME $IPADDR \# $EPOCH >&5;
   echo =$NAME:$IPADDR:1 >&6;
    #_IP_()
    # { 
    # echo $IPADDR|grepcidr -f $1-ips.txt >/dev/null;
    # }
    #if _IP_ cloudflare ... 
    #if _IP_ aws ...
    #if _IP_ cloudfront ...
    #if _IP_ fastly ...
    #if _IP_ akamai ...
    # etc.
   done;
   }
For example,

    drill example.com|1.sh 3>>1.zone 
    kdig example.com|1.sh 4>>/etc/hosts 
    drill example.com|1.sh 5>>1.map
    kdig example.com|1.sh 6>>data
Task 2: Extract and transform CloudFront domains and their CNAMEs from BIND format stub resolver output to haproxy map format.

    grep -A1 ANSWER.SEC|tr '\11' '\40'|sed -n '/cloudfront.net\.$/{s/\. .* / /;s/\.$//p;}' 
For example,

    drill blogs.aws.amazon.com|2.sh >>2.map


I also don't enjoy coming back to large sed expressions I wrote previously but I'm starting to write them in a way that makes that much easier. I'm on my phone so this will probably be syntactically wrong but hopefully the gist is evident. Basically, load the expression from a commented 'file':

    sed ./target --file=<(cat <<- 
    SED_EXPR | grep -v -E '^\s*#'
        # do a string replacement 
        s#target#replacement#g
        
        # then do something else
        ..
    SED_EXPR
    )


I have a couple basics I know by heart in awk, sed, and grep and those things enable me to be quick doing all sorts of stuff I need to do for large datasets on a daily basis. cat user_uuid_list | sed s/$/,/ sort of stuff. or anything tabular looking where I want specific columns (awk). maybe there are better ways, but I don't really care, as this works for me, is available everywhere, and I will never forget how to do it. I also use python for lots of stuff!


I use sed a lot in shell scripts, it's very powerful at massaging output from a stream and input it into something else.


Maybe someone will come up with a wrapper command line interface where we can express the desired outcome and GPT-3 / ChatGPT can produce (then execute) the "correct" commands. I agree with the sentiment of the power of grep/awk/sed (and use them in that order) but I also would prefer Python for anything I need longer than an hour. Having a natural language command -> grep/awk/sed pipeline -> saved as a CLI shortcut could be cool.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: