awk
is a powerful programming language used for text processing and manipulation in Unix/Linux environments. It's particularly well-suited for tasks involving structured text files, especially when those files are data files or CSV files. It gets its name from the initials of its creators: Aho, Weinberger, and Kernighan.
Image source: Wikipedia
How awk
Works
When you run awk
, you specify an awk
program that tells awk
what to do. The program consists of a series of rules. Each rule specifies one pattern to search for and one action to perform upon finding the pattern.
Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in braces to separate it from the pattern. Newlines usually separate rules. Therefore, a awk
program looks like this:
awk [options] 'program' input-file(s)
OR
awk [options] 'pattern { action }' input-file(s)
pattern: Specifies when the action should be performed. If omitted, the action is applied to every line.
action: What to do when a line matches the pattern. Actions are enclosed in braces
{}
.
Single quotes around program
makes the shell not to interpret any awk
characters as special shell characters. The quotes also cause the shell to treat all of program
as a single argument for awk
, and allow program
to be more than one line long.
Benefits of AWK
Supports complex pattern-matching and processing.
Designed for efficient text processing on both small and large files.
Easy to write and understand one-liners for basic tasks.
Changing data files.
Producing formatted reports.
Available on all Unix-like systems without the need for installation or setup.
Variables in AWK
Variables in AWK play a crucial role in processing text and data. They are used to store temporary data, manipulate fields, control the flow of the program, and customize output. AWK variables can be user-defined or built-in, with the latter providing access to various useful pieces of information or functionality. Some of the most commonly used built-in variables are:
Variable | Description |
FS (Field Separator) | Controls how fields in a record (line) are separated. The default is whitespace. You can change it to parse CSV or other formats. |
OFS (Output Field Separator) | Specifies the separator to use when printing multiple fields with print . |
RS (Record Separator) | Determines how records are separated in the input data. By default, it's a newline character.So if you do not change it, a record is one line of the input file. |
ORS (Output Record Separator) | The separator used when printing output records. By default, it's a newline. |
NF (Number of Fields) | Contains the number of fields in the current record. |
NR*(Number of Records)* | The total number of input records processed so far. Working with Text Files. |
FILENAME | The name of the current input file. |
$0 | Represents the entire current record. |
$1, $2, ..., $n | Represents the first, second, ..., nth field in the current record. |
BEGIN and END - are not variables but special pattern blocks. The
BEGIN
block is executed before any input is read, and theEND
block is executed after all input has been processed. These blocks are useful for initialization and summary tasks, respectively.Arrays - AWK supports associative arrays, which can be indexed by string or number. Arrays are useful for collecting and organizing data dynamically during execution.
Hands-on Exercise Overview
The main goal of this hands-on exercise is to learn how to use awk
utility to manipulate data.
The input file for the examples provided below is the mail-list.txt file, which represents a list of peoples’ names together with their email addresses and information about those people. Each record contains the name of a person, his/her phone number, his/her email address, and a code for his/her relationship with the author of the list. The columns are aligned using spaces. An ‘A’ in the last column means the person is an acquaintance. An ‘F’ in the last column means the person is a friend. An ‘R’ means that the person is a relative:
Hands-on Exercise
Print the 1st and 3rd columns:
awk '{ print $1 "\t" $3}' mail-list.txt
Print lines that match a certain pattern. Search the input file mail-list.txt for the character string li:
awk '/li/ { print $0 }' mail-list
When lines containing ‘li’ are found, they are printed because
print $0
means print the current line.The slashes indicate that ‘li’ is the pattern to search for. This type of pattern is called a regular expression. The pattern is allowed to match parts of words.
Print columns that match a specific pattern:
awk '/.edu/ { print $3 }' mail-list.txt
When
awk
locates a pattern match, the command will execute the whole record. You can change the default by issuing an instruction to display only certain fields.Print every line that is longer than 55 characters:
awk 'length($0) > 55' file
awk
has a built-in length function that returns the length of the string. From the command$0
variable stores the entire line and in the absence of a body block, the default action is taken, i.e., the print action. Therefore, in our mail-list.txt file, if a line has more than 55 characters, the comparison results to true, and the line is printed as shown below.Print the total number of bytes used by mail-list.txt:
ls -l mail-list | awk '{ x += $5 } END { print "total bytes: " x }'
Count the lines in mail-list.txt:
awk 'END { print NR }' mail-list.txt
END
specifies that the action should be performed after all lines are processed.NR
is a built-in variable that keeps track of the number of records (lines) read.
Print the even-numbered lines in the mail-list.txt file:
awk 'NR % 2 == 0' mail-list.txt
- If you used the expression ‘NR % 2 == 1’ instead, the program would print the odd-numbered lines.