AWK

Appeared in:
1977
Influenced by:
Influenced:
Paradigm:
Typing discipline:
File extensions:
.awk
Versions and implementations (Collapse all | Expand all):
Programming language

AWK (name derived from surnames of the authors) is an interpreted scripting programming language designed for processing text data.

AWK was created in 1977 in Bell Labs by Alfred Aho, Peter Weinberger and Brian Kernighan. This version of the language was included in Unix V7 (1979). In 1988 a book called “The AWK Programming Language” was published, describing a new dialect of the language, included in Unix SysV. The old dialect was incompatible with the new one, so to avoid confusion they are referred to as oawk (old awk) and nawk (new awk), respectively. The latter was released under a free software license in 1996 and is maintained by Kernighan. The language is standardized in IEEE Std 1003.1-2004.

Nowadays AWK is a mandatory element of all Unix-like systems; along with Unix shell it is included in all standard Unix environments. AWK implementations exist for all platforms.

AWK program takes a stream of text data (from file or console) as input and processes it record by record. The program is a series of rules in form pattern {action}; here pattern is an expression (typically a regular expression), and action is a series of commands. Besides, a program can contain user-defined functions.

When input stream is processed, each line of input data is matched against each pattern of the program, and the actions of matching patterns are executed. The patterns can be given as:

  • a single expression: the action is executed for lines which make the expression evaluate true.
  • a pair of expressions: the action is executed for all lines starting with a line for which the first expression is true, and ending with a line for which the second expression is true.
  • two special patterns BEGIN and END give the actions to be executed before and after processing the input, respectively.

AWK allows to process each line as a string (stored in variable $0) or as a record with a set of fields (stored in $0, $1, ...). There are other built-in variables in the language: the number of lines read so far NR, the number of fields in the current record NF etc. System variables allow to tune the data processing mode, for example, to set record and field separators (default values are line feed and a space).

AWK is a contextually typed language: all primitive data units are stored as strings, though they can be interpreted as numbers depending on the context of their usage (for example, in arithmetic expressions). The main data structure of the language are associative arrays (indexed by strings).

Language shortcomings include:

  • lack of captures in regular expressions: language standard disallows to identify substrings of the string which match the regular expression. gawk fixes this.
  • non-repeatability: it’s impossible to apply the same rule to a string more than once without explicitly programming it. Note that sed, one of the prototypes of AWK, doesn’t have this limitation.
  • impossibility to evaluate a string as part of the program.
  • a confusing method of defining local variables for a user-defined function.

Elements of syntax:

Inline comments #
Case-sensitivity yes
Variable identifier regexp [_a-zA-Z][_a-zA-Z0-9]*
Variable assignment varname = value
Variable declaration none
Variable declaration with assignment none
Grouping expressions ( ... )
Block { ... }
Physical (shallow) equality a == b
Physical (shallow) inequality a != b
Deep equality a == b
Deep inequality a != b
Comparison < > <= >=
Function definition function functionName(argname1, ..., argnameN)
Function call functionName(arg1, ..., argN)
If - then if (condition) trueBlock
For each value in a numeric range, 1 increment for (i = first; i <= last; i++) loopBody
For each value in a numeric range, 1 decrement for (i = last; i >= first; i--) loopBody

Examples:

Hello, World!:

Example for versions Jawk 1.02, gawk 3.1.6, mawk 1.3.3

The printing is done with BEGIN pattern, i.e., before processing the input.

BEGIN { print "Hello, World!" }

Factorial:

Example for versions Jawk 1.02, gawk 3.1.6, mawk 1.3.3

This example uses iterative factorial definition. Individual statements within code block can be separated with semicolons (;) or new lines.

BEGIN {
    f = 1
    print "0! = " f
    for (i=1; i<17; i++) {
        f *= i
        print i "! = " f
    }
}

Fibonacci numbers:

Example for versions Jawk 1.02, gawk 3.1.6, mawk 1.3.3

This example uses iterative definition of Fibonacci numbers. fib is an associative array, and pr is a string.

BEGIN {
    fib[1] = 1
    fib[2] = 1
    for (i=3; i<17; i++)
        fib[i] = fib[i-1]+fib[i-2]
    pr = ""
    for (i=1; i<17; i++)
        pr = pr fib[i] ", "
    print pr "..." 
}

Quadratic equation:

Example for versions Jawk 1.02, gawk 3.1.6, mawk 1.3.3
{   A = $1
    B = $2
    C = $3
    if (A == 0) 
        print "Not a quadratic equation"
    else
    {   D = B*B-4*A*C
        if (D == 0)
            print "x = " (-B/2/A)
        else if (D > 0)
        {   print "x1 = " ((-B+sqrt(D))/2/A)
            print "x2 = " ((-B-sqrt(D))/2/A)
        }
        else
        {   print "x1 = (" (-B/2/A) "," (sqrt(-D)/2/A) ")"
            print "x2 = (" (-B/2/A) "," (-sqrt(-D)/2/A) ")"
        }
    }
}

CamelCase:

Example for versions Jawk 1.02, gawk 3.1.6, mawk 1.3.3

mawk provides no function length to get the size of the array, neither it can be used in Jawk — an attempt results in “Cannot evaluate an unindexed array.” runtime error.

Instead we can use the fact that function split returns the number of string fragments it extracted from the string. Otherwise this example is identical to this one.

{   text = $0;
    N = split(text, words, /[^a-zA-Z]+/);
    for (i=1; i<=N; i++) {
        res = res toupper(substr(words[i],1,1)) tolower(substr(words[i],2));
    }
    print res
}

CamelCase:

Example for versions gawk 3.1.6

Variable $0 stores the whole string read (as opposed to variables $1, $2 etc. which store fields of the record). split splits the string into fragments which are separated with matches to the regular expression and writes the result to the array words. After this each element of the array is converted to correct case using functions substr, toupper and tolower.

{   text = $0;
    split(text, words, /[^a-zA-Z]+/);
    for (i=1; i<=length(words); i++) {
        res = res toupper(substr(words[i],1,1)) tolower(substr(words[i],2));
    }
    print res
}