The functions in this section look at or change the text of one or more
strings.
gawk
understands locales (see Locales), and does all string processing in terms of
characters, not bytes. This distinction is particularly important
to understand for locales where one character
may be represented by multiple bytes. Thus, for example, length()
returns the number of characters in a string, and not the number of bytes
used to represent those characters, Similarly, index()
works with
character indices, and not byte indices.
In the following list, optional parameters are enclosed in square brackets ([ ]).
Several functions perform string substitution; the full discussion is
provided in the description of the sub()
function, which comes
towards the end since the list is presented in alphabetic order.
Those functions that are specific to gawk are marked with a
pound sign (‘#’):
asort(
source [,
dest [,
how ] ]) #
"ascending string"
for the value of how. If the ‘source’ array contains subarrays as values,
they will come out last(first) in the ‘dest’ array for ‘ascending’(‘descending’)
order specification. The value of IGNORECASE
affects the sorting.
The third argument can also be a user-defined function name in which case
the value returned by the function is used to order the array elements
before constructing the result array.
See Array Sorting Functions, for more information.
For example, if the contents of a
are as follows:
a["last"] = "de" a["first"] = "sac" a["middle"] = "cul"
A call to asort()
:
asort(a)
results in the following contents of a
:
a[1] = "cul" a[2] = "de" a[3] = "sac"
In order to reverse the direction of the sorted results in the above example,
asort()
can be called with three arguments as follows:
asort(a, a, "descending")
The asort()
function is described in more detail in
Array Sorting Functions.
asort()
is a gawk extension; it is not available
in compatibility mode (see Options).
asorti(
source [,
dest [,
how ] ]) #
asort()
, however, the indices
are sorted, instead of the values. (Here too,
IGNORECASE
affects the sorting.)
The asorti()
function is described in more detail in
Array Sorting Functions.
asorti()
is a gawk extension; it is not available
in compatibility mode (see Options).
gensub(
regexp,
replacement,
how [,
target]) #
$0
. It returns the modified string as the result
of the function and the original target string is not changed.
gensub()
is a general substitution function. It's purpose is
to provide more features than the standard sub()
and gsub()
functions.
gensub()
provides an additional feature that is not available
in sub()
or gsub()
: the ability to specify components of a
regexp in the replacement text. This is done by using parentheses in
the regexp to mark the components and then specifying ‘\N’
in the replacement text, where N is a digit from 1 to 9.
For example:
$ gawk ' > BEGIN { > a = "abc def" > b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) > print b > }' -| def abc
As with sub()
, you must type two backslashes in order
to get one into the string.
In the replacement text, the sequence ‘\0’ represents the entire
matched text, as does the character ‘&’.
The following example shows how you can use the third argument to control which match of the regexp should be changed:
$ echo a b c a b c | > gawk '{ print gensub(/a/, "AA", 2) }' -| a b c AA b c
In this case, $0
is the default target string.
gensub()
returns the new string as its result, which is
passed directly to print
for printing.
If the how argument is a string that does not begin with ‘g’ or ‘G’, or if it is a number that is less than or equal to zero, only one substitution is performed. If how is zero, gawk issues a warning message.
If regexp does not match target, gensub()
's return value
is the original unchanged value of target.
gensub()
is a gawk extension; it is not available
in compatibility mode (see Options).
gsub(
regexp,
replacement [,
target])
gsub()
stands for
“global,” which means replace everywhere. For example:
{ gsub(/Britain/, "United Kingdom"); print }
replaces all occurrences of the string ‘Britain’ with ‘United Kingdom’ for all input records.
The gsub()
function returns the number of substitutions made. If
the variable to search and alter (target) is
omitted, then the entire input record ($0
) is used.
As in sub()
, the characters ‘&’ and ‘\’ are special,
and the third argument must be assignable.
index(
in,
find)
$ awk 'BEGIN { print index("peanut", "an") }' -| 3
If find is not found, index()
returns zero.
(Remember that string indices in awk start at one.)
length(
[string])
length("abcde")
is five. By
contrast, length(15 * 35)
works out to three. In this example, 15 * 35 =
525, and 525 is then converted to the string "525"
, which has
three characters.
If no argument is supplied, length()
returns the length of $0
.
NOTE: In older versions of awk, the length()
function could
be called
without any parentheses. Doing so is considered poor practice,
although the 2008 POSIX standard explicitly allows it, to
support historical practice. For programs to be maximally portable,
always supply the parentheses.
If length()
is called with a variable that has not been used,
gawk forces the variable to be a scalar. Other
implementations of awk leave the variable without a type.
(d.c.)
Consider:
$ gawk 'BEGIN { print length(x) ; x[1] = 1 }' -| 0 error--> gawk: fatal: attempt to use scalar `x' as array $ nawk 'BEGIN { print length(x) ; x[1] = 1 }' -| 0
If --lint has been specified on the command line, gawk issues a warning about this.
With gawk and several other awk implementations, when given an
array argument, the length()
function returns the number of elements
in the array. (c.e.)
This is less useful than it might seem at first, as the
array is not guaranteed to be indexed from one to the number of elements
in it.
If --lint is provided on the command line
(see Options),
gawk warns that passing an array argument is not portable.
If --posix is supplied, using an array argument is a fatal error
(see Arrays).
match(
string,
regexp [,
array])
The regexp argument may be either a regexp constant
(/.../
) or a string constant ("..."
).
In the latter case, the string is treated as a regexp to be matched.
See Computed Regexps, for a
discussion of the difference between the two forms, and the
implications for writing your program correctly.
The order of the first two arguments is backwards from most other string
functions that work with regular expressions, such as
sub()
and gsub()
. It might help to remember that
for match()
, the order is the same as for the ‘~’ operator:
‘string ~ regexp’.
The match()
function sets the built-in variable RSTART
to
the index. It also sets the built-in variable RLENGTH
to the
length in characters of the matched substring. If no match is found,
RSTART
is set to zero, and RLENGTH
to −1.
For example:
{ if ($1 == "FIND") regex = $2 else { where = match($0, regex) if (where != 0) print "Match of", regex, "found at", where, "in", $0 } }
This program looks for lines that match the regular expression stored in
the variable regex
. This regular expression can be changed. If the
first word on a line is ‘FIND’, regex
is changed to be the
second word on that line. Therefore, if given:
FIND ru+n My program runs but not very quickly FIND Melvin JF+KM This line is property of Reality Engineering Co. Melvin was here.
awk prints:
Match of ru+n found at 12 in My program runs Match of Melvin found at 1 in Melvin was here.
If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. For example:
$ echo foooobazbarrrrr | > gawk '{ match($0, /(fo+).+(bar*)/, arr) > print arr[1], arr[2] }' -| foooo barrrrr
In addition, multidimensional subscripts are available providing the start index and length of each matched subexpression:
$ echo foooobazbarrrrr | > gawk '{ match($0, /(fo+).+(bar*)/, arr) > print arr[1], arr[2] > print arr[1, "start"], arr[1, "length"] > print arr[2, "start"], arr[2, "length"] > }' -| foooo barrrrr -| 1 5 -| 9 7
There may not be subscripts for the start and index for every parenthesized
subexpression, since they may not all have matched text; thus they
should be tested for with the in
operator
(see Reference to Elements).
The array argument to match()
is a
gawk extension. In compatibility mode
(see Options),
using a third argument is a fatal error.
patsplit(
string,
array [,
fieldpat [,
seps ] ]) #
[1]
, the second piece in array[2]
, and so
forth. The third argument, fieldpat, is
a regexp describing the fields in string (just as FPAT
is
a regexp describing the fields in input records).
It may be either a regexp constant or a string.
If fieldpat is omitted, the value of FPAT
is used.
patsplit()
returns the number of elements created.
seps[
i]
is
the separator string
between array[
i]
and array[
i+1]
.
Any leading separator will be in seps[0]
.
The patsplit()
function splits strings into pieces in a
manner similar to the way input lines are split into fields using FPAT
(see Splitting By Content.
Before splitting the string, patsplit()
deletes any previously existing
elements in the arrays array and seps.
The patsplit()
function is a
gawk extension. In compatibility mode
(see Options),
it is not available.
split(
string,
array [,
fieldsep [,
seps ] ])
[1]
, the second piece in array[2]
, and so
forth. The string value of the third argument, fieldsep, is
a regexp describing where to split string (much as FS
can
be a regexp describing where to split input records;
see Regexp Field Splitting).
If fieldsep is omitted, the value of FS
is used.
split()
returns the number of elements created.
seps is a gawk extension with seps[
i]
being the separator string
between array[
i]
and array[
i+1]
.
If fieldsep is a single
space then any leading whitespace goes into seps[0]
and
any trailing
whitespace goes into seps[
n]
where n is the
return value of
split()
(that is, the number of elements in array).
The split()
function splits strings into pieces in a
manner similar to the way input lines are split into fields. For example:
split("cul-de-sac", a, "-", seps)
splits the string ‘cul-de-sac’ into three fields using ‘-’ as the
separator. It sets the contents of the array a
as follows:
a[1] = "cul" a[2] = "de" a[3] = "sac"
and sets the contents of the array seps
as follows:
seps[1] = "-" seps[2] = "-"
The value returned by this call to split()
is three.
As with input field-splitting, when the value of fieldsep is
" "
, leading and trailing whitespace is ignored in values assigned to
the elements of
array but not in seps, and the elements
are separated by runs of whitespace.
Also as with input field-splitting, if fieldsep is the null string, each
individual character in the string is split into its own array element.
(c.e.)
Note, however, that RS
has no effect on the way split()
works. Even though ‘RS = ""’ causes newline to also be an input
field separator, this does not affect how split()
splits strings.
Modern implementations of awk, including gawk, allow
the third argument to be a regexp constant (/abc/
) as well as a
string.
(d.c.)
The POSIX standard allows this as well.
See Computed Regexps, for a
discussion of the difference between using a string constant or a regexp constant,
and the implications for writing your program correctly.
Before splitting the string, split()
deletes any previously existing
elements in the arrays array and seps.
If string is null, the array has no elements. (So this is a portable way to delete an entire array with one statement. See Delete.)
If string does not match fieldsep at all (but is not null),
array has one element only. The value of that element is the original
string.
sprintf(
format,
expression1, ...)
printf
would
have printed out with the same arguments
(see Printf).
For example:
pival = sprintf("pi = %.2f (approx.)", 22/7)
assigns the string ‘pi = 3.14 (approx.)’ to the variable pival
.
strtonum(
str) #
strtonum()
assumes that str
is an octal number. If str begins with a leading ‘0x’ or
‘0X’, strtonum()
assumes that str is a hexadecimal number.
For example:
$ echo 0x11 | > gawk '{ printf "%d\n", strtonum($1) }' -| 17
Using the strtonum()
function is not the same as adding zero
to a string value; the automatic coercion of strings to numbers
works only for decimal data, not for octal or hexadecimal.1
Note also that strtonum()
uses the current locale's decimal point
for recognizing numbers (see Locales).
strtonum()
is a gawk extension; it is not available
in compatibility mode (see Options).
sub(
regexp,
replacement [,
target])
The regexp argument may be either a regexp constant
(/.../
) or a string constant ("..."
).
In the latter case, the string is treated as a regexp to be matched.
See Computed Regexps, for a
discussion of the difference between the two forms, and the
implications for writing your program correctly.
This function is peculiar because target is not simply
used to compute a value, and not just any expression will do—it
must be a variable, field, or array element so that sub()
can
store a modified value there. If this argument is omitted, then the
default is to use and alter $0
.2
For example:
str = "water, water, everywhere" sub(/at/, "ith", str)
sets str
to ‘wither, water, everywhere’, by replacing the
leftmost longest occurrence of ‘at’ with ‘ith’.
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line. Here is another example:
$ awk 'BEGIN { > str = "daabaaa" > sub(/a+/, "C&C", str) > print str > }' -| dCaaCbaaa
This shows how ‘&’ can represent a nonconstant string and also illustrates the “leftmost, longest” rule in regexp matching (see Leftmost Longest).
The effect of this special character (‘&’) can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write ‘\\&’ in a string constant to include a literal ‘&’ in the replacement. For example, the following shows how to replace the first ‘|’ on each line with an ‘&’:
{ sub(/\|/, "\\&"); print }
As mentioned, the third argument to sub()
must
be a variable, field or array element.
Some versions of awk allow the third argument to
be an expression that is not an lvalue. In such a case, sub()
still searches for the pattern and returns zero or one, but the result of
the substitution (if any) is thrown away because there is no place
to put it. Such versions of awk accept expressions
like the following:
sub(/USA/, "United States", "the USA and Canada")
For historical compatibility, gawk accepts such erroneous code. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.
Finally, if the regexp is not a regexp constant, it is converted into a
string, and then the value of that string is treated as the regexp to match.
substr(
string,
start [,
length])
substr("washington", 5, 3)
returns "ing"
.
If length is not present, substr()
returns the whole suffix of
string that begins at character number start. For example,
substr("washington", 5)
returns "ington"
. The whole
suffix is also returned
if length is greater than the number of characters remaining
in the string, counting from character start.
If start is less than one, substr()
treats it as
if it was one. (POSIX doesn't specify what to do in this case:
Brian Kernighan's awk acts this way, and therefore gawk
does too.)
If start is greater than the number of characters
in the string, substr()
returns the null string.
Similarly, if length is present but less than or equal to zero,
the null string is returned.
The string returned by substr()
cannot be
assigned. Thus, it is a mistake to attempt to change a portion of
a string, as shown in the following example:
string = "abcdef" # try to get "abCDEf", won't work substr(string, 3, 3) = "CDE"
It is also a mistake to use substr()
as the third argument
of sub()
or gsub()
:
gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG
(Some commercial versions of awk treat
substr()
as assignable, but doing so is not portable.)
If you need to replace bits and pieces of a string, combine substr()
with string concatenation, in the following manner:
string = "abcdef" ... string = substr(string, 1, 2) "CDE" substr(string, 6)
tolower(
string)
tolower("MiXeD cAsE 123")
returns "mixed case 123"
.
toupper(
string)
toupper("MiXeD cAsE 123")
returns "MIXED CASE 123"
.
[1] Unless you use the --non-decimal-data option, which isn't recommended. See Nondecimal Data, for more information.
[2] Note that this means
that the record will first be regenerated using the value of OFS
if
any fields have been changed, and that the fields will be updated
after the substitution, even if the operation is a “no-op” such
as ‘sub(/^/, "")’.
[3] This is different from C and C++, in which the first character is number zero.