35074 lines
1.5 MiB
Executable File
35074 lines
1.5 MiB
Executable File
This is gawk.info, produced by makeinfo version 5.2 from gawk.texi.
|
||
|
||
Copyright (C) 1989, 1991, 1992, 1993, 1996-2005, 2007, 2009-2015
|
||
Free Software Foundation, Inc.
|
||
|
||
|
||
This is Edition 4.1 of 'GAWK: Effective AWK Programming: A User's
|
||
Guide for GNU Awk', for the 4.1.3 (or later) version of the GNU
|
||
implementation of AWK.
|
||
|
||
Permission is granted to copy, distribute and/or modify this document
|
||
under the terms of the GNU Free Documentation License, Version 1.3 or
|
||
any later version published by the Free Software Foundation; with the
|
||
Invariant Sections being "GNU General Public License", with the
|
||
Front-Cover Texts being "A GNU Manual", and with the Back-Cover Texts as
|
||
in (a) below. A copy of the license is included in the section entitled
|
||
"GNU Free Documentation License".
|
||
|
||
a. The FSF's Back-Cover Text is: "You have the freedom to copy and
|
||
modify this GNU manual."
|
||
INFO-DIR-SECTION Text creation and manipulation
|
||
START-INFO-DIR-ENTRY
|
||
* Gawk: (gawk). A text scanning and processing language.
|
||
END-INFO-DIR-ENTRY
|
||
|
||
INFO-DIR-SECTION Individual utilities
|
||
START-INFO-DIR-ENTRY
|
||
* awk: (gawk)Invoking gawk. Text scanning and processing.
|
||
END-INFO-DIR-ENTRY
|
||
|
||
|
||
File: gawk.info, Node: Top, Next: Foreword3, Up: (dir)
|
||
|
||
General Introduction
|
||
********************
|
||
|
||
This file documents 'awk', a program that you can use to select
|
||
particular records in a file and perform operations upon them.
|
||
|
||
Copyright (C) 1989, 1991, 1992, 1993, 1996-2005, 2007, 2009-2015
|
||
Free Software Foundation, Inc.
|
||
|
||
|
||
This is Edition 4.1 of 'GAWK: Effective AWK Programming: A User's
|
||
Guide for GNU Awk', for the 4.1.3 (or later) version of the GNU
|
||
implementation of AWK.
|
||
|
||
Permission is granted to copy, distribute and/or modify this document
|
||
under the terms of the GNU Free Documentation License, Version 1.3 or
|
||
any later version published by the Free Software Foundation; with the
|
||
Invariant Sections being "GNU General Public License", with the
|
||
Front-Cover Texts being "A GNU Manual", and with the Back-Cover Texts as
|
||
in (a) below. A copy of the license is included in the section entitled
|
||
"GNU Free Documentation License".
|
||
|
||
a. The FSF's Back-Cover Text is: "You have the freedom to copy and
|
||
modify this GNU manual."
|
||
|
||
* Menu:
|
||
|
||
* Foreword3:: Some nice words about this
|
||
Info file.
|
||
* Foreword4:: More nice words.
|
||
* Preface:: What this Info file is about; brief
|
||
history and acknowledgments.
|
||
* Getting Started:: A basic introduction to using
|
||
'awk'. How to run an 'awk'
|
||
program. Command-line syntax.
|
||
* Invoking Gawk:: How to run 'gawk'.
|
||
* Regexp:: All about matching things using regular
|
||
expressions.
|
||
* Reading Files:: How to read files and manipulate fields.
|
||
* Printing:: How to print using 'awk'. Describes
|
||
the 'print' and 'printf'
|
||
statements. Also describes redirection of
|
||
output.
|
||
* Expressions:: Expressions are the basic building blocks
|
||
of statements.
|
||
* Patterns and Actions:: Overviews of patterns and actions.
|
||
* Arrays:: The description and use of arrays. Also
|
||
includes array-oriented control statements.
|
||
* Functions:: Built-in and user-defined functions.
|
||
* Library Functions:: A Library of 'awk' Functions.
|
||
* Sample Programs:: Many 'awk' programs with complete
|
||
explanations.
|
||
* Advanced Features:: Stuff for advanced users, specific to
|
||
'gawk'.
|
||
* Internationalization:: Getting 'gawk' to speak your
|
||
language.
|
||
* Debugger:: The 'gawk' debugger.
|
||
* Arbitrary Precision Arithmetic:: Arbitrary precision arithmetic with
|
||
'gawk'.
|
||
* Dynamic Extensions:: Adding new built-in functions to
|
||
'gawk'.
|
||
* Language History:: The evolution of the 'awk'
|
||
language.
|
||
* Installation:: Installing 'gawk' under various
|
||
operating systems.
|
||
* Notes:: Notes about adding things to 'gawk'
|
||
and possible future work.
|
||
* Basic Concepts:: A very quick introduction to programming
|
||
concepts.
|
||
* Glossary:: An explanation of some unfamiliar terms.
|
||
* Copying:: Your right to copy and distribute
|
||
'gawk'.
|
||
* GNU Free Documentation License:: The license for this Info file.
|
||
* Index:: Concept and Variable Index.
|
||
|
||
* History:: The history of 'gawk' and
|
||
'awk'.
|
||
* Names:: What name to use to find
|
||
'awk'.
|
||
* This Manual:: Using this Info file. Includes
|
||
sample input files that you can use.
|
||
* Conventions:: Typographical Conventions.
|
||
* Manual History:: Brief history of the GNU project and
|
||
this Info file.
|
||
* How To Contribute:: Helping to save the world.
|
||
* Acknowledgments:: Acknowledgments.
|
||
* Running gawk:: How to run 'gawk' programs;
|
||
includes command-line syntax.
|
||
* One-shot:: Running a short throwaway
|
||
'awk' program.
|
||
* Read Terminal:: Using no input files (input from the
|
||
keyboard instead).
|
||
* Long:: Putting permanent 'awk'
|
||
programs in files.
|
||
* Executable Scripts:: Making self-contained 'awk'
|
||
programs.
|
||
* Comments:: Adding documentation to 'gawk'
|
||
programs.
|
||
* Quoting:: More discussion of shell quoting
|
||
issues.
|
||
* DOS Quoting:: Quoting in Windows Batch Files.
|
||
* Sample Data Files:: Sample data files for use in the
|
||
'awk' programs illustrated in
|
||
this Info file.
|
||
* Very Simple:: A very simple example.
|
||
* Two Rules:: A less simple one-line example using
|
||
two rules.
|
||
* More Complex:: A more complex example.
|
||
* Statements/Lines:: Subdividing or combining statements
|
||
into lines.
|
||
* Other Features:: Other Features of 'awk'.
|
||
* When:: When to use 'gawk' and when to
|
||
use other things.
|
||
* Intro Summary:: Summary of the introduction.
|
||
* Command Line:: How to run 'awk'.
|
||
* Options:: Command-line options and their
|
||
meanings.
|
||
* Other Arguments:: Input file names and variable
|
||
assignments.
|
||
* Naming Standard Input:: How to specify standard input with
|
||
other files.
|
||
* Environment Variables:: The environment variables
|
||
'gawk' uses.
|
||
* AWKPATH Variable:: Searching directories for
|
||
'awk' programs.
|
||
* AWKLIBPATH Variable:: Searching directories for
|
||
'awk' shared libraries.
|
||
* Other Environment Variables:: The environment variables.
|
||
* Exit Status:: 'gawk''s exit status.
|
||
* Include Files:: Including other files into your
|
||
program.
|
||
* Loading Shared Libraries:: Loading shared libraries into your
|
||
program.
|
||
* Obsolete:: Obsolete Options and/or features.
|
||
* Undocumented:: Undocumented Options and Features.
|
||
* Invoking Summary:: Invocation summary.
|
||
* Regexp Usage:: How to Use Regular Expressions.
|
||
* Escape Sequences:: How to write nonprinting characters.
|
||
* Regexp Operators:: Regular Expression Operators.
|
||
* Bracket Expressions:: What can go between '[...]'.
|
||
* Leftmost Longest:: How much text matches.
|
||
* Computed Regexps:: Using Dynamic Regexps.
|
||
* GNU Regexp Operators:: Operators specific to GNU software.
|
||
* Case-sensitivity:: How to do case-insensitive matching.
|
||
* Regexp Summary:: Regular expressions summary.
|
||
* Records:: Controlling how data is split into
|
||
records.
|
||
* awk split records:: How standard 'awk' splits
|
||
records.
|
||
* gawk split records:: How 'gawk' splits records.
|
||
* Fields:: An introduction to fields.
|
||
* Nonconstant Fields:: Nonconstant Field Numbers.
|
||
* Changing Fields:: Changing the Contents of a Field.
|
||
* Field Separators:: The field separator and how to change
|
||
it.
|
||
* Default Field Splitting:: How fields are normally separated.
|
||
* Regexp Field Splitting:: Using regexps as the field separator.
|
||
* Single Character Fields:: Making each character a separate
|
||
field.
|
||
* Command Line Field Separator:: Setting 'FS' from the command
|
||
line.
|
||
* Full Line Fields:: Making the full line be a single
|
||
field.
|
||
* Field Splitting Summary:: Some final points and a summary table.
|
||
* Constant Size:: Reading constant width data.
|
||
* Splitting By Content:: Defining Fields By Content
|
||
* Multiple Line:: Reading multiline records.
|
||
* Getline:: Reading files under explicit program
|
||
control using the 'getline'
|
||
function.
|
||
* Plain Getline:: Using 'getline' with no
|
||
arguments.
|
||
* Getline/Variable:: Using 'getline' into a variable.
|
||
* Getline/File:: Using 'getline' from a file.
|
||
* Getline/Variable/File:: Using 'getline' into a variable
|
||
from a file.
|
||
* Getline/Pipe:: Using 'getline' from a pipe.
|
||
* Getline/Variable/Pipe:: Using 'getline' into a variable
|
||
from a pipe.
|
||
* Getline/Coprocess:: Using 'getline' from a coprocess.
|
||
* Getline/Variable/Coprocess:: Using 'getline' into a variable
|
||
from a coprocess.
|
||
* Getline Notes:: Important things to know about
|
||
'getline'.
|
||
* Getline Summary:: Summary of 'getline' Variants.
|
||
* Read Timeout:: Reading input with a timeout.
|
||
* Command-line directories:: What happens if you put a directory on
|
||
the command line.
|
||
* Input Summary:: Input summary.
|
||
* Input Exercises:: Exercises.
|
||
* Print:: The 'print' statement.
|
||
* Print Examples:: Simple examples of 'print'
|
||
statements.
|
||
* Output Separators:: The output separators and how to
|
||
change them.
|
||
* OFMT:: Controlling Numeric Output With
|
||
'print'.
|
||
* Printf:: The 'printf' statement.
|
||
* Basic Printf:: Syntax of the 'printf' statement.
|
||
* Control Letters:: Format-control letters.
|
||
* Format Modifiers:: Format-specification modifiers.
|
||
* Printf Examples:: Several examples.
|
||
* Redirection:: How to redirect output to multiple
|
||
files and pipes.
|
||
* Special FD:: Special files for I/O.
|
||
* Special Files:: File name interpretation in
|
||
'gawk'. 'gawk' allows
|
||
access to inherited file descriptors.
|
||
* Other Inherited Files:: Accessing other open files with
|
||
'gawk'.
|
||
* Special Network:: Special files for network
|
||
communications.
|
||
* Special Caveats:: Things to watch out for.
|
||
* Close Files And Pipes:: Closing Input and Output Files and
|
||
Pipes.
|
||
* Output Summary:: Output summary.
|
||
* Output Exercises:: Exercises.
|
||
* Values:: Constants, Variables, and Regular
|
||
Expressions.
|
||
* Constants:: String, numeric and regexp constants.
|
||
* Scalar Constants:: Numeric and string constants.
|
||
* Nondecimal-numbers:: What are octal and hex numbers.
|
||
* Regexp Constants:: Regular Expression constants.
|
||
* Using Constant Regexps:: When and how to use a regexp constant.
|
||
* Variables:: Variables give names to values for
|
||
later use.
|
||
* Using Variables:: Using variables in your programs.
|
||
* Assignment Options:: Setting variables on the command line
|
||
and a summary of command-line syntax.
|
||
This is an advanced method of input.
|
||
* Conversion:: The conversion of strings to numbers
|
||
and vice versa.
|
||
* Strings And Numbers:: How 'awk' Converts Between
|
||
Strings And Numbers.
|
||
* Locale influences conversions:: How the locale may affect conversions.
|
||
* All Operators:: 'gawk''s operators.
|
||
* Arithmetic Ops:: Arithmetic operations ('+',
|
||
'-', etc.)
|
||
* Concatenation:: Concatenating strings.
|
||
* Assignment Ops:: Changing the value of a variable or a
|
||
field.
|
||
* Increment Ops:: Incrementing the numeric value of a
|
||
variable.
|
||
* Truth Values and Conditions:: Testing for true and false.
|
||
* Truth Values:: What is "true" and what is
|
||
"false".
|
||
* Typing and Comparison:: How variables acquire types and how
|
||
this affects comparison of numbers and
|
||
strings with '<', etc.
|
||
* Variable Typing:: String type versus numeric type.
|
||
* Comparison Operators:: The comparison operators.
|
||
* POSIX String Comparison:: String comparison with POSIX rules.
|
||
* Boolean Ops:: Combining comparison expressions using
|
||
boolean operators '||' ("or"),
|
||
'&&' ("and") and '!'
|
||
("not").
|
||
* Conditional Exp:: Conditional expressions select between
|
||
two subexpressions under control of a
|
||
third subexpression.
|
||
* Function Calls:: A function call is an expression.
|
||
* Precedence:: How various operators nest.
|
||
* Locales:: How the locale affects things.
|
||
* Expressions Summary:: Expressions summary.
|
||
* Pattern Overview:: What goes into a pattern.
|
||
* Regexp Patterns:: Using regexps as patterns.
|
||
* Expression Patterns:: Any expression can be used as a
|
||
pattern.
|
||
* Ranges:: Pairs of patterns specify record
|
||
ranges.
|
||
* BEGIN/END:: Specifying initialization and cleanup
|
||
rules.
|
||
* Using BEGIN/END:: How and why to use BEGIN/END rules.
|
||
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
|
||
* BEGINFILE/ENDFILE:: Two special patterns for advanced
|
||
control.
|
||
* Empty:: The empty pattern, which matches every
|
||
record.
|
||
* Using Shell Variables:: How to use shell variables with
|
||
'awk'.
|
||
* Action Overview:: What goes into an action.
|
||
* Statements:: Describes the various control
|
||
statements in detail.
|
||
* If Statement:: Conditionally execute some
|
||
'awk' statements.
|
||
* While Statement:: Loop until some condition is
|
||
satisfied.
|
||
* Do Statement:: Do specified action while looping
|
||
until some condition is satisfied.
|
||
* For Statement:: Another looping statement, that
|
||
provides initialization and increment
|
||
clauses.
|
||
* Switch Statement:: Switch/case evaluation for conditional
|
||
execution of statements based on a
|
||
value.
|
||
* Break Statement:: Immediately exit the innermost
|
||
enclosing loop.
|
||
* Continue Statement:: Skip to the end of the innermost
|
||
enclosing loop.
|
||
* Next Statement:: Stop processing the current input
|
||
record.
|
||
* Nextfile Statement:: Stop processing the current file.
|
||
* Exit Statement:: Stop execution of 'awk'.
|
||
* Built-in Variables:: Summarizes the predefined variables.
|
||
* User-modified:: Built-in variables that you change to
|
||
control 'awk'.
|
||
* Auto-set:: Built-in variables where 'awk'
|
||
gives you information.
|
||
* ARGC and ARGV:: Ways to use 'ARGC' and
|
||
'ARGV'.
|
||
* Pattern Action Summary:: Patterns and Actions summary.
|
||
* Array Basics:: The basics of arrays.
|
||
* Array Intro:: Introduction to Arrays
|
||
* Reference to Elements:: How to examine one element of an
|
||
array.
|
||
* Assigning Elements:: How to change an element of an array.
|
||
* Array Example:: Basic Example of an Array
|
||
* Scanning an Array:: A variation of the 'for'
|
||
statement. It loops through the
|
||
indices of an array's existing
|
||
elements.
|
||
* Controlling Scanning:: Controlling the order in which arrays
|
||
are scanned.
|
||
* Numeric Array Subscripts:: How to use numbers as subscripts in
|
||
'awk'.
|
||
* Uninitialized Subscripts:: Using Uninitialized variables as
|
||
subscripts.
|
||
* Delete:: The 'delete' statement removes an
|
||
element from an array.
|
||
* Multidimensional:: Emulating multidimensional arrays in
|
||
'awk'.
|
||
* Multiscanning:: Scanning multidimensional arrays.
|
||
* Arrays of Arrays:: True multidimensional arrays.
|
||
* Arrays Summary:: Summary of arrays.
|
||
* Built-in:: Summarizes the built-in functions.
|
||
* Calling Built-in:: How to call built-in functions.
|
||
* Numeric Functions:: Functions that work with numbers,
|
||
including 'int()', 'sin()'
|
||
and 'rand()'.
|
||
* String Functions:: Functions for string manipulation,
|
||
such as 'split()', 'match()'
|
||
and 'sprintf()'.
|
||
* Gory Details:: More than you want to know about
|
||
'\' and '&' with
|
||
'sub()', 'gsub()', and
|
||
'gensub()'.
|
||
* I/O Functions:: Functions for files and shell
|
||
commands.
|
||
* Time Functions:: Functions for dealing with timestamps.
|
||
* Bitwise Functions:: Functions for bitwise operations.
|
||
* Type Functions:: Functions for type information.
|
||
* I18N Functions:: Functions for string translation.
|
||
* User-defined:: Describes User-defined functions in
|
||
detail.
|
||
* Definition Syntax:: How to write definitions and what they
|
||
mean.
|
||
* Function Example:: An example function definition and
|
||
what it does.
|
||
* Function Caveats:: Things to watch out for.
|
||
* Calling A Function:: Don't use spaces.
|
||
* Variable Scope:: Controlling variable scope.
|
||
* Pass By Value/Reference:: Passing parameters.
|
||
* Return Statement:: Specifying the value a function
|
||
returns.
|
||
* Dynamic Typing:: How variable types can change at
|
||
runtime.
|
||
* Indirect Calls:: Choosing the function to call at
|
||
runtime.
|
||
* Functions Summary:: Summary of functions.
|
||
* Library Names:: How to best name private global
|
||
variables in library functions.
|
||
* General Functions:: Functions that are of general use.
|
||
* Strtonum Function:: A replacement for the built-in
|
||
'strtonum()' function.
|
||
* Assert Function:: A function for assertions in
|
||
'awk' programs.
|
||
* Round Function:: A function for rounding if
|
||
'sprintf()' does not do it
|
||
correctly.
|
||
* Cliff Random Function:: The Cliff Random Number Generator.
|
||
* Ordinal Functions:: Functions for using characters as
|
||
numbers and vice versa.
|
||
* Join Function:: A function to join an array into a
|
||
string.
|
||
* Getlocaltime Function:: A function to get formatted times.
|
||
* Readfile Function:: A function to read an entire file at
|
||
once.
|
||
* Shell Quoting:: A function to quote strings for the
|
||
shell.
|
||
* Data File Management:: Functions for managing command-line
|
||
data files.
|
||
* Filetrans Function:: A function for handling data file
|
||
transitions.
|
||
* Rewind Function:: A function for rereading the current
|
||
file.
|
||
* File Checking:: Checking that data files are readable.
|
||
* Empty Files:: Checking for zero-length files.
|
||
* Ignoring Assigns:: Treating assignments as file names.
|
||
* Getopt Function:: A function for processing command-line
|
||
arguments.
|
||
* Passwd Functions:: Functions for getting user
|
||
information.
|
||
* Group Functions:: Functions for getting group
|
||
information.
|
||
* Walking Arrays:: A function to walk arrays of arrays.
|
||
* Library Functions Summary:: Summary of library functions.
|
||
* Library Exercises:: Exercises.
|
||
* Running Examples:: How to run these examples.
|
||
* Clones:: Clones of common utilities.
|
||
* Cut Program:: The 'cut' utility.
|
||
* Egrep Program:: The 'egrep' utility.
|
||
* Id Program:: The 'id' utility.
|
||
* Split Program:: The 'split' utility.
|
||
* Tee Program:: The 'tee' utility.
|
||
* Uniq Program:: The 'uniq' utility.
|
||
* Wc Program:: The 'wc' utility.
|
||
* Miscellaneous Programs:: Some interesting 'awk'
|
||
programs.
|
||
* Dupword Program:: Finding duplicated words in a
|
||
document.
|
||
* Alarm Program:: An alarm clock.
|
||
* Translate Program:: A program similar to the 'tr'
|
||
utility.
|
||
* Labels Program:: Printing mailing labels.
|
||
* Word Sorting:: A program to produce a word usage
|
||
count.
|
||
* History Sorting:: Eliminating duplicate entries from a
|
||
history file.
|
||
* Extract Program:: Pulling out programs from Texinfo
|
||
source files.
|
||
* Simple Sed:: A Simple Stream Editor.
|
||
* Igawk Program:: A wrapper for 'awk' that
|
||
includes files.
|
||
* Anagram Program:: Finding anagrams from a dictionary.
|
||
* Signature Program:: People do amazing things with too much
|
||
time on their hands.
|
||
* Programs Summary:: Summary of programs.
|
||
* Programs Exercises:: Exercises.
|
||
* Nondecimal Data:: Allowing nondecimal input data.
|
||
* Array Sorting:: Facilities for controlling array
|
||
traversal and sorting arrays.
|
||
* Controlling Array Traversal:: How to use PROCINFO["sorted_in"].
|
||
* Array Sorting Functions:: How to use 'asort()' and
|
||
'asorti()'.
|
||
* Two-way I/O:: Two-way communications with another
|
||
process.
|
||
* TCP/IP Networking:: Using 'gawk' for network
|
||
programming.
|
||
* Profiling:: Profiling your 'awk' programs.
|
||
* Advanced Features Summary:: Summary of advanced features.
|
||
* I18N and L10N:: Internationalization and Localization.
|
||
* Explaining gettext:: How GNU 'gettext' works.
|
||
* Programmer i18n:: Features for the programmer.
|
||
* Translator i18n:: Features for the translator.
|
||
* String Extraction:: Extracting marked strings.
|
||
* Printf Ordering:: Rearranging 'printf' arguments.
|
||
* I18N Portability:: 'awk'-level portability
|
||
issues.
|
||
* I18N Example:: A simple i18n example.
|
||
* Gawk I18N:: 'gawk' is also
|
||
internationalized.
|
||
* I18N Summary:: Summary of I18N stuff.
|
||
* Debugging:: Introduction to 'gawk'
|
||
debugger.
|
||
* Debugging Concepts:: Debugging in General.
|
||
* Debugging Terms:: Additional Debugging Concepts.
|
||
* Awk Debugging:: Awk Debugging.
|
||
* Sample Debugging Session:: Sample debugging session.
|
||
* Debugger Invocation:: How to Start the Debugger.
|
||
* Finding The Bug:: Finding the Bug.
|
||
* List of Debugger Commands:: Main debugger commands.
|
||
* Breakpoint Control:: Control of Breakpoints.
|
||
* Debugger Execution Control:: Control of Execution.
|
||
* Viewing And Changing Data:: Viewing and Changing Data.
|
||
* Execution Stack:: Dealing with the Stack.
|
||
* Debugger Info:: Obtaining Information about the
|
||
Program and the Debugger State.
|
||
* Miscellaneous Debugger Commands:: Miscellaneous Commands.
|
||
* Readline Support:: Readline support.
|
||
* Limitations:: Limitations and future plans.
|
||
* Debugging Summary:: Debugging summary.
|
||
* Computer Arithmetic:: A quick intro to computer math.
|
||
* Math Definitions:: Defining terms used.
|
||
* MPFR features:: The MPFR features in 'gawk'.
|
||
* FP Math Caution:: Things to know.
|
||
* Inexactness of computations:: Floating point math is not exact.
|
||
* Inexact representation:: Numbers are not exactly represented.
|
||
* Comparing FP Values:: How to compare floating point values.
|
||
* Errors accumulate:: Errors get bigger as they go.
|
||
* Getting Accuracy:: Getting more accuracy takes some work.
|
||
* Try To Round:: Add digits and round.
|
||
* Setting precision:: How to set the precision.
|
||
* Setting the rounding mode:: How to set the rounding mode.
|
||
* Arbitrary Precision Integers:: Arbitrary Precision Integer Arithmetic
|
||
with 'gawk'.
|
||
* POSIX Floating Point Problems:: Standards Versus Existing Practice.
|
||
* Floating point summary:: Summary of floating point discussion.
|
||
* Extension Intro:: What is an extension.
|
||
* Plugin License:: A note about licensing.
|
||
* Extension Mechanism Outline:: An outline of how it works.
|
||
* Extension API Description:: A full description of the API.
|
||
* Extension API Functions Introduction:: Introduction to the API functions.
|
||
* General Data Types:: The data types.
|
||
* Memory Allocation Functions:: Functions for allocating memory.
|
||
* Constructor Functions:: Functions for creating values.
|
||
* Registration Functions:: Functions to register things with
|
||
'gawk'.
|
||
* Extension Functions:: Registering extension functions.
|
||
* Exit Callback Functions:: Registering an exit callback.
|
||
* Extension Version String:: Registering a version string.
|
||
* Input Parsers:: Registering an input parser.
|
||
* Output Wrappers:: Registering an output wrapper.
|
||
* Two-way processors:: Registering a two-way processor.
|
||
* Printing Messages:: Functions for printing messages.
|
||
* Updating 'ERRNO':: Functions for updating 'ERRNO'.
|
||
* Requesting Values:: How to get a value.
|
||
* Accessing Parameters:: Functions for accessing parameters.
|
||
* Symbol Table Access:: Functions for accessing global
|
||
variables.
|
||
* Symbol table by name:: Accessing variables by name.
|
||
* Symbol table by cookie:: Accessing variables by "cookie".
|
||
* Cached values:: Creating and using cached values.
|
||
* Array Manipulation:: Functions for working with arrays.
|
||
* Array Data Types:: Data types for working with arrays.
|
||
* Array Functions:: Functions for working with arrays.
|
||
* Flattening Arrays:: How to flatten arrays.
|
||
* Creating Arrays:: How to create and populate arrays.
|
||
* Extension API Variables:: Variables provided by the API.
|
||
* Extension Versioning:: API Version information.
|
||
* Extension API Informational Variables:: Variables providing information about
|
||
'gawk''s invocation.
|
||
* Extension API Boilerplate:: Boilerplate code for using the API.
|
||
* Finding Extensions:: How 'gawk' finds compiled
|
||
extensions.
|
||
* Extension Example:: Example C code for an extension.
|
||
* Internal File Description:: What the new functions will do.
|
||
* Internal File Ops:: The code for internal file operations.
|
||
* Using Internal File Ops:: How to use an external extension.
|
||
* Extension Samples:: The sample extensions that ship with
|
||
'gawk'.
|
||
* Extension Sample File Functions:: The file functions sample.
|
||
* Extension Sample Fnmatch:: An interface to 'fnmatch()'.
|
||
* Extension Sample Fork:: An interface to 'fork()' and
|
||
other process functions.
|
||
* Extension Sample Inplace:: Enabling in-place file editing.
|
||
* Extension Sample Ord:: Character to value to character
|
||
conversions.
|
||
* Extension Sample Readdir:: An interface to 'readdir()'.
|
||
* Extension Sample Revout:: Reversing output sample output
|
||
wrapper.
|
||
* Extension Sample Rev2way:: Reversing data sample two-way
|
||
processor.
|
||
* Extension Sample Read write array:: Serializing an array to a file.
|
||
* Extension Sample Readfile:: Reading an entire file into a string.
|
||
* Extension Sample Time:: An interface to 'gettimeofday()'
|
||
and 'sleep()'.
|
||
* Extension Sample API Tests:: Tests for the API.
|
||
* gawkextlib:: The 'gawkextlib' project.
|
||
* Extension summary:: Extension summary.
|
||
* Extension Exercises:: Exercises.
|
||
* V7/SVR3.1:: The major changes between V7 and
|
||
System V Release 3.1.
|
||
* SVR4:: Minor changes between System V
|
||
Releases 3.1 and 4.
|
||
* POSIX:: New features from the POSIX standard.
|
||
* BTL:: New features from Brian Kernighan's
|
||
version of 'awk'.
|
||
* POSIX/GNU:: The extensions in 'gawk' not
|
||
in POSIX 'awk'.
|
||
* Feature History:: The history of the features in
|
||
'gawk'.
|
||
* Common Extensions:: Common Extensions Summary.
|
||
* Ranges and Locales:: How locales used to affect regexp
|
||
ranges.
|
||
* Contributors:: The major contributors to
|
||
'gawk'.
|
||
* History summary:: History summary.
|
||
* Gawk Distribution:: What is in the 'gawk'
|
||
distribution.
|
||
* Getting:: How to get the distribution.
|
||
* Extracting:: How to extract the distribution.
|
||
* Distribution contents:: What is in the distribution.
|
||
* Unix Installation:: Installing 'gawk' under
|
||
various versions of Unix.
|
||
* Quick Installation:: Compiling 'gawk' under Unix.
|
||
* Additional Configuration Options:: Other compile-time options.
|
||
* Configuration Philosophy:: How it's all supposed to work.
|
||
* Non-Unix Installation:: Installation on Other Operating
|
||
Systems.
|
||
* PC Installation:: Installing and Compiling
|
||
'gawk' on MS-DOS and OS/2.
|
||
* PC Binary Installation:: Installing a prepared distribution.
|
||
* PC Compiling:: Compiling 'gawk' for MS-DOS,
|
||
Windows32, and OS/2.
|
||
* PC Testing:: Testing 'gawk' on PC systems.
|
||
* PC Using:: Running 'gawk' on MS-DOS,
|
||
Windows32 and OS/2.
|
||
* Cygwin:: Building and running 'gawk'
|
||
for Cygwin.
|
||
* MSYS:: Using 'gawk' In The MSYS
|
||
Environment.
|
||
* VMS Installation:: Installing 'gawk' on VMS.
|
||
* VMS Compilation:: How to compile 'gawk' under
|
||
VMS.
|
||
* VMS Dynamic Extensions:: Compiling 'gawk' dynamic
|
||
extensions on VMS.
|
||
* VMS Installation Details:: How to install 'gawk' under
|
||
VMS.
|
||
* VMS Running:: How to run 'gawk' under VMS.
|
||
* VMS GNV:: The VMS GNV Project.
|
||
* VMS Old Gawk:: An old version comes with some VMS
|
||
systems.
|
||
* Bugs:: Reporting Problems and Bugs.
|
||
* Other Versions:: Other freely available 'awk'
|
||
implementations.
|
||
* Installation summary:: Summary of installation.
|
||
* Compatibility Mode:: How to disable certain 'gawk'
|
||
extensions.
|
||
* Additions:: Making Additions To 'gawk'.
|
||
* Accessing The Source:: Accessing the Git repository.
|
||
* Adding Code:: Adding code to the main body of
|
||
'gawk'.
|
||
* New Ports:: Porting 'gawk' to a new
|
||
operating system.
|
||
* Derived Files:: Why derived files are kept in the Git
|
||
repository.
|
||
* Future Extensions:: New features that may be implemented
|
||
one day.
|
||
* Implementation Limitations:: Some limitations of the
|
||
implementation.
|
||
* Extension Design:: Design notes about the extension API.
|
||
* Old Extension Problems:: Problems with the old mechanism.
|
||
* Extension New Mechanism Goals:: Goals for the new mechanism.
|
||
* Extension Other Design Decisions:: Some other design decisions.
|
||
* Extension Future Growth:: Some room for future growth.
|
||
* Old Extension Mechanism:: Some compatibility for old extensions.
|
||
* Notes summary:: Summary of implementation notes.
|
||
* Basic High Level:: The high level view.
|
||
* Basic Data Typing:: A very quick intro to data types.
|
||
|
||
To my parents, for their love, and for the wonderful example they set
|
||
for me.
|
||
|
||
To my wife Miriam, for making me complete. Thank you for building
|
||
your life together with me.
|
||
|
||
To our children Chana, Rivka, Nachum and Malka, for enrichening our
|
||
lives in innumerable ways.
|
||
|
||
|
||
File: gawk.info, Node: Foreword3, Next: Foreword4, Prev: Top, Up: Top
|
||
|
||
Foreword to the Third Edition
|
||
*****************************
|
||
|
||
Arnold Robbins and I are good friends. We were introduced in 1990 by
|
||
circumstances--and our favorite programming language, AWK. The
|
||
circumstances started a couple of years earlier. I was working at a new
|
||
job and noticed an unplugged Unix computer sitting in the corner. No
|
||
one knew how to use it, and neither did I. However, a couple of days
|
||
later, it was running, and I was 'root' and the one-and-only user. That
|
||
day, I began the transition from statistician to Unix programmer.
|
||
|
||
On one of many trips to the library or bookstore in search of books
|
||
on Unix, I found the gray AWK book, a.k.a. Alfred V. Aho, Brian W.
|
||
Kernighan, and Peter J. Weinberger's 'The AWK Programming Language'
|
||
(Addison-Wesley, 1988). 'awk''s simple programming paradigm--find a
|
||
pattern in the input and then perform an action--often reduced complex
|
||
or tedious data manipulations to a few lines of code. I was excited to
|
||
try my hand at programming in AWK.
|
||
|
||
Alas, the 'awk' on my computer was a limited version of the language
|
||
described in the gray book. I discovered that my computer had "old
|
||
'awk'" and the book described "new 'awk'." I learned that this was
|
||
typical; the old version refused to step aside or relinquish its name.
|
||
If a system had a new 'awk', it was invariably called 'nawk', and few
|
||
systems had it. The best way to get a new 'awk' was to 'ftp' the source
|
||
code for 'gawk' from 'prep.ai.mit.edu'. 'gawk' was a version of new
|
||
'awk' written by David Trueman and Arnold, and available under the GNU
|
||
General Public License.
|
||
|
||
(Incidentally, it's no longer difficult to find a new 'awk'. 'gawk'
|
||
ships with GNU/Linux, and you can download binaries or source code for
|
||
almost any system; my wife uses 'gawk' on her VMS box.)
|
||
|
||
My Unix system started out unplugged from the wall; it certainly was
|
||
not plugged into a network. So, oblivious to the existence of 'gawk'
|
||
and the Unix community in general, and desiring a new 'awk', I wrote my
|
||
own, called 'mawk'. Before I was finished, I knew about 'gawk', but it
|
||
was too late to stop, so I eventually posted to a 'comp.sources'
|
||
newsgroup.
|
||
|
||
A few days after my posting, I got a friendly email from Arnold
|
||
introducing himself. He suggested we share design and algorithms and
|
||
attached a draft of the POSIX standard so that I could update 'mawk' to
|
||
support language extensions added after publication of 'The AWK
|
||
Programming Language'.
|
||
|
||
Frankly, if our roles had been reversed, I would not have been so
|
||
open and we probably would have never met. I'm glad we did meet. He is
|
||
an AWK expert's AWK expert and a genuinely nice person. Arnold
|
||
contributes significant amounts of his expertise and time to the Free
|
||
Software Foundation.
|
||
|
||
This book is the 'gawk' reference manual, but at its core it is a
|
||
book about AWK programming that will appeal to a wide audience. It is a
|
||
definitive reference to the AWK language as defined by the 1987 Bell
|
||
Laboratories release and codified in the 1992 POSIX Utilities standard.
|
||
|
||
On the other hand, the novice AWK programmer can study a wealth of
|
||
practical programs that emphasize the power of AWK's basic idioms:
|
||
data-driven control flow, pattern matching with regular expressions, and
|
||
associative arrays. Those looking for something new can try out
|
||
'gawk''s interface to network protocols via special '/inet' files.
|
||
|
||
The programs in this book make clear that an AWK program is typically
|
||
much smaller and faster to develop than a counterpart written in C.
|
||
Consequently, there is often a payoff to prototyping an algorithm or
|
||
design in AWK to get it running quickly and expose problems early.
|
||
Often, the interpreted performance is adequate and the AWK prototype
|
||
becomes the product.
|
||
|
||
The new 'pgawk' (profiling 'gawk'), produces program execution
|
||
counts. I recently experimented with an algorithm that for n lines of
|
||
input, exhibited ~ C n^2 performance, while theory predicted ~ C n log n
|
||
behavior. A few minutes poring over the 'awkprof.out' profile
|
||
pinpointed the problem to a single line of code. 'pgawk' is a welcome
|
||
addition to my programmer's toolbox.
|
||
|
||
Arnold has distilled over a decade of experience writing and using
|
||
AWK programs, and developing 'gawk', into this book. If you use AWK or
|
||
want to learn how, then read this book.
|
||
|
||
Michael Brennan
|
||
Author of 'mawk'
|
||
March 2001
|
||
|
||
|
||
File: gawk.info, Node: Foreword4, Next: Preface, Prev: Foreword3, Up: Top
|
||
|
||
Foreword to the Fourth Edition
|
||
******************************
|
||
|
||
Some things don't change. Thirteen years ago I wrote: "If you use AWK
|
||
or want to learn how, then read this book." True then, and still true
|
||
today.
|
||
|
||
Learning to use a programming language is about more than mastering
|
||
the syntax. One needs to acquire an understanding of how to use the
|
||
features of the language to solve practical programming problems. A
|
||
focus of this book is many examples that show how to use AWK.
|
||
|
||
Some things do change. Our computers are much faster and have more
|
||
memory. Consequently, speed and storage inefficiencies of a high-level
|
||
language matter less. Prototyping in AWK and then rewriting in C for
|
||
performance reasons happens less, because more often the prototype is
|
||
fast enough.
|
||
|
||
Of course, there are computing operations that are best done in C or
|
||
C++. With 'gawk' 4.1 and later, you do not have to choose between
|
||
writing your program in AWK or in C/C++. You can write most of your
|
||
program in AWK and the aspects that require C/C++ capabilities can be
|
||
written in C/C++, and then the pieces glued together when the 'gawk'
|
||
module loads the C/C++ module as a dynamic plug-in. *note Dynamic
|
||
Extensions::, has all the details, and, as expected, many examples to
|
||
help you learn the ins and outs.
|
||
|
||
I enjoy programming in AWK and had fun (re)reading this book. I
|
||
think you will too.
|
||
|
||
Michael Brennan
|
||
Author of 'mawk'
|
||
October 2014
|
||
|
||
|
||
File: gawk.info, Node: Preface, Next: Getting Started, Prev: Foreword4, Up: Top
|
||
|
||
Preface
|
||
*******
|
||
|
||
Several kinds of tasks occur repeatedly when working with text files.
|
||
You might want to extract certain lines and discard the rest. Or you
|
||
may need to make changes wherever certain patterns appear, but leave the
|
||
rest of the file alone. Such jobs are often easy with 'awk'. The 'awk'
|
||
utility interprets a special-purpose programming language that makes it
|
||
easy to handle simple data-reformatting jobs.
|
||
|
||
The GNU implementation of 'awk' is called 'gawk'; if you invoke it
|
||
with the proper options or environment variables, it is fully compatible
|
||
with the POSIX(1) specification of the 'awk' language and with the Unix
|
||
version of 'awk' maintained by Brian Kernighan. This means that all
|
||
properly written 'awk' programs should work with 'gawk'. So most of the
|
||
time, we don't distinguish between 'gawk' and other 'awk'
|
||
implementations.
|
||
|
||
Using 'awk' you can:
|
||
|
||
* Manage small, personal databases
|
||
|
||
* Generate reports
|
||
|
||
* Validate data
|
||
|
||
* Produce indexes and perform other document-preparation tasks
|
||
|
||
* Experiment with algorithms that you can adapt later to other
|
||
computer languages
|
||
|
||
In addition, 'gawk' provides facilities that make it easy to:
|
||
|
||
* Extract bits and pieces of data for processing
|
||
|
||
* Sort data
|
||
|
||
* Perform simple network communications
|
||
|
||
* Profile and debug 'awk' programs
|
||
|
||
* Extend the language with functions written in C or C++
|
||
|
||
This Info file teaches you about the 'awk' language and how you can
|
||
use it effectively. You should already be familiar with basic system
|
||
commands, such as 'cat' and 'ls',(2) as well as basic shell facilities,
|
||
such as input/output (I/O) redirection and pipes.
|
||
|
||
Implementations of the 'awk' language are available for many
|
||
different computing environments. This Info file, while describing the
|
||
'awk' language in general, also describes the particular implementation
|
||
of 'awk' called 'gawk' (which stands for "GNU 'awk'"). 'gawk' runs on a
|
||
broad range of Unix systems, ranging from Intel-architecture PC-based
|
||
computers up through large-scale systems. 'gawk' has also been ported
|
||
to Mac OS X, Microsoft Windows (all versions) and OS/2 PCs, and
|
||
OpenVMS.(3)
|
||
|
||
* Menu:
|
||
|
||
* History:: The history of 'gawk' and
|
||
'awk'.
|
||
* Names:: What name to use to find 'awk'.
|
||
* This Manual:: Using this Info file. Includes sample
|
||
input files that you can use.
|
||
* Conventions:: Typographical Conventions.
|
||
* Manual History:: Brief history of the GNU project and this
|
||
Info file.
|
||
* How To Contribute:: Helping to save the world.
|
||
* Acknowledgments:: Acknowledgments.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The 2008 POSIX standard is accessible online at <http://www.opengroup.org/onlinepubs/9699919799/>.
|
||
|
||
(2) These utilities are available on POSIX-compliant systems, as well
|
||
as on traditional Unix-based systems. If you are using some other
|
||
operating system, you still need to be familiar with the ideas of I/O
|
||
redirection and pipes.
|
||
|
||
(3) Some other, obsolete systems to which 'gawk' was once ported are
|
||
no longer supported and the code for those systems has been removed.
|
||
|
||
|
||
File: gawk.info, Node: History, Next: Names, Up: Preface
|
||
|
||
History of 'awk' and 'gawk'
|
||
===========================
|
||
|
||
Recipe for a Programming Language
|
||
|
||
1 part 'egrep' 1 part 'snobol'
|
||
2 parts 'ed' 3 parts C
|
||
|
||
Blend all parts well using 'lex' and 'yacc'. Document minimally and
|
||
release.
|
||
|
||
After eight years, add another part 'egrep' and two more parts C.
|
||
Document very well and release.
|
||
|
||
The name 'awk' comes from the initials of its designers: Alfred V.
|
||
Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version
|
||
of 'awk' was written in 1977 at AT&T Bell Laboratories. In 1985, a new
|
||
version made the programming language more powerful, introducing
|
||
user-defined functions, multiple input streams, and computed regular
|
||
expressions. This new version became widely available with Unix System
|
||
V Release 3.1 (1987). The version in System V Release 4 (1989) added
|
||
some new features and cleaned up the behavior in some of the "dark
|
||
corners" of the language. The specification for 'awk' in the POSIX
|
||
Command Language and Utilities standard further clarified the language.
|
||
Both the 'gawk' designers and the original 'awk' designers at Bell
|
||
Laboratories provided feedback for the POSIX specification.
|
||
|
||
Paul Rubin wrote 'gawk' in 1986. Jay Fenlason completed it, with
|
||
advice from Richard Stallman. John Woods contributed parts of the code
|
||
as well. In 1988 and 1989, David Trueman, with help from me, thoroughly
|
||
reworked 'gawk' for compatibility with the newer 'awk'. Circa 1994, I
|
||
became the primary maintainer. Current development focuses on bug
|
||
fixes, performance improvements, standards compliance, and,
|
||
occasionally, new features.
|
||
|
||
In May 1997, Ju"rgen Kahrs felt the need for network access from
|
||
'awk', and with a little help from me, set about adding features to do
|
||
this for 'gawk'. At that time, he also wrote the bulk of 'TCP/IP
|
||
Internetworking with 'gawk'' (a separate document, available as part of
|
||
the 'gawk' distribution). His code finally became part of the main
|
||
'gawk' distribution with 'gawk' version 3.1.
|
||
|
||
John Haque rewrote the 'gawk' internals, in the process providing an
|
||
'awk'-level debugger. This version became available as 'gawk' version
|
||
4.0 in 2011.
|
||
|
||
*Note Contributors::, for a full list of those who have made
|
||
important contributions to 'gawk'.
|
||
|
||
|
||
File: gawk.info, Node: Names, Next: This Manual, Prev: History, Up: Preface
|
||
|
||
A Rose by Any Other Name
|
||
========================
|
||
|
||
The 'awk' language has evolved over the years. Full details are
|
||
provided in *note Language History::. The language described in this
|
||
Info file is often referred to as "new 'awk'." By analogy, the original
|
||
version of 'awk' is referred to as "old 'awk'."
|
||
|
||
On most current systems, when you run the 'awk' utility you get some
|
||
version of new 'awk'.(1) If your system's standard 'awk' is the old
|
||
one, you will see something like this if you try the test program:
|
||
|
||
$ awk 1 /dev/null
|
||
error-> awk: syntax error near line 1
|
||
error-> awk: bailing out near line 1
|
||
|
||
In this case, you should find a version of new 'awk', or just install
|
||
'gawk'!
|
||
|
||
Throughout this Info file, whenever we refer to a language feature
|
||
that should be available in any complete implementation of POSIX 'awk',
|
||
we simply use the term 'awk'. When referring to a feature that is
|
||
specific to the GNU implementation, we use the term 'gawk'.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Only Solaris systems still use an old 'awk' for the default 'awk'
|
||
utility. A more modern 'awk' lives in '/usr/xpg6/bin' on these systems.
|
||
|
||
|
||
File: gawk.info, Node: This Manual, Next: Conventions, Prev: Names, Up: Preface
|
||
|
||
Using This Book
|
||
===============
|
||
|
||
The term 'awk' refers to a particular program as well as to the language
|
||
you use to tell this program what to do. When we need to be careful, we
|
||
call the language "the 'awk' language," and the program "the 'awk'
|
||
utility." This Info file explains both how to write programs in the
|
||
'awk' language and how to run the 'awk' utility. The term "'awk'
|
||
program" refers to a program written by you in the 'awk' programming
|
||
language.
|
||
|
||
Primarily, this Info file explains the features of 'awk' as defined
|
||
in the POSIX standard. It does so in the context of the 'gawk'
|
||
implementation. While doing so, it also attempts to describe important
|
||
differences between 'gawk' and other 'awk' implementations.(1) Finally,
|
||
it notes any 'gawk' features that are not in the POSIX standard for
|
||
'awk'.
|
||
|
||
There are sidebars scattered throughout the Info file. They add a
|
||
more complete explanation of points that are relevant, but not likely to
|
||
be of interest on first reading. All appear in the index, under the
|
||
heading "sidebar."
|
||
|
||
Most of the time, the examples use complete 'awk' programs. Some of
|
||
the more advanced sections show only the part of the 'awk' program that
|
||
illustrates the concept being described.
|
||
|
||
Although this Info file is aimed principally at people who have not
|
||
been exposed to 'awk', there is a lot of information here that even the
|
||
'awk' expert should find useful. In particular, the description of
|
||
POSIX 'awk' and the example programs in *note Library Functions::, and
|
||
in *note Sample Programs::, should be of interest.
|
||
|
||
This Info file is split into several parts, as follows:
|
||
|
||
* Part I describes the 'awk' language and the 'gawk' program in
|
||
detail. It starts with the basics, and continues through all of
|
||
the features of 'awk'. It contains the following chapters:
|
||
|
||
- *note Getting Started::, provides the essentials you need to
|
||
know to begin using 'awk'.
|
||
|
||
- *note Invoking Gawk::, describes how to run 'gawk', the
|
||
meaning of its command-line options, and how it finds 'awk'
|
||
program source files.
|
||
|
||
- *note Regexp::, introduces regular expressions in general, and
|
||
in particular the flavors supported by POSIX 'awk' and 'gawk'.
|
||
|
||
- *note Reading Files::, describes how 'awk' reads your data.
|
||
It introduces the concepts of records and fields, as well as
|
||
the 'getline' command. I/O redirection is first described
|
||
here. Network I/O is also briefly introduced here.
|
||
|
||
- *note Printing::, describes how 'awk' programs can produce
|
||
output with 'print' and 'printf'.
|
||
|
||
- *note Expressions::, describes expressions, which are the
|
||
basic building blocks for getting most things done in a
|
||
program.
|
||
|
||
- *note Patterns and Actions::, describes how to write patterns
|
||
for matching records, actions for doing something when a
|
||
record is matched, and the predefined variables 'awk' and
|
||
'gawk' use.
|
||
|
||
- *note Arrays::, covers 'awk''s one-and-only data structure:
|
||
the associative array. Deleting array elements and whole
|
||
arrays is described, as well as sorting arrays in 'gawk'. The
|
||
major node also describes how 'gawk' provides arrays of
|
||
arrays.
|
||
|
||
- *note Functions::, describes the built-in functions 'awk' and
|
||
'gawk' provide, as well as how to define your own functions.
|
||
It also discusses how 'gawk' lets you call functions
|
||
indirectly.
|
||
|
||
* Part II shows how to use 'awk' and 'gawk' for problem solving.
|
||
There is lots of code here for you to read and learn from. This
|
||
part contains the following chapters:
|
||
|
||
- *note Library Functions::, provides a number of functions
|
||
meant to be used from main 'awk' programs.
|
||
|
||
- *note Sample Programs::, provides many sample 'awk' programs.
|
||
|
||
Reading these two chapters allows you to see 'awk' solving real
|
||
problems.
|
||
|
||
* Part III focuses on features specific to 'gawk'. It contains the
|
||
following chapters:
|
||
|
||
- *note Advanced Features::, describes a number of advanced
|
||
features. Of particular note are the abilities to control the
|
||
order of array traversal, have two-way communications with
|
||
another process, perform TCP/IP networking, and profile your
|
||
'awk' programs.
|
||
|
||
- *note Internationalization::, describes special features for
|
||
translating program messages into different languages at
|
||
runtime.
|
||
|
||
- *note Debugger::, describes the 'gawk' debugger.
|
||
|
||
- *note Arbitrary Precision Arithmetic::, describes advanced
|
||
arithmetic facilities.
|
||
|
||
- *note Dynamic Extensions::, describes how to add new variables
|
||
and functions to 'gawk' by writing extensions in C or C++.
|
||
|
||
* Part IV provides the appendices, the Glossary, and two licenses
|
||
that cover the 'gawk' source code and this Info file, respectively.
|
||
It contains the following appendices:
|
||
|
||
- *note Language History::, describes how the 'awk' language has
|
||
evolved since its first release to the present. It also
|
||
describes how 'gawk' has acquired features over time.
|
||
|
||
- *note Installation::, describes how to get 'gawk', how to
|
||
compile it on POSIX-compatible systems, and how to compile and
|
||
use it on different non-POSIX systems. It also describes how
|
||
to report bugs in 'gawk' and where to get other freely
|
||
available 'awk' implementations.
|
||
|
||
- *note Notes::, describes how to disable 'gawk''s extensions,
|
||
as well as how to contribute new code to 'gawk', and some
|
||
possible future directions for 'gawk' development.
|
||
|
||
- *note Basic Concepts::, provides some very cursory background
|
||
material for those who are completely unfamiliar with computer
|
||
programming.
|
||
|
||
The *note Glossary::, defines most, if not all, of the
|
||
significant terms used throughout the Info file. If you find
|
||
terms that you aren't familiar with, try looking them up here.
|
||
|
||
- *note Copying::, and *note GNU Free Documentation License::,
|
||
present the licenses that cover the 'gawk' source code and
|
||
this Info file, respectively.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) All such differences appear in the index under the entry
|
||
"differences in 'awk' and 'gawk'."
|
||
|
||
|
||
File: gawk.info, Node: Conventions, Next: Manual History, Prev: This Manual, Up: Preface
|
||
|
||
Typographical Conventions
|
||
=========================
|
||
|
||
This Info file is written in Texinfo
|
||
(http://www.gnu.org/software/texinfo/), the GNU documentation formatting
|
||
language. A single Texinfo source file is used to produce both the
|
||
printed and online versions of the documentation. This minor node
|
||
briefly documents the typographical conventions used in Texinfo.
|
||
|
||
Examples you would type at the command line are preceded by the
|
||
common shell primary and secondary prompts, '$' and '>'. Input that you
|
||
type is shown 'like this'. Output from the command is preceded by the
|
||
glyph "-|". This typically represents the command's standard output.
|
||
Error messages and other output on the command's standard error are
|
||
preceded by the glyph "error->". For example:
|
||
|
||
$ echo hi on stdout
|
||
-| hi on stdout
|
||
$ echo hello on stderr 1>&2
|
||
error-> hello on stderr
|
||
|
||
Characters that you type at the keyboard look 'like this'. In
|
||
particular, there are special characters called "control characters."
|
||
These are characters that you type by holding down both the 'CONTROL'
|
||
key and another key, at the same time. For example, a 'Ctrl-d' is typed
|
||
by first pressing and holding the 'CONTROL' key, next pressing the 'd'
|
||
key, and finally releasing both keys.
|
||
|
||
For the sake of brevity, throughout this Info file, we refer to Brian
|
||
Kernighan's version of 'awk' as "BWK 'awk'." (*Note Other Versions::,
|
||
for information on his and other versions.)
|
||
|
||
Dark Corners
|
||
------------
|
||
|
||
Dark corners are basically fractal--no matter how much you
|
||
illuminate, there's always a smaller but darker one.
|
||
-- _Brian Kernighan_
|
||
|
||
Until the POSIX standard (and 'GAWK: Effective AWK Programming'),
|
||
many features of 'awk' were either poorly documented or not documented
|
||
at all. Descriptions of such features (often called "dark corners") are
|
||
noted in this Info file with "(d.c.)." They also appear in the index
|
||
under the heading "dark corner."
|
||
|
||
But, as noted by the opening quote, any coverage of dark corners is
|
||
by definition incomplete.
|
||
|
||
Extensions to the standard 'awk' language that are supported by more
|
||
than one 'awk' implementation are marked "(c.e.)," and listed in the
|
||
index under "common extensions" and "extensions, common."
|
||
|
||
|
||
File: gawk.info, Node: Manual History, Next: How To Contribute, Prev: Conventions, Up: Preface
|
||
|
||
The GNU Project and This Book
|
||
=============================
|
||
|
||
The Free Software Foundation (FSF) is a nonprofit organization dedicated
|
||
to the production and distribution of freely distributable software. It
|
||
was founded by Richard M. Stallman, the author of the original Emacs
|
||
editor. GNU Emacs is the most widely used version of Emacs today.
|
||
|
||
The GNU(1) Project is an ongoing effort on the part of the Free
|
||
Software Foundation to create a complete, freely distributable,
|
||
POSIX-compliant computing environment. The FSF uses the GNU General
|
||
Public License (GPL) to ensure that its software's source code is always
|
||
available to the end user. A copy of the GPL is included for your
|
||
reference (*note Copying::). The GPL applies to the C language source
|
||
code for 'gawk'. To find out more about the FSF and the GNU Project
|
||
online, see the GNU Project's home page (http://www.gnu.org). This Info
|
||
file may also be read from GNU's website
|
||
(http://www.gnu.org/software/gawk/manual/).
|
||
|
||
A shell, an editor (Emacs), highly portable optimizing C, C++, and
|
||
Objective-C compilers, a symbolic debugger and dozens of large and small
|
||
utilities (such as 'gawk'), have all been completed and are freely
|
||
available. The GNU operating system kernel (the HURD), has been
|
||
released but remains in an early stage of development.
|
||
|
||
Until the GNU operating system is more fully developed, you should
|
||
consider using GNU/Linux, a freely distributable, Unix-like operating
|
||
system for Intel, Power Architecture, Sun SPARC, IBM S/390, and other
|
||
systems.(2) Many GNU/Linux distributions are available for download
|
||
from the Internet.
|
||
|
||
The Info file itself has gone through multiple previous editions.
|
||
Paul Rubin wrote the very first draft of 'The GAWK Manual'; it was
|
||
around 40 pages long. Diane Close and Richard Stallman improved it,
|
||
yielding a version that was around 90 pages and barely described the
|
||
original, "old" version of 'awk'.
|
||
|
||
I started working with that version in the fall of 1988. As work on
|
||
it progressed, the FSF published several preliminary versions (numbered
|
||
0.X). In 1996, edition 1.0 was released with 'gawk' 3.0.0. The FSF
|
||
published the first two editions under the title 'The GNU Awk User's
|
||
Guide'.
|
||
|
||
This edition maintains the basic structure of the previous editions.
|
||
For FSF edition 4.0, the content was thoroughly reviewed and updated.
|
||
All references to 'gawk' versions prior to 4.0 were removed. Of
|
||
significant note for that edition was the addition of *note Debugger::.
|
||
|
||
For FSF edition 4.1, the content has been reorganized into parts, and
|
||
the major new additions are *note Arbitrary Precision Arithmetic::, and
|
||
*note Dynamic Extensions::.
|
||
|
||
This Info file will undoubtedly continue to evolve. If you find an
|
||
error in the Info file, please report it! *Note Bugs::, for information
|
||
on submitting problem reports electronically.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) GNU stands for "GNU's Not Unix."
|
||
|
||
(2) The terminology "GNU/Linux" is explained in the *note Glossary::.
|
||
|
||
|
||
File: gawk.info, Node: How To Contribute, Next: Acknowledgments, Prev: Manual History, Up: Preface
|
||
|
||
How to Contribute
|
||
=================
|
||
|
||
As the maintainer of GNU 'awk', I once thought that I would be able to
|
||
manage a collection of publicly available 'awk' programs and I even
|
||
solicited contributions. Making things available on the Internet helps
|
||
keep the 'gawk' distribution down to manageable size.
|
||
|
||
The initial collection of material, such as it is, is still available
|
||
at <ftp://ftp.freefriends.org/arnold/Awkstuff>. In the hopes of doing
|
||
something more broad, I acquired the 'awk.info' domain.
|
||
|
||
However, I found that I could not dedicate enough time to managing
|
||
contributed code: the archive did not grow and the domain went unused
|
||
for several years.
|
||
|
||
Late in 2008, a volunteer took on the task of setting up an
|
||
'awk'-related website--<http://awk.info>--and did a very nice job.
|
||
|
||
If you have written an interesting 'awk' program, or have written a
|
||
'gawk' extension that you would like to share with the rest of the
|
||
world, please see <http://awk.info/?contribute> for how to contribute it
|
||
to the website.
|
||
|
||
|
||
File: gawk.info, Node: Acknowledgments, Prev: How To Contribute, Up: Preface
|
||
|
||
Acknowledgments
|
||
===============
|
||
|
||
The initial draft of 'The GAWK Manual' had the following
|
||
acknowledgments:
|
||
|
||
Many people need to be thanked for their assistance in producing
|
||
this manual. Jay Fenlason contributed many ideas and sample
|
||
programs. Richard Mlynarik and Robert Chassell gave helpful
|
||
comments on drafts of this manual. The paper 'A Supplemental
|
||
Document for AWK' by John W. Pierce of the Chemistry Department at
|
||
UC San Diego, pinpointed several issues relevant both to 'awk'
|
||
implementation and to this manual, that would otherwise have
|
||
escaped us.
|
||
|
||
I would like to acknowledge Richard M. Stallman, for his vision of a
|
||
better world and for his courage in founding the FSF and starting the
|
||
GNU Project.
|
||
|
||
Earlier editions of this Info file had the following
|
||
acknowledgements:
|
||
|
||
The following people (in alphabetical order) provided helpful
|
||
comments on various versions of this book: Rick Adams, Dr. Nelson
|
||
H.F. Beebe, Karl Berry, Dr. Michael Brennan, Rich Burridge, Claire
|
||
Cloutier, Diane Close, Scott Deifik, Christopher ("Topher") Eliot,
|
||
Jeffrey Friedl, Dr. Darrel Hankerson, Michal Jaegermann, Dr.
|
||
Richard J. LeBlanc, Michael Lijewski, Pat Rankin, Miriam Robbins,
|
||
Mary Sheehan, and Chuck Toporek.
|
||
|
||
Robert J. Chassell provided much valuable advice on the use of
|
||
Texinfo. He also deserves special thanks for convincing me _not_
|
||
to title this Info file 'How to Gawk Politely'. Karl Berry helped
|
||
significantly with the TeX part of Texinfo.
|
||
|
||
I would like to thank Marshall and Elaine Hartholz of Seattle and
|
||
Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet
|
||
vacation time in their homes, which allowed me to make significant
|
||
progress on this Info file and on 'gawk' itself.
|
||
|
||
Phil Hughes of SSC contributed in a very important way by loaning
|
||
me his laptop GNU/Linux system, not once, but twice, which allowed
|
||
me to do a lot of work while away from home.
|
||
|
||
David Trueman deserves special credit; he has done a yeoman job of
|
||
evolving 'gawk' so that it performs well and without bugs.
|
||
Although he is no longer involved with 'gawk', working with him on
|
||
this project was a significant pleasure.
|
||
|
||
The intrepid members of the GNITS mailing list, and most notably
|
||
Ulrich Drepper, provided invaluable help and feedback for the
|
||
design of the internationalization features.
|
||
|
||
Chuck Toporek, Mary Sheehan, and Claire Cloutier of O'Reilly &
|
||
Associates contributed significant editorial help for this Info
|
||
file for the 3.1 release of 'gawk'.
|
||
|
||
Dr. Nelson Beebe, Andreas Buening, Dr. Manuel Collado, Antonio
|
||
Colombo, Stephen Davies, Scott Deifik, Akim Demaille, Darrel Hankerson,
|
||
Michal Jaegermann, Ju"rgen Kahrs, Stepan Kasal, John Malmberg, Dave
|
||
Pitts, Chet Ramey, Pat Rankin, Andrew Schorr, Corinna Vinschen, and Eli
|
||
Zaretskii (in alphabetical order) make up the current 'gawk' "crack
|
||
portability team." Without their hard work and help, 'gawk' would not
|
||
be nearly the robust, portable program it is today. It has been and
|
||
continues to be a pleasure working with this team of fine people.
|
||
|
||
Notable code and documentation contributions were made by a number of
|
||
people. *Note Contributors::, for the full list.
|
||
|
||
Thanks to Michael Brennan for the Forewords.
|
||
|
||
Thanks to Patrice Dumas for the new 'makeinfo' program. Thanks to
|
||
Karl Berry, who continues to work to keep the Texinfo markup language
|
||
sane.
|
||
|
||
Robert P.J. Day, Michael Brennan, and Brian Kernighan kindly acted as
|
||
reviewers for the 2015 edition of this Info file. Their feedback helped
|
||
improve the final work.
|
||
|
||
I would also like to thank Brian Kernighan for his invaluable
|
||
assistance during the testing and debugging of 'gawk', and for his
|
||
ongoing help and advice in clarifying numerous points about the
|
||
language. We could not have done nearly as good a job on either 'gawk'
|
||
or its documentation without his help.
|
||
|
||
Brian is in a class by himself as a programmer and technical author.
|
||
I have to thank him (yet again) for his ongoing friendship and for being
|
||
a role model to me for close to 30 years! Having him as a reviewer is
|
||
an exciting privilege. It has also been extremely humbling...
|
||
|
||
I must thank my wonderful wife, Miriam, for her patience through the
|
||
many versions of this project, for her proofreading, and for sharing me
|
||
with the computer. I would like to thank my parents for their love, and
|
||
for the grace with which they raised and educated me. Finally, I also
|
||
must acknowledge my gratitude to G-d, for the many opportunities He has
|
||
sent my way, as well as for the gifts He has given me with which to take
|
||
advantage of those opportunities.
|
||
|
||
|
||
Arnold Robbins
|
||
Nof Ayalon
|
||
Israel
|
||
February 2015
|
||
|
||
|
||
File: gawk.info, Node: Getting Started, Next: Invoking Gawk, Prev: Preface, Up: Top
|
||
|
||
1 Getting Started with 'awk'
|
||
****************************
|
||
|
||
The basic function of 'awk' is to search files for lines (or other units
|
||
of text) that contain certain patterns. When a line matches one of the
|
||
patterns, 'awk' performs specified actions on that line. 'awk'
|
||
continues to process input lines in this way until it reaches the end of
|
||
the input files.
|
||
|
||
Programs in 'awk' are different from programs in most other
|
||
languages, because 'awk' programs are "data driven" (i.e., you describe
|
||
the data you want to work with and then what to do when you find it).
|
||
Most other languages are "procedural"; you have to describe, in great
|
||
detail, every step the program should take. When working with
|
||
procedural languages, it is usually much harder to clearly describe the
|
||
data your program will process. For this reason, 'awk' programs are
|
||
often refreshingly easy to read and write.
|
||
|
||
When you run 'awk', you specify an 'awk' "program" that tells 'awk'
|
||
what to do. The program consists of a series of "rules" (it may also
|
||
contain "function definitions", an advanced feature that we will ignore
|
||
for now; *note User-defined::). Each rule specifies one pattern to
|
||
search for and one action to perform upon finding the pattern.
|
||
|
||
Syntactically, a rule consists of a "pattern" followed by an
|
||
"action". The action is enclosed in braces to separate it from the
|
||
pattern. Newlines usually separate rules. Therefore, an 'awk' program
|
||
looks like this:
|
||
|
||
PATTERN { ACTION }
|
||
PATTERN { ACTION }
|
||
...
|
||
|
||
* Menu:
|
||
|
||
* Running gawk:: How to run 'gawk' programs; includes
|
||
command-line syntax.
|
||
* Sample Data Files:: Sample data files for use in the 'awk'
|
||
programs illustrated in this Info file.
|
||
* Very Simple:: A very simple example.
|
||
* Two Rules:: A less simple one-line example using two
|
||
rules.
|
||
* More Complex:: A more complex example.
|
||
* Statements/Lines:: Subdividing or combining statements into
|
||
lines.
|
||
* Other Features:: Other Features of 'awk'.
|
||
* When:: When to use 'gawk' and when to use
|
||
other things.
|
||
* Intro Summary:: Summary of the introduction.
|
||
|
||
|
||
File: gawk.info, Node: Running gawk, Next: Sample Data Files, Up: Getting Started
|
||
|
||
1.1 How to Run 'awk' Programs
|
||
=============================
|
||
|
||
There are several ways to run an 'awk' program. If the program is
|
||
short, it is easiest to include it in the command that runs 'awk', like
|
||
this:
|
||
|
||
awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ...
|
||
|
||
When the program is long, it is usually more convenient to put it in
|
||
a file and run it with a command like this:
|
||
|
||
awk -f PROGRAM-FILE INPUT-FILE1 INPUT-FILE2 ...
|
||
|
||
This minor node discusses both mechanisms, along with several
|
||
variations of each.
|
||
|
||
* Menu:
|
||
|
||
* One-shot:: Running a short throwaway 'awk'
|
||
program.
|
||
* Read Terminal:: Using no input files (input from the keyboard
|
||
instead).
|
||
* Long:: Putting permanent 'awk' programs in
|
||
files.
|
||
* Executable Scripts:: Making self-contained 'awk' programs.
|
||
* Comments:: Adding documentation to 'gawk'
|
||
programs.
|
||
* Quoting:: More discussion of shell quoting issues.
|
||
|
||
|
||
File: gawk.info, Node: One-shot, Next: Read Terminal, Up: Running gawk
|
||
|
||
1.1.1 One-Shot Throwaway 'awk' Programs
|
||
---------------------------------------
|
||
|
||
Once you are familiar with 'awk', you will often type in simple programs
|
||
the moment you want to use them. Then you can write the program as the
|
||
first argument of the 'awk' command, like this:
|
||
|
||
awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ...
|
||
|
||
where PROGRAM consists of a series of patterns and actions, as described
|
||
earlier.
|
||
|
||
This command format instructs the "shell", or command interpreter, to
|
||
start 'awk' and use the PROGRAM to process records in the input file(s).
|
||
There are single quotes around PROGRAM so the shell won't interpret any
|
||
'awk' characters as special shell characters. The quotes also cause the
|
||
shell to treat all of PROGRAM as a single argument for 'awk', and allow
|
||
PROGRAM to be more than one line long.
|
||
|
||
This format is also useful for running short or medium-sized 'awk'
|
||
programs from shell scripts, because it avoids the need for a separate
|
||
file for the 'awk' program. A self-contained shell script is more
|
||
reliable because there are no other files to misplace.
|
||
|
||
Later in this chapter, in *note Very Simple::, we'll see examples of
|
||
several short, self-contained programs.
|
||
|
||
|
||
File: gawk.info, Node: Read Terminal, Next: Long, Prev: One-shot, Up: Running gawk
|
||
|
||
1.1.2 Running 'awk' Without Input Files
|
||
---------------------------------------
|
||
|
||
You can also run 'awk' without any input files. If you type the
|
||
following command line:
|
||
|
||
awk 'PROGRAM'
|
||
|
||
'awk' applies the PROGRAM to the "standard input", which usually means
|
||
whatever you type on the keyboard. This continues until you indicate
|
||
end-of-file by typing 'Ctrl-d'. (On non-POSIX operating systems, the
|
||
end-of-file character may be different. For example, on OS/2, it is
|
||
'Ctrl-z'.)
|
||
|
||
As an example, the following program prints a friendly piece of
|
||
advice (from Douglas Adams's 'The Hitchhiker's Guide to the Galaxy'), to
|
||
keep you from worrying about the complexities of computer programming:
|
||
|
||
$ awk 'BEGIN { print "Don\47t Panic!" }'
|
||
-| Don't Panic!
|
||
|
||
'awk' executes statements associated with 'BEGIN' before reading any
|
||
input. If there are no other statements in your program, as is the case
|
||
here, 'awk' just stops, instead of trying to read input it doesn't know
|
||
how to process. The '\47' is a magic way (explained later) of getting a
|
||
single quote into the program, without having to engage in ugly shell
|
||
quoting tricks.
|
||
|
||
NOTE: If you use Bash as your shell, you should execute the command
|
||
'set +H' before running this program interactively, to disable the
|
||
C shell-style command history, which treats '!' as a special
|
||
character. We recommend putting this command into your personal
|
||
startup file.
|
||
|
||
This next simple 'awk' program emulates the 'cat' utility; it copies
|
||
whatever you type on the keyboard to its standard output (why this works
|
||
is explained shortly):
|
||
|
||
$ awk '{ print }'
|
||
Now is the time for all good men
|
||
-| Now is the time for all good men
|
||
to come to the aid of their country.
|
||
-| to come to the aid of their country.
|
||
Four score and seven years ago, ...
|
||
-| Four score and seven years ago, ...
|
||
What, me worry?
|
||
-| What, me worry?
|
||
Ctrl-d
|
||
|
||
|
||
File: gawk.info, Node: Long, Next: Executable Scripts, Prev: Read Terminal, Up: Running gawk
|
||
|
||
1.1.3 Running Long Programs
|
||
---------------------------
|
||
|
||
Sometimes 'awk' programs are very long. In these cases, it is more
|
||
convenient to put the program into a separate file. In order to tell
|
||
'awk' to use that file for its program, you type:
|
||
|
||
awk -f SOURCE-FILE INPUT-FILE1 INPUT-FILE2 ...
|
||
|
||
The '-f' instructs the 'awk' utility to get the 'awk' program from
|
||
the file SOURCE-FILE (*note Options::). Any file name can be used for
|
||
SOURCE-FILE. For example, you could put the program:
|
||
|
||
BEGIN { print "Don't Panic!" }
|
||
|
||
into the file 'advice'. Then this command:
|
||
|
||
awk -f advice
|
||
|
||
does the same thing as this one:
|
||
|
||
awk 'BEGIN { print "Don\47t Panic!" }'
|
||
|
||
This was explained earlier (*note Read Terminal::). Note that you don't
|
||
usually need single quotes around the file name that you specify with
|
||
'-f', because most file names don't contain any of the shell's special
|
||
characters. Notice that in 'advice', the 'awk' program did not have
|
||
single quotes around it. The quotes are only needed for programs that
|
||
are provided on the 'awk' command line. (Also, placing the program in a
|
||
file allows us to use a literal single quote in the program text,
|
||
instead of the magic '\47'.)
|
||
|
||
If you want to clearly identify an 'awk' program file as such, you
|
||
can add the extension '.awk' to the file name. This doesn't affect the
|
||
execution of the 'awk' program but it does make "housekeeping" easier.
|
||
|
||
|
||
File: gawk.info, Node: Executable Scripts, Next: Comments, Prev: Long, Up: Running gawk
|
||
|
||
1.1.4 Executable 'awk' Programs
|
||
-------------------------------
|
||
|
||
Once you have learned 'awk', you may want to write self-contained 'awk'
|
||
scripts, using the '#!' script mechanism. You can do this on many
|
||
systems.(1) For example, you could update the file 'advice' to look
|
||
like this:
|
||
|
||
#! /bin/awk -f
|
||
|
||
BEGIN { print "Don't Panic!" }
|
||
|
||
After making this file executable (with the 'chmod' utility), simply
|
||
type 'advice' at the shell and the system arranges to run 'awk' as if
|
||
you had typed 'awk -f advice':
|
||
|
||
$ chmod +x advice
|
||
$ advice
|
||
-| Don't Panic!
|
||
|
||
(We assume you have the current directory in your shell's search path
|
||
variable [typically '$PATH']. If not, you may need to type './advice'
|
||
at the shell.)
|
||
|
||
Self-contained 'awk' scripts are useful when you want to write a
|
||
program that users can invoke without their having to know that the
|
||
program is written in 'awk'.
|
||
|
||
Understanding '#!'
|
||
|
||
'awk' is an "interpreted" language. This means that the 'awk'
|
||
utility reads your program and then processes your data according to the
|
||
instructions in your program. (This is different from a "compiled"
|
||
language such as C, where your program is first compiled into machine
|
||
code that is executed directly by your system's processor.) The 'awk'
|
||
utility is thus termed an "interpreter". Many modern languages are
|
||
interpreted.
|
||
|
||
The line beginning with '#!' lists the full file name of an
|
||
interpreter to run and a single optional initial command-line argument
|
||
to pass to that interpreter. The operating system then runs the
|
||
interpreter with the given argument and the full argument list of the
|
||
executed program. The first argument in the list is the full file name
|
||
of the 'awk' program. The rest of the argument list contains either
|
||
options to 'awk', or data files, or both. (Note that on many systems
|
||
'awk' may be found in '/usr/bin' instead of in '/bin'.)
|
||
|
||
Some systems limit the length of the interpreter name to 32
|
||
characters. Often, this can be dealt with by using a symbolic link.
|
||
|
||
You should not put more than one argument on the '#!' line after the
|
||
path to 'awk'. It does not work. The operating system treats the rest
|
||
of the line as a single argument and passes it to 'awk'. Doing this
|
||
leads to confusing behavior--most likely a usage diagnostic of some sort
|
||
from 'awk'.
|
||
|
||
Finally, the value of 'ARGV[0]' (*note Built-in Variables::) varies
|
||
depending upon your operating system. Some systems put 'awk' there,
|
||
some put the full pathname of 'awk' (such as '/bin/awk'), and some put
|
||
the name of your script ('advice'). (d.c.) Don't rely on the value of
|
||
'ARGV[0]' to provide your script name.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The '#!' mechanism works on GNU/Linux systems, BSD-based systems,
|
||
and commercial Unix systems.
|
||
|
||
|
||
File: gawk.info, Node: Comments, Next: Quoting, Prev: Executable Scripts, Up: Running gawk
|
||
|
||
1.1.5 Comments in 'awk' Programs
|
||
--------------------------------
|
||
|
||
A "comment" is some text that is included in a program for the sake of
|
||
human readers; it is not really an executable part of the program.
|
||
Comments can explain what the program does and how it works. Nearly all
|
||
programming languages have provisions for comments, as programs are
|
||
typically hard to understand without them.
|
||
|
||
In the 'awk' language, a comment starts with the number sign
|
||
character ('#') and continues to the end of the line. The '#' does not
|
||
have to be the first character on the line. The 'awk' language ignores
|
||
the rest of a line following a number sign. For example, we could have
|
||
put the following into 'advice':
|
||
|
||
# This program prints a nice, friendly message. It helps
|
||
# keep novice users from being afraid of the computer.
|
||
BEGIN { print "Don't Panic!" }
|
||
|
||
You can put comment lines into keyboard-composed throwaway 'awk'
|
||
programs, but this usually isn't very useful; the purpose of a comment
|
||
is to help you or another person understand the program when reading it
|
||
at a later time.
|
||
|
||
CAUTION: As mentioned in *note One-shot::, you can enclose short to
|
||
medium-sized programs in single quotes, in order to keep your shell
|
||
scripts self-contained. When doing so, _don't_ put an apostrophe
|
||
(i.e., a single quote) into a comment (or anywhere else in your
|
||
program). The shell interprets the quote as the closing quote for
|
||
the entire program. As a result, usually the shell prints a
|
||
message about mismatched quotes, and if 'awk' actually runs, it
|
||
will probably print strange messages about syntax errors. For
|
||
example, look at the following:
|
||
|
||
$ awk 'BEGIN { print "hello" } # let's be cute'
|
||
>
|
||
|
||
The shell sees that the first two quotes match, and that a new
|
||
quoted object begins at the end of the command line. It therefore
|
||
prompts with the secondary prompt, waiting for more input. With
|
||
Unix 'awk', closing the quoted string produces this result:
|
||
|
||
$ awk '{ print "hello" } # let's be cute'
|
||
> '
|
||
error-> awk: can't open file be
|
||
error-> source line number 1
|
||
|
||
Putting a backslash before the single quote in 'let's' wouldn't
|
||
help, because backslashes are not special inside single quotes.
|
||
The next node describes the shell's quoting rules.
|
||
|
||
|
||
File: gawk.info, Node: Quoting, Prev: Comments, Up: Running gawk
|
||
|
||
1.1.6 Shell Quoting Issues
|
||
--------------------------
|
||
|
||
* Menu:
|
||
|
||
* DOS Quoting:: Quoting in Windows Batch Files.
|
||
|
||
For short to medium-length 'awk' programs, it is most convenient to
|
||
enter the program on the 'awk' command line. This is best done by
|
||
enclosing the entire program in single quotes. This is true whether you
|
||
are entering the program interactively at the shell prompt, or writing
|
||
it as part of a larger shell script:
|
||
|
||
awk 'PROGRAM TEXT' INPUT-FILE1 INPUT-FILE2 ...
|
||
|
||
Once you are working with the shell, it is helpful to have a basic
|
||
knowledge of shell quoting rules. The following rules apply only to
|
||
POSIX-compliant, Bourne-style shells (such as Bash, the GNU Bourne-Again
|
||
Shell). If you use the C shell, you're on your own.
|
||
|
||
Before diving into the rules, we introduce a concept that appears
|
||
throughout this Info file, which is that of the "null", or empty,
|
||
string.
|
||
|
||
The null string is character data that has no value. In other words,
|
||
it is empty. It is written in 'awk' programs like this: '""'. In the
|
||
shell, it can be written using single or double quotes: '""' or ''''.
|
||
Although the null string has no characters in it, it does exist. For
|
||
example, consider this command:
|
||
|
||
$ echo ""
|
||
|
||
Here, the 'echo' utility receives a single argument, even though that
|
||
argument has no characters in it. In the rest of this Info file, we use
|
||
the terms "null string" and "empty string" interchangeably. Now, on to
|
||
the quoting rules:
|
||
|
||
* Quoted items can be concatenated with nonquoted items as well as
|
||
with other quoted items. The shell turns everything into one
|
||
argument for the command.
|
||
|
||
* Preceding any single character with a backslash ('\') quotes that
|
||
character. The shell removes the backslash and passes the quoted
|
||
character on to the command.
|
||
|
||
* Single quotes protect everything between the opening and closing
|
||
quotes. The shell does no interpretation of the quoted text,
|
||
passing it on verbatim to the command. It is _impossible_ to embed
|
||
a single quote inside single-quoted text. Refer back to *note
|
||
Comments::, for an example of what happens if you try.
|
||
|
||
* Double quotes protect most things between the opening and closing
|
||
quotes. The shell does at least variable and command substitution
|
||
on the quoted text. Different shells may do additional kinds of
|
||
processing on double-quoted text.
|
||
|
||
Because certain characters within double-quoted text are processed
|
||
by the shell, they must be "escaped" within the text. Of note are
|
||
the characters '$', '`', '\', and '"', all of which must be
|
||
preceded by a backslash within double-quoted text if they are to be
|
||
passed on literally to the program. (The leading backslash is
|
||
stripped first.) Thus, the example seen in *note Read Terminal:::
|
||
|
||
awk 'BEGIN { print "Don\47t Panic!" }'
|
||
|
||
could instead be written this way:
|
||
|
||
$ awk "BEGIN { print \"Don't Panic!\" }"
|
||
-| Don't Panic!
|
||
|
||
Note that the single quote is not special within double quotes.
|
||
|
||
* Null strings are removed when they occur as part of a non-null
|
||
command-line argument, while explicit null objects are kept. For
|
||
example, to specify that the field separator 'FS' should be set to
|
||
the null string, use:
|
||
|
||
awk -F "" 'PROGRAM' FILES # correct
|
||
|
||
Don't use this:
|
||
|
||
awk -F"" 'PROGRAM' FILES # wrong!
|
||
|
||
In the second case, 'awk' attempts to use the text of the program
|
||
as the value of 'FS', and the first file name as the text of the
|
||
program! This results in syntax errors at best, and confusing
|
||
behavior at worst.
|
||
|
||
Mixing single and double quotes is difficult. You have to resort to
|
||
shell quoting tricks, like this:
|
||
|
||
$ awk 'BEGIN { print "Here is a single quote <'"'"'>" }'
|
||
-| Here is a single quote <'>
|
||
|
||
This program consists of three concatenated quoted strings. The first
|
||
and the third are single-quoted, and the second is double-quoted.
|
||
|
||
This can be "simplified" to:
|
||
|
||
$ awk 'BEGIN { print "Here is a single quote <'\''>" }'
|
||
-| Here is a single quote <'>
|
||
|
||
Judge for yourself which of these two is the more readable.
|
||
|
||
Another option is to use double quotes, escaping the embedded,
|
||
'awk'-level double quotes:
|
||
|
||
$ awk "BEGIN { print \"Here is a single quote <'>\" }"
|
||
-| Here is a single quote <'>
|
||
|
||
This option is also painful, because double quotes, backslashes, and
|
||
dollar signs are very common in more advanced 'awk' programs.
|
||
|
||
A third option is to use the octal escape sequence equivalents (*note
|
||
Escape Sequences::) for the single- and double-quote characters, like
|
||
so:
|
||
|
||
$ awk 'BEGIN { print "Here is a single quote <\47>" }'
|
||
-| Here is a single quote <'>
|
||
$ awk 'BEGIN { print "Here is a double quote <\42>" }'
|
||
-| Here is a double quote <">
|
||
|
||
This works nicely, but you should comment clearly what the escapes mean.
|
||
|
||
A fourth option is to use command-line variable assignment, like
|
||
this:
|
||
|
||
$ awk -v sq="'" 'BEGIN { print "Here is a single quote <" sq ">" }'
|
||
-| Here is a single quote <'>
|
||
|
||
(Here, the two string constants and the value of 'sq' are
|
||
concatenated into a single string that is printed by 'print'.)
|
||
|
||
If you really need both single and double quotes in your 'awk'
|
||
program, it is probably best to move it into a separate file, where the
|
||
shell won't be part of the picture and you can say what you mean.
|
||
|
||
|
||
File: gawk.info, Node: DOS Quoting, Up: Quoting
|
||
|
||
1.1.6.1 Quoting in MS-Windows Batch Files
|
||
.........................................
|
||
|
||
Although this Info file generally only worries about POSIX systems and
|
||
the POSIX shell, the following issue arises often enough for many users
|
||
that it is worth addressing.
|
||
|
||
The "shells" on Microsoft Windows systems use the double-quote
|
||
character for quoting, and make it difficult or impossible to include an
|
||
escaped double-quote character in a command-line script. The following
|
||
example, courtesy of Jeroen Brink, shows how to print all lines in a
|
||
file surrounded by double quotes:
|
||
|
||
gawk "{ print \"\042\" $0 \"\042\" }" FILE
|
||
|
||
|
||
File: gawk.info, Node: Sample Data Files, Next: Very Simple, Prev: Running gawk, Up: Getting Started
|
||
|
||
1.2 Data files for the Examples
|
||
===============================
|
||
|
||
Many of the examples in this Info file take their input from two sample
|
||
data files. The first, 'mail-list', represents a list of peoples' names
|
||
together with their email addresses and information about those people.
|
||
The second data file, called 'inventory-shipped', contains information
|
||
about monthly shipments. In both files, each line is considered to be
|
||
one "record".
|
||
|
||
In 'mail-list', each record contains the name of a person, his/her
|
||
phone number, his/her email address, and a code for his/her relationship
|
||
with the author of the list. The columns are aligned using spaces. An
|
||
'A' in the last column means that the person is an acquaintance. An 'F'
|
||
in the last column means that the person is a friend. An 'R' means that
|
||
the person is a relative:
|
||
|
||
Amelia 555-5553 amelia.zodiacusque@gmail.com F
|
||
Anthony 555-3412 anthony.asserturo@hotmail.com A
|
||
Becky 555-7685 becky.algebrarum@gmail.com A
|
||
Bill 555-1675 bill.drowning@hotmail.com A
|
||
Broderick 555-0542 broderick.aliquotiens@yahoo.com R
|
||
Camilla 555-2912 camilla.infusarum@skynet.be R
|
||
Fabius 555-1234 fabius.undevicesimus@ucb.edu F
|
||
Julie 555-6699 julie.perscrutabor@skeeve.com F
|
||
Martin 555-6480 martin.codicibus@hotmail.com A
|
||
Samuel 555-3430 samuel.lanceolis@shu.edu A
|
||
Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R
|
||
|
||
The data file 'inventory-shipped' represents information about
|
||
shipments during the year. Each record contains the month, the number
|
||
of green crates shipped, the number of red boxes shipped, the number of
|
||
orange bags shipped, and the number of blue packages shipped,
|
||
respectively. There are 16 entries, covering the 12 months of last year
|
||
and the first four months of the current year. An empty line separates
|
||
the data for the two years:
|
||
|
||
Jan 13 25 15 115
|
||
Feb 15 32 24 226
|
||
Mar 15 24 34 228
|
||
Apr 31 52 63 420
|
||
May 16 34 29 208
|
||
Jun 31 42 75 492
|
||
Jul 24 34 67 436
|
||
Aug 15 34 47 316
|
||
Sep 13 55 37 277
|
||
Oct 29 54 68 525
|
||
Nov 20 87 82 577
|
||
Dec 17 35 61 401
|
||
|
||
Jan 21 36 64 620
|
||
Feb 26 58 80 652
|
||
Mar 24 75 70 495
|
||
Apr 21 70 74 514
|
||
|
||
The sample files are included in the 'gawk' distribution, in the
|
||
directory 'awklib/eg/data'.
|
||
|
||
|
||
File: gawk.info, Node: Very Simple, Next: Two Rules, Prev: Sample Data Files, Up: Getting Started
|
||
|
||
1.3 Some Simple Examples
|
||
========================
|
||
|
||
The following command runs a simple 'awk' program that searches the
|
||
input file 'mail-list' for the character string 'li' (a grouping of
|
||
characters is usually called a "string"; the term "string" is based on
|
||
similar usage in English, such as "a string of pearls" or "a string of
|
||
cars in a train"):
|
||
|
||
awk '/li/ { print $0 }' mail-list
|
||
|
||
When lines containing 'li' are found, they are printed because 'print $0'
|
||
means print the current line. (Just 'print' by itself means the same
|
||
thing, so we could have written that instead.)
|
||
|
||
You will notice that slashes ('/') surround the string 'li' in the
|
||
'awk' program. The slashes indicate that 'li' is the pattern to search
|
||
for. This type of pattern is called a "regular expression", which is
|
||
covered in more detail later (*note Regexp::). The pattern is allowed
|
||
to match parts of words. There are single quotes around the 'awk'
|
||
program so that the shell won't interpret any of it as special shell
|
||
characters.
|
||
|
||
Here is what this program prints:
|
||
|
||
$ awk '/li/ { print $0 }' mail-list
|
||
-| Amelia 555-5553 amelia.zodiacusque@gmail.com F
|
||
-| Broderick 555-0542 broderick.aliquotiens@yahoo.com R
|
||
-| Julie 555-6699 julie.perscrutabor@skeeve.com F
|
||
-| Samuel 555-3430 samuel.lanceolis@shu.edu A
|
||
|
||
In an 'awk' rule, either the pattern or the action can be omitted,
|
||
but not both. If the pattern is omitted, then the action is performed
|
||
for _every_ input line. If the action is omitted, the default action is
|
||
to print all lines that match the pattern.
|
||
|
||
Thus, we could leave out the action (the 'print' statement and the
|
||
braces) in the previous example and the result would be the same: 'awk'
|
||
prints all lines matching the pattern 'li'. By comparison, omitting the
|
||
'print' statement but retaining the braces makes an empty action that
|
||
does nothing (i.e., no lines are printed).
|
||
|
||
Many practical 'awk' programs are just a line or two long. Following
|
||
is a collection of useful, short programs to get you started. Some of
|
||
these programs contain constructs that haven't been covered yet. (The
|
||
description of the program will give you a good idea of what is going
|
||
on, but you'll need to read the rest of the Info file to become an 'awk'
|
||
expert!) Most of the examples use a data file named 'data'. This is
|
||
just a placeholder; if you use these programs yourself, substitute your
|
||
own file names for 'data'. For future reference, note that there is
|
||
often more than one way to do things in 'awk'. At some point, you may
|
||
want to look back at these examples and see if you can come up with
|
||
different ways to do the same things shown here:
|
||
|
||
* Print every line that is longer than 80 characters:
|
||
|
||
awk 'length($0) > 80' data
|
||
|
||
The sole rule has a relational expression as its pattern and has no
|
||
action--so it uses the default action, printing the record.
|
||
|
||
* Print the length of the longest input line:
|
||
|
||
awk '{ if (length($0) > max) max = length($0) }
|
||
END { print max }' data
|
||
|
||
The code associated with 'END' executes after all input has been
|
||
read; it's the other side of the coin to 'BEGIN'.
|
||
|
||
* Print the length of the longest line in 'data':
|
||
|
||
expand data | awk '{ if (x < length($0)) x = length($0) }
|
||
END { print "maximum line length is " x }'
|
||
|
||
This example differs slightly from the previous one: the input is
|
||
processed by the 'expand' utility to change TABs into spaces, so
|
||
the widths compared are actually the right-margin columns, as
|
||
opposed to the number of input characters on each line.
|
||
|
||
* Print every line that has at least one field:
|
||
|
||
awk 'NF > 0' data
|
||
|
||
This is an easy way to delete blank lines from a file (or rather,
|
||
to create a new file similar to the old file but from which the
|
||
blank lines have been removed).
|
||
|
||
* Print seven random numbers from 0 to 100, inclusive:
|
||
|
||
awk 'BEGIN { for (i = 1; i <= 7; i++)
|
||
print int(101 * rand()) }'
|
||
|
||
* Print the total number of bytes used by FILES:
|
||
|
||
ls -l FILES | awk '{ x += $5 }
|
||
END { print "total bytes: " x }'
|
||
|
||
* Print the total number of kilobytes used by FILES:
|
||
|
||
ls -l FILES | awk '{ x += $5 }
|
||
END { print "total K-bytes:", x / 1024 }'
|
||
|
||
* Print a sorted list of the login names of all users:
|
||
|
||
awk -F: '{ print $1 }' /etc/passwd | sort
|
||
|
||
* Count the lines in a file:
|
||
|
||
awk 'END { print NR }' data
|
||
|
||
* Print the even-numbered lines in the data file:
|
||
|
||
awk 'NR % 2 == 0' data
|
||
|
||
If you used the expression 'NR % 2 == 1' instead, the program would
|
||
print the odd-numbered lines.
|
||
|
||
|
||
File: gawk.info, Node: Two Rules, Next: More Complex, Prev: Very Simple, Up: Getting Started
|
||
|
||
1.4 An Example with Two Rules
|
||
=============================
|
||
|
||
The 'awk' utility reads the input files one line at a time. For each
|
||
line, 'awk' tries the patterns of each rule. If several patterns match,
|
||
then several actions execute in the order in which they appear in the
|
||
'awk' program. If no patterns match, then no actions run.
|
||
|
||
After processing all the rules that match the line (and perhaps there
|
||
are none), 'awk' reads the next line. (However, *note Next Statement::,
|
||
and also *note Nextfile Statement::.) This continues until the program
|
||
reaches the end of the file. For example, the following 'awk' program
|
||
contains two rules:
|
||
|
||
/12/ { print $0 }
|
||
/21/ { print $0 }
|
||
|
||
The first rule has the string '12' as the pattern and 'print $0' as the
|
||
action. The second rule has the string '21' as the pattern and also has
|
||
'print $0' as the action. Each rule's action is enclosed in its own
|
||
pair of braces.
|
||
|
||
This program prints every line that contains the string '12' _or_ the
|
||
string '21'. If a line contains both strings, it is printed twice, once
|
||
by each rule.
|
||
|
||
This is what happens if we run this program on our two sample data
|
||
files, 'mail-list' and 'inventory-shipped':
|
||
|
||
$ awk '/12/ { print $0 }
|
||
> /21/ { print $0 }' mail-list inventory-shipped
|
||
-| Anthony 555-3412 anthony.asserturo@hotmail.com A
|
||
-| Camilla 555-2912 camilla.infusarum@skynet.be R
|
||
-| Fabius 555-1234 fabius.undevicesimus@ucb.edu F
|
||
-| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R
|
||
-| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R
|
||
-| Jan 21 36 64 620
|
||
-| Apr 21 70 74 514
|
||
|
||
Note how the line beginning with 'Jean-Paul' in 'mail-list' was printed
|
||
twice, once for each rule.
|
||
|
||
|
||
File: gawk.info, Node: More Complex, Next: Statements/Lines, Prev: Two Rules, Up: Getting Started
|
||
|
||
1.5 A More Complex Example
|
||
==========================
|
||
|
||
Now that we've mastered some simple tasks, let's look at what typical
|
||
'awk' programs do. This example shows how 'awk' can be used to
|
||
summarize, select, and rearrange the output of another utility. It uses
|
||
features that haven't been covered yet, so don't worry if you don't
|
||
understand all the details:
|
||
|
||
ls -l | awk '$6 == "Nov" { sum += $5 }
|
||
END { print sum }'
|
||
|
||
This command prints the total number of bytes in all the files in the
|
||
current directory that were last modified in November (of any year).
|
||
The 'ls -l' part of this example is a system command that gives you a
|
||
listing of the files in a directory, including each file's size and the
|
||
date the file was last modified. Its output looks like this:
|
||
|
||
-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile
|
||
-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h
|
||
-rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h
|
||
-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awkgram.y
|
||
-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c
|
||
-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c
|
||
-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c
|
||
-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c
|
||
|
||
The first field contains read-write permissions, the second field
|
||
contains the number of links to the file, and the third field identifies
|
||
the file's owner. The fourth field identifies the file's group. The
|
||
fifth field contains the file's size in bytes. The sixth, seventh, and
|
||
eighth fields contain the month, day, and time, respectively, that the
|
||
file was last modified. Finally, the ninth field contains the file
|
||
name.
|
||
|
||
The '$6 == "Nov"' in our 'awk' program is an expression that tests
|
||
whether the sixth field of the output from 'ls -l' matches the string
|
||
'Nov'. Each time a line has the string 'Nov' for its sixth field, 'awk'
|
||
performs the action 'sum += $5'. This adds the fifth field (the file's
|
||
size) to the variable 'sum'. As a result, when 'awk' has finished
|
||
reading all the input lines, 'sum' is the total of the sizes of the
|
||
files whose lines matched the pattern. (This works because 'awk'
|
||
variables are automatically initialized to zero.)
|
||
|
||
After the last line of output from 'ls' has been processed, the 'END'
|
||
rule executes and prints the value of 'sum'. In this example, the value
|
||
of 'sum' is 80600.
|
||
|
||
These more advanced 'awk' techniques are covered in later sections
|
||
(*note Action Overview::). Before you can move on to more advanced
|
||
'awk' programming, you have to know how 'awk' interprets your input and
|
||
displays your output. By manipulating fields and using 'print'
|
||
statements, you can produce some very useful and impressive-looking
|
||
reports.
|
||
|
||
|
||
File: gawk.info, Node: Statements/Lines, Next: Other Features, Prev: More Complex, Up: Getting Started
|
||
|
||
1.6 'awk' Statements Versus Lines
|
||
=================================
|
||
|
||
Most often, each line in an 'awk' program is a separate statement or
|
||
separate rule, like this:
|
||
|
||
awk '/12/ { print $0 }
|
||
/21/ { print $0 }' mail-list inventory-shipped
|
||
|
||
However, 'gawk' ignores newlines after any of the following symbols
|
||
and keywords:
|
||
|
||
, { ? : || && do else
|
||
|
||
A newline at any other point is considered the end of the statement.(1)
|
||
|
||
If you would like to split a single statement into two lines at a
|
||
point where a newline would terminate it, you can "continue" it by
|
||
ending the first line with a backslash character ('\'). The backslash
|
||
must be the final character on the line in order to be recognized as a
|
||
continuation character. A backslash is allowed anywhere in the
|
||
statement, even in the middle of a string or regular expression. For
|
||
example:
|
||
|
||
awk '/This regular expression is too long, so continue it\
|
||
on the next line/ { print $1 }'
|
||
|
||
We have generally not used backslash continuation in our sample
|
||
programs. 'gawk' places no limit on the length of a line, so backslash
|
||
continuation is never strictly necessary; it just makes programs more
|
||
readable. For this same reason, as well as for clarity, we have kept
|
||
most statements short in the programs presented throughout the Info
|
||
file. Backslash continuation is most useful when your 'awk' program is
|
||
in a separate source file instead of entered from the command line. You
|
||
should also note that many 'awk' implementations are more particular
|
||
about where you may use backslash continuation. For example, they may
|
||
not allow you to split a string constant using backslash continuation.
|
||
Thus, for maximum portability of your 'awk' programs, it is best not to
|
||
split your lines in the middle of a regular expression or a string.
|
||
|
||
CAUTION: _Backslash continuation does not work as described with
|
||
the C shell._ It works for 'awk' programs in files and for
|
||
one-shot programs, _provided_ you are using a POSIX-compliant
|
||
shell, such as the Unix Bourne shell or Bash. But the C shell
|
||
behaves differently! There you must use two backslashes in a row,
|
||
followed by a newline. Note also that when using the C shell,
|
||
_every_ newline in your 'awk' program must be escaped with a
|
||
backslash. To illustrate:
|
||
|
||
% awk 'BEGIN { \
|
||
? print \\
|
||
? "hello, world" \
|
||
? }'
|
||
-| hello, world
|
||
|
||
Here, the '%' and '?' are the C shell's primary and secondary
|
||
prompts, analogous to the standard shell's '$' and '>'.
|
||
|
||
Compare the previous example to how it is done with a
|
||
POSIX-compliant shell:
|
||
|
||
$ awk 'BEGIN {
|
||
> print \
|
||
> "hello, world"
|
||
> }'
|
||
-| hello, world
|
||
|
||
'awk' is a line-oriented language. Each rule's action has to begin
|
||
on the same line as the pattern. To have the pattern and action on
|
||
separate lines, you _must_ use backslash continuation; there is no other
|
||
option.
|
||
|
||
Another thing to keep in mind is that backslash continuation and
|
||
comments do not mix. As soon as 'awk' sees the '#' that starts a
|
||
comment, it ignores _everything_ on the rest of the line. For example:
|
||
|
||
$ gawk 'BEGIN { print "dont panic" # a friendly \
|
||
> BEGIN rule
|
||
> }'
|
||
error-> gawk: cmd. line:2: BEGIN rule
|
||
error-> gawk: cmd. line:2: ^ syntax error
|
||
|
||
In this case, it looks like the backslash would continue the comment
|
||
onto the next line. However, the backslash-newline combination is never
|
||
even noticed because it is "hidden" inside the comment. Thus, the
|
||
'BEGIN' is noted as a syntax error.
|
||
|
||
When 'awk' statements within one rule are short, you might want to
|
||
put more than one of them on a line. This is accomplished by separating
|
||
the statements with a semicolon (';'). This also applies to the rules
|
||
themselves. Thus, the program shown at the start of this minor node
|
||
could also be written this way:
|
||
|
||
/12/ { print $0 } ; /21/ { print $0 }
|
||
|
||
NOTE: The requirement that states that rules on the same line must
|
||
be separated with a semicolon was not in the original 'awk'
|
||
language; it was added for consistency with the treatment of
|
||
statements within an action.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The '?' and ':' referred to here is the three-operand conditional
|
||
expression described in *note Conditional Exp::. Splitting lines after
|
||
'?' and ':' is a minor 'gawk' extension; if '--posix' is specified
|
||
(*note Options::), then this extension is disabled.
|
||
|
||
|
||
File: gawk.info, Node: Other Features, Next: When, Prev: Statements/Lines, Up: Getting Started
|
||
|
||
1.7 Other Features of 'awk'
|
||
===========================
|
||
|
||
The 'awk' language provides a number of predefined, or "built-in",
|
||
variables that your programs can use to get information from 'awk'.
|
||
There are other variables your program can set as well to control how
|
||
'awk' processes your data.
|
||
|
||
In addition, 'awk' provides a number of built-in functions for doing
|
||
common computational and string-related operations. 'gawk' provides
|
||
built-in functions for working with timestamps, performing bit
|
||
manipulation, for runtime string translation (internationalization),
|
||
determining the type of a variable, and array sorting.
|
||
|
||
As we develop our presentation of the 'awk' language, we will
|
||
introduce most of the variables and many of the functions. They are
|
||
described systematically in *note Built-in Variables::, and in *note
|
||
Built-in::.
|
||
|
||
|
||
File: gawk.info, Node: When, Next: Intro Summary, Prev: Other Features, Up: Getting Started
|
||
|
||
1.8 When to Use 'awk'
|
||
=====================
|
||
|
||
Now that you've seen some of what 'awk' can do, you might wonder how
|
||
'awk' could be useful for you. By using utility programs, advanced
|
||
patterns, field separators, arithmetic statements, and other selection
|
||
criteria, you can produce much more complex output. The 'awk' language
|
||
is very useful for producing reports from large amounts of raw data,
|
||
such as summarizing information from the output of other utility
|
||
programs like 'ls'. (*Note More Complex::.)
|
||
|
||
Programs written with 'awk' are usually much smaller than they would
|
||
be in other languages. This makes 'awk' programs easy to compose and
|
||
use. Often, 'awk' programs can be quickly composed at your keyboard,
|
||
used once, and thrown away. Because 'awk' programs are interpreted, you
|
||
can avoid the (usually lengthy) compilation part of the typical
|
||
edit-compile-test-debug cycle of software development.
|
||
|
||
Complex programs have been written in 'awk', including a complete
|
||
retargetable assembler for eight-bit microprocessors (*note Glossary::,
|
||
for more information), and a microcode assembler for a special-purpose
|
||
Prolog computer. The original 'awk''s capabilities were strained by
|
||
tasks of such complexity, but modern versions are more capable.
|
||
|
||
If you find yourself writing 'awk' scripts of more than, say, a few
|
||
hundred lines, you might consider using a different programming
|
||
language. The shell is good at string and pattern matching; in
|
||
addition, it allows powerful use of the system utilities. Python offers
|
||
a nice balance between high-level ease of programming and access to
|
||
system facilities.(1)
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Other popular scripting languages include Ruby and Perl.
|
||
|
||
|
||
File: gawk.info, Node: Intro Summary, Prev: When, Up: Getting Started
|
||
|
||
1.9 Summary
|
||
===========
|
||
|
||
* Programs in 'awk' consist of PATTERN-ACTION pairs.
|
||
|
||
* An ACTION without a PATTERN always runs. The default ACTION for a
|
||
pattern without one is '{ print $0 }'.
|
||
|
||
* Use either 'awk 'PROGRAM' FILES' or 'awk -f PROGRAM-FILE FILES' to
|
||
run 'awk'.
|
||
|
||
* You may use the special '#!' header line to create 'awk' programs
|
||
that are directly executable.
|
||
|
||
* Comments in 'awk' programs start with '#' and continue to the end
|
||
of the same line.
|
||
|
||
* Be aware of quoting issues when writing 'awk' programs as part of a
|
||
larger shell script (or MS-Windows batch file).
|
||
|
||
* You may use backslash continuation to continue a source line.
|
||
Lines are automatically continued after a comma, open brace,
|
||
question mark, colon, '||', '&&', 'do', and 'else'.
|
||
|
||
|
||
File: gawk.info, Node: Invoking Gawk, Next: Regexp, Prev: Getting Started, Up: Top
|
||
|
||
2 Running 'awk' and 'gawk'
|
||
**************************
|
||
|
||
This major node covers how to run 'awk', both POSIX-standard and
|
||
'gawk'-specific command-line options, and what 'awk' and 'gawk' do with
|
||
nonoption arguments. It then proceeds to cover how 'gawk' searches for
|
||
source files, reading standard input along with other files, 'gawk''s
|
||
environment variables, 'gawk''s exit status, using include files, and
|
||
obsolete and undocumented options and/or features.
|
||
|
||
Many of the options and features described here are discussed in more
|
||
detail later in the Info file; feel free to skip over things in this
|
||
major node that don't interest you right now.
|
||
|
||
* Menu:
|
||
|
||
* Command Line:: How to run 'awk'.
|
||
* Options:: Command-line options and their meanings.
|
||
* Other Arguments:: Input file names and variable assignments.
|
||
* Naming Standard Input:: How to specify standard input with other
|
||
files.
|
||
* Environment Variables:: The environment variables 'gawk' uses.
|
||
* Exit Status:: 'gawk''s exit status.
|
||
* Include Files:: Including other files into your program.
|
||
* Loading Shared Libraries:: Loading shared libraries into your program.
|
||
* Obsolete:: Obsolete Options and/or features.
|
||
* Undocumented:: Undocumented Options and Features.
|
||
* Invoking Summary:: Invocation summary.
|
||
|
||
|
||
File: gawk.info, Node: Command Line, Next: Options, Up: Invoking Gawk
|
||
|
||
2.1 Invoking 'awk'
|
||
==================
|
||
|
||
There are two ways to run 'awk'--with an explicit program or with one or
|
||
more program files. Here are templates for both of them; items enclosed
|
||
in [...] in these templates are optional:
|
||
|
||
'awk' [OPTIONS] '-f' PROGFILE ['--'] FILE ...
|
||
'awk' [OPTIONS] ['--'] ''PROGRAM'' FILE ...
|
||
|
||
In addition to traditional one-letter POSIX-style options, 'gawk'
|
||
also supports GNU long options.
|
||
|
||
It is possible to invoke 'awk' with an empty program:
|
||
|
||
awk '' datafile1 datafile2
|
||
|
||
Doing so makes little sense, though; 'awk' exits silently when given an
|
||
empty program. (d.c.) If '--lint' has been specified on the command
|
||
line, 'gawk' issues a warning that the program is empty.
|
||
|
||
|
||
File: gawk.info, Node: Options, Next: Other Arguments, Prev: Command Line, Up: Invoking Gawk
|
||
|
||
2.2 Command-Line Options
|
||
========================
|
||
|
||
Options begin with a dash and consist of a single character. GNU-style
|
||
long options consist of two dashes and a keyword. The keyword can be
|
||
abbreviated, as long as the abbreviation allows the option to be
|
||
uniquely identified. If the option takes an argument, either the
|
||
keyword is immediately followed by an equals sign ('=') and the
|
||
argument's value, or the keyword and the argument's value are separated
|
||
by whitespace. If a particular option with a value is given more than
|
||
once, it is the last value that counts.
|
||
|
||
Each long option for 'gawk' has a corresponding POSIX-style short
|
||
option. The long and short options are interchangeable in all contexts.
|
||
The following list describes options mandated by the POSIX standard:
|
||
|
||
'-F FS'
|
||
'--field-separator FS'
|
||
Set the 'FS' variable to FS (*note Field Separators::).
|
||
|
||
'-f SOURCE-FILE'
|
||
'--file SOURCE-FILE'
|
||
Read the 'awk' program source from SOURCE-FILE instead of in the
|
||
first nonoption argument. This option may be given multiple times;
|
||
the 'awk' program consists of the concatenation of the contents of
|
||
each specified SOURCE-FILE.
|
||
|
||
'-v VAR=VAL'
|
||
'--assign VAR=VAL'
|
||
Set the variable VAR to the value VAL _before_ execution of the
|
||
program begins. Such variable values are available inside the
|
||
'BEGIN' rule (*note Other Arguments::).
|
||
|
||
The '-v' option can only set one variable, but it can be used more
|
||
than once, setting another variable each time, like this: 'awk -v foo=1
|
||
-v bar=2 ...'.
|
||
|
||
CAUTION: Using '-v' to set the values of the built-in
|
||
variables may lead to surprising results. 'awk' will reset
|
||
the values of those variables as it needs to, possibly
|
||
ignoring any initial value you may have given.
|
||
|
||
'-W GAWK-OPT'
|
||
Provide an implementation-specific option. This is the POSIX
|
||
convention for providing implementation-specific options. These
|
||
options also have corresponding GNU-style long options. Note that
|
||
the long options may be abbreviated, as long as the abbreviations
|
||
remain unique. The full list of 'gawk'-specific options is
|
||
provided next.
|
||
|
||
'--'
|
||
Signal the end of the command-line options. The following
|
||
arguments are not treated as options even if they begin with '-'.
|
||
This interpretation of '--' follows the POSIX argument parsing
|
||
conventions.
|
||
|
||
This is useful if you have file names that start with '-', or in
|
||
shell scripts, if you have file names that will be specified by the
|
||
user that could start with '-'. It is also useful for passing
|
||
options on to the 'awk' program; see *note Getopt Function::.
|
||
|
||
The following list describes 'gawk'-specific options:
|
||
|
||
'-b'
|
||
'--characters-as-bytes'
|
||
Cause 'gawk' to treat all input data as single-byte characters. In
|
||
addition, all output written with 'print' or 'printf' is treated as
|
||
single-byte characters.
|
||
|
||
Normally, 'gawk' follows the POSIX standard and attempts to process
|
||
its input data according to the current locale (*note Locales::).
|
||
This can often involve converting multibyte characters into wide
|
||
characters (internally), and can lead to problems or confusion if
|
||
the input data does not contain valid multibyte characters. This
|
||
option is an easy way to tell 'gawk', "Hands off my data!"
|
||
|
||
'-c'
|
||
'--traditional'
|
||
Specify "compatibility mode", in which the GNU extensions to the
|
||
'awk' language are disabled, so that 'gawk' behaves just like BWK
|
||
'awk'. *Note POSIX/GNU::, which summarizes the extensions. Also
|
||
see *note Compatibility Mode::.
|
||
|
||
'-C'
|
||
'--copyright'
|
||
Print the short version of the General Public License and then
|
||
exit.
|
||
|
||
'-d'[FILE]
|
||
'--dump-variables'['='FILE]
|
||
Print a sorted list of global variables, their types, and final
|
||
values to FILE. If no FILE is provided, print this list to a file
|
||
named 'awkvars.out' in the current directory. No space is allowed
|
||
between the '-d' and FILE, if FILE is supplied.
|
||
|
||
Having a list of all global variables is a good way to look for
|
||
typographical errors in your programs. You would also use this
|
||
option if you have a large program with a lot of functions, and you
|
||
want to be sure that your functions don't inadvertently use global
|
||
variables that you meant to be local. (This is a particularly easy
|
||
mistake to make with simple variable names like 'i', 'j', etc.)
|
||
|
||
'-D'[FILE]
|
||
'--debug'['='FILE]
|
||
Enable debugging of 'awk' programs (*note Debugging::). By
|
||
default, the debugger reads commands interactively from the
|
||
keyboard (standard input). The optional FILE argument allows you
|
||
to specify a file with a list of commands for the debugger to
|
||
execute noninteractively. No space is allowed between the '-D' and
|
||
FILE, if FILE is supplied.
|
||
|
||
'-e' PROGRAM-TEXT
|
||
'--source' PROGRAM-TEXT
|
||
Provide program source code in the PROGRAM-TEXT. This option
|
||
allows you to mix source code in files with source code that you
|
||
enter on the command line. This is particularly useful when you
|
||
have library functions that you want to use from your command-line
|
||
programs (*note AWKPATH Variable::).
|
||
|
||
'-E' FILE
|
||
'--exec' FILE
|
||
Similar to '-f', read 'awk' program text from FILE. There are two
|
||
differences from '-f':
|
||
|
||
* This option terminates option processing; anything else on the
|
||
command line is passed on directly to the 'awk' program.
|
||
|
||
* Command-line variable assignments of the form 'VAR=VALUE' are
|
||
disallowed.
|
||
|
||
This option is particularly necessary for World Wide Web CGI
|
||
applications that pass arguments through the URL; using this option
|
||
prevents a malicious (or other) user from passing in options,
|
||
assignments, or 'awk' source code (via '-e') to the CGI
|
||
application.(1) This option should be used with '#!' scripts
|
||
(*note Executable Scripts::), like so:
|
||
|
||
#! /usr/local/bin/gawk -E
|
||
|
||
AWK PROGRAM HERE ...
|
||
|
||
'-g'
|
||
'--gen-pot'
|
||
Analyze the source program and generate a GNU 'gettext' portable
|
||
object template file on standard output for all string constants
|
||
that have been marked for translation. *Note
|
||
Internationalization::, for information about this option.
|
||
|
||
'-h'
|
||
'--help'
|
||
Print a "usage" message summarizing the short- and long-style
|
||
options that 'gawk' accepts and then exit.
|
||
|
||
'-i' SOURCE-FILE
|
||
'--include' SOURCE-FILE
|
||
Read an 'awk' source library from SOURCE-FILE. This option is
|
||
completely equivalent to using the '@include' directive inside your
|
||
program. It is very similar to the '-f' option, but there are two
|
||
important differences. First, when '-i' is used, the program
|
||
source is not loaded if it has been previously loaded, whereas with
|
||
'-f', 'gawk' always loads the file. Second, because this option is
|
||
intended to be used with code libraries, 'gawk' does not recognize
|
||
such files as constituting main program input. Thus, after
|
||
processing an '-i' argument, 'gawk' still expects to find the main
|
||
source code via the '-f' option or on the command line.
|
||
|
||
'-l' EXT
|
||
'--load' EXT
|
||
Load a dynamic extension named EXT. Extensions are stored as
|
||
system shared libraries. This option searches for the library
|
||
using the 'AWKLIBPATH' environment variable. The correct library
|
||
suffix for your platform will be supplied by default, so it need
|
||
not be specified in the extension name. The extension
|
||
initialization routine should be named 'dl_load()'. An alternative
|
||
is to use the '@load' keyword inside the program to load a shared
|
||
library. This advanced feature is described in detail in *note
|
||
Dynamic Extensions::.
|
||
|
||
'-L'[VALUE]
|
||
'--lint'['='VALUE]
|
||
Warn about constructs that are dubious or nonportable to other
|
||
'awk' implementations. No space is allowed between the '-L' and
|
||
VALUE, if VALUE is supplied. Some warnings are issued when 'gawk'
|
||
first reads your program. Others are issued at runtime, as your
|
||
program executes. With an optional argument of 'fatal', lint
|
||
warnings become fatal errors. This may be drastic, but its use
|
||
will certainly encourage the development of cleaner 'awk' programs.
|
||
With an optional argument of 'invalid', only warnings about things
|
||
that are actually invalid are issued. (This is not fully
|
||
implemented yet.)
|
||
|
||
Some warnings are only printed once, even if the dubious constructs
|
||
they warn about occur multiple times in your 'awk' program. Thus,
|
||
when eliminating problems pointed out by '--lint', you should take
|
||
care to search for all occurrences of each inappropriate construct.
|
||
As 'awk' programs are usually short, doing so is not burdensome.
|
||
|
||
'-M'
|
||
'--bignum'
|
||
Force arbitrary-precision arithmetic on numbers. This option has
|
||
no effect if 'gawk' is not compiled to use the GNU MPFR and MP
|
||
libraries (*note Arbitrary Precision Arithmetic::).
|
||
|
||
'-n'
|
||
'--non-decimal-data'
|
||
Enable automatic interpretation of octal and hexadecimal values in
|
||
input data (*note Nondecimal Data::).
|
||
|
||
CAUTION: This option can severely break old programs. Use
|
||
with care. Also note that this option may disappear in a
|
||
future version of 'gawk'.
|
||
|
||
'-N'
|
||
'--use-lc-numeric'
|
||
Force the use of the locale's decimal point character when parsing
|
||
numeric input data (*note Locales::).
|
||
|
||
'-o'[FILE]
|
||
'--pretty-print'['='FILE]
|
||
Enable pretty-printing of 'awk' programs. By default, the output
|
||
program is created in a file named 'awkprof.out' (*note
|
||
Profiling::). The optional FILE argument allows you to specify a
|
||
different file name for the output. No space is allowed between
|
||
the '-o' and FILE, if FILE is supplied.
|
||
|
||
NOTE: Due to the way 'gawk' has evolved, with this option your
|
||
program still executes. This will change in the next major
|
||
release, such that 'gawk' will only pretty-print the program
|
||
and not run it.
|
||
|
||
'-O'
|
||
'--optimize'
|
||
Enable some optimizations on the internal representation of the
|
||
program. At the moment, this includes just simple constant
|
||
folding.
|
||
|
||
'-p'[FILE]
|
||
'--profile'['='FILE]
|
||
Enable profiling of 'awk' programs (*note Profiling::). By
|
||
default, profiles are created in a file named 'awkprof.out'. The
|
||
optional FILE argument allows you to specify a different file name
|
||
for the profile file. No space is allowed between the '-p' and
|
||
FILE, if FILE is supplied.
|
||
|
||
The profile contains execution counts for each statement in the
|
||
program in the left margin, and function call counts for each
|
||
function.
|
||
|
||
'-P'
|
||
'--posix'
|
||
Operate in strict POSIX mode. This disables all 'gawk' extensions
|
||
(just like '--traditional') and disables all extensions not allowed
|
||
by POSIX. *Note Common Extensions::, for a summary of the
|
||
extensions in 'gawk' that are disabled by this option. Also, the
|
||
following additional restrictions apply:
|
||
|
||
* Newlines do not act as whitespace to separate fields when 'FS'
|
||
is equal to a single space (*note Fields::).
|
||
|
||
* Newlines are not allowed after '?' or ':' (*note Conditional
|
||
Exp::).
|
||
|
||
* Specifying '-Ft' on the command line does not set the value of
|
||
'FS' to be a single TAB character (*note Field Separators::).
|
||
|
||
* The locale's decimal point character is used for parsing input
|
||
data (*note Locales::).
|
||
|
||
If you supply both '--traditional' and '--posix' on the command
|
||
line, '--posix' takes precedence. 'gawk' issues a warning if both
|
||
options are supplied.
|
||
|
||
'-r'
|
||
'--re-interval'
|
||
Allow interval expressions (*note Regexp Operators::) in regexps.
|
||
This is now 'gawk''s default behavior. Nevertheless, this option
|
||
remains (both for backward compatibility and for use in combination
|
||
with '--traditional').
|
||
|
||
'-S'
|
||
'--sandbox'
|
||
Disable the 'system()' function, input redirections with 'getline',
|
||
output redirections with 'print' and 'printf', and dynamic
|
||
extensions. This is particularly useful when you want to run 'awk'
|
||
scripts from questionable sources and need to make sure the scripts
|
||
can't access your system (other than the specified input data
|
||
file).
|
||
|
||
'-t'
|
||
'--lint-old'
|
||
Warn about constructs that are not available in the original
|
||
version of 'awk' from Version 7 Unix (*note V7/SVR3.1::).
|
||
|
||
'-V'
|
||
'--version'
|
||
Print version information for this particular copy of 'gawk'. This
|
||
allows you to determine if your copy of 'gawk' is up to date with
|
||
respect to whatever the Free Software Foundation is currently
|
||
distributing. It is also useful for bug reports (*note Bugs::).
|
||
|
||
As long as program text has been supplied, any other options are
|
||
flagged as invalid with a warning message but are otherwise ignored.
|
||
|
||
In compatibility mode, as a special case, if the value of FS supplied
|
||
to the '-F' option is 't', then 'FS' is set to the TAB character
|
||
('"\t"'). This is true only for '--traditional' and not for '--posix'
|
||
(*note Field Separators::).
|
||
|
||
The '-f' option may be used more than once on the command line. If
|
||
it is, 'awk' reads its program source from all of the named files, as if
|
||
they had been concatenated together into one big file. This is useful
|
||
for creating libraries of 'awk' functions. These functions can be
|
||
written once and then retrieved from a standard place, instead of having
|
||
to be included in each individual program. The '-i' option is similar
|
||
in this regard. (As mentioned in *note Definition Syntax::, function
|
||
names must be unique.)
|
||
|
||
With standard 'awk', library functions can still be used, even if the
|
||
program is entered at the keyboard, by specifying '-f /dev/tty'. After
|
||
typing your program, type 'Ctrl-d' (the end-of-file character) to
|
||
terminate it. (You may also use '-f -' to read program source from the
|
||
standard input, but then you will not be able to also use the standard
|
||
input as a source of data.)
|
||
|
||
Because it is clumsy using the standard 'awk' mechanisms to mix
|
||
source file and command-line 'awk' programs, 'gawk' provides the '-e'
|
||
option. This does not require you to preempt the standard input for
|
||
your source code; it allows you to easily mix command-line and library
|
||
source code (*note AWKPATH Variable::). As with '-f', the '-e' and '-i'
|
||
options may also be used multiple times on the command line.
|
||
|
||
If no '-f' or '-e' option is specified, then 'gawk' uses the first
|
||
nonoption command-line argument as the text of the program source code.
|
||
|
||
If the environment variable 'POSIXLY_CORRECT' exists, then 'gawk'
|
||
behaves in strict POSIX mode, exactly as if you had supplied '--posix'.
|
||
Many GNU programs look for this environment variable to suppress
|
||
extensions that conflict with POSIX, but 'gawk' behaves differently: it
|
||
suppresses all extensions, even those that do not conflict with POSIX,
|
||
and behaves in strict POSIX mode. If '--lint' is supplied on the
|
||
command line and 'gawk' turns on POSIX mode because of
|
||
'POSIXLY_CORRECT', then it issues a warning message indicating that
|
||
POSIX mode is in effect. You would typically set this variable in your
|
||
shell's startup file. For a Bourne-compatible shell (such as Bash), you
|
||
would add these lines to the '.profile' file in your home directory:
|
||
|
||
POSIXLY_CORRECT=true
|
||
export POSIXLY_CORRECT
|
||
|
||
For a C shell-compatible shell,(2) you would add this line to the
|
||
'.login' file in your home directory:
|
||
|
||
setenv POSIXLY_CORRECT true
|
||
|
||
Having 'POSIXLY_CORRECT' set is not recommended for daily use, but it
|
||
is good for testing the portability of your programs to other
|
||
environments.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) For more detail, please see Section 4.4 of RFC 3875
|
||
(http://www.ietf.org/rfc/rfc3875). Also see the explanatory note sent
|
||
to the 'gawk' bug mailing list
|
||
(http://lists.gnu.org/archive/html/bug-gawk/2014-11/msg00022.html).
|
||
|
||
(2) Not recommended.
|
||
|
||
|
||
File: gawk.info, Node: Other Arguments, Next: Naming Standard Input, Prev: Options, Up: Invoking Gawk
|
||
|
||
2.3 Other Command-Line Arguments
|
||
================================
|
||
|
||
Any additional arguments on the command line are normally treated as
|
||
input files to be processed in the order specified. However, an
|
||
argument that has the form 'VAR=VALUE', assigns the value VALUE to the
|
||
variable VAR--it does not specify a file at all. (See *note Assignment
|
||
Options::.) In the following example, COUNT=1 is a variable assignment,
|
||
not a file name:
|
||
|
||
awk -f program.awk file1 count=1 file2
|
||
|
||
All the command-line arguments are made available to your 'awk'
|
||
program in the 'ARGV' array (*note Built-in Variables::). Command-line
|
||
options and the program text (if present) are omitted from 'ARGV'. All
|
||
other arguments, including variable assignments, are included. As each
|
||
element of 'ARGV' is processed, 'gawk' sets 'ARGIND' to the index in
|
||
'ARGV' of the current element.
|
||
|
||
Changing 'ARGC' and 'ARGV' in your 'awk' program lets you control how
|
||
'awk' processes the input files; this is described in more detail in
|
||
*note ARGC and ARGV::.
|
||
|
||
The distinction between file name arguments and variable-assignment
|
||
arguments is made when 'awk' is about to open the next input file. At
|
||
that point in execution, it checks the file name to see whether it is
|
||
really a variable assignment; if so, 'awk' sets the variable instead of
|
||
reading a file.
|
||
|
||
Therefore, the variables actually receive the given values after all
|
||
previously specified files have been read. In particular, the values of
|
||
variables assigned in this fashion are _not_ available inside a 'BEGIN'
|
||
rule (*note BEGIN/END::), because such rules are run before 'awk' begins
|
||
scanning the argument list.
|
||
|
||
The variable values given on the command line are processed for
|
||
escape sequences (*note Escape Sequences::). (d.c.)
|
||
|
||
In some very early implementations of 'awk', when a variable
|
||
assignment occurred before any file names, the assignment would happen
|
||
_before_ the 'BEGIN' rule was executed. 'awk''s behavior was thus
|
||
inconsistent; some command-line assignments were available inside the
|
||
'BEGIN' rule, while others were not. Unfortunately, some applications
|
||
came to depend upon this "feature." When 'awk' was changed to be more
|
||
consistent, the '-v' option was added to accommodate applications that
|
||
depended upon the old behavior.
|
||
|
||
The variable assignment feature is most useful for assigning to
|
||
variables such as 'RS', 'OFS', and 'ORS', which control input and output
|
||
formats, before scanning the data files. It is also useful for
|
||
controlling state if multiple passes are needed over a data file. For
|
||
example:
|
||
|
||
awk 'pass == 1 { PASS 1 STUFF }
|
||
pass == 2 { PASS 2 STUFF }' pass=1 mydata pass=2 mydata
|
||
|
||
Given the variable assignment feature, the '-F' option for setting
|
||
the value of 'FS' is not strictly necessary. It remains for historical
|
||
compatibility.
|
||
|
||
|
||
File: gawk.info, Node: Naming Standard Input, Next: Environment Variables, Prev: Other Arguments, Up: Invoking Gawk
|
||
|
||
2.4 Naming Standard Input
|
||
=========================
|
||
|
||
Often, you may wish to read standard input together with other files.
|
||
For example, you may wish to read one file, read standard input coming
|
||
from a pipe, and then read another file.
|
||
|
||
The way to name the standard input, with all versions of 'awk', is to
|
||
use a single, standalone minus sign or dash, '-'. For example:
|
||
|
||
SOME_COMMAND | awk -f myprog.awk file1 - file2
|
||
|
||
Here, 'awk' first reads 'file1', then it reads the output of
|
||
SOME_COMMAND, and finally it reads 'file2'.
|
||
|
||
You may also use '"-"' to name standard input when reading files with
|
||
'getline' (*note Getline/File::).
|
||
|
||
In addition, 'gawk' allows you to specify the special file name
|
||
'/dev/stdin', both on the command line and with 'getline'. Some other
|
||
versions of 'awk' also support this, but it is not standard. (Some
|
||
operating systems provide a '/dev/stdin' file in the filesystem;
|
||
however, 'gawk' always processes this file name itself.)
|
||
|
||
|
||
File: gawk.info, Node: Environment Variables, Next: Exit Status, Prev: Naming Standard Input, Up: Invoking Gawk
|
||
|
||
2.5 The Environment Variables 'gawk' Uses
|
||
=========================================
|
||
|
||
A number of environment variables influence how 'gawk' behaves.
|
||
|
||
* Menu:
|
||
|
||
* AWKPATH Variable:: Searching directories for 'awk'
|
||
programs.
|
||
* AWKLIBPATH Variable:: Searching directories for 'awk' shared
|
||
libraries.
|
||
* Other Environment Variables:: The environment variables.
|
||
|
||
|
||
File: gawk.info, Node: AWKPATH Variable, Next: AWKLIBPATH Variable, Up: Environment Variables
|
||
|
||
2.5.1 The 'AWKPATH' Environment Variable
|
||
----------------------------------------
|
||
|
||
The previous minor node described how 'awk' program files can be named
|
||
on the command line with the '-f' option. In most 'awk'
|
||
implementations, you must supply a precise pathname for each program
|
||
file, unless the file is in the current directory. But with 'gawk', if
|
||
the file name supplied to the '-f' or '-i' options does not contain a
|
||
directory separator '/', then 'gawk' searches a list of directories
|
||
(called the "search path") one by one, looking for a file with the
|
||
specified name.
|
||
|
||
The search path is a string consisting of directory names separated
|
||
by colons.(1) 'gawk' gets its search path from the 'AWKPATH'
|
||
environment variable. If that variable does not exist, or if it has an
|
||
empty value, 'gawk' uses a default path (described shortly).
|
||
|
||
The search path feature is particularly helpful for building
|
||
libraries of useful 'awk' functions. The library files can be placed in
|
||
a standard directory in the default path and then specified on the
|
||
command line with a short file name. Otherwise, you would have to type
|
||
the full file name for each file.
|
||
|
||
By using the '-i' or '-f' options, your command-line 'awk' programs
|
||
can use facilities in 'awk' library files (*note Library Functions::).
|
||
Path searching is not done if 'gawk' is in compatibility mode. This is
|
||
true for both '--traditional' and '--posix'. *Note Options::.
|
||
|
||
If the source code file is not found after the initial search, the
|
||
path is searched again after adding the suffix '.awk' to the file name.
|
||
|
||
'gawk''s path search mechanism is similar to the shell's. (See 'The
|
||
Bourne-Again SHell manual' (http://www.gnu.org/software/bash/manual/).)
|
||
It treats a null entry in the path as indicating the current directory.
|
||
(A null entry is indicated by starting or ending the path with a colon
|
||
or by placing two colons next to each other ['::'].)
|
||
|
||
NOTE: To include the current directory in the path, either place
|
||
'.' as an entry in the path or write a null entry in the path.
|
||
|
||
Different past versions of 'gawk' would also look explicitly in the
|
||
current directory, either before or after the path search. As of
|
||
version 4.1.2, this no longer happens; if you wish to look in the
|
||
current directory, you must include '.' either as a separate entry
|
||
or as a null entry in the search path.
|
||
|
||
The default value for 'AWKPATH' is '.:/usr/local/share/awk'.(2)
|
||
Since '.' is included at the beginning, 'gawk' searches first in the
|
||
current directory and then in '/usr/local/share/awk'. In practice, this
|
||
means that you will rarely need to change the value of 'AWKPATH'.
|
||
|
||
'gawk' places the value of the search path that it used into
|
||
'ENVIRON["AWKPATH"]'. This provides access to the actual search path
|
||
value from within an 'awk' program.
|
||
|
||
Although you can change 'ENVIRON["AWKPATH"]' within your 'awk'
|
||
program, this has no effect on the running program's behavior. This
|
||
makes sense: the 'AWKPATH' environment variable is used to find the
|
||
program source files. Once your program is running, all the files have
|
||
been found, and 'gawk' no longer needs to use 'AWKPATH'.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Semicolons on MS-Windows and MS-DOS.
|
||
|
||
(2) Your version of 'gawk' may use a different directory; it will
|
||
depend upon how 'gawk' was built and installed. The actual directory is
|
||
the value of '$(datadir)' generated when 'gawk' was configured. You
|
||
probably don't need to worry about this, though.
|
||
|
||
|
||
File: gawk.info, Node: AWKLIBPATH Variable, Next: Other Environment Variables, Prev: AWKPATH Variable, Up: Environment Variables
|
||
|
||
2.5.2 The 'AWKLIBPATH' Environment Variable
|
||
-------------------------------------------
|
||
|
||
The 'AWKLIBPATH' environment variable is similar to the 'AWKPATH'
|
||
variable, but it is used to search for loadable extensions (stored as
|
||
system shared libraries) specified with the '-l' option rather than for
|
||
source files. If the extension is not found, the path is searched again
|
||
after adding the appropriate shared library suffix for the platform.
|
||
For example, on GNU/Linux systems, the suffix '.so' is used. The search
|
||
path specified is also used for extensions loaded via the '@load'
|
||
keyword (*note Loading Shared Libraries::).
|
||
|
||
If 'AWKLIBPATH' does not exist in the environment, or if it has an
|
||
empty value, 'gawk' uses a default path; this is typically
|
||
'/usr/local/lib/gawk', although it can vary depending upon how 'gawk'
|
||
was built.
|
||
|
||
'gawk' places the value of the search path that it used into
|
||
'ENVIRON["AWKLIBPATH"]'. This provides access to the actual search path
|
||
value from within an 'awk' program.
|
||
|
||
|
||
File: gawk.info, Node: Other Environment Variables, Prev: AWKLIBPATH Variable, Up: Environment Variables
|
||
|
||
2.5.3 Other Environment Variables
|
||
---------------------------------
|
||
|
||
A number of other environment variables affect 'gawk''s behavior, but
|
||
they are more specialized. Those in the following list are meant to be
|
||
used by regular users:
|
||
|
||
'GAWK_MSEC_SLEEP'
|
||
Specifies the interval between connection retries, in milliseconds.
|
||
On systems that do not support the 'usleep()' system call, the
|
||
value is rounded up to an integral number of seconds.
|
||
|
||
'GAWK_READ_TIMEOUT'
|
||
Specifies the time, in milliseconds, for 'gawk' to wait for input
|
||
before returning with an error. *Note Read Timeout::.
|
||
|
||
'GAWK_SOCK_RETRIES'
|
||
Controls the number of times 'gawk' attempts to retry a two-way
|
||
TCP/IP (socket) connection before giving up. *Note TCP/IP
|
||
Networking::.
|
||
|
||
'POSIXLY_CORRECT'
|
||
Causes 'gawk' to switch to POSIX-compatibility mode, disabling all
|
||
traditional and GNU extensions. *Note Options::.
|
||
|
||
The environment variables in the following list are meant for use by
|
||
the 'gawk' developers for testing and tuning. They are subject to
|
||
change. The variables are:
|
||
|
||
'AWKBUFSIZE'
|
||
This variable only affects 'gawk' on POSIX-compliant systems. With
|
||
a value of 'exact', 'gawk' uses the size of each input file as the
|
||
size of the memory buffer to allocate for I/O. Otherwise, the value
|
||
should be a number, and 'gawk' uses that number as the size of the
|
||
buffer to allocate. (When this variable is not set, 'gawk' uses
|
||
the smaller of the file's size and the "default" blocksize, which
|
||
is usually the filesystem's I/O blocksize.)
|
||
|
||
'AWK_HASH'
|
||
If this variable exists with a value of 'gst', 'gawk' switches to
|
||
using the hash function from GNU Smalltalk for managing arrays.
|
||
This function may be marginally faster than the standard function.
|
||
|
||
'AWKREADFUNC'
|
||
If this variable exists, 'gawk' switches to reading source files
|
||
one line at a time, instead of reading in blocks. This exists for
|
||
debugging problems on filesystems on non-POSIX operating systems
|
||
where I/O is performed in records, not in blocks.
|
||
|
||
'GAWK_MSG_SRC'
|
||
If this variable exists, 'gawk' includes the file name and line
|
||
number within the 'gawk' source code from which warning and/or
|
||
fatal messages are generated. Its purpose is to help isolate the
|
||
source of a message, as there are multiple places that produce the
|
||
same warning or error message.
|
||
|
||
'GAWK_NO_DFA'
|
||
If this variable exists, 'gawk' does not use the DFA regexp matcher
|
||
for "does it match" kinds of tests. This can cause 'gawk' to be
|
||
slower. Its purpose is to help isolate differences between the two
|
||
regexp matchers that 'gawk' uses internally. (There aren't
|
||
supposed to be differences, but occasionally theory and practice
|
||
don't coordinate with each other.)
|
||
|
||
'GAWK_NO_PP_RUN'
|
||
When 'gawk' is invoked with the '--pretty-print' option, it will
|
||
not run the program if this environment variable exists.
|
||
|
||
CAUTION: This variable will not survive into the next major
|
||
release.
|
||
|
||
'GAWK_STACKSIZE'
|
||
This specifies the amount by which 'gawk' should grow its internal
|
||
evaluation stack, when needed.
|
||
|
||
'INT_CHAIN_MAX'
|
||
This specifies intended maximum number of items 'gawk' will
|
||
maintain on a hash chain for managing arrays indexed by integers.
|
||
|
||
'STR_CHAIN_MAX'
|
||
This specifies intended maximum number of items 'gawk' will
|
||
maintain on a hash chain for managing arrays indexed by strings.
|
||
|
||
'TIDYMEM'
|
||
If this variable exists, 'gawk' uses the 'mtrace()' library calls
|
||
from the GNU C library to help track down possible memory leaks.
|
||
|
||
|
||
File: gawk.info, Node: Exit Status, Next: Include Files, Prev: Environment Variables, Up: Invoking Gawk
|
||
|
||
2.6 'gawk''s Exit Status
|
||
========================
|
||
|
||
If the 'exit' statement is used with a value (*note Exit Statement::),
|
||
then 'gawk' exits with the numeric value given to it.
|
||
|
||
Otherwise, if there were no problems during execution, 'gawk' exits
|
||
with the value of the C constant 'EXIT_SUCCESS'. This is usually zero.
|
||
|
||
If an error occurs, 'gawk' exits with the value of the C constant
|
||
'EXIT_FAILURE'. This is usually one.
|
||
|
||
If 'gawk' exits because of a fatal error, the exit status is two. On
|
||
non-POSIX systems, this value may be mapped to 'EXIT_FAILURE'.
|
||
|
||
|
||
File: gawk.info, Node: Include Files, Next: Loading Shared Libraries, Prev: Exit Status, Up: Invoking Gawk
|
||
|
||
2.7 Including Other Files into Your Program
|
||
===========================================
|
||
|
||
This minor node describes a feature that is specific to 'gawk'.
|
||
|
||
The '@include' keyword can be used to read external 'awk' source
|
||
files. This gives you the ability to split large 'awk' source files
|
||
into smaller, more manageable pieces, and also lets you reuse common
|
||
'awk' code from various 'awk' scripts. In other words, you can group
|
||
together 'awk' functions used to carry out specific tasks into external
|
||
files. These files can be used just like function libraries, using the
|
||
'@include' keyword in conjunction with the 'AWKPATH' environment
|
||
variable. Note that source files may also be included using the '-i'
|
||
option.
|
||
|
||
Let's see an example. We'll start with two (trivial) 'awk' scripts,
|
||
namely 'test1' and 'test2'. Here is the 'test1' script:
|
||
|
||
BEGIN {
|
||
print "This is script test1."
|
||
}
|
||
|
||
and here is 'test2':
|
||
|
||
@include "test1"
|
||
BEGIN {
|
||
print "This is script test2."
|
||
}
|
||
|
||
Running 'gawk' with 'test2' produces the following result:
|
||
|
||
$ gawk -f test2
|
||
-| This is script test1.
|
||
-| This is script test2.
|
||
|
||
'gawk' runs the 'test2' script, which includes 'test1' using the
|
||
'@include' keyword. So, to include external 'awk' source files, you
|
||
just use '@include' followed by the name of the file to be included,
|
||
enclosed in double quotes.
|
||
|
||
NOTE: Keep in mind that this is a language construct and the file
|
||
name cannot be a string variable, but rather just a literal string
|
||
constant in double quotes.
|
||
|
||
The files to be included may be nested; e.g., given a third script,
|
||
namely 'test3':
|
||
|
||
@include "test2"
|
||
BEGIN {
|
||
print "This is script test3."
|
||
}
|
||
|
||
Running 'gawk' with the 'test3' script produces the following results:
|
||
|
||
$ gawk -f test3
|
||
-| This is script test1.
|
||
-| This is script test2.
|
||
-| This is script test3.
|
||
|
||
The file name can, of course, be a pathname. For example:
|
||
|
||
@include "../io_funcs"
|
||
|
||
and:
|
||
|
||
@include "/usr/awklib/network"
|
||
|
||
are both valid. The 'AWKPATH' environment variable can be of great
|
||
value when using '@include'. The same rules for the use of the
|
||
'AWKPATH' variable in command-line file searches (*note AWKPATH
|
||
Variable::) apply to '@include' also.
|
||
|
||
This is very helpful in constructing 'gawk' function libraries. If
|
||
you have a large script with useful, general-purpose 'awk' functions,
|
||
you can break it down into library files and put those files in a
|
||
special directory. You can then include those "libraries," either by
|
||
using the full pathnames of the files, or by setting the 'AWKPATH'
|
||
environment variable accordingly and then using '@include' with just the
|
||
file part of the full pathname. Of course, you can keep library files
|
||
in more than one directory; the more complex the working environment is,
|
||
the more directories you may need to organize the files to be included.
|
||
|
||
Given the ability to specify multiple '-f' options, the '@include'
|
||
mechanism is not strictly necessary. However, the '@include' keyword
|
||
can help you in constructing self-contained 'gawk' programs, thus
|
||
reducing the need for writing complex and tedious command lines. In
|
||
particular, '@include' is very useful for writing CGI scripts to be run
|
||
from web pages.
|
||
|
||
As mentioned in *note AWKPATH Variable::, the current directory is
|
||
always searched first for source files, before searching in 'AWKPATH';
|
||
this also applies to files named with '@include'.
|
||
|
||
|
||
File: gawk.info, Node: Loading Shared Libraries, Next: Obsolete, Prev: Include Files, Up: Invoking Gawk
|
||
|
||
2.8 Loading Dynamic Extensions into Your Program
|
||
================================================
|
||
|
||
This minor node describes a feature that is specific to 'gawk'.
|
||
|
||
The '@load' keyword can be used to read external 'awk' extensions
|
||
(stored as system shared libraries). This allows you to link in
|
||
compiled code that may offer superior performance and/or give you access
|
||
to extended capabilities not supported by the 'awk' language. The
|
||
'AWKLIBPATH' variable is used to search for the extension. Using
|
||
'@load' is completely equivalent to using the '-l' command-line option.
|
||
|
||
If the extension is not initially found in 'AWKLIBPATH', another
|
||
search is conducted after appending the platform's default shared
|
||
library suffix to the file name. For example, on GNU/Linux systems, the
|
||
suffix '.so' is used:
|
||
|
||
$ gawk '@load "ordchr"; BEGIN {print chr(65)}'
|
||
-| A
|
||
|
||
This is equivalent to the following example:
|
||
|
||
$ gawk -lordchr 'BEGIN {print chr(65)}'
|
||
-| A
|
||
|
||
For command-line usage, the '-l' option is more convenient, but '@load'
|
||
is useful for embedding inside an 'awk' source file that requires access
|
||
to an extension.
|
||
|
||
*note Dynamic Extensions::, describes how to write extensions (in C
|
||
or C++) that can be loaded with either '@load' or the '-l' option. It
|
||
also describes the 'ordchr' extension.
|
||
|
||
|
||
File: gawk.info, Node: Obsolete, Next: Undocumented, Prev: Loading Shared Libraries, Up: Invoking Gawk
|
||
|
||
2.9 Obsolete Options and/or Features
|
||
====================================
|
||
|
||
This minor node describes features and/or command-line options from
|
||
previous releases of 'gawk' that either are not available in the current
|
||
version or are still supported but deprecated (meaning that they will
|
||
_not_ be in the next release).
|
||
|
||
The process-related special files '/dev/pid', '/dev/ppid',
|
||
'/dev/pgrpid', and '/dev/user' were deprecated in 'gawk' 3.1, but still
|
||
worked. As of version 4.0, they are no longer interpreted specially by
|
||
'gawk'. (Use 'PROCINFO' instead; see *note Auto-set::.)
|
||
|
||
|
||
File: gawk.info, Node: Undocumented, Next: Invoking Summary, Prev: Obsolete, Up: Invoking Gawk
|
||
|
||
2.10 Undocumented Options and Features
|
||
======================================
|
||
|
||
Use the Source, Luke!
|
||
-- _Obi-Wan_
|
||
|
||
This minor node intentionally left blank.
|
||
|
||
|
||
File: gawk.info, Node: Invoking Summary, Prev: Undocumented, Up: Invoking Gawk
|
||
|
||
2.11 Summary
|
||
============
|
||
|
||
* Use either 'awk 'PROGRAM' FILES' or 'awk -f PROGRAM-FILE FILES' to
|
||
run 'awk'.
|
||
|
||
* The three standard options for all versions of 'awk' are '-f',
|
||
'-F', and '-v'. 'gawk' supplies these and many others, as well as
|
||
corresponding GNU-style long options.
|
||
|
||
* Nonoption command-line arguments are usually treated as file names,
|
||
unless they have the form 'VAR=VALUE', in which case they are taken
|
||
as variable assignments to be performed at that point in processing
|
||
the input.
|
||
|
||
* All nonoption command-line arguments, excluding the program text,
|
||
are placed in the 'ARGV' array. Adjusting 'ARGC' and 'ARGV'
|
||
affects how 'awk' processes input.
|
||
|
||
* You can use a single minus sign ('-') to refer to standard input on
|
||
the command line. 'gawk' also lets you use the special file name
|
||
'/dev/stdin'.
|
||
|
||
* 'gawk' pays attention to a number of environment variables.
|
||
'AWKPATH', 'AWKLIBPATH', and 'POSIXLY_CORRECT' are the most
|
||
important ones.
|
||
|
||
* 'gawk''s exit status conveys information to the program that
|
||
invoked it. Use the 'exit' statement from within an 'awk' program
|
||
to set the exit status.
|
||
|
||
* 'gawk' allows you to include other 'awk' source files into your
|
||
program using the '@include' statement and/or the '-i' and '-f'
|
||
command-line options.
|
||
|
||
* 'gawk' allows you to load additional functions written in C or C++
|
||
using the '@load' statement and/or the '-l' option. (This advanced
|
||
feature is described later, in *note Dynamic Extensions::.)
|
||
|
||
|
||
File: gawk.info, Node: Regexp, Next: Reading Files, Prev: Invoking Gawk, Up: Top
|
||
|
||
3 Regular Expressions
|
||
*********************
|
||
|
||
A "regular expression", or "regexp", is a way of describing a set of
|
||
strings. Because regular expressions are such a fundamental part of
|
||
'awk' programming, their format and use deserve a separate major node.
|
||
|
||
A regular expression enclosed in slashes ('/') is an 'awk' pattern
|
||
that matches every input record whose text belongs to that set. The
|
||
simplest regular expression is a sequence of letters, numbers, or both.
|
||
Such a regexp matches any string that contains that sequence. Thus, the
|
||
regexp 'foo' matches any string containing 'foo'. Thus, the pattern
|
||
'/foo/' matches any input record containing the three adjacent
|
||
characters 'foo' _anywhere_ in the record. Other kinds of regexps let
|
||
you specify more complicated classes of strings.
|
||
|
||
* Menu:
|
||
|
||
* Regexp Usage:: How to Use Regular Expressions.
|
||
* Escape Sequences:: How to write nonprinting characters.
|
||
* Regexp Operators:: Regular Expression Operators.
|
||
* Bracket Expressions:: What can go between '[...]'.
|
||
* Leftmost Longest:: How much text matches.
|
||
* Computed Regexps:: Using Dynamic Regexps.
|
||
* GNU Regexp Operators:: Operators specific to GNU software.
|
||
* Case-sensitivity:: How to do case-insensitive matching.
|
||
* Regexp Summary:: Regular expressions summary.
|
||
|
||
|
||
File: gawk.info, Node: Regexp Usage, Next: Escape Sequences, Up: Regexp
|
||
|
||
3.1 How to Use Regular Expressions
|
||
==================================
|
||
|
||
A regular expression can be used as a pattern by enclosing it in
|
||
slashes. Then the regular expression is tested against the entire text
|
||
of each record. (Normally, it only needs to match some part of the text
|
||
in order to succeed.) For example, the following prints the second
|
||
field of each record where the string 'li' appears anywhere in the
|
||
record:
|
||
|
||
$ awk '/li/ { print $2 }' mail-list
|
||
-| 555-5553
|
||
-| 555-0542
|
||
-| 555-6699
|
||
-| 555-3430
|
||
|
||
Regular expressions can also be used in matching expressions. These
|
||
expressions allow you to specify the string to match against; it need
|
||
not be the entire current input record. The two operators '~' and '!~'
|
||
perform regular expression comparisons. Expressions using these
|
||
operators can be used as patterns, or in 'if', 'while', 'for', and 'do'
|
||
statements. (*Note Statements::.) For example, the following is true
|
||
if the expression EXP (taken as a string) matches REGEXP:
|
||
|
||
EXP ~ /REGEXP/
|
||
|
||
This example matches, or selects, all input records with the uppercase
|
||
letter 'J' somewhere in the first field:
|
||
|
||
$ awk '$1 ~ /J/' inventory-shipped
|
||
-| Jan 13 25 15 115
|
||
-| Jun 31 42 75 492
|
||
-| Jul 24 34 67 436
|
||
-| Jan 21 36 64 620
|
||
|
||
So does this:
|
||
|
||
awk '{ if ($1 ~ /J/) print }' inventory-shipped
|
||
|
||
This next example is true if the expression EXP (taken as a character
|
||
string) does _not_ match REGEXP:
|
||
|
||
EXP !~ /REGEXP/
|
||
|
||
The following example matches, or selects, all input records whose
|
||
first field _does not_ contain the uppercase letter 'J':
|
||
|
||
$ awk '$1 !~ /J/' inventory-shipped
|
||
-| Feb 15 32 24 226
|
||
-| Mar 15 24 34 228
|
||
-| Apr 31 52 63 420
|
||
-| May 16 34 29 208
|
||
...
|
||
|
||
When a regexp is enclosed in slashes, such as '/foo/', we call it a
|
||
"regexp constant", much like '5.27' is a numeric constant and '"foo"' is
|
||
a string constant.
|
||
|
||
|
||
File: gawk.info, Node: Escape Sequences, Next: Regexp Operators, Prev: Regexp Usage, Up: Regexp
|
||
|
||
3.2 Escape Sequences
|
||
====================
|
||
|
||
Some characters cannot be included literally in string constants
|
||
('"foo"') or regexp constants ('/foo/'). Instead, they should be
|
||
represented with "escape sequences", which are character sequences
|
||
beginning with a backslash ('\'). One use of an escape sequence is to
|
||
include a double-quote character in a string constant. Because a plain
|
||
double quote ends the string, you must use '\"' to represent an actual
|
||
double-quote character as a part of the string. For example:
|
||
|
||
$ awk 'BEGIN { print "He said \"hi!\" to her." }'
|
||
-| He said "hi!" to her.
|
||
|
||
The backslash character itself is another character that cannot be
|
||
included normally; you must write '\\' to put one backslash in the
|
||
string or regexp. Thus, the string whose contents are the two
|
||
characters '"' and '\' must be written '"\"\\"'.
|
||
|
||
Other escape sequences represent unprintable characters such as TAB
|
||
or newline. There is nothing to stop you from entering most unprintable
|
||
characters directly in a string constant or regexp constant, but they
|
||
may look ugly.
|
||
|
||
The following list presents all the escape sequences used in 'awk'
|
||
and what they represent. Unless noted otherwise, all these escape
|
||
sequences apply to both string constants and regexp constants:
|
||
|
||
'\\'
|
||
A literal backslash, '\'.
|
||
|
||
'\a'
|
||
The "alert" character, 'Ctrl-g', ASCII code 7 (BEL). (This often
|
||
makes some sort of audible noise.)
|
||
|
||
'\b'
|
||
Backspace, 'Ctrl-h', ASCII code 8 (BS).
|
||
|
||
'\f'
|
||
Formfeed, 'Ctrl-l', ASCII code 12 (FF).
|
||
|
||
'\n'
|
||
Newline, 'Ctrl-j', ASCII code 10 (LF).
|
||
|
||
'\r'
|
||
Carriage return, 'Ctrl-m', ASCII code 13 (CR).
|
||
|
||
'\t'
|
||
Horizontal TAB, 'Ctrl-i', ASCII code 9 (HT).
|
||
|
||
'\v'
|
||
Vertical TAB, 'Ctrl-k', ASCII code 11 (VT).
|
||
|
||
'\NNN'
|
||
The octal value NNN, where NNN stands for 1 to 3 digits between '0'
|
||
and '7'. For example, the code for the ASCII ESC (escape)
|
||
character is '\033'.
|
||
|
||
'\xHH...'
|
||
The hexadecimal value HH, where HH stands for a sequence of
|
||
hexadecimal digits ('0'-'9', and either 'A'-'F' or 'a'-'f'). Like
|
||
the same construct in ISO C, the escape sequence continues until
|
||
the first nonhexadecimal digit is seen. (c.e.) However, using
|
||
more than two hexadecimal digits produces undefined results. (The
|
||
'\x' escape sequence is not allowed in POSIX 'awk'.)
|
||
|
||
CAUTION: The next major release of 'gawk' will change, such
|
||
that a maximum of two hexadecimal digits following the '\x'
|
||
will be used.
|
||
|
||
'\/'
|
||
A literal slash (necessary for regexp constants only). This
|
||
sequence is used when you want to write a regexp constant that
|
||
contains a slash (such as '/.*:\/home\/[[:alnum:]]+:.*/'; the
|
||
'[[:alnum:]]' notation is discussed in *note Bracket
|
||
Expressions::). Because the regexp is delimited by slashes, you
|
||
need to escape any slash that is part of the pattern, in order to
|
||
tell 'awk' to keep processing the rest of the regexp.
|
||
|
||
'\"'
|
||
A literal double quote (necessary for string constants only). This
|
||
sequence is used when you want to write a string constant that
|
||
contains a double quote (such as '"He said \"hi!\" to her."').
|
||
Because the string is delimited by double quotes, you need to
|
||
escape any quote that is part of the string, in order to tell 'awk'
|
||
to keep processing the rest of the string.
|
||
|
||
In 'gawk', a number of additional two-character sequences that begin
|
||
with a backslash have special meaning in regexps. *Note GNU Regexp
|
||
Operators::.
|
||
|
||
In a regexp, a backslash before any character that is not in the
|
||
previous list and not listed in *note GNU Regexp Operators::, means that
|
||
the next character should be taken literally, even if it would normally
|
||
be a regexp operator. For example, '/a\+b/' matches the three
|
||
characters 'a+b'.
|
||
|
||
For complete portability, do not use a backslash before any character
|
||
not shown in the previous list or that is not an operator.
|
||
|
||
Backslash Before Regular Characters
|
||
|
||
If you place a backslash in a string constant before something that
|
||
is not one of the characters previously listed, POSIX 'awk' purposely
|
||
leaves what happens as undefined. There are two choices:
|
||
|
||
Strip the backslash out
|
||
This is what BWK 'awk' and 'gawk' both do. For example, '"a\qc"'
|
||
is the same as '"aqc"'. (Because this is such an easy bug both to
|
||
introduce and to miss, 'gawk' warns you about it.) Consider 'FS =
|
||
"[ \t]+\|[ \t]+"' to use vertical bars surrounded by whitespace as
|
||
the field separator. There should be two backslashes in the
|
||
string: 'FS = "[ \t]+\\|[ \t]+"'.)
|
||
|
||
Leave the backslash alone
|
||
Some other 'awk' implementations do this. In such implementations,
|
||
typing '"a\qc"' is the same as typing '"a\\qc"'.
|
||
|
||
To summarize:
|
||
|
||
* The escape sequences in the preceding list are always processed
|
||
first, for both string constants and regexp constants. This
|
||
happens very early, as soon as 'awk' reads your program.
|
||
|
||
* 'gawk' processes both regexp constants and dynamic regexps (*note
|
||
Computed Regexps::), for the special operators listed in *note GNU
|
||
Regexp Operators::.
|
||
|
||
* A backslash before any other character means to treat that
|
||
character literally.
|
||
|
||
Escape Sequences for Metacharacters
|
||
|
||
Suppose you use an octal or hexadecimal escape to represent a regexp
|
||
metacharacter. (See *note Regexp Operators::.) Does 'awk' treat the
|
||
character as a literal character or as a regexp operator?
|
||
|
||
Historically, such characters were taken literally. (d.c.) However,
|
||
the POSIX standard indicates that they should be treated as real
|
||
metacharacters, which is what 'gawk' does. In compatibility mode (*note
|
||
Options::), 'gawk' treats the characters represented by octal and
|
||
hexadecimal escape sequences literally when used in regexp constants.
|
||
Thus, '/a\52b/' is equivalent to '/a\*b/'.
|
||
|
||
|
||
File: gawk.info, Node: Regexp Operators, Next: Bracket Expressions, Prev: Escape Sequences, Up: Regexp
|
||
|
||
3.3 Regular Expression Operators
|
||
================================
|
||
|
||
You can combine regular expressions with special characters, called
|
||
"regular expression operators" or "metacharacters", to increase the
|
||
power and versatility of regular expressions.
|
||
|
||
The escape sequences described in *note Escape Sequences::, are valid
|
||
inside a regexp. They are introduced by a '\' and are recognized and
|
||
converted into corresponding real characters as the very first step in
|
||
processing regexps.
|
||
|
||
Here is a list of metacharacters. All characters that are not escape
|
||
sequences and that are not listed here stand for themselves:
|
||
|
||
'\'
|
||
This suppresses the special meaning of a character when matching.
|
||
For example, '\$' matches the character '$'.
|
||
|
||
'^'
|
||
This matches the beginning of a string. '^@chapter' matches
|
||
'@chapter' at the beginning of a string, for example, and can be
|
||
used to identify chapter beginnings in Texinfo source files. The
|
||
'^' is known as an "anchor", because it anchors the pattern to
|
||
match only at the beginning of the string.
|
||
|
||
It is important to realize that '^' does not match the beginning of
|
||
a line (the point right after a '\n' newline character) embedded in
|
||
a string. The condition is not true in the following example:
|
||
|
||
if ("line1\nLINE 2" ~ /^L/) ...
|
||
|
||
'$'
|
||
This is similar to '^', but it matches only at the end of a string.
|
||
For example, 'p$' matches a record that ends with a 'p'. The '$'
|
||
is an anchor and does not match the end of a line (the point right
|
||
before a '\n' newline character) embedded in a string. The
|
||
condition in the following example is not true:
|
||
|
||
if ("line1\nLINE 2" ~ /1$/) ...
|
||
|
||
'.' (period)
|
||
This matches any single character, _including_ the newline
|
||
character. For example, '.P' matches any single character followed
|
||
by a 'P' in a string. Using concatenation, we can make a regular
|
||
expression such as 'U.A', which matches any three-character
|
||
sequence that begins with 'U' and ends with 'A'.
|
||
|
||
In strict POSIX mode (*note Options::), '.' does not match the NUL
|
||
character, which is a character with all bits equal to zero.
|
||
Otherwise, NUL is just another character. Other versions of 'awk'
|
||
may not be able to match the NUL character.
|
||
|
||
'['...']'
|
||
This is called a "bracket expression".(1) It matches any _one_ of
|
||
the characters that are enclosed in the square brackets. For
|
||
example, '[MVX]' matches any one of the characters 'M', 'V', or 'X'
|
||
in a string. A full discussion of what can be inside the square
|
||
brackets of a bracket expression is given in *note Bracket
|
||
Expressions::.
|
||
|
||
'[^'...']'
|
||
This is a "complemented bracket expression". The first character
|
||
after the '[' _must_ be a '^'. It matches any characters _except_
|
||
those in the square brackets. For example, '[^awk]' matches any
|
||
character that is not an 'a', 'w', or 'k'.
|
||
|
||
'|'
|
||
This is the "alternation operator" and it is used to specify
|
||
alternatives. The '|' has the lowest precedence of all the regular
|
||
expression operators. For example, '^P|[aeiouy]' matches any
|
||
string that matches either '^P' or '[aeiouy]'. This means it
|
||
matches any string that starts with 'P' or contains (anywhere
|
||
within it) a lowercase English vowel.
|
||
|
||
The alternation applies to the largest possible regexps on either
|
||
side.
|
||
|
||
'('...')'
|
||
Parentheses are used for grouping in regular expressions, as in
|
||
arithmetic. They can be used to concatenate regular expressions
|
||
containing the alternation operator, '|'. For example,
|
||
'@(samp|code)\{[^}]+\}' matches both '@code{foo}' and '@samp{bar}'.
|
||
(These are Texinfo formatting control sequences. The '+' is
|
||
explained further on in this list.)
|
||
|
||
'*'
|
||
This symbol means that the preceding regular expression should be
|
||
repeated as many times as necessary to find a match. For example,
|
||
'ph*' applies the '*' symbol to the preceding 'h' and looks for
|
||
matches of one 'p' followed by any number of 'h's. This also
|
||
matches just 'p' if no 'h's are present.
|
||
|
||
There are two subtle points to understand about how '*' works.
|
||
First, the '*' applies only to the single preceding regular
|
||
expression component (e.g., in 'ph*', it applies just to the 'h').
|
||
To cause '*' to apply to a larger subexpression, use parentheses:
|
||
'(ph)*' matches 'ph', 'phph', 'phphph', and so on.
|
||
|
||
Second, '*' finds as many repetitions as possible. If the text to
|
||
be matched is 'phhhhhhhhhhhhhhooey', 'ph*' matches all of the 'h's.
|
||
|
||
'+'
|
||
This symbol is similar to '*', except that the preceding expression
|
||
must be matched at least once. This means that 'wh+y' would match
|
||
'why' and 'whhy', but not 'wy', whereas 'wh*y' would match all
|
||
three.
|
||
|
||
'?'
|
||
This symbol is similar to '*', except that the preceding expression
|
||
can be matched either once or not at all. For example, 'fe?d'
|
||
matches 'fed' and 'fd', but nothing else.
|
||
|
||
'{'N'}'
|
||
'{'N',}'
|
||
'{'N','M'}'
|
||
One or two numbers inside braces denote an "interval expression".
|
||
If there is one number in the braces, the preceding regexp is
|
||
repeated N times. If there are two numbers separated by a comma,
|
||
the preceding regexp is repeated N to M times. If there is one
|
||
number followed by a comma, then the preceding regexp is repeated
|
||
at least N times:
|
||
|
||
'wh{3}y'
|
||
Matches 'whhhy', but not 'why' or 'whhhhy'.
|
||
|
||
'wh{3,5}y'
|
||
Matches 'whhhy', 'whhhhy', or 'whhhhhy' only.
|
||
|
||
'wh{2,}y'
|
||
Matches 'whhy', 'whhhy', and so on.
|
||
|
||
Interval expressions were not traditionally available in 'awk'.
|
||
They were added as part of the POSIX standard to make 'awk' and
|
||
'egrep' consistent with each other.
|
||
|
||
Initially, because old programs may use '{' and '}' in regexp
|
||
constants, 'gawk' did _not_ match interval expressions in regexps.
|
||
|
||
However, beginning with version 4.0, 'gawk' does match interval
|
||
expressions by default. This is because compatibility with POSIX
|
||
has become more important to most 'gawk' users than compatibility
|
||
with old programs.
|
||
|
||
For programs that use '{' and '}' in regexp constants, it is good
|
||
practice to always escape them with a backslash. Then the regexp
|
||
constants are valid and work the way you want them to, using any
|
||
version of 'awk'.(2)
|
||
|
||
Finally, when '{' and '}' appear in regexp constants in a way that
|
||
cannot be interpreted as an interval expression (such as '/q{a}/'),
|
||
then they stand for themselves.
|
||
|
||
In regular expressions, the '*', '+', and '?' operators, as well as
|
||
the braces '{' and '}', have the highest precedence, followed by
|
||
concatenation, and finally by '|'. As in arithmetic, parentheses can
|
||
change how operators are grouped.
|
||
|
||
In POSIX 'awk' and 'gawk', the '*', '+', and '?' operators stand for
|
||
themselves when there is nothing in the regexp that precedes them. For
|
||
example, '/+/' matches a literal plus sign. However, many other
|
||
versions of 'awk' treat such a usage as a syntax error.
|
||
|
||
If 'gawk' is in compatibility mode (*note Options::), interval
|
||
expressions are not available in regular expressions.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) In other literature, you may see a bracket expression referred to
|
||
as either a "character set", a "character class", or a "character list".
|
||
|
||
(2) Use two backslashes if you're using a string constant with a
|
||
regexp operator or function.
|
||
|
||
|
||
File: gawk.info, Node: Bracket Expressions, Next: Leftmost Longest, Prev: Regexp Operators, Up: Regexp
|
||
|
||
3.4 Using Bracket Expressions
|
||
=============================
|
||
|
||
As mentioned earlier, a bracket expression matches any character among
|
||
those listed between the opening and closing square brackets.
|
||
|
||
Within a bracket expression, a "range expression" consists of two
|
||
characters separated by a hyphen. It matches any single character that
|
||
sorts between the two characters, based upon the system's native
|
||
character set. For example, '[0-9]' is equivalent to '[0123456789]'.
|
||
(See *note Ranges and Locales::, for an explanation of how the POSIX
|
||
standard and 'gawk' have changed over time. This is mainly of
|
||
historical interest.)
|
||
|
||
To include one of the characters '\', ']', '-', or '^' in a bracket
|
||
expression, put a '\' in front of it. For example:
|
||
|
||
[d\]]
|
||
|
||
matches either 'd' or ']'. Additionally, if you place ']' right after
|
||
the opening '[', the closing bracket is treated as one of the characters
|
||
to be matched.
|
||
|
||
The treatment of '\' in bracket expressions is compatible with other
|
||
'awk' implementations and is also mandated by POSIX. The regular
|
||
expressions in 'awk' are a superset of the POSIX specification for
|
||
Extended Regular Expressions (EREs). POSIX EREs are based on the
|
||
regular expressions accepted by the traditional 'egrep' utility.
|
||
|
||
"Character classes" are a feature introduced in the POSIX standard.
|
||
A character class is a special notation for describing lists of
|
||
characters that have a specific attribute, but the actual characters can
|
||
vary from country to country and/or from character set to character set.
|
||
For example, the notion of what is an alphabetic character differs
|
||
between the United States and France.
|
||
|
||
A character class is only valid in a regexp _inside_ the brackets of
|
||
a bracket expression. Character classes consist of '[:', a keyword
|
||
denoting the class, and ':]'. *note Table 3.1: table-char-classes.
|
||
lists the character classes defined by the POSIX standard.
|
||
|
||
Class Meaning
|
||
--------------------------------------------------------------------------
|
||
'[:alnum:]' Alphanumeric characters
|
||
'[:alpha:]' Alphabetic characters
|
||
'[:blank:]' Space and TAB characters
|
||
'[:cntrl:]' Control characters
|
||
'[:digit:]' Numeric characters
|
||
'[:graph:]' Characters that are both printable and visible (a space is
|
||
printable but not visible, whereas an 'a' is both)
|
||
'[:lower:]' Lowercase alphabetic characters
|
||
'[:print:]' Printable characters (characters that are not control
|
||
characters)
|
||
'[:punct:]' Punctuation characters (characters that are not letters,
|
||
digits, control characters, or space characters)
|
||
'[:space:]' Space characters (such as space, TAB, and formfeed, to name
|
||
a few)
|
||
'[:upper:]' Uppercase alphabetic characters
|
||
'[:xdigit:]'Characters that are hexadecimal digits
|
||
|
||
Table 3.1: POSIX character classes
|
||
|
||
For example, before the POSIX standard, you had to write
|
||
'/[A-Za-z0-9]/' to match alphanumeric characters. If your character set
|
||
had other alphabetic characters in it, this would not match them. With
|
||
the POSIX character classes, you can write '/[[:alnum:]]/' to match the
|
||
alphabetic and numeric characters in your character set.
|
||
|
||
Some utilities that match regular expressions provide a nonstandard
|
||
'[:ascii:]' character class; 'awk' does not. However, you can simulate
|
||
such a construct using '[\x00-\x7F]'. This matches all values
|
||
numerically between zero and 127, which is the defined range of the
|
||
ASCII character set. Use a complemented character list ('[^\x00-\x7F]')
|
||
to match any single-byte characters that are not in the ASCII range.
|
||
|
||
Two additional special sequences can appear in bracket expressions.
|
||
These apply to non-ASCII character sets, which can have single symbols
|
||
(called "collating elements") that are represented with more than one
|
||
character. They can also have several characters that are equivalent
|
||
for "collating", or sorting, purposes. (For example, in French, a plain
|
||
"e" and a grave-accented "e`" are equivalent.) These sequences are:
|
||
|
||
Collating symbols
|
||
Multicharacter collating elements enclosed between '[.' and '.]'.
|
||
For example, if 'ch' is a collating element, then '[[.ch.]]' is a
|
||
regexp that matches this collating element, whereas '[ch]' is a
|
||
regexp that matches either 'c' or 'h'.
|
||
|
||
Equivalence classes
|
||
Locale-specific names for a list of characters that are equal. The
|
||
name is enclosed between '[=' and '=]'. For example, the name 'e'
|
||
might be used to represent all of "e," "e^," "e`," and "e'." In
|
||
this case, '[[=e=]]' is a regexp that matches any of 'e', 'e^',
|
||
'e'', or 'e`'.
|
||
|
||
These features are very valuable in non-English-speaking locales.
|
||
|
||
CAUTION: The library functions that 'gawk' uses for regular
|
||
expression matching currently recognize only POSIX character
|
||
classes; they do not recognize collating symbols or equivalence
|
||
classes.
|
||
|
||
|
||
File: gawk.info, Node: Leftmost Longest, Next: Computed Regexps, Prev: Bracket Expressions, Up: Regexp
|
||
|
||
3.5 How Much Text Matches?
|
||
==========================
|
||
|
||
Consider the following:
|
||
|
||
echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }'
|
||
|
||
This example uses the 'sub()' function to make a change to the input
|
||
record. ('sub()' replaces the first instance of any text matched by the
|
||
first argument with the string provided as the second argument; *note
|
||
String Functions::.) Here, the regexp '/a+/' indicates "one or more 'a'
|
||
characters," and the replacement text is '<A>'.
|
||
|
||
The input contains four 'a' characters. 'awk' (and POSIX) regular
|
||
expressions always match the leftmost, _longest_ sequence of input
|
||
characters that can match. Thus, all four 'a' characters are replaced
|
||
with '<A>' in this example:
|
||
|
||
$ echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }'
|
||
-| <A>bcd
|
||
|
||
For simple match/no-match tests, this is not so important. But when
|
||
doing text matching and substitutions with the 'match()', 'sub()',
|
||
'gsub()', and 'gensub()' functions, it is very important. *Note String
|
||
Functions::, for more information on these functions. Understanding
|
||
this principle is also important for regexp-based record and field
|
||
splitting (*note Records::, and also *note Field Separators::).
|
||
|
||
|
||
File: gawk.info, Node: Computed Regexps, Next: GNU Regexp Operators, Prev: Leftmost Longest, Up: Regexp
|
||
|
||
3.6 Using Dynamic Regexps
|
||
=========================
|
||
|
||
The righthand side of a '~' or '!~' operator need not be a regexp
|
||
constant (i.e., a string of characters between slashes). It may be any
|
||
expression. The expression is evaluated and converted to a string if
|
||
necessary; the contents of the string are then used as the regexp. A
|
||
regexp computed in this way is called a "dynamic regexp" or a "computed
|
||
regexp":
|
||
|
||
BEGIN { digits_regexp = "[[:digit:]]+" }
|
||
$0 ~ digits_regexp { print }
|
||
|
||
This sets 'digits_regexp' to a regexp that describes one or more digits,
|
||
and tests whether the input record matches this regexp.
|
||
|
||
NOTE: When using the '~' and '!~' operators, be aware that there is
|
||
a difference between a regexp constant enclosed in slashes and a
|
||
string constant enclosed in double quotes. If you are going to use
|
||
a string constant, you have to understand that the string is, in
|
||
essence, scanned _twice_: the first time when 'awk' reads your
|
||
program, and the second time when it goes to match the string on
|
||
the lefthand side of the operator with the pattern on the right.
|
||
This is true of any string-valued expression (such as
|
||
'digits_regexp', shown in the previous example), not just string
|
||
constants.
|
||
|
||
What difference does it make if the string is scanned twice? The
|
||
answer has to do with escape sequences, and particularly with
|
||
backslashes. To get a backslash into a regular expression inside a
|
||
string, you have to type two backslashes.
|
||
|
||
For example, '/\*/' is a regexp constant for a literal '*'. Only one
|
||
backslash is needed. To do the same thing with a string, you have to
|
||
type '"\\*"'. The first backslash escapes the second one so that the
|
||
string actually contains the two characters '\' and '*'.
|
||
|
||
Given that you can use both regexp and string constants to describe
|
||
regular expressions, which should you use? The answer is "regexp
|
||
constants," for several reasons:
|
||
|
||
* String constants are more complicated to write and more difficult
|
||
to read. Using regexp constants makes your programs less
|
||
error-prone. Not understanding the difference between the two
|
||
kinds of constants is a common source of errors.
|
||
|
||
* It is more efficient to use regexp constants. 'awk' can note that
|
||
you have supplied a regexp and store it internally in a form that
|
||
makes pattern matching more efficient. When using a string
|
||
constant, 'awk' must first convert the string into this internal
|
||
form and then perform the pattern matching.
|
||
|
||
* Using regexp constants is better form; it shows clearly that you
|
||
intend a regexp match.
|
||
|
||
Using '\n' in Bracket Expressions of Dynamic Regexps
|
||
|
||
Some older versions of 'awk' do not allow the newline character to be
|
||
used inside a bracket expression for a dynamic regexp:
|
||
|
||
$ awk '$0 ~ "[ \t\n]"'
|
||
error-> awk: newline in character class [
|
||
error-> ]...
|
||
error-> source line number 1
|
||
error-> context is
|
||
error-> $0 ~ "[ >>> \t\n]" <<<
|
||
|
||
But a newline in a regexp constant works with no problem:
|
||
|
||
$ awk '$0 ~ /[ \t\n]/'
|
||
here is a sample line
|
||
-| here is a sample line
|
||
Ctrl-d
|
||
|
||
'gawk' does not have this problem, and it isn't likely to occur often
|
||
in practice, but it's worth noting for future reference.
|
||
|
||
|
||
File: gawk.info, Node: GNU Regexp Operators, Next: Case-sensitivity, Prev: Computed Regexps, Up: Regexp
|
||
|
||
3.7 'gawk'-Specific Regexp Operators
|
||
====================================
|
||
|
||
GNU software that deals with regular expressions provides a number of
|
||
additional regexp operators. These operators are described in this
|
||
minor node and are specific to 'gawk'; they are not available in other
|
||
'awk' implementations. Most of the additional operators deal with word
|
||
matching. For our purposes, a "word" is a sequence of one or more
|
||
letters, digits, or underscores ('_'):
|
||
|
||
'\s'
|
||
Matches any whitespace character. Think of it as shorthand for
|
||
'[[:space:]]'.
|
||
|
||
'\S'
|
||
Matches any character that is not whitespace. Think of it as
|
||
shorthand for '[^[:space:]]'.
|
||
|
||
'\w'
|
||
Matches any word-constituent character--that is, it matches any
|
||
letter, digit, or underscore. Think of it as shorthand for
|
||
'[[:alnum:]_]'.
|
||
|
||
'\W'
|
||
Matches any character that is not word-constituent. Think of it as
|
||
shorthand for '[^[:alnum:]_]'.
|
||
|
||
'\<'
|
||
Matches the empty string at the beginning of a word. For example,
|
||
'/\<away/' matches 'away' but not 'stowaway'.
|
||
|
||
'\>'
|
||
Matches the empty string at the end of a word. For example,
|
||
'/stow\>/' matches 'stow' but not 'stowaway'.
|
||
|
||
'\y'
|
||
Matches the empty string at either the beginning or the end of a
|
||
word (i.e., the word boundar*y*). For example, '\yballs?\y'
|
||
matches either 'ball' or 'balls', as a separate word.
|
||
|
||
'\B'
|
||
Matches the empty string that occurs between two word-constituent
|
||
characters. For example, '/\Brat\B/' matches 'crate', but it does
|
||
not match 'dirty rat'. '\B' is essentially the opposite of '\y'.
|
||
|
||
There are two other operators that work on buffers. In Emacs, a
|
||
"buffer" is, naturally, an Emacs buffer. Other GNU programs, including
|
||
'gawk', consider the entire string to match as the buffer. The
|
||
operators are:
|
||
|
||
'\`'
|
||
Matches the empty string at the beginning of a buffer (string)
|
||
|
||
'\''
|
||
Matches the empty string at the end of a buffer (string)
|
||
|
||
Because '^' and '$' always work in terms of the beginning and end of
|
||
strings, these operators don't add any new capabilities for 'awk'. They
|
||
are provided for compatibility with other GNU software.
|
||
|
||
In other GNU software, the word-boundary operator is '\b'. However,
|
||
that conflicts with the 'awk' language's definition of '\b' as
|
||
backspace, so 'gawk' uses a different letter. An alternative method
|
||
would have been to require two backslashes in the GNU operators, but
|
||
this was deemed too confusing. The current method of using '\y' for the
|
||
GNU '\b' appears to be the lesser of two evils.
|
||
|
||
The various command-line options (*note Options::) control how 'gawk'
|
||
interprets characters in regexps:
|
||
|
||
No options
|
||
In the default case, 'gawk' provides all the facilities of POSIX
|
||
regexps and the GNU regexp operators described in *note Regexp
|
||
Operators::.
|
||
|
||
'--posix'
|
||
Match only POSIX regexps; the GNU operators are not special (e.g.,
|
||
'\w' matches a literal 'w'). Interval expressions are allowed.
|
||
|
||
'--traditional'
|
||
Match traditional Unix 'awk' regexps. The GNU operators are not
|
||
special, and interval expressions are not available. Because BWK
|
||
'awk' supports them, the POSIX character classes ('[[:alnum:]]',
|
||
etc.) are available. Characters described by octal and
|
||
hexadecimal escape sequences are treated literally, even if they
|
||
represent regexp metacharacters.
|
||
|
||
'--re-interval'
|
||
Allow interval expressions in regexps, if '--traditional' has been
|
||
provided. Otherwise, interval expressions are available by
|
||
default.
|
||
|
||
|
||
File: gawk.info, Node: Case-sensitivity, Next: Regexp Summary, Prev: GNU Regexp Operators, Up: Regexp
|
||
|
||
3.8 Case Sensitivity in Matching
|
||
================================
|
||
|
||
Case is normally significant in regular expressions, both when matching
|
||
ordinary characters (i.e., not metacharacters) and inside bracket
|
||
expressions. Thus, a 'w' in a regular expression matches only a
|
||
lowercase 'w' and not an uppercase 'W'.
|
||
|
||
The simplest way to do a case-independent match is to use a bracket
|
||
expression--for example, '[Ww]'. However, this can be cumbersome if you
|
||
need to use it often, and it can make the regular expressions harder to
|
||
read. There are two alternatives that you might prefer.
|
||
|
||
One way to perform a case-insensitive match at a particular point in
|
||
the program is to convert the data to a single case, using the
|
||
'tolower()' or 'toupper()' built-in string functions (which we haven't
|
||
discussed yet; *note String Functions::). For example:
|
||
|
||
tolower($1) ~ /foo/ { ... }
|
||
|
||
converts the first field to lowercase before matching against it. This
|
||
works in any POSIX-compliant 'awk'.
|
||
|
||
Another method, specific to 'gawk', is to set the variable
|
||
'IGNORECASE' to a nonzero value (*note Built-in Variables::). When
|
||
'IGNORECASE' is not zero, _all_ regexp and string operations ignore
|
||
case.
|
||
|
||
Changing the value of 'IGNORECASE' dynamically controls the case
|
||
sensitivity of the program as it runs. Case is significant by default
|
||
because 'IGNORECASE' (like most variables) is initialized to zero:
|
||
|
||
x = "aB"
|
||
if (x ~ /ab/) ... # this test will fail
|
||
|
||
IGNORECASE = 1
|
||
if (x ~ /ab/) ... # now it will succeed
|
||
|
||
In general, you cannot use 'IGNORECASE' to make certain rules case
|
||
insensitive and other rules case sensitive, as there is no
|
||
straightforward way to set 'IGNORECASE' just for the pattern of a
|
||
particular rule.(1) To do this, use either bracket expressions or
|
||
'tolower()'. However, one thing you can do with 'IGNORECASE' only is
|
||
dynamically turn case sensitivity on or off for all the rules at once.
|
||
|
||
'IGNORECASE' can be set on the command line or in a 'BEGIN' rule
|
||
(*note Other Arguments::; also *note Using BEGIN/END::). Setting
|
||
'IGNORECASE' from the command line is a way to make a program case
|
||
insensitive without having to edit it.
|
||
|
||
In multibyte locales, the equivalences between upper- and lowercase
|
||
characters are tested based on the wide-character values of the locale's
|
||
character set. Otherwise, the characters are tested based on the
|
||
ISO-8859-1 (ISO Latin-1) character set. This character set is a
|
||
superset of the traditional 128 ASCII characters, which also provides a
|
||
number of characters suitable for use with European languages.(2)
|
||
|
||
The value of 'IGNORECASE' has no effect if 'gawk' is in compatibility
|
||
mode (*note Options::). Case is always significant in compatibility
|
||
mode.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Experienced C and C++ programmers will note that it is possible,
|
||
using something like 'IGNORECASE = 1 && /foObAr/ { ... }' and
|
||
'IGNORECASE = 0 || /foobar/ { ... }'. However, this is somewhat obscure
|
||
and we don't recommend it.
|
||
|
||
(2) If you don't understand this, don't worry about it; it just means
|
||
that 'gawk' does the right thing.
|
||
|
||
|
||
File: gawk.info, Node: Regexp Summary, Prev: Case-sensitivity, Up: Regexp
|
||
|
||
3.9 Summary
|
||
===========
|
||
|
||
* Regular expressions describe sets of strings to be matched. In
|
||
'awk', regular expression constants are written enclosed between
|
||
slashes: '/'...'/'.
|
||
|
||
* Regexp constants may be used standalone in patterns and in
|
||
conditional expressions, or as part of matching expressions using
|
||
the '~' and '!~' operators.
|
||
|
||
* Escape sequences let you represent nonprintable characters and also
|
||
let you represent regexp metacharacters as literal characters to be
|
||
matched.
|
||
|
||
* Regexp operators provide grouping, alternation, and repetition.
|
||
|
||
* Bracket expressions give you a shorthand for specifying sets of
|
||
characters that can match at a particular point in a regexp.
|
||
Within bracket expressions, POSIX character classes let you specify
|
||
certain groups of characters in a locale-independent fashion.
|
||
|
||
* Regular expressions match the leftmost longest text in the string
|
||
being matched. This matters for cases where you need to know the
|
||
extent of the match, such as for text substitution and when the
|
||
record separator is a regexp.
|
||
|
||
* Matching expressions may use dynamic regexps (i.e., string values
|
||
treated as regular expressions).
|
||
|
||
* 'gawk''s 'IGNORECASE' variable lets you control the case
|
||
sensitivity of regexp matching. In other 'awk' versions, use
|
||
'tolower()' or 'toupper()'.
|
||
|
||
|
||
File: gawk.info, Node: Reading Files, Next: Printing, Prev: Regexp, Up: Top
|
||
|
||
4 Reading Input Files
|
||
*********************
|
||
|
||
In the typical 'awk' program, 'awk' reads all input either from the
|
||
standard input (by default, this is the keyboard, but often it is a pipe
|
||
from another command) or from files whose names you specify on the 'awk'
|
||
command line. If you specify input files, 'awk' reads them in order,
|
||
processing all the data from one before going on to the next. The name
|
||
of the current input file can be found in the predefined variable
|
||
'FILENAME' (*note Built-in Variables::).
|
||
|
||
The input is read in units called "records", and is processed by the
|
||
rules of your program one record at a time. By default, each record is
|
||
one line. Each record is automatically split into chunks called
|
||
"fields". This makes it more convenient for programs to work on the
|
||
parts of a record.
|
||
|
||
On rare occasions, you may need to use the 'getline' command. The
|
||
'getline' command is valuable both because it can do explicit input from
|
||
any number of files, and because the files used with it do not have to
|
||
be named on the 'awk' command line (*note Getline::).
|
||
|
||
* Menu:
|
||
|
||
* Records:: Controlling how data is split into records.
|
||
* Fields:: An introduction to fields.
|
||
* Nonconstant Fields:: Nonconstant Field Numbers.
|
||
* Changing Fields:: Changing the Contents of a Field.
|
||
* Field Separators:: The field separator and how to change it.
|
||
* Constant Size:: Reading constant width data.
|
||
* Splitting By Content:: Defining Fields By Content
|
||
* Multiple Line:: Reading multiline records.
|
||
* Getline:: Reading files under explicit program control
|
||
using the 'getline' function.
|
||
* Read Timeout:: Reading input with a timeout.
|
||
* Command-line directories:: What happens if you put a directory on the
|
||
command line.
|
||
* Input Summary:: Input summary.
|
||
* Input Exercises:: Exercises.
|
||
|
||
|
||
File: gawk.info, Node: Records, Next: Fields, Up: Reading Files
|
||
|
||
4.1 How Input Is Split into Records
|
||
===================================
|
||
|
||
'awk' divides the input for your program into records and fields. It
|
||
keeps track of the number of records that have been read so far from the
|
||
current input file. This value is stored in a predefined variable
|
||
called 'FNR', which is reset to zero every time a new file is started.
|
||
Another predefined variable, 'NR', records the total number of input
|
||
records read so far from all data files. It starts at zero, but is
|
||
never automatically reset to zero.
|
||
|
||
* Menu:
|
||
|
||
* awk split records:: How standard 'awk' splits records.
|
||
* gawk split records:: How 'gawk' splits records.
|
||
|
||
|
||
File: gawk.info, Node: awk split records, Next: gawk split records, Up: Records
|
||
|
||
4.1.1 Record Splitting with Standard 'awk'
|
||
------------------------------------------
|
||
|
||
Records are separated by a character called the "record separator". By
|
||
default, the record separator is the newline character. This is why
|
||
records are, by default, single lines. To use a different character for
|
||
the record separator, simply assign that character to the predefined
|
||
variable 'RS'.
|
||
|
||
Like any other variable, the value of 'RS' can be changed in the
|
||
'awk' program with the assignment operator, '=' (*note Assignment
|
||
Ops::). The new record-separator character should be enclosed in
|
||
quotation marks, which indicate a string constant. Often, the right
|
||
time to do this is at the beginning of execution, before any input is
|
||
processed, so that the very first record is read with the proper
|
||
separator. To do this, use the special 'BEGIN' pattern (*note
|
||
BEGIN/END::). For example:
|
||
|
||
awk 'BEGIN { RS = "u" }
|
||
{ print $0 }' mail-list
|
||
|
||
changes the value of 'RS' to 'u', before reading any input. The new
|
||
value is a string whose first character is the letter "u"; as a result,
|
||
records are separated by the letter "u". Then the input file is read,
|
||
and the second rule in the 'awk' program (the action with no pattern)
|
||
prints each record. Because each 'print' statement adds a newline at
|
||
the end of its output, this 'awk' program copies the input with each 'u'
|
||
changed to a newline. Here are the results of running the program on
|
||
'mail-list':
|
||
|
||
$ awk 'BEGIN { RS = "u" }
|
||
> { print $0 }' mail-list
|
||
-| Amelia 555-5553 amelia.zodiac
|
||
-| sq
|
||
-| e@gmail.com F
|
||
-| Anthony 555-3412 anthony.assert
|
||
-| ro@hotmail.com A
|
||
-| Becky 555-7685 becky.algebrar
|
||
-| m@gmail.com A
|
||
-| Bill 555-1675 bill.drowning@hotmail.com A
|
||
-| Broderick 555-0542 broderick.aliq
|
||
-| otiens@yahoo.com R
|
||
-| Camilla 555-2912 camilla.inf
|
||
-| sar
|
||
-| m@skynet.be R
|
||
-| Fabi
|
||
-| s 555-1234 fabi
|
||
-| s.
|
||
-| ndevicesim
|
||
-| s@
|
||
-| cb.ed
|
||
-| F
|
||
-| J
|
||
-| lie 555-6699 j
|
||
-| lie.perscr
|
||
-| tabor@skeeve.com F
|
||
-| Martin 555-6480 martin.codicib
|
||
-| s@hotmail.com A
|
||
-| Sam
|
||
-| el 555-3430 sam
|
||
-| el.lanceolis@sh
|
||
-| .ed
|
||
-| A
|
||
-| Jean-Pa
|
||
-| l 555-2127 jeanpa
|
||
-| l.campanor
|
||
-| m@ny
|
||
-| .ed
|
||
-| R
|
||
-|
|
||
|
||
Note that the entry for the name 'Bill' is not split. In the original
|
||
data file (*note Sample Data Files::), the line looks like this:
|
||
|
||
Bill 555-1675 bill.drowning@hotmail.com A
|
||
|
||
It contains no 'u', so there is no reason to split the record, unlike
|
||
the others, which each have one or more occurrences of the 'u'. In
|
||
fact, this record is treated as part of the previous record; the newline
|
||
separating them in the output is the original newline in the data file,
|
||
not the one added by 'awk' when it printed the record!
|
||
|
||
Another way to change the record separator is on the command line,
|
||
using the variable-assignment feature (*note Other Arguments::):
|
||
|
||
awk '{ print $0 }' RS="u" mail-list
|
||
|
||
This sets 'RS' to 'u' before processing 'mail-list'.
|
||
|
||
Using an alphabetic character such as 'u' for the record separator is
|
||
highly likely to produce strange results. Using an unusual character
|
||
such as '/' is more likely to produce correct behavior in the majority
|
||
of cases, but there are no guarantees. The moral is: Know Your Data.
|
||
|
||
When using regular characters as the record separator, there is one
|
||
unusual case that occurs when 'gawk' is being fully POSIX-compliant
|
||
(*note Options::). Then, the following (extreme) pipeline prints a
|
||
surprising '1':
|
||
|
||
$ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
|
||
-| 1
|
||
|
||
There is one field, consisting of a newline. The value of the
|
||
built-in variable 'NF' is the number of fields in the current record.
|
||
(In the normal case, 'gawk' treats the newline as whitespace, printing
|
||
'0' as the result. Most other versions of 'awk' also act this way.)
|
||
|
||
Reaching the end of an input file terminates the current input
|
||
record, even if the last character in the file is not the character in
|
||
'RS'. (d.c.)
|
||
|
||
The empty string '""' (a string without any characters) has a special
|
||
meaning as the value of 'RS'. It means that records are separated by
|
||
one or more blank lines and nothing else. *Note Multiple Line::, for
|
||
more details.
|
||
|
||
If you change the value of 'RS' in the middle of an 'awk' run, the
|
||
new value is used to delimit subsequent records, but the record
|
||
currently being processed, as well as records already processed, are not
|
||
affected.
|
||
|
||
After the end of the record has been determined, 'gawk' sets the
|
||
variable 'RT' to the text in the input that matched 'RS'.
|
||
|
||
|
||
File: gawk.info, Node: gawk split records, Prev: awk split records, Up: Records
|
||
|
||
4.1.2 Record Splitting with 'gawk'
|
||
----------------------------------
|
||
|
||
When using 'gawk', the value of 'RS' is not limited to a one-character
|
||
string. It can be any regular expression (*note Regexp::). (c.e.) In
|
||
general, each record ends at the next string that matches the regular
|
||
expression; the next record starts at the end of the matching string.
|
||
This general rule is actually at work in the usual case, where 'RS'
|
||
contains just a newline: a record ends at the beginning of the next
|
||
matching string (the next newline in the input), and the following
|
||
record starts just after the end of this string (at the first character
|
||
of the following line). The newline, because it matches 'RS', is not
|
||
part of either record.
|
||
|
||
When 'RS' is a single character, 'RT' contains the same single
|
||
character. However, when 'RS' is a regular expression, 'RT' contains
|
||
the actual input text that matched the regular expression.
|
||
|
||
If the input file ends without any text matching 'RS', 'gawk' sets
|
||
'RT' to the null string.
|
||
|
||
The following example illustrates both of these features. It sets
|
||
'RS' equal to a regular expression that matches either a newline or a
|
||
series of one or more uppercase letters with optional leading and/or
|
||
trailing whitespace:
|
||
|
||
$ echo record 1 AAAA record 2 BBBB record 3 |
|
||
> gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
|
||
> { print "Record =", $0,"and RT = [" RT "]" }'
|
||
-| Record = record 1 and RT = [ AAAA ]
|
||
-| Record = record 2 and RT = [ BBBB ]
|
||
-| Record = record 3 and RT = [
|
||
-| ]
|
||
|
||
The square brackets delineate the contents of 'RT', letting you see the
|
||
leading and trailing whitespace. The final value of 'RT' is a newline.
|
||
*Note Simple Sed::, for a more useful example of 'RS' as a regexp and
|
||
'RT'.
|
||
|
||
If you set 'RS' to a regular expression that allows optional trailing
|
||
text, such as 'RS = "abc(XYZ)?"', it is possible, due to implementation
|
||
constraints, that 'gawk' may match the leading part of the regular
|
||
expression, but not the trailing part, particularly if the input text
|
||
that could match the trailing part is fairly long. 'gawk' attempts to
|
||
avoid this problem, but currently, there's no guarantee that this will
|
||
never happen.
|
||
|
||
NOTE: Remember that in 'awk', the '^' and '$' anchor metacharacters
|
||
match the beginning and end of a _string_, and not the beginning
|
||
and end of a _line_. As a result, something like 'RS =
|
||
"^[[:upper:]]"' can only match at the beginning of a file. This is
|
||
because 'gawk' views the input file as one long string that happens
|
||
to contain newline characters. It is thus best to avoid anchor
|
||
metacharacters in the value of 'RS'.
|
||
|
||
The use of 'RS' as a regular expression and the 'RT' variable are
|
||
'gawk' extensions; they are not available in compatibility mode (*note
|
||
Options::). In compatibility mode, only the first character of the
|
||
value of 'RS' determines the end of the record.
|
||
|
||
'RS = "\0"' Is Not Portable
|
||
|
||
There are times when you might want to treat an entire data file as a
|
||
single record. The only way to make this happen is to give 'RS' a value
|
||
that you know doesn't occur in the input file. This is hard to do in a
|
||
general way, such that a program always works for arbitrary input files.
|
||
|
||
You might think that for text files, the NUL character, which
|
||
consists of a character with all bits equal to zero, is a good value to
|
||
use for 'RS' in this case:
|
||
|
||
BEGIN { RS = "\0" } # whole file becomes one record?
|
||
|
||
'gawk' in fact accepts this, and uses the NUL character for the
|
||
record separator. This works for certain special files, such as
|
||
'/proc/environ' on GNU/Linux systems, where the NUL character is in fact
|
||
the record separator. However, this usage is _not_ portable to most
|
||
other 'awk' implementations.
|
||
|
||
Almost all other 'awk' implementations(1) store strings internally as
|
||
C-style strings. C strings use the NUL character as the string
|
||
terminator. In effect, this means that 'RS = "\0"' is the same as 'RS =
|
||
""'. (d.c.)
|
||
|
||
It happens that recent versions of 'mawk' can use the NUL character
|
||
as a record separator. However, this is a special case: 'mawk' does not
|
||
allow embedded NUL characters in strings. (This may change in a future
|
||
version of 'mawk'.)
|
||
|
||
*Note Readfile Function::, for an interesting way to read whole
|
||
files. If you are using 'gawk', see *note Extension Sample Readfile::,
|
||
for another option.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) At least that we know about.
|
||
|
||
|
||
File: gawk.info, Node: Fields, Next: Nonconstant Fields, Prev: Records, Up: Reading Files
|
||
|
||
4.2 Examining Fields
|
||
====================
|
||
|
||
When 'awk' reads an input record, the record is automatically "parsed"
|
||
or separated by the 'awk' utility into chunks called "fields". By
|
||
default, fields are separated by "whitespace", like words in a line.
|
||
Whitespace in 'awk' means any string of one or more spaces, TABs, or
|
||
newlines;(1) other characters that are considered whitespace by other
|
||
languages (such as formfeed, vertical tab, etc.) are _not_ considered
|
||
whitespace by 'awk'.
|
||
|
||
The purpose of fields is to make it more convenient for you to refer
|
||
to these pieces of the record. You don't have to use them--you can
|
||
operate on the whole record if you want--but fields are what make simple
|
||
'awk' programs so powerful.
|
||
|
||
You use a dollar sign ('$') to refer to a field in an 'awk' program,
|
||
followed by the number of the field you want. Thus, '$1' refers to the
|
||
first field, '$2' to the second, and so on. (Unlike in the Unix shells,
|
||
the field numbers are not limited to single digits. '$127' is the 127th
|
||
field in the record.) For example, suppose the following is a line of
|
||
input:
|
||
|
||
This seems like a pretty nice example.
|
||
|
||
Here the first field, or '$1', is 'This', the second field, or '$2', is
|
||
'seems', and so on. Note that the last field, '$7', is 'example.'.
|
||
Because there is no space between the 'e' and the '.', the period is
|
||
considered part of the seventh field.
|
||
|
||
'NF' is a predefined variable whose value is the number of fields in
|
||
the current record. 'awk' automatically updates the value of 'NF' each
|
||
time it reads a record. No matter how many fields there are, the last
|
||
field in a record can be represented by '$NF'. So, '$NF' is the same as
|
||
'$7', which is 'example.'. If you try to reference a field beyond the
|
||
last one (such as '$8' when the record has only seven fields), you get
|
||
the empty string. (If used in a numeric operation, you get zero.)
|
||
|
||
The use of '$0', which looks like a reference to the "zeroth" field,
|
||
is a special case: it represents the whole input record. Use it when
|
||
you are not interested in specific fields. Here are some more examples:
|
||
|
||
$ awk '$1 ~ /li/ { print $0 }' mail-list
|
||
-| Amelia 555-5553 amelia.zodiacusque@gmail.com F
|
||
-| Julie 555-6699 julie.perscrutabor@skeeve.com F
|
||
|
||
This example prints each record in the file 'mail-list' whose first
|
||
field contains the string 'li'.
|
||
|
||
By contrast, the following example looks for 'li' in _the entire
|
||
record_ and prints the first and last fields for each matching input
|
||
record:
|
||
|
||
$ awk '/li/ { print $1, $NF }' mail-list
|
||
-| Amelia F
|
||
-| Broderick R
|
||
-| Julie F
|
||
-| Samuel A
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) In POSIX 'awk', newlines are not considered whitespace for
|
||
separating fields.
|
||
|
||
|
||
File: gawk.info, Node: Nonconstant Fields, Next: Changing Fields, Prev: Fields, Up: Reading Files
|
||
|
||
4.3 Nonconstant Field Numbers
|
||
=============================
|
||
|
||
A field number need not be a constant. Any expression in the 'awk'
|
||
language can be used after a '$' to refer to a field. The value of the
|
||
expression specifies the field number. If the value is a string, rather
|
||
than a number, it is converted to a number. Consider this example:
|
||
|
||
awk '{ print $NR }'
|
||
|
||
Recall that 'NR' is the number of records read so far: one in the first
|
||
record, two in the second, and so on. So this example prints the first
|
||
field of the first record, the second field of the second record, and so
|
||
on. For the twentieth record, field number 20 is printed; most likely,
|
||
the record has fewer than 20 fields, so this prints a blank line. Here
|
||
is another example of using expressions as field numbers:
|
||
|
||
awk '{ print $(2*2) }' mail-list
|
||
|
||
'awk' evaluates the expression '(2*2)' and uses its value as the
|
||
number of the field to print. The '*' represents multiplication, so the
|
||
expression '2*2' evaluates to four. The parentheses are used so that
|
||
the multiplication is done before the '$' operation; they are necessary
|
||
whenever there is a binary operator(1) in the field-number expression.
|
||
This example, then, prints the type of relationship (the fourth field)
|
||
for every line of the file 'mail-list'. (All of the 'awk' operators are
|
||
listed, in order of decreasing precedence, in *note Precedence::.)
|
||
|
||
If the field number you compute is zero, you get the entire record.
|
||
Thus, '$(2-2)' has the same value as '$0'. Negative field numbers are
|
||
not allowed; trying to reference one usually terminates the program.
|
||
(The POSIX standard does not define what happens when you reference a
|
||
negative field number. 'gawk' notices this and terminates your program.
|
||
Other 'awk' implementations may behave differently.)
|
||
|
||
As mentioned in *note Fields::, 'awk' stores the current record's
|
||
number of fields in the built-in variable 'NF' (also *note Built-in
|
||
Variables::). Thus, the expression '$NF' is not a special feature--it
|
||
is the direct consequence of evaluating 'NF' and using its value as a
|
||
field number.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) A "binary operator", such as '*' for multiplication, is one that
|
||
takes two operands. The distinction is required because 'awk' also has
|
||
unary (one-operand) and ternary (three-operand) operators.
|
||
|
||
|
||
File: gawk.info, Node: Changing Fields, Next: Field Separators, Prev: Nonconstant Fields, Up: Reading Files
|
||
|
||
4.4 Changing the Contents of a Field
|
||
====================================
|
||
|
||
The contents of a field, as seen by 'awk', can be changed within an
|
||
'awk' program; this changes what 'awk' perceives as the current input
|
||
record. (The actual input is untouched; 'awk' _never_ modifies the
|
||
input file.) Consider the following example and its output:
|
||
|
||
$ awk '{ nboxes = $3 ; $3 = $3 - 10
|
||
> print nboxes, $3 }' inventory-shipped
|
||
-| 25 15
|
||
-| 32 22
|
||
-| 24 14
|
||
...
|
||
|
||
The program first saves the original value of field three in the
|
||
variable 'nboxes'. The '-' sign represents subtraction, so this program
|
||
reassigns field three, '$3', as the original value of field three minus
|
||
ten: '$3 - 10'. (*Note Arithmetic Ops::.) Then it prints the original
|
||
and new values for field three. (Someone in the warehouse made a
|
||
consistent mistake while inventorying the red boxes.)
|
||
|
||
For this to work, the text in '$3' must make sense as a number; the
|
||
string of characters must be converted to a number for the computer to
|
||
do arithmetic on it. The number resulting from the subtraction is
|
||
converted back to a string of characters that then becomes field three.
|
||
*Note Conversion::.
|
||
|
||
When the value of a field is changed (as perceived by 'awk'), the
|
||
text of the input record is recalculated to contain the new field where
|
||
the old one was. In other words, '$0' changes to reflect the altered
|
||
field. Thus, this program prints a copy of the input file, with 10
|
||
subtracted from the second field of each line:
|
||
|
||
$ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped
|
||
-| Jan 3 25 15 115
|
||
-| Feb 5 32 24 226
|
||
-| Mar 5 24 34 228
|
||
...
|
||
|
||
It is also possible to assign contents to fields that are out of
|
||
range. For example:
|
||
|
||
$ awk '{ $6 = ($5 + $4 + $3 + $2)
|
||
> print $6 }' inventory-shipped
|
||
-| 168
|
||
-| 297
|
||
-| 301
|
||
...
|
||
|
||
We've just created '$6', whose value is the sum of fields '$2', '$3',
|
||
'$4', and '$5'. The '+' sign represents addition. For the file
|
||
'inventory-shipped', '$6' represents the total number of parcels shipped
|
||
for a particular month.
|
||
|
||
Creating a new field changes 'awk''s internal copy of the current
|
||
input record, which is the value of '$0'. Thus, if you do 'print $0'
|
||
after adding a field, the record printed includes the new field, with
|
||
the appropriate number of field separators between it and the previously
|
||
existing fields.
|
||
|
||
This recomputation affects and is affected by 'NF' (the number of
|
||
fields; *note Fields::). For example, the value of 'NF' is set to the
|
||
number of the highest field you create. The exact format of '$0' is
|
||
also affected by a feature that has not been discussed yet: the "output
|
||
field separator", 'OFS', used to separate the fields (*note Output
|
||
Separators::).
|
||
|
||
Note, however, that merely _referencing_ an out-of-range field does
|
||
_not_ change the value of either '$0' or 'NF'. Referencing an
|
||
out-of-range field only produces an empty string. For example:
|
||
|
||
if ($(NF+1) != "")
|
||
print "can't happen"
|
||
else
|
||
print "everything is normal"
|
||
|
||
should print 'everything is normal', because 'NF+1' is certain to be out
|
||
of range. (*Note If Statement::, for more information about 'awk''s
|
||
'if-else' statements. *Note Typing and Comparison::, for more
|
||
information about the '!=' operator.)
|
||
|
||
It is important to note that making an assignment to an existing
|
||
field changes the value of '$0' but does not change the value of 'NF',
|
||
even when you assign the empty string to a field. For example:
|
||
|
||
$ echo a b c d | awk '{ OFS = ":"; $2 = ""
|
||
> print $0; print NF }'
|
||
-| a::c:d
|
||
-| 4
|
||
|
||
The field is still there; it just has an empty value, delimited by the
|
||
two colons between 'a' and 'c'. This example shows what happens if you
|
||
create a new field:
|
||
|
||
$ echo a b c d | awk '{ OFS = ":"; $2 = ""; $6 = "new"
|
||
> print $0; print NF }'
|
||
-| a::c:d::new
|
||
-| 6
|
||
|
||
The intervening field, '$5', is created with an empty value (indicated
|
||
by the second pair of adjacent colons), and 'NF' is updated with the
|
||
value six.
|
||
|
||
Decrementing 'NF' throws away the values of the fields after the new
|
||
value of 'NF' and recomputes '$0'. (d.c.) Here is an example:
|
||
|
||
$ echo a b c d e f | awk '{ print "NF =", NF;
|
||
> NF = 3; print $0 }'
|
||
-| NF = 6
|
||
-| a b c
|
||
|
||
CAUTION: Some versions of 'awk' don't rebuild '$0' when 'NF' is
|
||
decremented.
|
||
|
||
Finally, there are times when it is convenient to force 'awk' to
|
||
rebuild the entire record, using the current values of the fields and
|
||
'OFS'. To do this, use the seemingly innocuous assignment:
|
||
|
||
$1 = $1 # force record to be reconstituted
|
||
print $0 # or whatever else with $0
|
||
|
||
This forces 'awk' to rebuild the record. It does help to add a comment,
|
||
as we've shown here.
|
||
|
||
There is a flip side to the relationship between '$0' and the fields.
|
||
Any assignment to '$0' causes the record to be reparsed into fields
|
||
using the _current_ value of 'FS'. This also applies to any built-in
|
||
function that updates '$0', such as 'sub()' and 'gsub()' (*note String
|
||
Functions::).
|
||
|
||
Understanding '$0'
|
||
|
||
It is important to remember that '$0' is the _full_ record, exactly
|
||
as it was read from the input. This includes any leading or trailing
|
||
whitespace, and the exact whitespace (or other characters) that
|
||
separates the fields.
|
||
|
||
It is a common error to try to change the field separators in a
|
||
record simply by setting 'FS' and 'OFS', and then expecting a plain
|
||
'print' or 'print $0' to print the modified record.
|
||
|
||
But this does not work, because nothing was done to change the record
|
||
itself. Instead, you must force the record to be rebuilt, typically
|
||
with a statement such as '$1 = $1', as described earlier.
|
||
|
||
|
||
File: gawk.info, Node: Field Separators, Next: Constant Size, Prev: Changing Fields, Up: Reading Files
|
||
|
||
4.5 Specifying How Fields Are Separated
|
||
=======================================
|
||
|
||
* Menu:
|
||
|
||
* Default Field Splitting:: How fields are normally separated.
|
||
* Regexp Field Splitting:: Using regexps as the field separator.
|
||
* Single Character Fields:: Making each character a separate field.
|
||
* Command Line Field Separator:: Setting 'FS' from the command line.
|
||
* Full Line Fields:: Making the full line be a single field.
|
||
* Field Splitting Summary:: Some final points and a summary table.
|
||
|
||
The "field separator", which is either a single character or a regular
|
||
expression, controls the way 'awk' splits an input record into fields.
|
||
'awk' scans the input record for character sequences that match the
|
||
separator; the fields themselves are the text between the matches.
|
||
|
||
In the examples that follow, we use the bullet symbol (*) to
|
||
represent spaces in the output. If the field separator is 'oo', then
|
||
the following line:
|
||
|
||
moo goo gai pan
|
||
|
||
is split into three fields: 'm', '*g', and '*gai*pan'. Note the leading
|
||
spaces in the values of the second and third fields.
|
||
|
||
The field separator is represented by the predefined variable 'FS'.
|
||
Shell programmers take note: 'awk' does _not_ use the name 'IFS' that is
|
||
used by the POSIX-compliant shells (such as the Unix Bourne shell, 'sh',
|
||
or Bash).
|
||
|
||
The value of 'FS' can be changed in the 'awk' program with the
|
||
assignment operator, '=' (*note Assignment Ops::). Often, the right
|
||
time to do this is at the beginning of execution before any input has
|
||
been processed, so that the very first record is read with the proper
|
||
separator. To do this, use the special 'BEGIN' pattern (*note
|
||
BEGIN/END::). For example, here we set the value of 'FS' to the string
|
||
'","':
|
||
|
||
awk 'BEGIN { FS = "," } ; { print $2 }'
|
||
|
||
Given the input line:
|
||
|
||
John Q. Smith, 29 Oak St., Walamazoo, MI 42139
|
||
|
||
this 'awk' program extracts and prints the string '*29*Oak*St.'.
|
||
|
||
Sometimes the input data contains separator characters that don't
|
||
separate fields the way you thought they would. For instance, the
|
||
person's name in the example we just used might have a title or suffix
|
||
attached, such as:
|
||
|
||
John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
|
||
|
||
The same program would extract '*LXIX' instead of '*29*Oak*St.'. If you
|
||
were expecting the program to print the address, you would be surprised.
|
||
The moral is to choose your data layout and separator characters
|
||
carefully to prevent such problems. (If the data is not in a form that
|
||
is easy to process, perhaps you can massage it first with a separate
|
||
'awk' program.)
|
||
|
||
|
||
File: gawk.info, Node: Default Field Splitting, Next: Regexp Field Splitting, Up: Field Separators
|
||
|
||
4.5.1 Whitespace Normally Separates Fields
|
||
------------------------------------------
|
||
|
||
Fields are normally separated by whitespace sequences (spaces, TABs, and
|
||
newlines), not by single spaces. Two spaces in a row do not delimit an
|
||
empty field. The default value of the field separator 'FS' is a string
|
||
containing a single space, '" "'. If 'awk' interpreted this value in
|
||
the usual way, each space character would separate fields, so two spaces
|
||
in a row would make an empty field between them. The reason this does
|
||
not happen is that a single space as the value of 'FS' is a special
|
||
case--it is taken to specify the default manner of delimiting fields.
|
||
|
||
If 'FS' is any other single character, such as '","', then each
|
||
occurrence of that character separates two fields. Two consecutive
|
||
occurrences delimit an empty field. If the character occurs at the
|
||
beginning or the end of the line, that too delimits an empty field. The
|
||
space character is the only single character that does not follow these
|
||
rules.
|
||
|
||
|
||
File: gawk.info, Node: Regexp Field Splitting, Next: Single Character Fields, Prev: Default Field Splitting, Up: Field Separators
|
||
|
||
4.5.2 Using Regular Expressions to Separate Fields
|
||
--------------------------------------------------
|
||
|
||
The previous node discussed the use of single characters or simple
|
||
strings as the value of 'FS'. More generally, the value of 'FS' may be
|
||
a string containing any regular expression. In this case, each match in
|
||
the record for the regular expression separates fields. For example,
|
||
the assignment:
|
||
|
||
FS = ", \t"
|
||
|
||
makes every area of an input line that consists of a comma followed by a
|
||
space and a TAB into a field separator. ('\t' is an "escape sequence"
|
||
that stands for a TAB; *note Escape Sequences::, for the complete list
|
||
of similar escape sequences.)
|
||
|
||
For a less trivial example of a regular expression, try using single
|
||
spaces to separate fields the way single commas are used. 'FS' can be
|
||
set to '"[ ]"' (left bracket, space, right bracket). This regular
|
||
expression matches a single space and nothing else (*note Regexp::).
|
||
|
||
There is an important difference between the two cases of 'FS = " "'
|
||
(a single space) and 'FS = "[ \t\n]+"' (a regular expression matching
|
||
one or more spaces, TABs, or newlines). For both values of 'FS', fields
|
||
are separated by "runs" (multiple adjacent occurrences) of spaces, TABs,
|
||
and/or newlines. However, when the value of 'FS' is '" "', 'awk' first
|
||
strips leading and trailing whitespace from the record and then decides
|
||
where the fields are. For example, the following pipeline prints 'b':
|
||
|
||
$ echo ' a b c d ' | awk '{ print $2 }'
|
||
-| b
|
||
|
||
However, this pipeline prints 'a' (note the extra spaces around each
|
||
letter):
|
||
|
||
$ echo ' a b c d ' | awk 'BEGIN { FS = "[ \t\n]+" }
|
||
> { print $2 }'
|
||
-| a
|
||
|
||
In this case, the first field is null, or empty.
|
||
|
||
The stripping of leading and trailing whitespace also comes into play
|
||
whenever '$0' is recomputed. For instance, study this pipeline:
|
||
|
||
$ echo ' a b c d' | awk '{ print; $2 = $2; print }'
|
||
-| a b c d
|
||
-| a b c d
|
||
|
||
The first 'print' statement prints the record as it was read, with
|
||
leading whitespace intact. The assignment to '$2' rebuilds '$0' by
|
||
concatenating '$1' through '$NF' together, separated by the value of
|
||
'OFS' (which is a space by default). Because the leading whitespace was
|
||
ignored when finding '$1', it is not part of the new '$0'. Finally, the
|
||
last 'print' statement prints the new '$0'.
|
||
|
||
There is an additional subtlety to be aware of when using regular
|
||
expressions for field splitting. It is not well specified in the POSIX
|
||
standard, or anywhere else, what '^' means when splitting fields. Does
|
||
the '^' match only at the beginning of the entire record? Or is each
|
||
field separator a new string? It turns out that different 'awk'
|
||
versions answer this question differently, and you should not rely on
|
||
any specific behavior in your programs. (d.c.)
|
||
|
||
As a point of information, BWK 'awk' allows '^' to match only at the
|
||
beginning of the record. 'gawk' also works this way. For example:
|
||
|
||
$ echo 'xxAA xxBxx C' |
|
||
> gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
|
||
> printf "-->%s<--\n", $i }'
|
||
-| --><--
|
||
-| -->AA<--
|
||
-| -->xxBxx<--
|
||
-| -->C<--
|
||
|
||
|
||
File: gawk.info, Node: Single Character Fields, Next: Command Line Field Separator, Prev: Regexp Field Splitting, Up: Field Separators
|
||
|
||
4.5.3 Making Each Character a Separate Field
|
||
--------------------------------------------
|
||
|
||
There are times when you may want to examine each character of a record
|
||
separately. This can be done in 'gawk' by simply assigning the null
|
||
string ('""') to 'FS'. (c.e.) In this case, each individual character
|
||
in the record becomes a separate field. For example:
|
||
|
||
$ echo a b | gawk 'BEGIN { FS = "" }
|
||
> {
|
||
> for (i = 1; i <= NF; i = i + 1)
|
||
> print "Field", i, "is", $i
|
||
> }'
|
||
-| Field 1 is a
|
||
-| Field 2 is
|
||
-| Field 3 is b
|
||
|
||
Traditionally, the behavior of 'FS' equal to '""' was not defined.
|
||
In this case, most versions of Unix 'awk' simply treat the entire record
|
||
as only having one field. (d.c.) In compatibility mode (*note
|
||
Options::), if 'FS' is the null string, then 'gawk' also behaves this
|
||
way.
|
||
|
||
|
||
File: gawk.info, Node: Command Line Field Separator, Next: Full Line Fields, Prev: Single Character Fields, Up: Field Separators
|
||
|
||
4.5.4 Setting 'FS' from the Command Line
|
||
----------------------------------------
|
||
|
||
'FS' can be set on the command line. Use the '-F' option to do so. For
|
||
example:
|
||
|
||
awk -F, 'PROGRAM' INPUT-FILES
|
||
|
||
sets 'FS' to the ',' character. Notice that the option uses an
|
||
uppercase 'F' instead of a lowercase 'f'. The latter option ('-f')
|
||
specifies a file containing an 'awk' program.
|
||
|
||
The value used for the argument to '-F' is processed in exactly the
|
||
same way as assignments to the predefined variable 'FS'. Any special
|
||
characters in the field separator must be escaped appropriately. For
|
||
example, to use a '\' as the field separator on the command line, you
|
||
would have to type:
|
||
|
||
# same as FS = "\\"
|
||
awk -F\\\\ '...' files ...
|
||
|
||
Because '\' is used for quoting in the shell, 'awk' sees '-F\\'. Then
|
||
'awk' processes the '\\' for escape characters (*note Escape
|
||
Sequences::), finally yielding a single '\' to use for the field
|
||
separator.
|
||
|
||
As a special case, in compatibility mode (*note Options::), if the
|
||
argument to '-F' is 't', then 'FS' is set to the TAB character. If you
|
||
type '-F\t' at the shell, without any quotes, the '\' gets deleted, so
|
||
'awk' figures that you really want your fields to be separated with TABs
|
||
and not 't's. Use '-v FS="t"' or '-F"[t]"' on the command line if you
|
||
really do want to separate your fields with 't's. Use '-F '\t'' when
|
||
not in compatibility mode to specify that TABs separate fields.
|
||
|
||
As an example, let's use an 'awk' program file called 'edu.awk' that
|
||
contains the pattern '/edu/' and the action 'print $1':
|
||
|
||
/edu/ { print $1 }
|
||
|
||
Let's also set 'FS' to be the '-' character and run the program on
|
||
the file 'mail-list'. The following command prints a list of the names
|
||
of the people that work at or attend a university, and the first three
|
||
digits of their phone numbers:
|
||
|
||
$ awk -F- -f edu.awk mail-list
|
||
-| Fabius 555
|
||
-| Samuel 555
|
||
-| Jean
|
||
|
||
Note the third line of output. The third line in the original file
|
||
looked like this:
|
||
|
||
Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R
|
||
|
||
The '-' as part of the person's name was used as the field separator,
|
||
instead of the '-' in the phone number that was originally intended.
|
||
This demonstrates why you have to be careful in choosing your field and
|
||
record separators.
|
||
|
||
Perhaps the most common use of a single character as the field
|
||
separator occurs when processing the Unix system password file. On many
|
||
Unix systems, each user has a separate entry in the system password
|
||
file, with one line per user. The information in these lines is
|
||
separated by colons. The first field is the user's login name and the
|
||
second is the user's encrypted or shadow password. (A shadow password
|
||
is indicated by the presence of a single 'x' in the second field.) A
|
||
password file entry might look like this:
|
||
|
||
arnold:x:2076:10:Arnold Robbins:/home/arnold:/bin/bash
|
||
|
||
The following program searches the system password file and prints
|
||
the entries for users whose full name is not indicated:
|
||
|
||
awk -F: '$5 == ""' /etc/passwd
|
||
|
||
|
||
File: gawk.info, Node: Full Line Fields, Next: Field Splitting Summary, Prev: Command Line Field Separator, Up: Field Separators
|
||
|
||
4.5.5 Making the Full Line Be a Single Field
|
||
--------------------------------------------
|
||
|
||
Occasionally, it's useful to treat the whole input line as a single
|
||
field. This can be done easily and portably simply by setting 'FS' to
|
||
'"\n"' (a newline):(1)
|
||
|
||
awk -F'\n' 'PROGRAM' FILES ...
|
||
|
||
When you do this, '$1' is the same as '$0'.
|
||
|
||
Changing 'FS' Does Not Affect the Fields
|
||
|
||
According to the POSIX standard, 'awk' is supposed to behave as if
|
||
each record is split into fields at the time it is read. In particular,
|
||
this means that if you change the value of 'FS' after a record is read,
|
||
the values of the fields (i.e., how they were split) should reflect the
|
||
old value of 'FS', not the new one.
|
||
|
||
However, many older implementations of 'awk' do not work this way.
|
||
Instead, they defer splitting the fields until a field is actually
|
||
referenced. The fields are split using the _current_ value of 'FS'!
|
||
(d.c.) This behavior can be difficult to diagnose. The following
|
||
example illustrates the difference between the two methods:
|
||
|
||
sed 1q /etc/passwd | awk '{ FS = ":" ; print $1 }'
|
||
|
||
which usually prints:
|
||
|
||
root
|
||
|
||
on an incorrect implementation of 'awk', while 'gawk' prints the full
|
||
first line of the file, something like:
|
||
|
||
root:x:0:0:Root:/:
|
||
|
||
(The 'sed'(2) command prints just the first line of '/etc/passwd'.)
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Thanks to Andrew Schorr for this tip.
|
||
|
||
(2) The 'sed' utility is a "stream editor." Its behavior is also
|
||
defined by the POSIX standard.
|
||
|
||
|
||
File: gawk.info, Node: Field Splitting Summary, Prev: Full Line Fields, Up: Field Separators
|
||
|
||
4.5.6 Field-Splitting Summary
|
||
-----------------------------
|
||
|
||
It is important to remember that when you assign a string constant as
|
||
the value of 'FS', it undergoes normal 'awk' string processing. For
|
||
example, with Unix 'awk' and 'gawk', the assignment 'FS = "\.."' assigns
|
||
the character string '".."' to 'FS' (the backslash is stripped). This
|
||
creates a regexp meaning "fields are separated by occurrences of any two
|
||
characters." If instead you want fields to be separated by a literal
|
||
period followed by any single character, use 'FS = "\\.."'.
|
||
|
||
The following list summarizes how fields are split, based on the
|
||
value of 'FS' ('==' means "is equal to"):
|
||
|
||
'FS == " "'
|
||
Fields are separated by runs of whitespace. Leading and trailing
|
||
whitespace are ignored. This is the default.
|
||
|
||
'FS == ANY OTHER SINGLE CHARACTER'
|
||
Fields are separated by each occurrence of the character. Multiple
|
||
successive occurrences delimit empty fields, as do leading and
|
||
trailing occurrences. The character can even be a regexp
|
||
metacharacter; it does not need to be escaped.
|
||
|
||
'FS == REGEXP'
|
||
Fields are separated by occurrences of characters that match
|
||
REGEXP. Leading and trailing matches of REGEXP delimit empty
|
||
fields.
|
||
|
||
'FS == ""'
|
||
Each individual character in the record becomes a separate field.
|
||
(This is a common extension; it is not specified by the POSIX
|
||
standard.)
|
||
|
||
'FS' and 'IGNORECASE'
|
||
|
||
The 'IGNORECASE' variable (*note User-modified::) affects field
|
||
splitting _only_ when the value of 'FS' is a regexp. It has no effect
|
||
when 'FS' is a single character, even if that character is a letter.
|
||
Thus, in the following code:
|
||
|
||
FS = "c"
|
||
IGNORECASE = 1
|
||
$0 = "aCa"
|
||
print $1
|
||
|
||
The output is 'aCa'. If you really want to split fields on an
|
||
alphabetic character while ignoring case, use a regexp that will do it
|
||
for you (e.g., 'FS = "[c]"'). In this case, 'IGNORECASE' will take
|
||
effect.
|
||
|
||
|
||
File: gawk.info, Node: Constant Size, Next: Splitting By Content, Prev: Field Separators, Up: Reading Files
|
||
|
||
4.6 Reading Fixed-Width Data
|
||
============================
|
||
|
||
This minor node discusses an advanced feature of 'gawk'. If you are a
|
||
novice 'awk' user, you might want to skip it on the first reading.
|
||
|
||
'gawk' provides a facility for dealing with fixed-width fields with
|
||
no distinctive field separator. For example, data of this nature arises
|
||
in the input for old Fortran programs where numbers are run together, or
|
||
in the output of programs that did not anticipate the use of their
|
||
output as input for other programs.
|
||
|
||
An example of the latter is a table where all the columns are lined
|
||
up by the use of a variable number of spaces and _empty fields are just
|
||
spaces_. Clearly, 'awk''s normal field splitting based on 'FS' does not
|
||
work well in this case. Although a portable 'awk' program can use a
|
||
series of 'substr()' calls on '$0' (*note String Functions::), this is
|
||
awkward and inefficient for a large number of fields.
|
||
|
||
The splitting of an input record into fixed-width fields is specified
|
||
by assigning a string containing space-separated numbers to the built-in
|
||
variable 'FIELDWIDTHS'. Each number specifies the width of the field,
|
||
_including_ columns between fields. If you want to ignore the columns
|
||
between fields, you can specify the width as a separate field that is
|
||
subsequently ignored. It is a fatal error to supply a field width that
|
||
has a negative value. The following data is the output of the Unix 'w'
|
||
utility. It is useful to illustrate the use of 'FIELDWIDTHS':
|
||
|
||
10:06pm up 21 days, 14:04, 23 users
|
||
User tty login idle JCPU PCPU what
|
||
hzuo ttyV0 8:58pm 9 5 vi p24.tex
|
||
hzang ttyV3 6:37pm 50 -csh
|
||
eklye ttyV5 9:53pm 7 1 em thes.tex
|
||
dportein ttyV6 8:17pm 1:47 -csh
|
||
gierd ttyD3 10:00pm 1 elm
|
||
dave ttyD4 9:47pm 4 4 w
|
||
brent ttyp0 26Jun91 4:46 26:46 4:41 bash
|
||
dave ttyq4 26Jun9115days 46 46 wnewmail
|
||
|
||
The following program takes this input, converts the idle time to
|
||
number of seconds, and prints out the first two fields and the
|
||
calculated idle time:
|
||
|
||
BEGIN { FIELDWIDTHS = "9 6 10 6 7 7 35" }
|
||
NR > 2 {
|
||
idle = $4
|
||
sub(/^ +/, "", idle) # strip leading spaces
|
||
if (idle == "")
|
||
idle = 0
|
||
if (idle ~ /:/) {
|
||
split(idle, t, ":")
|
||
idle = t[1] * 60 + t[2]
|
||
}
|
||
if (idle ~ /days/)
|
||
idle *= 24 * 60 * 60
|
||
|
||
print $1, $2, idle
|
||
}
|
||
|
||
NOTE: The preceding program uses a number of 'awk' features that
|
||
haven't been introduced yet.
|
||
|
||
Running the program on the data produces the following results:
|
||
|
||
hzuo ttyV0 0
|
||
hzang ttyV3 50
|
||
eklye ttyV5 0
|
||
dportein ttyV6 107
|
||
gierd ttyD3 1
|
||
dave ttyD4 0
|
||
brent ttyp0 286
|
||
dave ttyq4 1296000
|
||
|
||
Another (possibly more practical) example of fixed-width input data
|
||
is the input from a deck of balloting cards. In some parts of the
|
||
United States, voters mark their choices by punching holes in computer
|
||
cards. These cards are then processed to count the votes for any
|
||
particular candidate or on any particular issue. Because a voter may
|
||
choose not to vote on some issue, any column on the card may be empty.
|
||
An 'awk' program for processing such data could use the 'FIELDWIDTHS'
|
||
feature to simplify reading the data. (Of course, getting 'gawk' to run
|
||
on a system with card readers is another story!)
|
||
|
||
Assigning a value to 'FS' causes 'gawk' to use 'FS' for field
|
||
splitting again. Use 'FS = FS' to make this happen, without having to
|
||
know the current value of 'FS'. In order to tell which kind of field
|
||
splitting is in effect, use 'PROCINFO["FS"]' (*note Auto-set::). The
|
||
value is '"FS"' if regular field splitting is being used, or
|
||
'"FIELDWIDTHS"' if fixed-width field splitting is being used:
|
||
|
||
if (PROCINFO["FS"] == "FS")
|
||
REGULAR FIELD SPLITTING ...
|
||
else if (PROCINFO["FS"] == "FIELDWIDTHS")
|
||
FIXED-WIDTH FIELD SPLITTING ...
|
||
else
|
||
CONTENT-BASED FIELD SPLITTING ... (see next minor node)
|
||
|
||
This information is useful when writing a function that needs to
|
||
temporarily change 'FS' or 'FIELDWIDTHS', read some records, and then
|
||
restore the original settings (*note Passwd Functions::, for an example
|
||
of such a function).
|
||
|
||
|
||
File: gawk.info, Node: Splitting By Content, Next: Multiple Line, Prev: Constant Size, Up: Reading Files
|
||
|
||
4.7 Defining Fields by Content
|
||
==============================
|
||
|
||
This minor node discusses an advanced feature of 'gawk'. If you are a
|
||
novice 'awk' user, you might want to skip it on the first reading.
|
||
|
||
Normally, when using 'FS', 'gawk' defines the fields as the parts of
|
||
the record that occur in between each field separator. In other words,
|
||
'FS' defines what a field _is not_, instead of what a field _is_.
|
||
However, there are times when you really want to define the fields by
|
||
what they are, and not by what they are not.
|
||
|
||
The most notorious such case is so-called "comma-separated values"
|
||
(CSV) data. Many spreadsheet programs, for example, can export their
|
||
data into text files, where each record is terminated with a newline,
|
||
and fields are separated by commas. If commas only separated the data,
|
||
there wouldn't be an issue. The problem comes when one of the fields
|
||
contains an _embedded_ comma. In such cases, most programs embed the
|
||
field in double quotes.(1) So, we might have data like this:
|
||
|
||
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
|
||
|
||
The 'FPAT' variable offers a solution for cases like this. The value
|
||
of 'FPAT' should be a string that provides a regular expression. This
|
||
regular expression describes the contents of each field.
|
||
|
||
In the case of CSV data as presented here, each field is either
|
||
"anything that is not a comma," or "a double quote, anything that is not
|
||
a double quote, and a closing double quote." If written as a regular
|
||
expression constant (*note Regexp::), we would have
|
||
'/([^,]+)|("[^"]+")/'. Writing this as a string requires us to escape
|
||
the double quotes, leading to:
|
||
|
||
FPAT = "([^,]+)|(\"[^\"]+\")"
|
||
|
||
Putting this to use, here is a simple program to parse the data:
|
||
|
||
BEGIN {
|
||
FPAT = "([^,]+)|(\"[^\"]+\")"
|
||
}
|
||
|
||
{
|
||
print "NF = ", NF
|
||
for (i = 1; i <= NF; i++) {
|
||
printf("$%d = <%s>\n", i, $i)
|
||
}
|
||
}
|
||
|
||
When run, we get the following:
|
||
|
||
$ gawk -f simple-csv.awk addresses.csv
|
||
NF = 7
|
||
$1 = <Robbins>
|
||
$2 = <Arnold>
|
||
$3 = <"1234 A Pretty Street, NE">
|
||
$4 = <MyTown>
|
||
$5 = <MyState>
|
||
$6 = <12345-6789>
|
||
$7 = <USA>
|
||
|
||
Note the embedded comma in the value of '$3'.
|
||
|
||
A straightforward improvement when processing CSV data of this sort
|
||
would be to remove the quotes when they occur, with something like this:
|
||
|
||
if (substr($i, 1, 1) == "\"") {
|
||
len = length($i)
|
||
$i = substr($i, 2, len - 2) # Get text within the two quotes
|
||
}
|
||
|
||
As with 'FS', the 'IGNORECASE' variable (*note User-modified::)
|
||
affects field splitting with 'FPAT'.
|
||
|
||
Assigning a value to 'FPAT' overrides field splitting with 'FS' and
|
||
with 'FIELDWIDTHS'. Similar to 'FIELDWIDTHS', the value of
|
||
'PROCINFO["FS"]' will be '"FPAT"' if content-based field splitting is
|
||
being used.
|
||
|
||
NOTE: Some programs export CSV data that contains embedded newlines
|
||
between the double quotes. 'gawk' provides no way to deal with
|
||
this. Even though a formal specification for CSV data exists,
|
||
there isn't much more to be done; the 'FPAT' mechanism provides an
|
||
elegant solution for the majority of cases, and the 'gawk'
|
||
developers are satisfied with that.
|
||
|
||
As written, the regexp used for 'FPAT' requires that each field
|
||
contain at least one character. A straightforward modification
|
||
(changing the first '+' to '*') allows fields to be empty:
|
||
|
||
FPAT = "([^,]*)|(\"[^\"]+\")"
|
||
|
||
Finally, the 'patsplit()' function makes the same functionality
|
||
available for splitting regular strings (*note String Functions::).
|
||
|
||
To recap, 'gawk' provides three independent methods to split input
|
||
records into fields. The mechanism used is based on which of the three
|
||
variables--'FS', 'FIELDWIDTHS', or 'FPAT'--was last assigned to.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The CSV format lacked a formal standard definition for many
|
||
years. RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt) standardizes the
|
||
most common practices.
|
||
|
||
|
||
File: gawk.info, Node: Multiple Line, Next: Getline, Prev: Splitting By Content, Up: Reading Files
|
||
|
||
4.8 Multiple-Line Records
|
||
=========================
|
||
|
||
In some databases, a single line cannot conveniently hold all the
|
||
information in one entry. In such cases, you can use multiline records.
|
||
The first step in doing this is to choose your data format.
|
||
|
||
One technique is to use an unusual character or string to separate
|
||
records. For example, you could use the formfeed character (written
|
||
'\f' in 'awk', as in C) to separate them, making each record a page of
|
||
the file. To do this, just set the variable 'RS' to '"\f"' (a string
|
||
containing the formfeed character). Any other character could equally
|
||
well be used, as long as it won't be part of the data in a record.
|
||
|
||
Another technique is to have blank lines separate records. By a
|
||
special dispensation, an empty string as the value of 'RS' indicates
|
||
that records are separated by one or more blank lines. When 'RS' is set
|
||
to the empty string, each record always ends at the first blank line
|
||
encountered. The next record doesn't start until the first nonblank
|
||
line that follows. No matter how many blank lines appear in a row, they
|
||
all act as one record separator. (Blank lines must be completely empty;
|
||
lines that contain only whitespace do not count.)
|
||
|
||
You can achieve the same effect as 'RS = ""' by assigning the string
|
||
'"\n\n+"' to 'RS'. This regexp matches the newline at the end of the
|
||
record and one or more blank lines after the record. In addition, a
|
||
regular expression always matches the longest possible sequence when
|
||
there is a choice (*note Leftmost Longest::). So, the next record
|
||
doesn't start until the first nonblank line that follows--no matter how
|
||
many blank lines appear in a row, they are considered one record
|
||
separator.
|
||
|
||
However, there is an important difference between 'RS = ""' and 'RS =
|
||
"\n\n+"'. In the first case, leading newlines in the input data file
|
||
are ignored, and if a file ends without extra blank lines after the last
|
||
record, the final newline is removed from the record. In the second
|
||
case, this special processing is not done. (d.c.)
|
||
|
||
Now that the input is separated into records, the second step is to
|
||
separate the fields in the records. One way to do this is to divide
|
||
each of the lines into fields in the normal manner. This happens by
|
||
default as the result of a special feature. When 'RS' is set to the
|
||
empty string _and_ 'FS' is set to a single character, the newline
|
||
character _always_ acts as a field separator. This is in addition to
|
||
whatever field separations result from 'FS'.(1)
|
||
|
||
The original motivation for this special exception was probably to
|
||
provide useful behavior in the default case (i.e., 'FS' is equal to
|
||
'" "'). This feature can be a problem if you really don't want the
|
||
newline character to separate fields, because there is no way to prevent
|
||
it. However, you can work around this by using the 'split()' function
|
||
to break up the record manually (*note String Functions::). If you have
|
||
a single-character field separator, you can work around the special
|
||
feature in a different way, by making 'FS' into a regexp for that single
|
||
character. For example, if the field separator is a percent character,
|
||
instead of 'FS = "%"', use 'FS = "[%]"'.
|
||
|
||
Another way to separate fields is to put each field on a separate
|
||
line: to do this, just set the variable 'FS' to the string '"\n"'.
|
||
(This single-character separator matches a single newline.) A practical
|
||
example of a data file organized this way might be a mailing list, where
|
||
blank lines separate the entries. Consider a mailing list in a file
|
||
named 'addresses', which looks like this:
|
||
|
||
Jane Doe
|
||
123 Main Street
|
||
Anywhere, SE 12345-6789
|
||
|
||
John Smith
|
||
456 Tree-lined Avenue
|
||
Smallville, MW 98765-4321
|
||
...
|
||
|
||
A simple program to process this file is as follows:
|
||
|
||
# addrs.awk --- simple mailing list program
|
||
|
||
# Records are separated by blank lines.
|
||
# Each line is one field.
|
||
BEGIN { RS = "" ; FS = "\n" }
|
||
|
||
{
|
||
print "Name is:", $1
|
||
print "Address is:", $2
|
||
print "City and State are:", $3
|
||
print ""
|
||
}
|
||
|
||
Running the program produces the following output:
|
||
|
||
$ awk -f addrs.awk addresses
|
||
-| Name is: Jane Doe
|
||
-| Address is: 123 Main Street
|
||
-| City and State are: Anywhere, SE 12345-6789
|
||
-|
|
||
-| Name is: John Smith
|
||
-| Address is: 456 Tree-lined Avenue
|
||
-| City and State are: Smallville, MW 98765-4321
|
||
-|
|
||
...
|
||
|
||
*Note Labels Program::, for a more realistic program dealing with
|
||
address lists. The following list summarizes how records are split,
|
||
based on the value of 'RS'. ('==' means "is equal to.")
|
||
|
||
'RS == "\n"'
|
||
Records are separated by the newline character ('\n'). In effect,
|
||
every line in the data file is a separate record, including blank
|
||
lines. This is the default.
|
||
|
||
'RS == ANY SINGLE CHARACTER'
|
||
Records are separated by each occurrence of the character.
|
||
Multiple successive occurrences delimit empty records.
|
||
|
||
'RS == ""'
|
||
Records are separated by runs of blank lines. When 'FS' is a
|
||
single character, then the newline character always serves as a
|
||
field separator, in addition to whatever value 'FS' may have.
|
||
Leading and trailing newlines in a file are ignored.
|
||
|
||
'RS == REGEXP'
|
||
Records are separated by occurrences of characters that match
|
||
REGEXP. Leading and trailing matches of REGEXP delimit empty
|
||
records. (This is a 'gawk' extension; it is not specified by the
|
||
POSIX standard.)
|
||
|
||
If not in compatibility mode (*note Options::), 'gawk' sets 'RT' to
|
||
the input text that matched the value specified by 'RS'. But if the
|
||
input file ended without any text that matches 'RS', then 'gawk' sets
|
||
'RT' to the null string.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) When 'FS' is the null string ('""') or a regexp, this special
|
||
feature of 'RS' does not apply. It does apply to the default field
|
||
separator of a single space: 'FS = " "'.
|
||
|
||
|
||
File: gawk.info, Node: Getline, Next: Read Timeout, Prev: Multiple Line, Up: Reading Files
|
||
|
||
4.9 Explicit Input with 'getline'
|
||
=================================
|
||
|
||
So far we have been getting our input data from 'awk''s main input
|
||
stream--either the standard input (usually your keyboard, sometimes the
|
||
output from another program) or the files specified on the command line.
|
||
The 'awk' language has a special built-in command called 'getline' that
|
||
can be used to read input under your explicit control.
|
||
|
||
The 'getline' command is used in several different ways and should
|
||
_not_ be used by beginners. The examples that follow the explanation of
|
||
the 'getline' command include material that has not been covered yet.
|
||
Therefore, come back and study the 'getline' command _after_ you have
|
||
reviewed the rest of this Info file and have a good knowledge of how
|
||
'awk' works.
|
||
|
||
The 'getline' command returns 1 if it finds a record and 0 if it
|
||
encounters the end of the file. If there is some error in getting a
|
||
record, such as a file that cannot be opened, then 'getline' returns -1.
|
||
In this case, 'gawk' sets the variable 'ERRNO' to a string describing
|
||
the error that occurred.
|
||
|
||
In the following examples, COMMAND stands for a string value that
|
||
represents a shell command.
|
||
|
||
NOTE: When '--sandbox' is specified (*note Options::), reading
|
||
lines from files, pipes, and coprocesses is disabled.
|
||
|
||
* Menu:
|
||
|
||
* Plain Getline:: Using 'getline' with no arguments.
|
||
* Getline/Variable:: Using 'getline' into a variable.
|
||
* Getline/File:: Using 'getline' from a file.
|
||
* Getline/Variable/File:: Using 'getline' into a variable from a
|
||
file.
|
||
* Getline/Pipe:: Using 'getline' from a pipe.
|
||
* Getline/Variable/Pipe:: Using 'getline' into a variable from a
|
||
pipe.
|
||
* Getline/Coprocess:: Using 'getline' from a coprocess.
|
||
* Getline/Variable/Coprocess:: Using 'getline' into a variable from a
|
||
coprocess.
|
||
* Getline Notes:: Important things to know about 'getline'.
|
||
* Getline Summary:: Summary of 'getline' Variants.
|
||
|
||
|
||
File: gawk.info, Node: Plain Getline, Next: Getline/Variable, Up: Getline
|
||
|
||
4.9.1 Using 'getline' with No Arguments
|
||
---------------------------------------
|
||
|
||
The 'getline' command can be used without arguments to read input from
|
||
the current input file. All it does in this case is read the next input
|
||
record and split it up into fields. This is useful if you've finished
|
||
processing the current record, but want to do some special processing on
|
||
the next record _right now_. For example:
|
||
|
||
# Remove text between /* and */, inclusive
|
||
{
|
||
if ((i = index($0, "/*")) != 0) {
|
||
out = substr($0, 1, i - 1) # leading part of the string
|
||
rest = substr($0, i + 2) # ... */ ...
|
||
j = index(rest, "*/") # is */ in trailing part?
|
||
if (j > 0) {
|
||
rest = substr(rest, j + 2) # remove comment
|
||
} else {
|
||
while (j == 0) {
|
||
# get more text
|
||
if (getline <= 0) {
|
||
print("unexpected EOF or error:", ERRNO) > "/dev/stderr"
|
||
exit
|
||
}
|
||
# build up the line using string concatenation
|
||
rest = rest $0
|
||
j = index(rest, "*/") # is */ in trailing part?
|
||
if (j != 0) {
|
||
rest = substr(rest, j + 2)
|
||
break
|
||
}
|
||
}
|
||
}
|
||
# build up the output line using string concatenation
|
||
$0 = out rest
|
||
}
|
||
print $0
|
||
}
|
||
|
||
This 'awk' program deletes C-style comments ('/* ... */') from the
|
||
input. It uses a number of features we haven't covered yet, including
|
||
string concatenation (*note Concatenation::) and the 'index()' and
|
||
'substr()' built-in functions (*note String Functions::). By replacing
|
||
the 'print $0' with other statements, you could perform more complicated
|
||
processing on the decommented input, such as searching for matches of a
|
||
regular expression. (This program has a subtle problem--it does not
|
||
work if one comment ends and another begins on the same line.)
|
||
|
||
This form of the 'getline' command sets 'NF', 'NR', 'FNR', 'RT', and
|
||
the value of '$0'.
|
||
|
||
NOTE: The new value of '$0' is used to test the patterns of any
|
||
subsequent rules. The original value of '$0' that triggered the
|
||
rule that executed 'getline' is lost. By contrast, the 'next'
|
||
statement reads a new record but immediately begins processing it
|
||
normally, starting with the first rule in the program. *Note Next
|
||
Statement::.
|
||
|
||
|
||
File: gawk.info, Node: Getline/Variable, Next: Getline/File, Prev: Plain Getline, Up: Getline
|
||
|
||
4.9.2 Using 'getline' into a Variable
|
||
-------------------------------------
|
||
|
||
You can use 'getline VAR' to read the next record from 'awk''s input
|
||
into the variable VAR. No other processing is done. For example,
|
||
suppose the next line is a comment or a special string, and you want to
|
||
read it without triggering any rules. This form of 'getline' allows you
|
||
to read that line and store it in a variable so that the main
|
||
read-a-line-and-check-each-rule loop of 'awk' never sees it. The
|
||
following example swaps every two lines of input:
|
||
|
||
{
|
||
if ((getline tmp) > 0) {
|
||
print tmp
|
||
print $0
|
||
} else
|
||
print $0
|
||
}
|
||
|
||
It takes the following list:
|
||
|
||
wan
|
||
tew
|
||
free
|
||
phore
|
||
|
||
and produces these results:
|
||
|
||
tew
|
||
wan
|
||
phore
|
||
free
|
||
|
||
The 'getline' command used in this way sets only the variables 'NR',
|
||
'FNR', and 'RT' (and, of course, VAR). The record is not split into
|
||
fields, so the values of the fields (including '$0') and the value of
|
||
'NF' do not change.
|
||
|
||
|
||
File: gawk.info, Node: Getline/File, Next: Getline/Variable/File, Prev: Getline/Variable, Up: Getline
|
||
|
||
4.9.3 Using 'getline' from a File
|
||
---------------------------------
|
||
|
||
Use 'getline < FILE' to read the next record from FILE. Here, FILE is a
|
||
string-valued expression that specifies the file name. '< FILE' is
|
||
called a "redirection" because it directs input to come from a different
|
||
place. For example, the following program reads its input record from
|
||
the file 'secondary.input' when it encounters a first field with a value
|
||
equal to 10 in the current input file:
|
||
|
||
{
|
||
if ($1 == 10) {
|
||
getline < "secondary.input"
|
||
print
|
||
} else
|
||
print
|
||
}
|
||
|
||
Because the main input stream is not used, the values of 'NR' and
|
||
'FNR' are not changed. However, the record it reads is split into
|
||
fields in the normal manner, so the values of '$0' and the other fields
|
||
are changed, resulting in a new value of 'NF'. 'RT' is also set.
|
||
|
||
According to POSIX, 'getline < EXPRESSION' is ambiguous if EXPRESSION
|
||
contains unparenthesized operators other than '$'; for example, 'getline
|
||
< dir "/" file' is ambiguous because the concatenation operator (not
|
||
discussed yet; *note Concatenation::) is not parenthesized. You should
|
||
write it as 'getline < (dir "/" file)' if you want your program to be
|
||
portable to all 'awk' implementations.
|
||
|
||
|
||
File: gawk.info, Node: Getline/Variable/File, Next: Getline/Pipe, Prev: Getline/File, Up: Getline
|
||
|
||
4.9.4 Using 'getline' into a Variable from a File
|
||
-------------------------------------------------
|
||
|
||
Use 'getline VAR < FILE' to read input from the file FILE, and put it in
|
||
the variable VAR. As earlier, FILE is a string-valued expression that
|
||
specifies the file from which to read.
|
||
|
||
In this version of 'getline', none of the predefined variables are
|
||
changed and the record is not split into fields. The only variable
|
||
changed is VAR.(1) For example, the following program copies all the
|
||
input files to the output, except for records that say
|
||
'@include FILENAME'. Such a record is replaced by the contents of the
|
||
file FILENAME:
|
||
|
||
{
|
||
if (NF == 2 && $1 == "@include") {
|
||
while ((getline line < $2) > 0)
|
||
print line
|
||
close($2)
|
||
} else
|
||
print
|
||
}
|
||
|
||
Note here how the name of the extra input file is not built into the
|
||
program; it is taken directly from the data, specifically from the
|
||
second field on the '@include' line.
|
||
|
||
The 'close()' function is called to ensure that if two identical
|
||
'@include' lines appear in the input, the entire specified file is
|
||
included twice. *Note Close Files And Pipes::.
|
||
|
||
One deficiency of this program is that it does not process nested
|
||
'@include' statements (i.e., '@include' statements in included files)
|
||
the way a true macro preprocessor would. *Note Igawk Program::, for a
|
||
program that does handle nested '@include' statements.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) This is not quite true. 'RT' could be changed if 'RS' is a
|
||
regular expression.
|
||
|
||
|
||
File: gawk.info, Node: Getline/Pipe, Next: Getline/Variable/Pipe, Prev: Getline/Variable/File, Up: Getline
|
||
|
||
4.9.5 Using 'getline' from a Pipe
|
||
---------------------------------
|
||
|
||
Omniscience has much to recommend it. Failing that, attention to
|
||
details would be useful.
|
||
-- _Brian Kernighan_
|
||
|
||
The output of a command can also be piped into 'getline', using
|
||
'COMMAND | getline'. In this case, the string COMMAND is run as a shell
|
||
command and its output is piped into 'awk' to be used as input. This
|
||
form of 'getline' reads one record at a time from the pipe. For
|
||
example, the following program copies its input to its output, except
|
||
for lines that begin with '@execute', which are replaced by the output
|
||
produced by running the rest of the line as a shell command:
|
||
|
||
{
|
||
if ($1 == "@execute") {
|
||
tmp = substr($0, 10) # Remove "@execute"
|
||
while ((tmp | getline) > 0)
|
||
print
|
||
close(tmp)
|
||
} else
|
||
print
|
||
}
|
||
|
||
The 'close()' function is called to ensure that if two identical
|
||
'@execute' lines appear in the input, the command is run for each one.
|
||
*Note Close Files And Pipes::. Given the input:
|
||
|
||
foo
|
||
bar
|
||
baz
|
||
@execute who
|
||
bletch
|
||
|
||
the program might produce:
|
||
|
||
foo
|
||
bar
|
||
baz
|
||
arnold ttyv0 Jul 13 14:22
|
||
miriam ttyp0 Jul 13 14:23 (murphy:0)
|
||
bill ttyp1 Jul 13 14:23 (murphy:0)
|
||
bletch
|
||
|
||
Notice that this program ran the command 'who' and printed the result.
|
||
(If you try this program yourself, you will of course get different
|
||
results, depending upon who is logged in on your system.)
|
||
|
||
This variation of 'getline' splits the record into fields, sets the
|
||
value of 'NF', and recomputes the value of '$0'. The values of 'NR' and
|
||
'FNR' are not changed. 'RT' is set.
|
||
|
||
According to POSIX, 'EXPRESSION | getline' is ambiguous if EXPRESSION
|
||
contains unparenthesized operators other than '$'--for example, '"echo "
|
||
"date" | getline' is ambiguous because the concatenation operator is not
|
||
parenthesized. You should write it as '("echo " "date") | getline' if
|
||
you want your program to be portable to all 'awk' implementations.
|
||
|
||
NOTE: Unfortunately, 'gawk' has not been consistent in its
|
||
treatment of a construct like '"echo " "date" | getline'. Most
|
||
versions, including the current version, treat it at as '("echo "
|
||
"date") | getline'. (This is also how BWK 'awk' behaves.) Some
|
||
versions instead treat it as '"echo " ("date" | getline)'. (This
|
||
is how 'mawk' behaves.) In short, _always_ use explicit
|
||
parentheses, and then you won't have to worry.
|
||
|
||
|
||
File: gawk.info, Node: Getline/Variable/Pipe, Next: Getline/Coprocess, Prev: Getline/Pipe, Up: Getline
|
||
|
||
4.9.6 Using 'getline' into a Variable from a Pipe
|
||
-------------------------------------------------
|
||
|
||
When you use 'COMMAND | getline VAR', the output of COMMAND is sent
|
||
through a pipe to 'getline' and into the variable VAR. For example, the
|
||
following program reads the current date and time into the variable
|
||
'current_time', using the 'date' utility, and then prints it:
|
||
|
||
BEGIN {
|
||
"date" | getline current_time
|
||
close("date")
|
||
print "Report printed on " current_time
|
||
}
|
||
|
||
In this version of 'getline', none of the predefined variables are
|
||
changed and the record is not split into fields. However, 'RT' is set.
|
||
|
||
According to POSIX, 'EXPRESSION | getline VAR' is ambiguous if
|
||
EXPRESSION contains unparenthesized operators other than '$'; for
|
||
example, '"echo " "date" | getline VAR' is ambiguous because the
|
||
concatenation operator is not parenthesized. You should write it as '("echo "
|
||
"date") | getline VAR' if you want your program to be portable to other
|
||
'awk' implementations.
|
||
|
||
|
||
File: gawk.info, Node: Getline/Coprocess, Next: Getline/Variable/Coprocess, Prev: Getline/Variable/Pipe, Up: Getline
|
||
|
||
4.9.7 Using 'getline' from a Coprocess
|
||
--------------------------------------
|
||
|
||
Reading input into 'getline' from a pipe is a one-way operation. The
|
||
command that is started with 'COMMAND | getline' only sends data _to_
|
||
your 'awk' program.
|
||
|
||
On occasion, you might want to send data to another program for
|
||
processing and then read the results back. 'gawk' allows you to start a
|
||
"coprocess", with which two-way communications are possible. This is
|
||
done with the '|&' operator. Typically, you write data to the coprocess
|
||
first and then read the results back, as shown in the following:
|
||
|
||
print "SOME QUERY" |& "db_server"
|
||
"db_server" |& getline
|
||
|
||
which sends a query to 'db_server' and then reads the results.
|
||
|
||
The values of 'NR' and 'FNR' are not changed, because the main input
|
||
stream is not used. However, the record is split into fields in the
|
||
normal manner, thus changing the values of '$0', of the other fields,
|
||
and of 'NF' and 'RT'.
|
||
|
||
Coprocesses are an advanced feature. They are discussed here only
|
||
because this is the minor node on 'getline'. *Note Two-way I/O::, where
|
||
coprocesses are discussed in more detail.
|
||
|
||
|
||
File: gawk.info, Node: Getline/Variable/Coprocess, Next: Getline Notes, Prev: Getline/Coprocess, Up: Getline
|
||
|
||
4.9.8 Using 'getline' into a Variable from a Coprocess
|
||
------------------------------------------------------
|
||
|
||
When you use 'COMMAND |& getline VAR', the output from the coprocess
|
||
COMMAND is sent through a two-way pipe to 'getline' and into the
|
||
variable VAR.
|
||
|
||
In this version of 'getline', none of the predefined variables are
|
||
changed and the record is not split into fields. The only variable
|
||
changed is VAR. However, 'RT' is set.
|
||
|
||
Coprocesses are an advanced feature. They are discussed here only
|
||
because this is the minor node on 'getline'. *Note Two-way I/O::, where
|
||
coprocesses are discussed in more detail.
|
||
|
||
|
||
File: gawk.info, Node: Getline Notes, Next: Getline Summary, Prev: Getline/Variable/Coprocess, Up: Getline
|
||
|
||
4.9.9 Points to Remember About 'getline'
|
||
----------------------------------------
|
||
|
||
Here are some miscellaneous points about 'getline' that you should bear
|
||
in mind:
|
||
|
||
* When 'getline' changes the value of '$0' and 'NF', 'awk' does _not_
|
||
automatically jump to the start of the program and start testing
|
||
the new record against every pattern. However, the new record is
|
||
tested against any subsequent rules.
|
||
|
||
* Some very old 'awk' implementations limit the number of pipelines
|
||
that an 'awk' program may have open to just one. In 'gawk', there
|
||
is no such limit. You can open as many pipelines (and coprocesses)
|
||
as the underlying operating system permits.
|
||
|
||
* An interesting side effect occurs if you use 'getline' without a
|
||
redirection inside a 'BEGIN' rule. Because an unredirected
|
||
'getline' reads from the command-line data files, the first
|
||
'getline' command causes 'awk' to set the value of 'FILENAME'.
|
||
Normally, 'FILENAME' does not have a value inside 'BEGIN' rules,
|
||
because you have not yet started to process the command-line data
|
||
files. (d.c.) (See *note BEGIN/END::; also *note Auto-set::.)
|
||
|
||
* Using 'FILENAME' with 'getline' ('getline < FILENAME') is likely to
|
||
be a source of confusion. 'awk' opens a separate input stream from
|
||
the current input file. However, by not using a variable, '$0' and
|
||
'NF' are still updated. If you're doing this, it's probably by
|
||
accident, and you should reconsider what it is you're trying to
|
||
accomplish.
|
||
|
||
* *note Getline Summary::, presents a table summarizing the 'getline'
|
||
variants and which variables they can affect. It is worth noting
|
||
that those variants that do not use redirection can cause
|
||
'FILENAME' to be updated if they cause 'awk' to start reading a new
|
||
input file.
|
||
|
||
* If the variable being assigned is an expression with side effects,
|
||
different versions of 'awk' behave differently upon encountering
|
||
end-of-file. Some versions don't evaluate the expression; many
|
||
versions (including 'gawk') do. Here is an example, courtesy of
|
||
Duncan Moore:
|
||
|
||
BEGIN {
|
||
system("echo 1 > f")
|
||
while ((getline a[++c] < "f") > 0) { }
|
||
print c
|
||
}
|
||
|
||
Here, the side effect is the '++c'. Is 'c' incremented if
|
||
end-of-file is encountered before the element in 'a' is assigned?
|
||
|
||
'gawk' treats 'getline' like a function call, and evaluates the
|
||
expression 'a[++c]' before attempting to read from 'f'. However,
|
||
some versions of 'awk' only evaluate the expression once they know
|
||
that there is a string value to be assigned.
|
||
|
||
|
||
File: gawk.info, Node: Getline Summary, Prev: Getline Notes, Up: Getline
|
||
|
||
4.9.10 Summary of 'getline' Variants
|
||
------------------------------------
|
||
|
||
*note Table 4.1: table-getline-variants. summarizes the eight variants
|
||
of 'getline', listing which predefined variables are set by each one,
|
||
and whether the variant is standard or a 'gawk' extension. Note: for
|
||
each variant, 'gawk' sets the 'RT' predefined variable.
|
||
|
||
Variant Effect 'awk' / 'gawk'
|
||
-------------------------------------------------------------------------
|
||
'getline' Sets '$0', 'NF', 'FNR', 'awk'
|
||
'NR', and 'RT'
|
||
'getline' VAR Sets VAR, 'FNR', 'NR', 'awk'
|
||
and 'RT'
|
||
'getline <' FILE Sets '$0', 'NF', and 'RT' 'awk'
|
||
'getline VAR < FILE' Sets VAR and 'RT' 'awk'
|
||
COMMAND '| getline' Sets '$0', 'NF', and 'RT' 'awk'
|
||
COMMAND '| getline' Sets VAR and 'RT' 'awk'
|
||
VAR
|
||
COMMAND '|& getline' Sets '$0', 'NF', and 'RT' 'gawk'
|
||
COMMAND '|& getline' Sets VAR and 'RT' 'gawk'
|
||
VAR
|
||
|
||
Table 4.1: 'getline' variants and what they set
|
||
|
||
|
||
File: gawk.info, Node: Read Timeout, Next: Command-line directories, Prev: Getline, Up: Reading Files
|
||
|
||
4.10 Reading Input with a Timeout
|
||
=================================
|
||
|
||
This minor node describes a feature that is specific to 'gawk'.
|
||
|
||
You may specify a timeout in milliseconds for reading input from the
|
||
keyboard, a pipe, or two-way communication, including TCP/IP sockets.
|
||
This can be done on a per-input, per-command, or per-connection basis,
|
||
by setting a special element in the 'PROCINFO' array (*note Auto-set::):
|
||
|
||
PROCINFO["input_name", "READ_TIMEOUT"] = TIMEOUT IN MILLISECONDS
|
||
|
||
When set, this causes 'gawk' to time out and return failure if no
|
||
data is available to read within the specified timeout period. For
|
||
example, a TCP client can decide to give up on receiving any response
|
||
from the server after a certain amount of time:
|
||
|
||
Service = "/inet/tcp/0/localhost/daytime"
|
||
PROCINFO[Service, "READ_TIMEOUT"] = 100
|
||
if ((Service |& getline) > 0)
|
||
print $0
|
||
else if (ERRNO != "")
|
||
print ERRNO
|
||
|
||
Here is how to read interactively from the user(1) without waiting
|
||
for more than five seconds:
|
||
|
||
PROCINFO["/dev/stdin", "READ_TIMEOUT"] = 5000
|
||
while ((getline < "/dev/stdin") > 0)
|
||
print $0
|
||
|
||
'gawk' terminates the read operation if input does not arrive after
|
||
waiting for the timeout period, returns failure, and sets 'ERRNO' to an
|
||
appropriate string value. A negative or zero value for the timeout is
|
||
the same as specifying no timeout at all.
|
||
|
||
A timeout can also be set for reading from the keyboard in the
|
||
implicit loop that reads input records and matches them against
|
||
patterns, like so:
|
||
|
||
$ gawk 'BEGIN { PROCINFO["-", "READ_TIMEOUT"] = 5000 }
|
||
> { print "You entered: " $0 }'
|
||
gawk
|
||
-| You entered: gawk
|
||
|
||
In this case, failure to respond within five seconds results in the
|
||
following error message:
|
||
|
||
error-> gawk: cmd. line:2: (FILENAME=- FNR=1) fatal: error reading input file `-': Connection timed out
|
||
|
||
The timeout can be set or changed at any time, and will take effect
|
||
on the next attempt to read from the input device. In the following
|
||
example, we start with a timeout value of one second, and progressively
|
||
reduce it by one-tenth of a second until we wait indefinitely for the
|
||
input to arrive:
|
||
|
||
PROCINFO[Service, "READ_TIMEOUT"] = 1000
|
||
while ((Service |& getline) > 0) {
|
||
print $0
|
||
PROCINFO[Service, "READ_TIMEOUT"] -= 100
|
||
}
|
||
|
||
NOTE: You should not assume that the read operation will block
|
||
exactly after the tenth record has been printed. It is possible
|
||
that 'gawk' will read and buffer more than one record's worth of
|
||
data the first time. Because of this, changing the value of
|
||
timeout like in the preceding example is not very useful.
|
||
|
||
If the 'PROCINFO' element is not present and the 'GAWK_READ_TIMEOUT'
|
||
environment variable exists, 'gawk' uses its value to initialize the
|
||
timeout value. The exclusive use of the environment variable to specify
|
||
timeout has the disadvantage of not being able to control it on a
|
||
per-command or per-connection basis.
|
||
|
||
'gawk' considers a timeout event to be an error even though the
|
||
attempt to read from the underlying device may succeed in a later
|
||
attempt. This is a limitation, and it also means that you cannot use
|
||
this to multiplex input from two or more sources.
|
||
|
||
Assigning a timeout value prevents read operations from blocking
|
||
indefinitely. But bear in mind that there are other ways 'gawk' can
|
||
stall waiting for an input device to be ready. A network client can
|
||
sometimes take a long time to establish a connection before it can start
|
||
reading any data, or the attempt to open a FIFO special file for reading
|
||
can block indefinitely until some other process opens it for writing.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) This assumes that standard input is the keyboard.
|
||
|
||
|
||
File: gawk.info, Node: Command-line directories, Next: Input Summary, Prev: Read Timeout, Up: Reading Files
|
||
|
||
4.11 Directories on the Command Line
|
||
====================================
|
||
|
||
According to the POSIX standard, files named on the 'awk' command line
|
||
must be text files; it is a fatal error if they are not. Most versions
|
||
of 'awk' treat a directory on the command line as a fatal error.
|
||
|
||
By default, 'gawk' produces a warning for a directory on the command
|
||
line, but otherwise ignores it. This makes it easier to use shell
|
||
wildcards with your 'awk' program:
|
||
|
||
$ gawk -f whizprog.awk * Directories could kill this program
|
||
|
||
If either of the '--posix' or '--traditional' options is given, then
|
||
'gawk' reverts to treating a directory on the command line as a fatal
|
||
error.
|
||
|
||
*Note Extension Sample Readdir::, for a way to treat directories as
|
||
usable data from an 'awk' program.
|
||
|
||
|
||
File: gawk.info, Node: Input Summary, Next: Input Exercises, Prev: Command-line directories, Up: Reading Files
|
||
|
||
4.12 Summary
|
||
============
|
||
|
||
* Input is split into records based on the value of 'RS'. The
|
||
possibilities are as follows:
|
||
|
||
Value of 'RS' Records are split on 'awk' / 'gawk'
|
||
...
|
||
---------------------------------------------------------------------------
|
||
Any single That character 'awk'
|
||
character
|
||
The empty string Runs of two or more 'awk'
|
||
('""') newlines
|
||
A regexp Text that matches the 'gawk'
|
||
regexp
|
||
|
||
* 'FNR' indicates how many records have been read from the current
|
||
input file; 'NR' indicates how many records have been read in
|
||
total.
|
||
|
||
* 'gawk' sets 'RT' to the text matched by 'RS'.
|
||
|
||
* After splitting the input into records, 'awk' further splits the
|
||
records into individual fields, named '$1', '$2', and so on. '$0'
|
||
is the whole record, and 'NF' indicates how many fields there are.
|
||
The default way to split fields is between whitespace characters.
|
||
|
||
* Fields may be referenced using a variable, as in '$NF'. Fields may
|
||
also be assigned values, which causes the value of '$0' to be
|
||
recomputed when it is later referenced. Assigning to a field with
|
||
a number greater than 'NF' creates the field and rebuilds the
|
||
record, using 'OFS' to separate the fields. Incrementing 'NF' does
|
||
the same thing. Decrementing 'NF' throws away fields and rebuilds
|
||
the record.
|
||
|
||
* Field splitting is more complicated than record splitting:
|
||
|
||
Field separator value Fields are split ... 'awk' /
|
||
'gawk'
|
||
---------------------------------------------------------------------------
|
||
'FS == " "' On runs of whitespace 'awk'
|
||
'FS == ANY SINGLE On that character 'awk'
|
||
CHARACTER'
|
||
'FS == REGEXP' On text matching the regexp 'awk'
|
||
'FS == ""' Such that each individual 'gawk'
|
||
character is a separate
|
||
field
|
||
'FIELDWIDTHS == LIST OF Based on character position 'gawk'
|
||
COLUMNS'
|
||
'FPAT == REGEXP' On the text surrounding 'gawk'
|
||
text matching the regexp
|
||
|
||
* Using 'FS = "\n"' causes the entire record to be a single field
|
||
(assuming that newlines separate records).
|
||
|
||
* 'FS' may be set from the command line using the '-F' option. This
|
||
can also be done using command-line variable assignment.
|
||
|
||
* Use 'PROCINFO["FS"]' to see how fields are being split.
|
||
|
||
* Use 'getline' in its various forms to read additional records from
|
||
the default input stream, from a file, or from a pipe or coprocess.
|
||
|
||
* Use 'PROCINFO[FILE, "READ_TIMEOUT"]' to cause reads to time out for
|
||
FILE.
|
||
|
||
* Directories on the command line are fatal for standard 'awk';
|
||
'gawk' ignores them if not in POSIX mode.
|
||
|
||
|
||
File: gawk.info, Node: Input Exercises, Prev: Input Summary, Up: Reading Files
|
||
|
||
4.13 Exercises
|
||
==============
|
||
|
||
1. Using the 'FIELDWIDTHS' variable (*note Constant Size::), write a
|
||
program to read election data, where each record represents one
|
||
voter's votes. Come up with a way to define which columns are
|
||
associated with each ballot item, and print the total votes,
|
||
including abstentions, for each item.
|
||
|
||
2. *note Plain Getline::, presented a program to remove C-style
|
||
comments ('/* ... */') from the input. That program does not work
|
||
if one comment ends on one line and another one starts later on the
|
||
same line. That can be fixed by making one simple change. What is
|
||
it?
|
||
|
||
|
||
File: gawk.info, Node: Printing, Next: Expressions, Prev: Reading Files, Up: Top
|
||
|
||
5 Printing Output
|
||
*****************
|
||
|
||
One of the most common programming actions is to "print", or output,
|
||
some or all of the input. Use the 'print' statement for simple output,
|
||
and the 'printf' statement for fancier formatting. The 'print'
|
||
statement is not limited when computing _which_ values to print.
|
||
However, with two exceptions, you cannot specify _how_ to print
|
||
them--how many columns, whether to use exponential notation or not, and
|
||
so on. (For the exceptions, *note Output Separators::, and *note
|
||
OFMT::.) For printing with specifications, you need the 'printf'
|
||
statement (*note Printf::).
|
||
|
||
Besides basic and formatted printing, this major node also covers I/O
|
||
redirections to files and pipes, introduces the special file names that
|
||
'gawk' processes internally, and discusses the 'close()' built-in
|
||
function.
|
||
|
||
* Menu:
|
||
|
||
* Print:: The 'print' statement.
|
||
* Print Examples:: Simple examples of 'print' statements.
|
||
* Output Separators:: The output separators and how to change them.
|
||
* OFMT:: Controlling Numeric Output With 'print'.
|
||
* Printf:: The 'printf' statement.
|
||
* Redirection:: How to redirect output to multiple files and
|
||
pipes.
|
||
* Special FD:: Special files for I/O.
|
||
* Special Files:: File name interpretation in 'gawk'.
|
||
'gawk' allows access to inherited file
|
||
descriptors.
|
||
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
|
||
* Output Summary:: Output summary.
|
||
* Output Exercises:: Exercises.
|
||
|
||
|
||
File: gawk.info, Node: Print, Next: Print Examples, Up: Printing
|
||
|
||
5.1 The 'print' Statement
|
||
=========================
|
||
|
||
Use the 'print' statement to produce output with simple, standardized
|
||
formatting. You specify only the strings or numbers to print, in a list
|
||
separated by commas. They are output, separated by single spaces,
|
||
followed by a newline. The statement looks like this:
|
||
|
||
print ITEM1, ITEM2, ...
|
||
|
||
The entire list of items may be optionally enclosed in parentheses. The
|
||
parentheses are necessary if any of the item expressions uses the '>'
|
||
relational operator; otherwise it could be confused with an output
|
||
redirection (*note Redirection::).
|
||
|
||
The items to print can be constant strings or numbers, fields of the
|
||
current record (such as '$1'), variables, or any 'awk' expression.
|
||
Numeric values are converted to strings and then printed.
|
||
|
||
The simple statement 'print' with no items is equivalent to 'print
|
||
$0': it prints the entire current record. To print a blank line, use
|
||
'print ""'. To print a fixed piece of text, use a string constant, such
|
||
as '"Don't Panic"', as one item. If you forget to use the double-quote
|
||
characters, your text is taken as an 'awk' expression, and you will
|
||
probably get an error. Keep in mind that a space is printed between any
|
||
two items.
|
||
|
||
Note that the 'print' statement is a statement and not an
|
||
expression--you can't use it in the pattern part of a pattern-action
|
||
statement, for example.
|
||
|
||
|
||
File: gawk.info, Node: Print Examples, Next: Output Separators, Prev: Print, Up: Printing
|
||
|
||
5.2 'print' Statement Examples
|
||
==============================
|
||
|
||
Each 'print' statement makes at least one line of output. However, it
|
||
isn't limited to only one line. If an item value is a string containing
|
||
a newline, the newline is output along with the rest of the string. A
|
||
single 'print' statement can make any number of lines this way.
|
||
|
||
The following is an example of printing a string that contains
|
||
embedded newlines (the '\n' is an escape sequence, used to represent the
|
||
newline character; *note Escape Sequences::):
|
||
|
||
$ awk 'BEGIN { print "line one\nline two\nline three" }'
|
||
-| line one
|
||
-| line two
|
||
-| line three
|
||
|
||
The next example, which is run on the 'inventory-shipped' file,
|
||
prints the first two fields of each input record, with a space between
|
||
them:
|
||
|
||
$ awk '{ print $1, $2 }' inventory-shipped
|
||
-| Jan 13
|
||
-| Feb 15
|
||
-| Mar 15
|
||
...
|
||
|
||
A common mistake in using the 'print' statement is to omit the comma
|
||
between two items. This often has the effect of making the items run
|
||
together in the output, with no space. The reason for this is that
|
||
juxtaposing two string expressions in 'awk' means to concatenate them.
|
||
Here is the same program, without the comma:
|
||
|
||
$ awk '{ print $1 $2 }' inventory-shipped
|
||
-| Jan13
|
||
-| Feb15
|
||
-| Mar15
|
||
...
|
||
|
||
To someone unfamiliar with the 'inventory-shipped' file, neither
|
||
example's output makes much sense. A heading line at the beginning
|
||
would make it clearer. Let's add some headings to our table of months
|
||
('$1') and green crates shipped ('$2'). We do this using a 'BEGIN' rule
|
||
(*note BEGIN/END::) so that the headings are only printed once:
|
||
|
||
awk 'BEGIN { print "Month Crates"
|
||
print "----- ------" }
|
||
{ print $1, $2 }' inventory-shipped
|
||
|
||
When run, the program prints the following:
|
||
|
||
Month Crates
|
||
----- ------
|
||
Jan 13
|
||
Feb 15
|
||
Mar 15
|
||
...
|
||
|
||
The only problem, however, is that the headings and the table data don't
|
||
line up! We can fix this by printing some spaces between the two
|
||
fields:
|
||
|
||
awk 'BEGIN { print "Month Crates"
|
||
print "----- ------" }
|
||
{ print $1, " ", $2 }' inventory-shipped
|
||
|
||
Lining up columns this way can get pretty complicated when there are
|
||
many columns to fix. Counting spaces for two or three columns is
|
||
simple, but any more than this can take up a lot of time. This is why
|
||
the 'printf' statement was created (*note Printf::); one of its
|
||
specialties is lining up columns of data.
|
||
|
||
NOTE: You can continue either a 'print' or 'printf' statement
|
||
simply by putting a newline after any comma (*note
|
||
Statements/Lines::).
|
||
|
||
|
||
File: gawk.info, Node: Output Separators, Next: OFMT, Prev: Print Examples, Up: Printing
|
||
|
||
5.3 Output Separators
|
||
=====================
|
||
|
||
As mentioned previously, a 'print' statement contains a list of items
|
||
separated by commas. In the output, the items are normally separated by
|
||
single spaces. However, this doesn't need to be the case; a single
|
||
space is simply the default. Any string of characters may be used as
|
||
the "output field separator" by setting the predefined variable 'OFS'.
|
||
The initial value of this variable is the string '" "' (i.e., a single
|
||
space).
|
||
|
||
The output from an entire 'print' statement is called an "output
|
||
record". Each 'print' statement outputs one output record, and then
|
||
outputs a string called the "output record separator" (or 'ORS'). The
|
||
initial value of 'ORS' is the string '"\n"' (i.e., a newline character).
|
||
Thus, each 'print' statement normally makes a separate line.
|
||
|
||
In order to change how output fields and records are separated,
|
||
assign new values to the variables 'OFS' and 'ORS'. The usual place to
|
||
do this is in the 'BEGIN' rule (*note BEGIN/END::), so that it happens
|
||
before any input is processed. It can also be done with assignments on
|
||
the command line, before the names of the input files, or using the '-v'
|
||
command-line option (*note Options::). The following example prints the
|
||
first and second fields of each input record, separated by a semicolon,
|
||
with a blank line added after each newline:
|
||
|
||
$ awk 'BEGIN { OFS = ";"; ORS = "\n\n" }
|
||
> { print $1, $2 }' mail-list
|
||
-| Amelia;555-5553
|
||
-|
|
||
-| Anthony;555-3412
|
||
-|
|
||
-| Becky;555-7685
|
||
-|
|
||
-| Bill;555-1675
|
||
-|
|
||
-| Broderick;555-0542
|
||
-|
|
||
-| Camilla;555-2912
|
||
-|
|
||
-| Fabius;555-1234
|
||
-|
|
||
-| Julie;555-6699
|
||
-|
|
||
-| Martin;555-6480
|
||
-|
|
||
-| Samuel;555-3430
|
||
-|
|
||
-| Jean-Paul;555-2127
|
||
-|
|
||
|
||
If the value of 'ORS' does not contain a newline, the program's
|
||
output runs together on a single line.
|
||
|
||
|
||
File: gawk.info, Node: OFMT, Next: Printf, Prev: Output Separators, Up: Printing
|
||
|
||
5.4 Controlling Numeric Output with 'print'
|
||
===========================================
|
||
|
||
When printing numeric values with the 'print' statement, 'awk'
|
||
internally converts each number to a string of characters and prints
|
||
that string. 'awk' uses the 'sprintf()' function to do this conversion
|
||
(*note String Functions::). For now, it suffices to say that the
|
||
'sprintf()' function accepts a "format specification" that tells it how
|
||
to format numbers (or strings), and that there are a number of different
|
||
ways in which numbers can be formatted. The different format
|
||
specifications are discussed more fully in *note Control Letters::.
|
||
|
||
The predefined variable 'OFMT' contains the format specification that
|
||
'print' uses with 'sprintf()' when it wants to convert a number to a
|
||
string for printing. The default value of 'OFMT' is '"%.6g"'. The way
|
||
'print' prints numbers can be changed by supplying a different format
|
||
specification for the value of 'OFMT', as shown in the following
|
||
example:
|
||
|
||
$ awk 'BEGIN {
|
||
> OFMT = "%.0f" # print numbers as integers (rounds)
|
||
> print 17.23, 17.54 }'
|
||
-| 17 18
|
||
|
||
According to the POSIX standard, 'awk''s behavior is undefined if 'OFMT'
|
||
contains anything but a floating-point conversion specification. (d.c.)
|
||
|
||
|
||
File: gawk.info, Node: Printf, Next: Redirection, Prev: OFMT, Up: Printing
|
||
|
||
5.5 Using 'printf' Statements for Fancier Printing
|
||
==================================================
|
||
|
||
For more precise control over the output format than what is provided by
|
||
'print', use 'printf'. With 'printf' you can specify the width to use
|
||
for each item, as well as various formatting choices for numbers (such
|
||
as what output base to use, whether to print an exponent, whether to
|
||
print a sign, and how many digits to print after the decimal point).
|
||
|
||
* Menu:
|
||
|
||
* Basic Printf:: Syntax of the 'printf' statement.
|
||
* Control Letters:: Format-control letters.
|
||
* Format Modifiers:: Format-specification modifiers.
|
||
* Printf Examples:: Several examples.
|
||
|
||
|
||
File: gawk.info, Node: Basic Printf, Next: Control Letters, Up: Printf
|
||
|
||
5.5.1 Introduction to the 'printf' Statement
|
||
--------------------------------------------
|
||
|
||
A simple 'printf' statement looks like this:
|
||
|
||
printf FORMAT, ITEM1, ITEM2, ...
|
||
|
||
As for 'print', the entire list of arguments may optionally be enclosed
|
||
in parentheses. Here too, the parentheses are necessary if any of the
|
||
item expressions uses the '>' relational operator; otherwise, it can be
|
||
confused with an output redirection (*note Redirection::).
|
||
|
||
The difference between 'printf' and 'print' is the FORMAT argument.
|
||
This is an expression whose value is taken as a string; it specifies how
|
||
to output each of the other arguments. It is called the "format
|
||
string".
|
||
|
||
The format string is very similar to that in the ISO C library
|
||
function 'printf()'. Most of FORMAT is text to output verbatim.
|
||
Scattered among this text are "format specifiers"--one per item. Each
|
||
format specifier says to output the next item in the argument list at
|
||
that place in the format.
|
||
|
||
The 'printf' statement does not automatically append a newline to its
|
||
output. It outputs only what the format string specifies. So if a
|
||
newline is needed, you must include one in the format string. The
|
||
output separator variables 'OFS' and 'ORS' have no effect on 'printf'
|
||
statements. For example:
|
||
|
||
$ awk 'BEGIN {
|
||
> ORS = "\nOUCH!\n"; OFS = "+"
|
||
> msg = "Don\47t Panic!"
|
||
> printf "%s\n", msg
|
||
> }'
|
||
-| Don't Panic!
|
||
|
||
Here, neither the '+' nor the 'OUCH!' appears in the output message.
|
||
|
||
|
||
File: gawk.info, Node: Control Letters, Next: Format Modifiers, Prev: Basic Printf, Up: Printf
|
||
|
||
5.5.2 Format-Control Letters
|
||
----------------------------
|
||
|
||
A format specifier starts with the character '%' and ends with a
|
||
"format-control letter"--it tells the 'printf' statement how to output
|
||
one item. The format-control letter specifies what _kind_ of value to
|
||
print. The rest of the format specifier is made up of optional
|
||
"modifiers" that control _how_ to print the value, such as the field
|
||
width. Here is a list of the format-control letters:
|
||
|
||
'%c'
|
||
Print a number as a character; thus, 'printf "%c", 65' outputs the
|
||
letter 'A'. The output for a string value is the first character
|
||
of the string.
|
||
|
||
NOTE: The POSIX standard says the first character of a string
|
||
is printed. In locales with multibyte characters, 'gawk'
|
||
attempts to convert the leading bytes of the string into a
|
||
valid wide character and then to print the multibyte encoding
|
||
of that character. Similarly, when printing a numeric value,
|
||
'gawk' allows the value to be within the numeric range of
|
||
values that can be held in a wide character. If the
|
||
conversion to multibyte encoding fails, 'gawk' uses the low
|
||
eight bits of the value as the character to print.
|
||
|
||
Other 'awk' versions generally restrict themselves to printing
|
||
the first byte of a string or to numeric values within the
|
||
range of a single byte (0-255).
|
||
|
||
'%d', '%i'
|
||
Print a decimal integer. The two control letters are equivalent.
|
||
(The '%i' specification is for compatibility with ISO C.)
|
||
|
||
'%e', '%E'
|
||
Print a number in scientific (exponential) notation. For example:
|
||
|
||
printf "%4.3e\n", 1950
|
||
|
||
prints '1.950e+03', with a total of four significant figures, three
|
||
of which follow the decimal point. (The '4.3' represents two
|
||
modifiers, discussed in the next node.) '%E' uses 'E' instead of
|
||
'e' in the output.
|
||
|
||
'%f'
|
||
Print a number in floating-point notation. For example:
|
||
|
||
printf "%4.3f", 1950
|
||
|
||
prints '1950.000', with a total of four significant figures, three
|
||
of which follow the decimal point. (The '4.3' represents two
|
||
modifiers, discussed in the next node.)
|
||
|
||
On systems supporting IEEE 754 floating-point format, values
|
||
representing negative infinity are formatted as '-inf' or
|
||
'-infinity', and positive infinity as 'inf' or 'infinity'. The
|
||
special "not a number" value formats as '-nan' or 'nan' (*note Math
|
||
Definitions::).
|
||
|
||
'%F'
|
||
Like '%f', but the infinity and "not a number" values are spelled
|
||
using uppercase letters.
|
||
|
||
The '%F' format is a POSIX extension to ISO C; not all systems
|
||
support it. On those that don't, 'gawk' uses '%f' instead.
|
||
|
||
'%g', '%G'
|
||
Print a number in either scientific notation or in floating-point
|
||
notation, whichever uses fewer characters; if the result is printed
|
||
in scientific notation, '%G' uses 'E' instead of 'e'.
|
||
|
||
'%o'
|
||
Print an unsigned octal integer (*note Nondecimal-numbers::).
|
||
|
||
'%s'
|
||
Print a string.
|
||
|
||
'%u'
|
||
Print an unsigned decimal integer. (This format is of marginal
|
||
use, because all numbers in 'awk' are floating point; it is
|
||
provided primarily for compatibility with C.)
|
||
|
||
'%x', '%X'
|
||
Print an unsigned hexadecimal integer; '%X' uses the letters 'A'
|
||
through 'F' instead of 'a' through 'f' (*note
|
||
Nondecimal-numbers::).
|
||
|
||
'%%'
|
||
Print a single '%'. This does not consume an argument and it
|
||
ignores any modifiers.
|
||
|
||
NOTE: When using the integer format-control letters for values that
|
||
are outside the range of the widest C integer type, 'gawk' switches
|
||
to the '%g' format specifier. If '--lint' is provided on the
|
||
command line (*note Options::), 'gawk' warns about this. Other
|
||
versions of 'awk' may print invalid values or do something else
|
||
entirely. (d.c.)
|
||
|
||
|
||
File: gawk.info, Node: Format Modifiers, Next: Printf Examples, Prev: Control Letters, Up: Printf
|
||
|
||
5.5.3 Modifiers for 'printf' Formats
|
||
------------------------------------
|
||
|
||
A format specification can also include "modifiers" that can control how
|
||
much of the item's value is printed, as well as how much space it gets.
|
||
The modifiers come between the '%' and the format-control letter. We
|
||
use the bullet symbol "*" in the following examples to represent spaces
|
||
in the output. Here are the possible modifiers, in the order in which
|
||
they may appear:
|
||
|
||
'N$'
|
||
An integer constant followed by a '$' is a "positional specifier".
|
||
Normally, format specifications are applied to arguments in the
|
||
order given in the format string. With a positional specifier, the
|
||
format specification is applied to a specific argument, instead of
|
||
what would be the next argument in the list. Positional specifiers
|
||
begin counting with one. Thus:
|
||
|
||
printf "%s %s\n", "don't", "panic"
|
||
printf "%2$s %1$s\n", "panic", "don't"
|
||
|
||
prints the famous friendly message twice.
|
||
|
||
At first glance, this feature doesn't seem to be of much use. It
|
||
is in fact a 'gawk' extension, intended for use in translating
|
||
messages at runtime. *Note Printf Ordering::, which describes how
|
||
and why to use positional specifiers. For now, we ignore them.
|
||
|
||
'-' (Minus)
|
||
The minus sign, used before the width modifier (see later on in
|
||
this list), says to left-justify the argument within its specified
|
||
width. Normally, the argument is printed right-justified in the
|
||
specified width. Thus:
|
||
|
||
printf "%-4s", "foo"
|
||
|
||
prints 'foo*'.
|
||
|
||
SPACE
|
||
For numeric conversions, prefix positive values with a space and
|
||
negative values with a minus sign.
|
||
|
||
'+'
|
||
The plus sign, used before the width modifier (see later on in this
|
||
list), says to always supply a sign for numeric conversions, even
|
||
if the data to format is positive. The '+' overrides the space
|
||
modifier.
|
||
|
||
'#'
|
||
Use an "alternative form" for certain control letters. For '%o',
|
||
supply a leading zero. For '%x' and '%X', supply a leading '0x' or
|
||
'0X' for a nonzero result. For '%e', '%E', '%f', and '%F', the
|
||
result always contains a decimal point. For '%g' and '%G',
|
||
trailing zeros are not removed from the result.
|
||
|
||
'0'
|
||
A leading '0' (zero) acts as a flag indicating that output should
|
||
be padded with zeros instead of spaces. This applies only to the
|
||
numeric output formats. This flag only has an effect when the
|
||
field width is wider than the value to print.
|
||
|
||
'''
|
||
A single quote or apostrophe character is a POSIX extension to ISO
|
||
C. It indicates that the integer part of a floating-point value, or
|
||
the entire part of an integer decimal value, should have a
|
||
thousands-separator character in it. This only works in locales
|
||
that support such characters. For example:
|
||
|
||
$ cat thousands.awk Show source program
|
||
-| BEGIN { printf "%'d\n", 1234567 }
|
||
$ LC_ALL=C gawk -f thousands.awk
|
||
-| 1234567 Results in "C" locale
|
||
$ LC_ALL=en_US.UTF-8 gawk -f thousands.awk
|
||
-| 1,234,567 Results in US English UTF locale
|
||
|
||
For more information about locales and internationalization issues,
|
||
see *note Locales::.
|
||
|
||
NOTE: The ''' flag is a nice feature, but its use complicates
|
||
things: it becomes difficult to use it in command-line
|
||
programs. For information on appropriate quoting tricks, see
|
||
*note Quoting::.
|
||
|
||
WIDTH
|
||
This is a number specifying the desired minimum width of a field.
|
||
Inserting any number between the '%' sign and the format-control
|
||
character forces the field to expand to this width. The default
|
||
way to do this is to pad with spaces on the left. For example:
|
||
|
||
printf "%4s", "foo"
|
||
|
||
prints '*foo'.
|
||
|
||
The value of WIDTH is a minimum width, not a maximum. If the item
|
||
value requires more than WIDTH characters, it can be as wide as
|
||
necessary. Thus, the following:
|
||
|
||
printf "%4s", "foobar"
|
||
|
||
prints 'foobar'.
|
||
|
||
Preceding the WIDTH with a minus sign causes the output to be
|
||
padded with spaces on the right, instead of on the left.
|
||
|
||
'.PREC'
|
||
A period followed by an integer constant specifies the precision to
|
||
use when printing. The meaning of the precision varies by control
|
||
letter:
|
||
|
||
'%d', '%i', '%o', '%u', '%x', '%X'
|
||
Minimum number of digits to print.
|
||
|
||
'%e', '%E', '%f', '%F'
|
||
Number of digits to the right of the decimal point.
|
||
|
||
'%g', '%G'
|
||
Maximum number of significant digits.
|
||
|
||
'%s'
|
||
Maximum number of characters from the string that should
|
||
print.
|
||
|
||
Thus, the following:
|
||
|
||
printf "%.4s", "foobar"
|
||
|
||
prints 'foob'.
|
||
|
||
The C library 'printf''s dynamic WIDTH and PREC capability (e.g.,
|
||
'"%*.*s"') is supported. Instead of supplying explicit WIDTH and/or
|
||
PREC values in the format string, they are passed in the argument list.
|
||
For example:
|
||
|
||
w = 5
|
||
p = 3
|
||
s = "abcdefg"
|
||
printf "%*.*s\n", w, p, s
|
||
|
||
is exactly equivalent to:
|
||
|
||
s = "abcdefg"
|
||
printf "%5.3s\n", s
|
||
|
||
Both programs output '**abc'. Earlier versions of 'awk' did not support
|
||
this capability. If you must use such a version, you may simulate this
|
||
feature by using concatenation to build up the format string, like so:
|
||
|
||
w = 5
|
||
p = 3
|
||
s = "abcdefg"
|
||
printf "%" w "." p "s\n", s
|
||
|
||
This is not particularly easy to read, but it does work.
|
||
|
||
C programmers may be used to supplying additional modifiers ('h',
|
||
'j', 'l', 'L', 't', and 'z') in 'printf' format strings. These are not
|
||
valid in 'awk'. Most 'awk' implementations silently ignore them. If
|
||
'--lint' is provided on the command line (*note Options::), 'gawk' warns
|
||
about their use. If '--posix' is supplied, their use is a fatal error.
|
||
|
||
|
||
File: gawk.info, Node: Printf Examples, Prev: Format Modifiers, Up: Printf
|
||
|
||
5.5.4 Examples Using 'printf'
|
||
-----------------------------
|
||
|
||
The following simple example shows how to use 'printf' to make an
|
||
aligned table:
|
||
|
||
awk '{ printf "%-10s %s\n", $1, $2 }' mail-list
|
||
|
||
This command prints the names of the people ('$1') in the file
|
||
'mail-list' as a string of 10 characters that are left-justified. It
|
||
also prints the phone numbers ('$2') next on the line. This produces an
|
||
aligned two-column table of names and phone numbers, as shown here:
|
||
|
||
$ awk '{ printf "%-10s %s\n", $1, $2 }' mail-list
|
||
-| Amelia 555-5553
|
||
-| Anthony 555-3412
|
||
-| Becky 555-7685
|
||
-| Bill 555-1675
|
||
-| Broderick 555-0542
|
||
-| Camilla 555-2912
|
||
-| Fabius 555-1234
|
||
-| Julie 555-6699
|
||
-| Martin 555-6480
|
||
-| Samuel 555-3430
|
||
-| Jean-Paul 555-2127
|
||
|
||
In this case, the phone numbers had to be printed as strings because
|
||
the numbers are separated by dashes. Printing the phone numbers as
|
||
numbers would have produced just the first three digits: '555'. This
|
||
would have been pretty confusing.
|
||
|
||
It wasn't necessary to specify a width for the phone numbers because
|
||
they are last on their lines. They don't need to have spaces after
|
||
them.
|
||
|
||
The table could be made to look even nicer by adding headings to the
|
||
tops of the columns. This is done using a 'BEGIN' rule (*note
|
||
BEGIN/END::) so that the headers are only printed once, at the beginning
|
||
of the 'awk' program:
|
||
|
||
awk 'BEGIN { print "Name Number"
|
||
print "---- ------" }
|
||
{ printf "%-10s %s\n", $1, $2 }' mail-list
|
||
|
||
The preceding example mixes 'print' and 'printf' statements in the
|
||
same program. Using just 'printf' statements can produce the same
|
||
results:
|
||
|
||
awk 'BEGIN { printf "%-10s %s\n", "Name", "Number"
|
||
printf "%-10s %s\n", "----", "------" }
|
||
{ printf "%-10s %s\n", $1, $2 }' mail-list
|
||
|
||
Printing each column heading with the same format specification used for
|
||
the column elements ensures that the headings are aligned just like the
|
||
columns.
|
||
|
||
The fact that the same format specification is used three times can
|
||
be emphasized by storing it in a variable, like this:
|
||
|
||
awk 'BEGIN { format = "%-10s %s\n"
|
||
printf format, "Name", "Number"
|
||
printf format, "----", "------" }
|
||
{ printf format, $1, $2 }' mail-list
|
||
|
||
|
||
File: gawk.info, Node: Redirection, Next: Special FD, Prev: Printf, Up: Printing
|
||
|
||
5.6 Redirecting Output of 'print' and 'printf'
|
||
==============================================
|
||
|
||
So far, the output from 'print' and 'printf' has gone to the standard
|
||
output, usually the screen. Both 'print' and 'printf' can also send
|
||
their output to other places. This is called "redirection".
|
||
|
||
NOTE: When '--sandbox' is specified (*note Options::), redirecting
|
||
output to files, pipes, and coprocesses is disabled.
|
||
|
||
A redirection appears after the 'print' or 'printf' statement.
|
||
Redirections in 'awk' are written just like redirections in shell
|
||
commands, except that they are written inside the 'awk' program.
|
||
|
||
There are four forms of output redirection: output to a file, output
|
||
appended to a file, output through a pipe to another command, and output
|
||
to a coprocess. We show them all for the 'print' statement, but they
|
||
work identically for 'printf':
|
||
|
||
'print ITEMS > OUTPUT-FILE'
|
||
This redirection prints the items into the output file named
|
||
OUTPUT-FILE. The file name OUTPUT-FILE can be any expression. Its
|
||
value is changed to a string and then used as a file name (*note
|
||
Expressions::).
|
||
|
||
When this type of redirection is used, the OUTPUT-FILE is erased
|
||
before the first output is written to it. Subsequent writes to the
|
||
same OUTPUT-FILE do not erase OUTPUT-FILE, but append to it. (This
|
||
is different from how you use redirections in shell scripts.) If
|
||
OUTPUT-FILE does not exist, it is created. For example, here is
|
||
how an 'awk' program can write a list of peoples' names to one file
|
||
named 'name-list', and a list of phone numbers to another file
|
||
named 'phone-list':
|
||
|
||
$ awk '{ print $2 > "phone-list"
|
||
> print $1 > "name-list" }' mail-list
|
||
$ cat phone-list
|
||
-| 555-5553
|
||
-| 555-3412
|
||
...
|
||
$ cat name-list
|
||
-| Amelia
|
||
-| Anthony
|
||
...
|
||
|
||
Each output file contains one name or number per line.
|
||
|
||
'print ITEMS >> OUTPUT-FILE'
|
||
This redirection prints the items into the preexisting output file
|
||
named OUTPUT-FILE. The difference between this and the single-'>'
|
||
redirection is that the old contents (if any) of OUTPUT-FILE are
|
||
not erased. Instead, the 'awk' output is appended to the file. If
|
||
OUTPUT-FILE does not exist, then it is created.
|
||
|
||
'print ITEMS | COMMAND'
|
||
It is possible to send output to another program through a pipe
|
||
instead of into a file. This redirection opens a pipe to COMMAND,
|
||
and writes the values of ITEMS through this pipe to another process
|
||
created to execute COMMAND.
|
||
|
||
The redirection argument COMMAND is actually an 'awk' expression.
|
||
Its value is converted to a string whose contents give the shell
|
||
command to be run. For example, the following produces two files,
|
||
one unsorted list of peoples' names, and one list sorted in reverse
|
||
alphabetical order:
|
||
|
||
awk '{ print $1 > "names.unsorted"
|
||
command = "sort -r > names.sorted"
|
||
print $1 | command }' mail-list
|
||
|
||
The unsorted list is written with an ordinary redirection, while
|
||
the sorted list is written by piping through the 'sort' utility.
|
||
|
||
The next example uses redirection to mail a message to the mailing
|
||
list 'bug-system'. This might be useful when trouble is
|
||
encountered in an 'awk' script run periodically for system
|
||
maintenance:
|
||
|
||
report = "mail bug-system"
|
||
print("Awk script failed:", $0) | report
|
||
print("at record number", FNR, "of", FILENAME) | report
|
||
close(report)
|
||
|
||
The 'close()' function is called here because it's a good idea to
|
||
close the pipe as soon as all the intended output has been sent to
|
||
it. *Note Close Files And Pipes::, for more information.
|
||
|
||
This example also illustrates the use of a variable to represent a
|
||
FILE or COMMAND--it is not necessary to always use a string
|
||
constant. Using a variable is generally a good idea, because (if
|
||
you mean to refer to that same file or command) 'awk' requires that
|
||
the string value be written identically every time.
|
||
|
||
'print ITEMS |& COMMAND'
|
||
This redirection prints the items to the input of COMMAND. The
|
||
difference between this and the single-'|' redirection is that the
|
||
output from COMMAND can be read with 'getline'. Thus, COMMAND is a
|
||
"coprocess", which works together with but is subsidiary to the
|
||
'awk' program.
|
||
|
||
This feature is a 'gawk' extension, and is not available in POSIX
|
||
'awk'. *Note Getline/Coprocess::, for a brief discussion. *Note
|
||
Two-way I/O::, for a more complete discussion.
|
||
|
||
Redirecting output using '>', '>>', '|', or '|&' asks the system to
|
||
open a file, pipe, or coprocess only if the particular FILE or COMMAND
|
||
you specify has not already been written to by your program or if it has
|
||
been closed since it was last written to.
|
||
|
||
It is a common error to use '>' redirection for the first 'print' to
|
||
a file, and then to use '>>' for subsequent output:
|
||
|
||
# clear the file
|
||
print "Don't panic" > "guide.txt"
|
||
...
|
||
# append
|
||
print "Avoid improbability generators" >> "guide.txt"
|
||
|
||
This is indeed how redirections must be used from the shell. But in
|
||
'awk', it isn't necessary. In this kind of case, a program should use
|
||
'>' for all the 'print' statements, because the output file is only
|
||
opened once. (It happens that if you mix '>' and '>>' output is
|
||
produced in the expected order. However, mixing the operators for the
|
||
same file is definitely poor style, and is confusing to readers of your
|
||
program.)
|
||
|
||
Many older 'awk' implementations limit the number of pipelines that
|
||
an 'awk' program may have open to just one! In 'gawk', there is no such
|
||
limit. 'gawk' allows a program to open as many pipelines as the
|
||
underlying operating system permits.
|
||
|
||
Piping into 'sh'
|
||
|
||
A particularly powerful way to use redirection is to build command
|
||
lines and pipe them into the shell, 'sh'. For example, suppose you have
|
||
a list of files brought over from a system where all the file names are
|
||
stored in uppercase, and you wish to rename them to have names in all
|
||
lowercase. The following program is both simple and efficient:
|
||
|
||
{ printf("mv %s %s\n", $0, tolower($0)) | "sh" }
|
||
|
||
END { close("sh") }
|
||
|
||
The 'tolower()' function returns its argument string with all
|
||
uppercase characters converted to lowercase (*note String Functions::).
|
||
The program builds up a list of command lines, using the 'mv' utility to
|
||
rename the files. It then sends the list to the shell for execution.
|
||
|
||
*Note Shell Quoting::, for a function that can help in generating
|
||
command lines to be fed to the shell.
|
||
|
||
|
||
File: gawk.info, Node: Special FD, Next: Special Files, Prev: Redirection, Up: Printing
|
||
|
||
5.7 Special Files for Standard Preopened Data Streams
|
||
=====================================================
|
||
|
||
Running programs conventionally have three input and output streams
|
||
already available to them for reading and writing. These are known as
|
||
the "standard input", "standard output", and "standard error output".
|
||
These open streams (and any other open files or pipes) are often
|
||
referred to by the technical term "file descriptors".
|
||
|
||
These streams are, by default, connected to your keyboard and screen,
|
||
but they are often redirected with the shell, via the '<', '<<', '>',
|
||
'>>', '>&', and '|' operators. Standard error is typically used for
|
||
writing error messages; the reason there are two separate streams,
|
||
standard output and standard error, is so that they can be redirected
|
||
separately.
|
||
|
||
In traditional implementations of 'awk', the only way to write an
|
||
error message to standard error in an 'awk' program is as follows:
|
||
|
||
print "Serious error detected!" | "cat 1>&2"
|
||
|
||
This works by opening a pipeline to a shell command that can access the
|
||
standard error stream that it inherits from the 'awk' process. This is
|
||
far from elegant, and it also requires a separate process. So people
|
||
writing 'awk' programs often don't do this. Instead, they send the
|
||
error messages to the screen, like this:
|
||
|
||
print "Serious error detected!" > "/dev/tty"
|
||
|
||
('/dev/tty' is a special file supplied by the operating system that is
|
||
connected to your keyboard and screen. It represents the "terminal,"(1)
|
||
which on modern systems is a keyboard and screen, not a serial console.)
|
||
This generally has the same effect, but not always: although the
|
||
standard error stream is usually the screen, it can be redirected; when
|
||
that happens, writing to the screen is not correct. In fact, if 'awk'
|
||
is run from a background job, it may not have a terminal at all. Then
|
||
opening '/dev/tty' fails.
|
||
|
||
'gawk', BWK 'awk', and 'mawk' provide special file names for
|
||
accessing the three standard streams. If the file name matches one of
|
||
these special names when 'gawk' (or one of the others) redirects input
|
||
or output, then it directly uses the descriptor that the file name
|
||
stands for. These special file names work for all operating systems
|
||
that 'gawk' has been ported to, not just those that are POSIX-compliant:
|
||
|
||
'/dev/stdin'
|
||
The standard input (file descriptor 0).
|
||
|
||
'/dev/stdout'
|
||
The standard output (file descriptor 1).
|
||
|
||
'/dev/stderr'
|
||
The standard error output (file descriptor 2).
|
||
|
||
With these facilities, the proper way to write an error message then
|
||
becomes:
|
||
|
||
print "Serious error detected!" > "/dev/stderr"
|
||
|
||
Note the use of quotes around the file name. Like with any other
|
||
redirection, the value must be a string. It is a common error to omit
|
||
the quotes, which leads to confusing results.
|
||
|
||
'gawk' does not treat these file names as special when in
|
||
POSIX-compatibility mode. However, because BWK 'awk' supports them,
|
||
'gawk' does support them even when invoked with the '--traditional'
|
||
option (*note Options::).
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The "tty" in '/dev/tty' stands for "Teletype," a serial terminal.
|
||
|
||
|
||
File: gawk.info, Node: Special Files, Next: Close Files And Pipes, Prev: Special FD, Up: Printing
|
||
|
||
5.8 Special File names in 'gawk'
|
||
================================
|
||
|
||
Besides access to standard input, standard output, and standard error,
|
||
'gawk' provides access to any open file descriptor. Additionally, there
|
||
are special file names reserved for TCP/IP networking.
|
||
|
||
* Menu:
|
||
|
||
* Other Inherited Files:: Accessing other open files with
|
||
'gawk'.
|
||
* Special Network:: Special files for network communications.
|
||
* Special Caveats:: Things to watch out for.
|
||
|
||
|
||
File: gawk.info, Node: Other Inherited Files, Next: Special Network, Up: Special Files
|
||
|
||
5.8.1 Accessing Other Open Files with 'gawk'
|
||
--------------------------------------------
|
||
|
||
Besides the '/dev/stdin', '/dev/stdout', and '/dev/stderr' special file
|
||
names mentioned earlier, 'gawk' provides syntax for accessing any other
|
||
inherited open file:
|
||
|
||
'/dev/fd/N'
|
||
The file associated with file descriptor N. Such a file must be
|
||
opened by the program initiating the 'awk' execution (typically the
|
||
shell). Unless special pains are taken in the shell from which
|
||
'gawk' is invoked, only descriptors 0, 1, and 2 are available.
|
||
|
||
The file names '/dev/stdin', '/dev/stdout', and '/dev/stderr' are
|
||
essentially aliases for '/dev/fd/0', '/dev/fd/1', and '/dev/fd/2',
|
||
respectively. However, those names are more self-explanatory.
|
||
|
||
Note that using 'close()' on a file name of the form '"/dev/fd/N"',
|
||
for file descriptor numbers above two, does actually close the given
|
||
file descriptor.
|
||
|
||
|
||
File: gawk.info, Node: Special Network, Next: Special Caveats, Prev: Other Inherited Files, Up: Special Files
|
||
|
||
5.8.2 Special Files for Network Communications
|
||
----------------------------------------------
|
||
|
||
'gawk' programs can open a two-way TCP/IP connection, acting as either a
|
||
client or a server. This is done using a special file name of the form:
|
||
|
||
/NET-TYPE/PROTOCOL/LOCAL-PORT/REMOTE-HOST/REMOTE-PORT
|
||
|
||
The NET-TYPE is one of 'inet', 'inet4', or 'inet6'. The PROTOCOL is
|
||
one of 'tcp' or 'udp', and the other fields represent the other
|
||
essential pieces of information for making a networking connection.
|
||
These file names are used with the '|&' operator for communicating with
|
||
a coprocess (*note Two-way I/O::). This is an advanced feature,
|
||
mentioned here only for completeness. Full discussion is delayed until
|
||
*note TCP/IP Networking::.
|
||
|
||
|
||
File: gawk.info, Node: Special Caveats, Prev: Special Network, Up: Special Files
|
||
|
||
5.8.3 Special File name Caveats
|
||
-------------------------------
|
||
|
||
Here are some things to bear in mind when using the special file names
|
||
that 'gawk' provides:
|
||
|
||
* Recognition of the file names for the three standard preopened
|
||
files is disabled only in POSIX mode.
|
||
|
||
* Recognition of the other special file names is disabled if 'gawk'
|
||
is in compatibility mode (either '--traditional' or '--posix';
|
||
*note Options::).
|
||
|
||
* 'gawk' _always_ interprets these special file names. For example,
|
||
using '/dev/fd/4' for output actually writes on file descriptor 4,
|
||
and not on a new file descriptor that is 'dup()'ed from file
|
||
descriptor 4. Most of the time this does not matter; however, it
|
||
is important to _not_ close any of the files related to file
|
||
descriptors 0, 1, and 2. Doing so results in unpredictable
|
||
behavior.
|
||
|
||
|
||
File: gawk.info, Node: Close Files And Pipes, Next: Output Summary, Prev: Special Files, Up: Printing
|
||
|
||
5.9 Closing Input and Output Redirections
|
||
=========================================
|
||
|
||
If the same file name or the same shell command is used with 'getline'
|
||
more than once during the execution of an 'awk' program (*note
|
||
Getline::), the file is opened (or the command is executed) the first
|
||
time only. At that time, the first record of input is read from that
|
||
file or command. The next time the same file or command is used with
|
||
'getline', another record is read from it, and so on.
|
||
|
||
Similarly, when a file or pipe is opened for output, 'awk' remembers
|
||
the file name or command associated with it, and subsequent writes to
|
||
the same file or command are appended to the previous writes. The file
|
||
or pipe stays open until 'awk' exits.
|
||
|
||
This implies that special steps are necessary in order to read the
|
||
same file again from the beginning, or to rerun a shell command (rather
|
||
than reading more output from the same command). The 'close()' function
|
||
makes these things possible:
|
||
|
||
close(FILENAME)
|
||
|
||
or:
|
||
|
||
close(COMMAND)
|
||
|
||
The argument FILENAME or COMMAND can be any expression. Its value
|
||
must _exactly_ match the string that was used to open the file or start
|
||
the command (spaces and other "irrelevant" characters included). For
|
||
example, if you open a pipe with this:
|
||
|
||
"sort -r names" | getline foo
|
||
|
||
then you must close it with this:
|
||
|
||
close("sort -r names")
|
||
|
||
Once this function call is executed, the next 'getline' from that
|
||
file or command, or the next 'print' or 'printf' to that file or
|
||
command, reopens the file or reruns the command. Because the expression
|
||
that you use to close a file or pipeline must exactly match the
|
||
expression used to open the file or run the command, it is good practice
|
||
to use a variable to store the file name or command. The previous
|
||
example becomes the following:
|
||
|
||
sortcom = "sort -r names"
|
||
sortcom | getline foo
|
||
...
|
||
close(sortcom)
|
||
|
||
This helps avoid hard-to-find typographical errors in your 'awk'
|
||
programs. Here are some of the reasons for closing an output file:
|
||
|
||
* To write a file and read it back later on in the same 'awk'
|
||
program. Close the file after writing it, then begin reading it
|
||
with 'getline'.
|
||
|
||
* To write numerous files, successively, in the same 'awk' program.
|
||
If the files aren't closed, eventually 'awk' may exceed a system
|
||
limit on the number of open files in one process. It is best to
|
||
close each one when the program has finished writing it.
|
||
|
||
* To make a command finish. When output is redirected through a
|
||
pipe, the command reading the pipe normally continues to try to
|
||
read input as long as the pipe is open. Often this means the
|
||
command cannot really do its work until the pipe is closed. For
|
||
example, if output is redirected to the 'mail' program, the message
|
||
is not actually sent until the pipe is closed.
|
||
|
||
* To run the same program a second time, with the same arguments.
|
||
This is not the same thing as giving more input to the first run!
|
||
|
||
For example, suppose a program pipes output to the 'mail' program.
|
||
If it outputs several lines redirected to this pipe without closing
|
||
it, they make a single message of several lines. By contrast, if
|
||
the program closes the pipe after each line of output, then each
|
||
line makes a separate message.
|
||
|
||
If you use more files than the system allows you to have open, 'gawk'
|
||
attempts to multiplex the available open files among your data files.
|
||
'gawk''s ability to do this depends upon the facilities of your
|
||
operating system, so it may not always work. It is therefore both good
|
||
practice and good portability advice to always use 'close()' on your
|
||
files when you are done with them. In fact, if you are using a lot of
|
||
pipes, it is essential that you close commands when done. For example,
|
||
consider something like this:
|
||
|
||
{
|
||
...
|
||
command = ("grep " $1 " /some/file | my_prog -q " $3)
|
||
while ((command | getline) > 0) {
|
||
PROCESS OUTPUT OF command
|
||
}
|
||
# need close(command) here
|
||
}
|
||
|
||
This example creates a new pipeline based on data in _each_ record.
|
||
Without the call to 'close()' indicated in the comment, 'awk' creates
|
||
child processes to run the commands, until it eventually runs out of
|
||
file descriptors for more pipelines.
|
||
|
||
Even though each command has finished (as indicated by the
|
||
end-of-file return status from 'getline'), the child process is not
|
||
terminated;(1) more importantly, the file descriptor for the pipe is not
|
||
closed and released until 'close()' is called or 'awk' exits.
|
||
|
||
'close()' silently does nothing if given an argument that does not
|
||
represent a file, pipe, or coprocess that was opened with a redirection.
|
||
In such a case, it returns a negative value, indicating an error. In
|
||
addition, 'gawk' sets 'ERRNO' to a string indicating the error.
|
||
|
||
Note also that 'close(FILENAME)' has no "magic" effects on the
|
||
implicit loop that reads through the files named on the command line.
|
||
It is, more likely, a close of a file that was never opened with a
|
||
redirection, so 'awk' silently does nothing, except return a negative
|
||
value.
|
||
|
||
When using the '|&' operator to communicate with a coprocess, it is
|
||
occasionally useful to be able to close one end of the two-way pipe
|
||
without closing the other. This is done by supplying a second argument
|
||
to 'close()'. As in any other call to 'close()', the first argument is
|
||
the name of the command or special file used to start the coprocess.
|
||
The second argument should be a string, with either of the values '"to"'
|
||
or '"from"'. Case does not matter. As this is an advanced feature,
|
||
discussion is delayed until *note Two-way I/O::, which describes it in
|
||
more detail and gives an example.
|
||
|
||
Using 'close()''s Return Value
|
||
|
||
In many older versions of Unix 'awk', the 'close()' function is
|
||
actually a statement. (d.c.) It is a syntax error to try and use the
|
||
return value from 'close()':
|
||
|
||
command = "..."
|
||
command | getline info
|
||
retval = close(command) # syntax error in many Unix awks
|
||
|
||
'gawk' treats 'close()' as a function. The return value is -1 if the
|
||
argument names something that was never opened with a redirection, or if
|
||
there is a system problem closing the file or process. In these cases,
|
||
'gawk' sets the predefined variable 'ERRNO' to a string describing the
|
||
problem.
|
||
|
||
In 'gawk', when closing a pipe or coprocess (input or output), the
|
||
return value is the exit status of the command.(2) Otherwise, it is the
|
||
return value from the system's 'close()' or 'fclose()' C functions when
|
||
closing input or output files, respectively. This value is zero if the
|
||
close succeeds, or -1 if it fails.
|
||
|
||
The POSIX standard is very vague; it says that 'close()' returns zero
|
||
on success and a nonzero value otherwise. In general, different
|
||
implementations vary in what they report when closing pipes; thus, the
|
||
return value cannot be used portably. (d.c.) In POSIX mode (*note
|
||
Options::), 'gawk' just returns zero when closing a pipe.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The technical terminology is rather morbid. The finished child
|
||
is called a "zombie," and cleaning up after it is referred to as
|
||
"reaping."
|
||
|
||
(2) This is a full 16-bit value as returned by the 'wait()' system
|
||
call. See the system manual pages for information on how to decode this
|
||
value.
|
||
|
||
|
||
File: gawk.info, Node: Output Summary, Next: Output Exercises, Prev: Close Files And Pipes, Up: Printing
|
||
|
||
5.10 Summary
|
||
============
|
||
|
||
* The 'print' statement prints comma-separated expressions. Each
|
||
expression is separated by the value of 'OFS' and terminated by the
|
||
value of 'ORS'. 'OFMT' provides the conversion format for numeric
|
||
values for the 'print' statement.
|
||
|
||
* The 'printf' statement provides finer-grained control over output,
|
||
with format-control letters for different data types and various
|
||
flags that modify the behavior of the format-control letters.
|
||
|
||
* Output from both 'print' and 'printf' may be redirected to files,
|
||
pipes, and coprocesses.
|
||
|
||
* 'gawk' provides special file names for access to standard input,
|
||
output, and error, and for network communications.
|
||
|
||
* Use 'close()' to close open file, pipe, and coprocess redirections.
|
||
For coprocesses, it is possible to close only one direction of the
|
||
communications.
|
||
|
||
|
||
File: gawk.info, Node: Output Exercises, Prev: Output Summary, Up: Printing
|
||
|
||
5.11 Exercises
|
||
==============
|
||
|
||
1. Rewrite the program:
|
||
|
||
awk 'BEGIN { print "Month Crates"
|
||
print "----- ------" }
|
||
{ print $1, " ", $2 }' inventory-shipped
|
||
|
||
from *note Output Separators::, by using a new value of 'OFS'.
|
||
|
||
2. Use the 'printf' statement to line up the headings and table data
|
||
for the 'inventory-shipped' example that was covered in *note
|
||
Print::.
|
||
|
||
3. What happens if you forget the double quotes when redirecting
|
||
output, as follows:
|
||
|
||
BEGIN { print "Serious error detected!" > /dev/stderr }
|
||
|
||
|
||
File: gawk.info, Node: Expressions, Next: Patterns and Actions, Prev: Printing, Up: Top
|
||
|
||
6 Expressions
|
||
*************
|
||
|
||
Expressions are the basic building blocks of 'awk' patterns and actions.
|
||
An expression evaluates to a value that you can print, test, or pass to
|
||
a function. Additionally, an expression can assign a new value to a
|
||
variable or a field by using an assignment operator.
|
||
|
||
An expression can serve as a pattern or action statement on its own.
|
||
Most other kinds of statements contain one or more expressions that
|
||
specify the data on which to operate. As in other languages,
|
||
expressions in 'awk' can include variables, array references, constants,
|
||
and function calls, as well as combinations of these with various
|
||
operators.
|
||
|
||
* Menu:
|
||
|
||
* Values:: Constants, Variables, and Regular Expressions.
|
||
* All Operators:: 'gawk''s operators.
|
||
* Truth Values and Conditions:: Testing for true and false.
|
||
* Function Calls:: A function call is an expression.
|
||
* Precedence:: How various operators nest.
|
||
* Locales:: How the locale affects things.
|
||
* Expressions Summary:: Expressions summary.
|
||
|
||
|
||
File: gawk.info, Node: Values, Next: All Operators, Up: Expressions
|
||
|
||
6.1 Constants, Variables, and Conversions
|
||
=========================================
|
||
|
||
Expressions are built up from values and the operations performed upon
|
||
them. This minor node describes the elementary objects that provide the
|
||
values used in expressions.
|
||
|
||
* Menu:
|
||
|
||
* Constants:: String, numeric and regexp constants.
|
||
* Using Constant Regexps:: When and how to use a regexp constant.
|
||
* Variables:: Variables give names to values for later use.
|
||
* Conversion:: The conversion of strings to numbers and vice
|
||
versa.
|
||
|
||
|
||
File: gawk.info, Node: Constants, Next: Using Constant Regexps, Up: Values
|
||
|
||
6.1.1 Constant Expressions
|
||
--------------------------
|
||
|
||
The simplest type of expression is the "constant", which always has the
|
||
same value. There are three types of constants: numeric, string, and
|
||
regular expression.
|
||
|
||
Each is used in the appropriate context when you need a data value
|
||
that isn't going to change. Numeric constants can have different forms,
|
||
but are internally stored in an identical manner.
|
||
|
||
* Menu:
|
||
|
||
* Scalar Constants:: Numeric and string constants.
|
||
* Nondecimal-numbers:: What are octal and hex numbers.
|
||
* Regexp Constants:: Regular Expression constants.
|
||
|
||
|
||
File: gawk.info, Node: Scalar Constants, Next: Nondecimal-numbers, Up: Constants
|
||
|
||
6.1.1.1 Numeric and String Constants
|
||
....................................
|
||
|
||
A "numeric constant" stands for a number. This number can be an
|
||
integer, a decimal fraction, or a number in scientific (exponential)
|
||
notation.(1) Here are some examples of numeric constants that all have
|
||
the same value:
|
||
|
||
105
|
||
1.05e+2
|
||
1050e-1
|
||
|
||
A "string constant" consists of a sequence of characters enclosed in
|
||
double quotation marks. For example:
|
||
|
||
"parrot"
|
||
|
||
represents the string whose contents are 'parrot'. Strings in 'gawk'
|
||
can be of any length, and they can contain any of the possible eight-bit
|
||
ASCII characters, including ASCII NUL (character code zero). Other
|
||
'awk' implementations may have difficulty with some character codes.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The internal representation of all numbers, including integers,
|
||
uses double-precision floating-point numbers. On most modern systems,
|
||
these are in IEEE 754 standard format. *Note Arbitrary Precision
|
||
Arithmetic::, for much more information.
|
||
|
||
|
||
File: gawk.info, Node: Nondecimal-numbers, Next: Regexp Constants, Prev: Scalar Constants, Up: Constants
|
||
|
||
6.1.1.2 Octal and Hexadecimal Numbers
|
||
.....................................
|
||
|
||
In 'awk', all numbers are in decimal (i.e., base 10). Many other
|
||
programming languages allow you to specify numbers in other bases, often
|
||
octal (base 8) and hexadecimal (base 16). In octal, the numbers go 0,
|
||
1, 2, 3, 4, 5, 6, 7, 10, 11, 12, and so on. Just as '11' in decimal is
|
||
1 times 10 plus 1, so '11' in octal is 1 times 8 plus 1. This equals 9
|
||
in decimal. In hexadecimal, there are 16 digits. Because the everyday
|
||
decimal number system only has ten digits ('0'-'9'), the letters 'a'
|
||
through 'f' are used to represent the rest. (Case in the letters is
|
||
usually irrelevant; hexadecimal 'a' and 'A' have the same value.) Thus,
|
||
'11' in hexadecimal is 1 times 16 plus 1, which equals 17 in decimal.
|
||
|
||
Just by looking at plain '11', you can't tell what base it's in. So,
|
||
in C, C++, and other languages derived from C, there is a special
|
||
notation to signify the base. Octal numbers start with a leading '0',
|
||
and hexadecimal numbers start with a leading '0x' or '0X':
|
||
|
||
'11'
|
||
Decimal value 11
|
||
|
||
'011'
|
||
Octal 11, decimal value 9
|
||
|
||
'0x11'
|
||
Hexadecimal 11, decimal value 17
|
||
|
||
This example shows the difference:
|
||
|
||
$ gawk 'BEGIN { printf "%d, %d, %d\n", 011, 11, 0x11 }'
|
||
-| 9, 11, 17
|
||
|
||
Being able to use octal and hexadecimal constants in your programs is
|
||
most useful when working with data that cannot be represented
|
||
conveniently as characters or as regular numbers, such as binary data of
|
||
various sorts.
|
||
|
||
'gawk' allows the use of octal and hexadecimal constants in your
|
||
program text. However, such numbers in the input data are not treated
|
||
differently; doing so by default would break old programs. (If you
|
||
really need to do this, use the '--non-decimal-data' command-line
|
||
option; *note Nondecimal Data::.) If you have octal or hexadecimal
|
||
data, you can use the 'strtonum()' function (*note String Functions::)
|
||
to convert the data into a number. Most of the time, you will want to
|
||
use octal or hexadecimal constants when working with the built-in
|
||
bit-manipulation functions; see *note Bitwise Functions::, for more
|
||
information.
|
||
|
||
Unlike in some early C implementations, '8' and '9' are not valid in
|
||
octal constants. For example, 'gawk' treats '018' as decimal 18:
|
||
|
||
$ gawk 'BEGIN { print "021 is", 021 ; print 018 }'
|
||
-| 021 is 17
|
||
-| 18
|
||
|
||
Octal and hexadecimal source code constants are a 'gawk' extension.
|
||
If 'gawk' is in compatibility mode (*note Options::), they are not
|
||
available.
|
||
|
||
A Constant's Base Does Not Affect Its Value
|
||
|
||
Once a numeric constant has been converted internally into a number,
|
||
'gawk' no longer remembers what the original form of the constant was;
|
||
the internal value is always used. This has particular consequences for
|
||
conversion of numbers to strings:
|
||
|
||
$ gawk 'BEGIN { printf "0x11 is <%s>\n", 0x11 }'
|
||
-| 0x11 is <17>
|
||
|
||
|
||
File: gawk.info, Node: Regexp Constants, Prev: Nondecimal-numbers, Up: Constants
|
||
|
||
6.1.1.3 Regular Expression Constants
|
||
....................................
|
||
|
||
A "regexp constant" is a regular expression description enclosed in
|
||
slashes, such as '/^beginning and end$/'. Most regexps used in 'awk'
|
||
programs are constant, but the '~' and '!~' matching operators can also
|
||
match computed or dynamic regexps (which are typically just ordinary
|
||
strings or variables that contain a regexp, but could be more complex
|
||
expressions).
|
||
|
||
|
||
File: gawk.info, Node: Using Constant Regexps, Next: Variables, Prev: Constants, Up: Values
|
||
|
||
6.1.2 Using Regular Expression Constants
|
||
----------------------------------------
|
||
|
||
When used on the righthand side of the '~' or '!~' operators, a regexp
|
||
constant merely stands for the regexp that is to be matched. However,
|
||
regexp constants (such as '/foo/') may be used like simple expressions.
|
||
When a regexp constant appears by itself, it has the same meaning as if
|
||
it appeared in a pattern (i.e., '($0 ~ /foo/)'). (d.c.) *Note
|
||
Expression Patterns::. This means that the following two code segments:
|
||
|
||
if ($0 ~ /barfly/ || $0 ~ /camelot/)
|
||
print "found"
|
||
|
||
and:
|
||
|
||
if (/barfly/ || /camelot/)
|
||
print "found"
|
||
|
||
are exactly equivalent. One rather bizarre consequence of this rule is
|
||
that the following Boolean expression is valid, but does not do what its
|
||
author probably intended:
|
||
|
||
# Note that /foo/ is on the left of the ~
|
||
if (/foo/ ~ $1) print "found foo"
|
||
|
||
This code is "obviously" testing '$1' for a match against the regexp
|
||
'/foo/'. But in fact, the expression '/foo/ ~ $1' really means '($0 ~
|
||
/foo/) ~ $1'. In other words, first match the input record against the
|
||
regexp '/foo/'. The result is either zero or one, depending upon the
|
||
success or failure of the match. That result is then matched against
|
||
the first field in the record. Because it is unlikely that you would
|
||
ever really want to make this kind of test, 'gawk' issues a warning when
|
||
it sees this construct in a program. Another consequence of this rule
|
||
is that the assignment statement:
|
||
|
||
matches = /foo/
|
||
|
||
assigns either zero or one to the variable 'matches', depending upon the
|
||
contents of the current input record.
|
||
|
||
Constant regular expressions are also used as the first argument for
|
||
the 'gensub()', 'sub()', and 'gsub()' functions, as the second argument
|
||
of the 'match()' function, and as the third argument of the 'split()'
|
||
and 'patsplit()' functions (*note String Functions::). Modern
|
||
implementations of 'awk', including 'gawk', allow the third argument of
|
||
'split()' to be a regexp constant, but some older implementations do
|
||
not. (d.c.) Because some built-in functions accept regexp constants as
|
||
arguments, confusion can arise when attempting to use regexp constants
|
||
as arguments to user-defined functions (*note User-defined::). For
|
||
example:
|
||
|
||
function mysub(pat, repl, str, global)
|
||
{
|
||
if (global)
|
||
gsub(pat, repl, str)
|
||
else
|
||
sub(pat, repl, str)
|
||
return str
|
||
}
|
||
|
||
{
|
||
...
|
||
text = "hi! hi yourself!"
|
||
mysub(/hi/, "howdy", text, 1)
|
||
...
|
||
}
|
||
|
||
In this example, the programmer wants to pass a regexp constant to
|
||
the user-defined function 'mysub()', which in turn passes it on to
|
||
either 'sub()' or 'gsub()'. However, what really happens is that the
|
||
'pat' parameter is assigned a value of either one or zero, depending
|
||
upon whether or not '$0' matches '/hi/'. 'gawk' issues a warning when
|
||
it sees a regexp constant used as a parameter to a user-defined
|
||
function, because passing a truth value in this way is probably not what
|
||
was intended.
|
||
|
||
|
||
File: gawk.info, Node: Variables, Next: Conversion, Prev: Using Constant Regexps, Up: Values
|
||
|
||
6.1.3 Variables
|
||
---------------
|
||
|
||
"Variables" are ways of storing values at one point in your program for
|
||
use later in another part of your program. They can be manipulated
|
||
entirely within the program text, and they can also be assigned values
|
||
on the 'awk' command line.
|
||
|
||
* Menu:
|
||
|
||
* Using Variables:: Using variables in your programs.
|
||
* Assignment Options:: Setting variables on the command line and a
|
||
summary of command-line syntax. This is an
|
||
advanced method of input.
|
||
|
||
|
||
File: gawk.info, Node: Using Variables, Next: Assignment Options, Up: Variables
|
||
|
||
6.1.3.1 Using Variables in a Program
|
||
....................................
|
||
|
||
Variables let you give names to values and refer to them later.
|
||
Variables have already been used in many of the examples. The name of a
|
||
variable must be a sequence of letters, digits, or underscores, and it
|
||
may not begin with a digit. Here, a "letter" is any one of the 52
|
||
upper- and lowercase English letters. Other characters that may be
|
||
defined as letters in non-English locales are not valid in variable
|
||
names. Case is significant in variable names; 'a' and 'A' are distinct
|
||
variables.
|
||
|
||
A variable name is a valid expression by itself; it represents the
|
||
variable's current value. Variables are given new values with
|
||
"assignment operators", "increment operators", and "decrement operators"
|
||
(*note Assignment Ops::). In addition, the 'sub()' and 'gsub()'
|
||
functions can change a variable's value, and the 'match()', 'split()',
|
||
and 'patsplit()' functions can change the contents of their array
|
||
parameters (*note String Functions::).
|
||
|
||
A few variables have special built-in meanings, such as 'FS' (the
|
||
field separator) and 'NF' (the number of fields in the current input
|
||
record). *Note Built-in Variables::, for a list of the predefined
|
||
variables. These predefined variables can be used and assigned just
|
||
like all other variables, but their values are also used or changed
|
||
automatically by 'awk'. All predefined variables' names are entirely
|
||
uppercase.
|
||
|
||
Variables in 'awk' can be assigned either numeric or string values.
|
||
The kind of value a variable holds can change over the life of a
|
||
program. By default, variables are initialized to the empty string,
|
||
which is zero if converted to a number. There is no need to explicitly
|
||
initialize a variable in 'awk', which is what you would do in C and in
|
||
most other traditional languages.
|
||
|
||
|
||
File: gawk.info, Node: Assignment Options, Prev: Using Variables, Up: Variables
|
||
|
||
6.1.3.2 Assigning Variables on the Command Line
|
||
...............................................
|
||
|
||
Any 'awk' variable can be set by including a "variable assignment" among
|
||
the arguments on the command line when 'awk' is invoked (*note Other
|
||
Arguments::). Such an assignment has the following form:
|
||
|
||
VARIABLE=TEXT
|
||
|
||
With it, a variable is set either at the beginning of the 'awk' run or
|
||
in between input files. When the assignment is preceded with the '-v'
|
||
option, as in the following:
|
||
|
||
-v VARIABLE=TEXT
|
||
|
||
the variable is set at the very beginning, even before the 'BEGIN' rules
|
||
execute. The '-v' option and its assignment must precede all the file
|
||
name arguments, as well as the program text. (*Note Options::, for more
|
||
information about the '-v' option.) Otherwise, the variable assignment
|
||
is performed at a time determined by its position among the input file
|
||
arguments--after the processing of the preceding input file argument.
|
||
For example:
|
||
|
||
awk '{ print $n }' n=4 inventory-shipped n=2 mail-list
|
||
|
||
prints the value of field number 'n' for all input records. Before the
|
||
first file is read, the command line sets the variable 'n' equal to
|
||
four. This causes the fourth field to be printed in lines from
|
||
'inventory-shipped'. After the first file has finished, but before the
|
||
second file is started, 'n' is set to two, so that the second field is
|
||
printed in lines from 'mail-list':
|
||
|
||
$ awk '{ print $n }' n=4 inventory-shipped n=2 mail-list
|
||
-| 15
|
||
-| 24
|
||
...
|
||
-| 555-5553
|
||
-| 555-3412
|
||
...
|
||
|
||
Command-line arguments are made available for explicit examination by
|
||
the 'awk' program in the 'ARGV' array (*note ARGC and ARGV::). 'awk'
|
||
processes the values of command-line assignments for escape sequences
|
||
(*note Escape Sequences::). (d.c.)
|
||
|
||
|
||
File: gawk.info, Node: Conversion, Prev: Variables, Up: Values
|
||
|
||
6.1.4 Conversion of Strings and Numbers
|
||
---------------------------------------
|
||
|
||
Number-to-string and string-to-number conversion are generally
|
||
straightforward. There can be subtleties to be aware of; this minor
|
||
node discusses this important facet of 'awk'.
|
||
|
||
* Menu:
|
||
|
||
* Strings And Numbers:: How 'awk' Converts Between Strings And
|
||
Numbers.
|
||
* Locale influences conversions:: How the locale may affect conversions.
|
||
|
||
|
||
File: gawk.info, Node: Strings And Numbers, Next: Locale influences conversions, Up: Conversion
|
||
|
||
6.1.4.1 How 'awk' Converts Between Strings and Numbers
|
||
......................................................
|
||
|
||
Strings are converted to numbers and numbers are converted to strings,
|
||
if the context of the 'awk' program demands it. For example, if the
|
||
value of either 'foo' or 'bar' in the expression 'foo + bar' happens to
|
||
be a string, it is converted to a number before the addition is
|
||
performed. If numeric values appear in string concatenation, they are
|
||
converted to strings. Consider the following:
|
||
|
||
two = 2; three = 3
|
||
print (two three) + 4
|
||
|
||
This prints the (numeric) value 27. The numeric values of the variables
|
||
'two' and 'three' are converted to strings and concatenated together.
|
||
The resulting string is converted back to the number 23, to which 4 is
|
||
then added.
|
||
|
||
If, for some reason, you need to force a number to be converted to a
|
||
string, concatenate that number with the empty string, '""'. To force a
|
||
string to be converted to a number, add zero to that string. A string
|
||
is converted to a number by interpreting any numeric prefix of the
|
||
string as numerals: '"2.5"' converts to 2.5, '"1e3"' converts to 1,000,
|
||
and '"25fix"' has a numeric value of 25. Strings that can't be
|
||
interpreted as valid numbers convert to zero.
|
||
|
||
The exact manner in which numbers are converted into strings is
|
||
controlled by the 'awk' predefined variable 'CONVFMT' (*note Built-in
|
||
Variables::). Numbers are converted using the 'sprintf()' function with
|
||
'CONVFMT' as the format specifier (*note String Functions::).
|
||
|
||
'CONVFMT''s default value is '"%.6g"', which creates a value with at
|
||
most six significant digits. For some applications, you might want to
|
||
change it to specify more precision. On most modern machines, 17 digits
|
||
is usually enough to capture a floating-point number's value exactly.(1)
|
||
|
||
Strange results can occur if you set 'CONVFMT' to a string that
|
||
doesn't tell 'sprintf()' how to format floating-point numbers in a
|
||
useful way. For example, if you forget the '%' in the format, 'awk'
|
||
converts all numbers to the same constant string.
|
||
|
||
As a special case, if a number is an integer, then the result of
|
||
converting it to a string is _always_ an integer, no matter what the
|
||
value of 'CONVFMT' may be. Given the following code fragment:
|
||
|
||
CONVFMT = "%2.2f"
|
||
a = 12
|
||
b = a ""
|
||
|
||
'b' has the value '"12"', not '"12.00"'. (d.c.)
|
||
|
||
Pre-POSIX 'awk' Used 'OFMT' for String Conversion
|
||
|
||
Prior to the POSIX standard, 'awk' used the value of 'OFMT' for
|
||
converting numbers to strings. 'OFMT' specifies the output format to
|
||
use when printing numbers with 'print'. 'CONVFMT' was introduced in
|
||
order to separate the semantics of conversion from the semantics of
|
||
printing. Both 'CONVFMT' and 'OFMT' have the same default value:
|
||
'"%.6g"'. In the vast majority of cases, old 'awk' programs do not
|
||
change their behavior. *Note Print::, for more information on the
|
||
'print' statement.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Pathological cases can require up to 752 digits (!), but we doubt
|
||
that you need to worry about this.
|
||
|
||
|
||
File: gawk.info, Node: Locale influences conversions, Prev: Strings And Numbers, Up: Conversion
|
||
|
||
6.1.4.2 Locales Can Influence Conversion
|
||
........................................
|
||
|
||
Where you are can matter when it comes to converting between numbers and
|
||
strings. The local character set and language--the "locale"--can affect
|
||
numeric formats. In particular, for 'awk' programs, it affects the
|
||
decimal point character and the thousands-separator character. The
|
||
'"C"' locale, and most English-language locales, use the period
|
||
character ('.') as the decimal point and don't have a thousands
|
||
separator. However, many (if not most) European and non-English locales
|
||
use the comma (',') as the decimal point character. European locales
|
||
often use either a space or a period as the thousands separator, if they
|
||
have one.
|
||
|
||
The POSIX standard says that 'awk' always uses the period as the
|
||
decimal point when reading the 'awk' program source code, and for
|
||
command-line variable assignments (*note Other Arguments::). However,
|
||
when interpreting input data, for 'print' and 'printf' output, and for
|
||
number-to-string conversion, the local decimal point character is used.
|
||
(d.c.) In all cases, numbers in source code and in input data cannot
|
||
have a thousands separator. Here are some examples indicating the
|
||
difference in behavior, on a GNU/Linux system:
|
||
|
||
$ export POSIXLY_CORRECT=1 Force POSIX behavior
|
||
$ gawk 'BEGIN { printf "%g\n", 3.1415927 }'
|
||
-| 3.14159
|
||
$ LC_ALL=en_DK.utf-8 gawk 'BEGIN { printf "%g\n", 3.1415927 }'
|
||
-| 3,14159
|
||
$ echo 4,321 | gawk '{ print $1 + 1 }'
|
||
-| 5
|
||
$ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
|
||
-| 5,321
|
||
|
||
The 'en_DK.utf-8' locale is for English in Denmark, where the comma acts
|
||
as the decimal point separator. In the normal '"C"' locale, 'gawk'
|
||
treats '4,321' as 4, while in the Danish locale, it's treated as the
|
||
full number including the fractional part, 4.321.
|
||
|
||
Some earlier versions of 'gawk' fully complied with this aspect of
|
||
the standard. However, many users in non-English locales complained
|
||
about this behavior, because their data used a period as the decimal
|
||
point, so the default behavior was restored to use a period as the
|
||
decimal point character. You can use the '--use-lc-numeric' option
|
||
(*note Options::) to force 'gawk' to use the locale's decimal point
|
||
character. ('gawk' also uses the locale's decimal point character when
|
||
in POSIX mode, either via '--posix' or the 'POSIXLY_CORRECT' environment
|
||
variable, as shown previously.)
|
||
|
||
*note Table 6.1: table-locale-affects. describes the cases in which
|
||
the locale's decimal point character is used and when a period is used.
|
||
Some of these features have not been described yet.
|
||
|
||
Feature Default '--posix' or
|
||
'--use-lc-numeric'
|
||
------------------------------------------------------------
|
||
'%'g' Use locale Use locale
|
||
'%g' Use period Use locale
|
||
Input Use period Use locale
|
||
'strtonum()'Use period Use locale
|
||
|
||
Table 6.1: Locale decimal point versus a period
|
||
|
||
Finally, modern-day formal standards and the IEEE standard
|
||
floating-point representation can have an unusual but important effect
|
||
on the way 'gawk' converts some special string values to numbers. The
|
||
details are presented in *note POSIX Floating Point Problems::.
|
||
|
||
|
||
File: gawk.info, Node: All Operators, Next: Truth Values and Conditions, Prev: Values, Up: Expressions
|
||
|
||
6.2 Operators: Doing Something with Values
|
||
==========================================
|
||
|
||
This minor node introduces the "operators" that make use of the values
|
||
provided by constants and variables.
|
||
|
||
* Menu:
|
||
|
||
* Arithmetic Ops:: Arithmetic operations ('+', '-',
|
||
etc.)
|
||
* Concatenation:: Concatenating strings.
|
||
* Assignment Ops:: Changing the value of a variable or a field.
|
||
* Increment Ops:: Incrementing the numeric value of a variable.
|
||
|
||
|
||
File: gawk.info, Node: Arithmetic Ops, Next: Concatenation, Up: All Operators
|
||
|
||
6.2.1 Arithmetic Operators
|
||
--------------------------
|
||
|
||
The 'awk' language uses the common arithmetic operators when evaluating
|
||
expressions. All of these arithmetic operators follow normal precedence
|
||
rules and work as you would expect them to.
|
||
|
||
The following example uses a file named 'grades', which contains a
|
||
list of student names as well as three test scores per student (it's a
|
||
small class):
|
||
|
||
Pat 100 97 58
|
||
Sandy 84 72 93
|
||
Chris 72 92 89
|
||
|
||
This program takes the file 'grades' and prints the average of the
|
||
scores:
|
||
|
||
$ awk '{ sum = $2 + $3 + $4 ; avg = sum / 3
|
||
> print $1, avg }' grades
|
||
-| Pat 85
|
||
-| Sandy 83
|
||
-| Chris 84.3333
|
||
|
||
The following list provides the arithmetic operators in 'awk', in
|
||
order from the highest precedence to the lowest:
|
||
|
||
'X ^ Y'
|
||
'X ** Y'
|
||
Exponentiation; X raised to the Y power. '2 ^ 3' has the value
|
||
eight; the character sequence '**' is equivalent to '^'. (c.e.)
|
||
|
||
'- X'
|
||
Negation.
|
||
|
||
'+ X'
|
||
Unary plus; the expression is converted to a number.
|
||
|
||
'X * Y'
|
||
Multiplication.
|
||
|
||
'X / Y'
|
||
Division; because all numbers in 'awk' are floating-point numbers,
|
||
the result is _not_ rounded to an integer--'3 / 4' has the value
|
||
0.75. (It is a common mistake, especially for C programmers, to
|
||
forget that _all_ numbers in 'awk' are floating point, and that
|
||
division of integer-looking constants produces a real number, not
|
||
an integer.)
|
||
|
||
'X % Y'
|
||
Remainder; further discussion is provided in the text, just after
|
||
this list.
|
||
|
||
'X + Y'
|
||
Addition.
|
||
|
||
'X - Y'
|
||
Subtraction.
|
||
|
||
Unary plus and minus have the same precedence, the multiplication
|
||
operators all have the same precedence, and addition and subtraction
|
||
have the same precedence.
|
||
|
||
When computing the remainder of 'X % Y', the quotient is rounded
|
||
toward zero to an integer and multiplied by Y. This result is
|
||
subtracted from X; this operation is sometimes known as "trunc-mod."
|
||
The following relation always holds:
|
||
|
||
b * int(a / b) + (a % b) == a
|
||
|
||
One possibly undesirable effect of this definition of remainder is
|
||
that 'X % Y' is negative if X is negative. Thus:
|
||
|
||
-17 % 8 = -1
|
||
|
||
In other 'awk' implementations, the signedness of the remainder may
|
||
be machine-dependent.
|
||
|
||
NOTE: The POSIX standard only specifies the use of '^' for
|
||
exponentiation. For maximum portability, do not use the '**'
|
||
operator.
|
||
|
||
|
||
File: gawk.info, Node: Concatenation, Next: Assignment Ops, Prev: Arithmetic Ops, Up: All Operators
|
||
|
||
6.2.2 String Concatenation
|
||
--------------------------
|
||
|
||
It seemed like a good idea at the time.
|
||
-- _Brian Kernighan_
|
||
|
||
There is only one string operation: concatenation. It does not have
|
||
a specific operator to represent it. Instead, concatenation is
|
||
performed by writing expressions next to one another, with no operator.
|
||
For example:
|
||
|
||
$ awk '{ print "Field number one: " $1 }' mail-list
|
||
-| Field number one: Amelia
|
||
-| Field number one: Anthony
|
||
...
|
||
|
||
Without the space in the string constant after the ':', the line runs
|
||
together. For example:
|
||
|
||
$ awk '{ print "Field number one:" $1 }' mail-list
|
||
-| Field number one:Amelia
|
||
-| Field number one:Anthony
|
||
...
|
||
|
||
Because string concatenation does not have an explicit operator, it
|
||
is often necessary to ensure that it happens at the right time by using
|
||
parentheses to enclose the items to concatenate. For example, you might
|
||
expect that the following code fragment concatenates 'file' and 'name':
|
||
|
||
file = "file"
|
||
name = "name"
|
||
print "something meaningful" > file name
|
||
|
||
This produces a syntax error with some versions of Unix 'awk'.(1) It is
|
||
necessary to use the following:
|
||
|
||
print "something meaningful" > (file name)
|
||
|
||
Parentheses should be used around concatenation in all but the most
|
||
common contexts, such as on the righthand side of '='. Be careful about
|
||
the kinds of expressions used in string concatenation. In particular,
|
||
the order of evaluation of expressions used for concatenation is
|
||
undefined in the 'awk' language. Consider this example:
|
||
|
||
BEGIN {
|
||
a = "don't"
|
||
print (a " " (a = "panic"))
|
||
}
|
||
|
||
It is not defined whether the second assignment to 'a' happens before or
|
||
after the value of 'a' is retrieved for producing the concatenated
|
||
value. The result could be either 'don't panic', or 'panic panic'.
|
||
|
||
The precedence of concatenation, when mixed with other operators, is
|
||
often counter-intuitive. Consider this example:
|
||
|
||
$ awk 'BEGIN { print -12 " " -24 }'
|
||
-| -12-24
|
||
|
||
This "obviously" is concatenating -12, a space, and -24. But where
|
||
did the space disappear to? The answer lies in the combination of
|
||
operator precedences and 'awk''s automatic conversion rules. To get the
|
||
desired result, write the program this way:
|
||
|
||
$ awk 'BEGIN { print -12 " " (-24) }'
|
||
-| -12 -24
|
||
|
||
This forces 'awk' to treat the '-' on the '-24' as unary. Otherwise,
|
||
it's parsed as follows:
|
||
|
||
-12 ('" "' - 24)
|
||
=> -12 (0 - 24)
|
||
=> -12 (-24)
|
||
=> -12-24
|
||
|
||
As mentioned earlier, when mixing concatenation with other operators,
|
||
_parenthesize_. Otherwise, you're never quite sure what you'll get.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) It happens that BWK 'awk', 'gawk', and 'mawk' all "get it right,"
|
||
but you should not rely on this.
|
||
|
||
|
||
File: gawk.info, Node: Assignment Ops, Next: Increment Ops, Prev: Concatenation, Up: All Operators
|
||
|
||
6.2.3 Assignment Expressions
|
||
----------------------------
|
||
|
||
An "assignment" is an expression that stores a (usually different) value
|
||
into a variable. For example, let's assign the value one to the
|
||
variable 'z':
|
||
|
||
z = 1
|
||
|
||
After this expression is executed, the variable 'z' has the value
|
||
one. Whatever old value 'z' had before the assignment is forgotten.
|
||
|
||
Assignments can also store string values. For example, the following
|
||
stores the value '"this food is good"' in the variable 'message':
|
||
|
||
thing = "food"
|
||
predicate = "good"
|
||
message = "this " thing " is " predicate
|
||
|
||
This also illustrates string concatenation. The '=' sign is called an
|
||
"assignment operator". It is the simplest assignment operator because
|
||
the value of the righthand operand is stored unchanged. Most operators
|
||
(addition, concatenation, and so on) have no effect except to compute a
|
||
value. If the value isn't used, there's no reason to use the operator.
|
||
An assignment operator is different; it does produce a value, but even
|
||
if you ignore it, the assignment still makes itself felt through the
|
||
alteration of the variable. We call this a "side effect".
|
||
|
||
The lefthand operand of an assignment need not be a variable (*note
|
||
Variables::); it can also be a field (*note Changing Fields::) or an
|
||
array element (*note Arrays::). These are all called "lvalues", which
|
||
means they can appear on the lefthand side of an assignment operator.
|
||
The righthand operand may be any expression; it produces the new value
|
||
that the assignment stores in the specified variable, field, or array
|
||
element. (Such values are called "rvalues".)
|
||
|
||
It is important to note that variables do _not_ have permanent types.
|
||
A variable's type is simply the type of whatever value was last assigned
|
||
to it. In the following program fragment, the variable 'foo' has a
|
||
numeric value at first, and a string value later on:
|
||
|
||
foo = 1
|
||
print foo
|
||
foo = "bar"
|
||
print foo
|
||
|
||
When the second assignment gives 'foo' a string value, the fact that it
|
||
previously had a numeric value is forgotten.
|
||
|
||
String values that do not begin with a digit have a numeric value of
|
||
zero. After executing the following code, the value of 'foo' is five:
|
||
|
||
foo = "a string"
|
||
foo = foo + 5
|
||
|
||
NOTE: Using a variable as a number and then later as a string can
|
||
be confusing and is poor programming style. The previous two
|
||
examples illustrate how 'awk' works, _not_ how you should write
|
||
your programs!
|
||
|
||
An assignment is an expression, so it has a value--the same value
|
||
that is assigned. Thus, 'z = 1' is an expression with the value one.
|
||
One consequence of this is that you can write multiple assignments
|
||
together, such as:
|
||
|
||
x = y = z = 5
|
||
|
||
This example stores the value five in all three variables ('x', 'y', and
|
||
'z'). It does so because the value of 'z = 5', which is five, is stored
|
||
into 'y' and then the value of 'y = z = 5', which is five, is stored
|
||
into 'x'.
|
||
|
||
Assignments may be used anywhere an expression is called for. For
|
||
example, it is valid to write 'x != (y = 1)' to set 'y' to one, and then
|
||
test whether 'x' equals one. But this style tends to make programs hard
|
||
to read; such nesting of assignments should be avoided, except perhaps
|
||
in a one-shot program.
|
||
|
||
Aside from '=', there are several other assignment operators that do
|
||
arithmetic with the old value of the variable. For example, the
|
||
operator '+=' computes a new value by adding the righthand value to the
|
||
old value of the variable. Thus, the following assignment adds five to
|
||
the value of 'foo':
|
||
|
||
foo += 5
|
||
|
||
This is equivalent to the following:
|
||
|
||
foo = foo + 5
|
||
|
||
Use whichever makes the meaning of your program clearer.
|
||
|
||
There are situations where using '+=' (or any assignment operator) is
|
||
_not_ the same as simply repeating the lefthand operand in the righthand
|
||
expression. For example:
|
||
|
||
# Thanks to Pat Rankin for this example
|
||
BEGIN {
|
||
foo[rand()] += 5
|
||
for (x in foo)
|
||
print x, foo[x]
|
||
|
||
bar[rand()] = bar[rand()] + 5
|
||
for (x in bar)
|
||
print x, bar[x]
|
||
}
|
||
|
||
The indices of 'bar' are practically guaranteed to be different, because
|
||
'rand()' returns different values each time it is called. (Arrays and
|
||
the 'rand()' function haven't been covered yet. *Note Arrays::, and
|
||
*note Numeric Functions::, for more information.) This example
|
||
illustrates an important fact about assignment operators: the lefthand
|
||
expression is only evaluated _once_.
|
||
|
||
It is up to the implementation as to which expression is evaluated
|
||
first, the lefthand or the righthand. Consider this example:
|
||
|
||
i = 1
|
||
a[i += 2] = i + 1
|
||
|
||
The value of 'a[3]' could be either two or four.
|
||
|
||
*note Table 6.2: table-assign-ops. lists the arithmetic assignment
|
||
operators. In each case, the righthand operand is an expression whose
|
||
value is converted to a number.
|
||
|
||
Operator Effect
|
||
--------------------------------------------------------------------------
|
||
LVALUE '+=' Add INCREMENT to the value of LVALUE.
|
||
INCREMENT
|
||
LVALUE '-=' Subtract DECREMENT from the value of LVALUE.
|
||
DECREMENT
|
||
LVALUE '*=' Multiply the value of LVALUE by COEFFICIENT.
|
||
COEFFICIENT
|
||
LVALUE '/=' DIVISOR Divide the value of LVALUE by DIVISOR.
|
||
LVALUE '%=' MODULUS Set LVALUE to its remainder by MODULUS.
|
||
LVALUE '^=' POWER Raise LVALUE to the power POWER.
|
||
LVALUE '**=' POWER Raise LVALUE to the power POWER. (c.e.)
|
||
|
||
Table 6.2: Arithmetic assignment operators
|
||
|
||
NOTE: Only the '^=' operator is specified by POSIX. For maximum
|
||
portability, do not use the '**=' operator.
|
||
|
||
Syntactic Ambiguities Between '/=' and Regular Expressions
|
||
|
||
There is a syntactic ambiguity between the '/=' assignment operator
|
||
and regexp constants whose first character is an '='. (d.c.) This is
|
||
most notable in some commercial 'awk' versions. For example:
|
||
|
||
$ awk /==/ /dev/null
|
||
error-> awk: syntax error at source line 1
|
||
error-> context is
|
||
error-> >>> /= <<<
|
||
error-> awk: bailing out at source line 1
|
||
|
||
A workaround is:
|
||
|
||
awk '/[=]=/' /dev/null
|
||
|
||
'gawk' does not have this problem; BWK 'awk' and 'mawk' also do not.
|
||
|
||
|
||
File: gawk.info, Node: Increment Ops, Prev: Assignment Ops, Up: All Operators
|
||
|
||
6.2.4 Increment and Decrement Operators
|
||
---------------------------------------
|
||
|
||
"Increment" and "decrement operators" increase or decrease the value of
|
||
a variable by one. An assignment operator can do the same thing, so the
|
||
increment operators add no power to the 'awk' language; however, they
|
||
are convenient abbreviations for very common operations.
|
||
|
||
The operator used for adding one is written '++'. It can be used to
|
||
increment a variable either before or after taking its value. To
|
||
"pre-increment" a variable 'v', write '++v'. This adds one to the value
|
||
of 'v'--that new value is also the value of the expression. (The
|
||
assignment expression 'v += 1' is completely equivalent.) Writing the
|
||
'++' after the variable specifies "post-increment". This increments the
|
||
variable value just the same; the difference is that the value of the
|
||
increment expression itself is the variable's _old_ value. Thus, if
|
||
'foo' has the value four, then the expression 'foo++' has the value
|
||
four, but it changes the value of 'foo' to five. In other words, the
|
||
operator returns the old value of the variable, but with the side effect
|
||
of incrementing it.
|
||
|
||
The post-increment 'foo++' is nearly the same as writing '(foo += 1)
|
||
- 1'. It is not perfectly equivalent because all numbers in 'awk' are
|
||
floating point--in floating point, 'foo + 1 - 1' does not necessarily
|
||
equal 'foo'. But the difference is minute as long as you stick to
|
||
numbers that are fairly small (less than 10e12).
|
||
|
||
Fields and array elements are incremented just like variables. (Use
|
||
'$(i++)' when you want to do a field reference and a variable increment
|
||
at the same time. The parentheses are necessary because of the
|
||
precedence of the field reference operator '$'.)
|
||
|
||
The decrement operator '--' works just like '++', except that it
|
||
subtracts one instead of adding it. As with '++', it can be used before
|
||
the lvalue to pre-decrement or after it to post-decrement. Following is
|
||
a summary of increment and decrement expressions:
|
||
|
||
'++LVALUE'
|
||
Increment LVALUE, returning the new value as the value of the
|
||
expression.
|
||
|
||
'LVALUE++'
|
||
Increment LVALUE, returning the _old_ value of LVALUE as the value
|
||
of the expression.
|
||
|
||
'--LVALUE'
|
||
Decrement LVALUE, returning the new value as the value of the
|
||
expression. (This expression is like '++LVALUE', but instead of
|
||
adding, it subtracts.)
|
||
|
||
'LVALUE--'
|
||
Decrement LVALUE, returning the _old_ value of LVALUE as the value
|
||
of the expression. (This expression is like 'LVALUE++', but
|
||
instead of adding, it subtracts.)
|
||
|
||
Operator Evaluation Order
|
||
|
||
Doctor, it hurts when I do this!
|
||
Then don't do that!
|
||
-- _Groucho Marx_
|
||
|
||
What happens for something like the following?
|
||
|
||
b = 6
|
||
print b += b++
|
||
|
||
Or something even stranger?
|
||
|
||
b = 6
|
||
b += ++b + b++
|
||
print b
|
||
|
||
In other words, when do the various side effects prescribed by the
|
||
postfix operators ('b++') take effect? When side effects happen is
|
||
"implementation-defined". In other words, it is up to the particular
|
||
version of 'awk'. The result for the first example may be 12 or 13, and
|
||
for the second, it may be 22 or 23.
|
||
|
||
In short, doing things like this is not recommended and definitely
|
||
not anything that you can rely upon for portability. You should avoid
|
||
such things in your own programs.
|
||
|
||
|
||
File: gawk.info, Node: Truth Values and Conditions, Next: Function Calls, Prev: All Operators, Up: Expressions
|
||
|
||
6.3 Truth Values and Conditions
|
||
===============================
|
||
|
||
In certain contexts, expression values also serve as "truth values";
|
||
i.e., they determine what should happen next as the program runs. This
|
||
minor node describes how 'awk' defines "true" and "false" and how values
|
||
are compared.
|
||
|
||
* Menu:
|
||
|
||
* Truth Values:: What is "true" and what is "false".
|
||
* Typing and Comparison:: How variables acquire types and how this
|
||
affects comparison of numbers and strings with
|
||
'<', etc.
|
||
* Boolean Ops:: Combining comparison expressions using boolean
|
||
operators '||' ("or"), '&&'
|
||
("and") and '!' ("not").
|
||
* Conditional Exp:: Conditional expressions select between two
|
||
subexpressions under control of a third
|
||
subexpression.
|
||
|
||
|
||
File: gawk.info, Node: Truth Values, Next: Typing and Comparison, Up: Truth Values and Conditions
|
||
|
||
6.3.1 True and False in 'awk'
|
||
-----------------------------
|
||
|
||
Many programming languages have a special representation for the
|
||
concepts of "true" and "false." Such languages usually use the special
|
||
constants 'true' and 'false', or perhaps their uppercase equivalents.
|
||
However, 'awk' is different. It borrows a very simple concept of true
|
||
and false from C. In 'awk', any nonzero numeric value _or_ any nonempty
|
||
string value is true. Any other value (zero or the null string, '""')
|
||
is false. The following program prints 'A strange truth value' three
|
||
times:
|
||
|
||
BEGIN {
|
||
if (3.1415927)
|
||
print "A strange truth value"
|
||
if ("Four Score And Seven Years Ago")
|
||
print "A strange truth value"
|
||
if (j = 57)
|
||
print "A strange truth value"
|
||
}
|
||
|
||
There is a surprising consequence of the "nonzero or non-null" rule:
|
||
the string constant '"0"' is actually true, because it is non-null.
|
||
(d.c.)
|
||
|
||
|
||
File: gawk.info, Node: Typing and Comparison, Next: Boolean Ops, Prev: Truth Values, Up: Truth Values and Conditions
|
||
|
||
6.3.2 Variable Typing and Comparison Expressions
|
||
------------------------------------------------
|
||
|
||
The Guide is definitive. Reality is frequently inaccurate.
|
||
-- _Douglas Adams, 'The Hitchhiker's Guide to the Galaxy'_
|
||
|
||
Unlike in other programming languages, in 'awk' variables do not have
|
||
a fixed type. Instead, they can be either a number or a string,
|
||
depending upon the value that is assigned to them. We look now at how
|
||
variables are typed, and how 'awk' compares variables.
|
||
|
||
* Menu:
|
||
|
||
* Variable Typing:: String type versus numeric type.
|
||
* Comparison Operators:: The comparison operators.
|
||
* POSIX String Comparison:: String comparison with POSIX rules.
|
||
|
||
|
||
File: gawk.info, Node: Variable Typing, Next: Comparison Operators, Up: Typing and Comparison
|
||
|
||
6.3.2.1 String Type versus Numeric Type
|
||
.......................................
|
||
|
||
The POSIX standard introduced the concept of a "numeric string", which
|
||
is simply a string that looks like a number--for example, '" +2"'. This
|
||
concept is used for determining the type of a variable. The type of the
|
||
variable is important because the types of two variables determine how
|
||
they are compared. Variable typing follows these rules:
|
||
|
||
* A numeric constant or the result of a numeric operation has the
|
||
"numeric" attribute.
|
||
|
||
* A string constant or the result of a string operation has the
|
||
"string" attribute.
|
||
|
||
* Fields, 'getline' input, 'FILENAME', 'ARGV' elements, 'ENVIRON'
|
||
elements, and the elements of an array created by 'match()',
|
||
'split()', and 'patsplit()' that are numeric strings have the
|
||
"strnum" attribute. Otherwise, they have the "string" attribute.
|
||
Uninitialized variables also have the "strnum" attribute.
|
||
|
||
* Attributes propagate across assignments but are not changed by any
|
||
use.
|
||
|
||
The last rule is particularly important. In the following program,
|
||
'a' has numeric type, even though it is later used in a string
|
||
operation:
|
||
|
||
BEGIN {
|
||
a = 12.345
|
||
b = a " is a cute number"
|
||
print b
|
||
}
|
||
|
||
When two operands are compared, either string comparison or numeric
|
||
comparison may be used. This depends upon the attributes of the
|
||
operands, according to the following symmetric matrix:
|
||
|
||
+-------------------------------
|
||
| STRING NUMERIC STRNUM
|
||
-----+-------------------------------
|
||
|
|
||
STRING | string string string
|
||
|
|
||
NUMERIC | string numeric numeric
|
||
|
|
||
STRNUM | string numeric numeric
|
||
-----+-------------------------------
|
||
|
||
The basic idea is that user input that looks numeric--and _only_ user
|
||
input--should be treated as numeric, even though it is actually made of
|
||
characters and is therefore also a string. Thus, for example, the
|
||
string constant '" +3.14"', when it appears in program source code, is a
|
||
string--even though it looks numeric--and is _never_ treated as a number
|
||
for comparison purposes.
|
||
|
||
In short, when one operand is a "pure" string, such as a string
|
||
constant, then a string comparison is performed. Otherwise, a numeric
|
||
comparison is performed.
|
||
|
||
This point bears additional emphasis: All user input is made of
|
||
characters, and so is first and foremost of string type; input strings
|
||
that look numeric are additionally given the strnum attribute. Thus,
|
||
the six-character input string ' +3.14' receives the strnum attribute.
|
||
In contrast, the eight characters '" +3.14"' appearing in program text
|
||
comprise a string constant. The following examples print '1' when the
|
||
comparison between the two different constants is true, and '0'
|
||
otherwise:
|
||
|
||
$ echo ' +3.14' | awk '{ print($0 == " +3.14") }' True
|
||
-| 1
|
||
$ echo ' +3.14' | awk '{ print($0 == "+3.14") }' False
|
||
-| 0
|
||
$ echo ' +3.14' | awk '{ print($0 == "3.14") }' False
|
||
-| 0
|
||
$ echo ' +3.14' | awk '{ print($0 == 3.14) }' True
|
||
-| 1
|
||
$ echo ' +3.14' | awk '{ print($1 == " +3.14") }' False
|
||
-| 0
|
||
$ echo ' +3.14' | awk '{ print($1 == "+3.14") }' True
|
||
-| 1
|
||
$ echo ' +3.14' | awk '{ print($1 == "3.14") }' False
|
||
-| 0
|
||
$ echo ' +3.14' | awk '{ print($1 == 3.14) }' True
|
||
-| 1
|
||
|
||
|
||
File: gawk.info, Node: Comparison Operators, Next: POSIX String Comparison, Prev: Variable Typing, Up: Typing and Comparison
|
||
|
||
6.3.2.2 Comparison Operators
|
||
............................
|
||
|
||
"Comparison expressions" compare strings or numbers for relationships
|
||
such as equality. They are written using "relational operators", which
|
||
are a superset of those in C. *note Table 6.3: table-relational-ops.
|
||
describes them.
|
||
|
||
Expression Result
|
||
--------------------------------------------------------------------------
|
||
X '<' Y True if X is less than Y
|
||
X '<=' Y True if X is less than or equal to Y
|
||
X '>' Y True if X is greater than Y
|
||
X '>=' Y True if X is greater than or equal to Y
|
||
X '==' Y True if X is equal to Y
|
||
X '!=' Y True if X is not equal to Y
|
||
X '~' Y True if the string X matches the regexp denoted by Y
|
||
X '!~' Y True if the string X does not match the regexp
|
||
denoted by Y
|
||
SUBSCRIPT 'in' True if the array ARRAY has an element with the
|
||
ARRAY subscript SUBSCRIPT
|
||
|
||
Table 6.3: Relational operators
|
||
|
||
Comparison expressions have the value one if true and zero if false.
|
||
When comparing operands of mixed types, numeric operands are converted
|
||
to strings using the value of 'CONVFMT' (*note Conversion::).
|
||
|
||
Strings are compared by comparing the first character of each, then
|
||
the second character of each, and so on. Thus, '"10"' is less than
|
||
'"9"'. If there are two strings where one is a prefix of the other, the
|
||
shorter string is less than the longer one. Thus, '"abc"' is less than
|
||
'"abcd"'.
|
||
|
||
It is very easy to accidentally mistype the '==' operator and leave
|
||
off one of the '=' characters. The result is still valid 'awk' code,
|
||
but the program does not do what is intended:
|
||
|
||
if (a = b) # oops! should be a == b
|
||
...
|
||
else
|
||
...
|
||
|
||
Unless 'b' happens to be zero or the null string, the 'if' part of the
|
||
test always succeeds. Because the operators are so similar, this kind
|
||
of error is very difficult to spot when scanning the source code.
|
||
|
||
The following list of expressions illustrates the kinds of
|
||
comparisons 'awk' performs, as well as what the result of each
|
||
comparison is:
|
||
|
||
'1.5 <= 2.0'
|
||
Numeric comparison (true)
|
||
|
||
'"abc" >= "xyz"'
|
||
String comparison (false)
|
||
|
||
'1.5 != " +2"'
|
||
String comparison (true)
|
||
|
||
'"1e2" < "3"'
|
||
String comparison (true)
|
||
|
||
'a = 2; b = "2"'
|
||
'a == b'
|
||
String comparison (true)
|
||
|
||
'a = 2; b = " +2"'
|
||
'a == b'
|
||
String comparison (false)
|
||
|
||
In this example:
|
||
|
||
$ echo 1e2 3 | awk '{ print ($1 < $2) ? "true" : "false" }'
|
||
-| false
|
||
|
||
the result is 'false' because both '$1' and '$2' are user input. They
|
||
are numeric strings--therefore both have the strnum attribute, dictating
|
||
a numeric comparison. The purpose of the comparison rules and the use
|
||
of numeric strings is to attempt to produce the behavior that is "least
|
||
surprising," while still "doing the right thing."
|
||
|
||
String comparisons and regular expression comparisons are very
|
||
different. For example:
|
||
|
||
x == "foo"
|
||
|
||
has the value one, or is true if the variable 'x' is precisely 'foo'.
|
||
By contrast:
|
||
|
||
x ~ /foo/
|
||
|
||
has the value one if 'x' contains 'foo', such as '"Oh, what a fool am
|
||
I!"'.
|
||
|
||
The righthand operand of the '~' and '!~' operators may be either a
|
||
regexp constant ('/'...'/') or an ordinary expression. In the latter
|
||
case, the value of the expression as a string is used as a dynamic
|
||
regexp (*note Regexp Usage::; also *note Computed Regexps::).
|
||
|
||
A constant regular expression in slashes by itself is also an
|
||
expression. '/REGEXP/' is an abbreviation for the following comparison
|
||
expression:
|
||
|
||
$0 ~ /REGEXP/
|
||
|
||
One special place where '/foo/' is _not_ an abbreviation for '$0 ~
|
||
/foo/' is when it is the righthand operand of '~' or '!~'. *Note Using
|
||
Constant Regexps::, where this is discussed in more detail.
|
||
|
||
|
||
File: gawk.info, Node: POSIX String Comparison, Prev: Comparison Operators, Up: Typing and Comparison
|
||
|
||
6.3.2.3 String Comparison with POSIX Rules
|
||
..........................................
|
||
|
||
The POSIX standard says that string comparison is performed based on the
|
||
locale's "collating order". This is the order in which characters sort,
|
||
as defined by the locale (for more discussion, *note Locales::). This
|
||
order is usually very different from the results obtained when doing
|
||
straight character-by-character comparison.(1)
|
||
|
||
Because this behavior differs considerably from existing practice,
|
||
'gawk' only implements it when in POSIX mode (*note Options::). Here is
|
||
an example to illustrate the difference, in an 'en_US.UTF-8' locale:
|
||
|
||
$ gawk 'BEGIN { printf("ABC < abc = %s\n",
|
||
> ("ABC" < "abc" ? "TRUE" : "FALSE")) }'
|
||
-| ABC < abc = TRUE
|
||
$ gawk --posix 'BEGIN { printf("ABC < abc = %s\n",
|
||
> ("ABC" < "abc" ? "TRUE" : "FALSE")) }'
|
||
-| ABC < abc = FALSE
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Technically, string comparison is supposed to behave the same way
|
||
as if the strings were compared with the C 'strcoll()' function.
|
||
|
||
|
||
File: gawk.info, Node: Boolean Ops, Next: Conditional Exp, Prev: Typing and Comparison, Up: Truth Values and Conditions
|
||
|
||
6.3.3 Boolean Expressions
|
||
-------------------------
|
||
|
||
A "Boolean expression" is a combination of comparison expressions or
|
||
matching expressions, using the Boolean operators "or" ('||'), "and"
|
||
('&&'), and "not" ('!'), along with parentheses to control nesting. The
|
||
truth value of the Boolean expression is computed by combining the truth
|
||
values of the component expressions. Boolean expressions are also
|
||
referred to as "logical expressions". The terms are equivalent.
|
||
|
||
Boolean expressions can be used wherever comparison and matching
|
||
expressions can be used. They can be used in 'if', 'while', 'do', and
|
||
'for' statements (*note Statements::). They have numeric values (one if
|
||
true, zero if false) that come into play if the result of the Boolean
|
||
expression is stored in a variable or used in arithmetic.
|
||
|
||
In addition, every Boolean expression is also a valid pattern, so you
|
||
can use one as a pattern to control the execution of rules. The Boolean
|
||
operators are:
|
||
|
||
'BOOLEAN1 && BOOLEAN2'
|
||
True if both BOOLEAN1 and BOOLEAN2 are true. For example, the
|
||
following statement prints the current input record if it contains
|
||
both 'edu' and 'li':
|
||
|
||
if ($0 ~ /edu/ && $0 ~ /li/) print
|
||
|
||
The subexpression BOOLEAN2 is evaluated only if BOOLEAN1 is true.
|
||
This can make a difference when BOOLEAN2 contains expressions that
|
||
have side effects. In the case of '$0 ~ /foo/ && ($2 == bar++)',
|
||
the variable 'bar' is not incremented if there is no substring
|
||
'foo' in the record.
|
||
|
||
'BOOLEAN1 || BOOLEAN2'
|
||
True if at least one of BOOLEAN1 or BOOLEAN2 is true. For example,
|
||
the following statement prints all records in the input that
|
||
contain _either_ 'edu' or 'li':
|
||
|
||
if ($0 ~ /edu/ || $0 ~ /li/) print
|
||
|
||
The subexpression BOOLEAN2 is evaluated only if BOOLEAN1 is false.
|
||
This can make a difference when BOOLEAN2 contains expressions that
|
||
have side effects. (Thus, this test never really distinguishes
|
||
records that contain both 'edu' and 'li'--as soon as 'edu' is
|
||
matched, the full test succeeds.)
|
||
|
||
'! BOOLEAN'
|
||
True if BOOLEAN is false. For example, the following program
|
||
prints 'no home!' in the unusual event that the 'HOME' environment
|
||
variable is not defined:
|
||
|
||
BEGIN { if (! ("HOME" in ENVIRON))
|
||
print "no home!" }
|
||
|
||
(The 'in' operator is described in *note Reference to Elements::.)
|
||
|
||
The '&&' and '||' operators are called "short-circuit" operators
|
||
because of the way they work. Evaluation of the full expression is
|
||
"short-circuited" if the result can be determined partway through its
|
||
evaluation.
|
||
|
||
Statements that end with '&&' or '||' can be continued simply by
|
||
putting a newline after them. But you cannot put a newline in front of
|
||
either of these operators without using backslash continuation (*note
|
||
Statements/Lines::).
|
||
|
||
The actual value of an expression using the '!' operator is either
|
||
one or zero, depending upon the truth value of the expression it is
|
||
applied to. The '!' operator is often useful for changing the sense of
|
||
a flag variable from false to true and back again. For example, the
|
||
following program is one way to print lines in between special
|
||
bracketing lines:
|
||
|
||
$1 == "START" { interested = ! interested; next }
|
||
interested { print }
|
||
$1 == "END" { interested = ! interested; next }
|
||
|
||
The variable 'interested', as with all 'awk' variables, starts out
|
||
initialized to zero, which is also false. When a line is seen whose
|
||
first field is 'START', the value of 'interested' is toggled to true,
|
||
using '!'. The next rule prints lines as long as 'interested' is true.
|
||
When a line is seen whose first field is 'END', 'interested' is toggled
|
||
back to false.(1)
|
||
|
||
Most commonly, the '!' operator is used in the conditions of 'if' and
|
||
'while' statements, where it often makes more sense to phrase the logic
|
||
in the negative:
|
||
|
||
if (! SOME CONDITION || SOME OTHER CONDITION) {
|
||
... DO WHATEVER PROCESSING ...
|
||
}
|
||
|
||
NOTE: The 'next' statement is discussed in *note Next Statement::.
|
||
'next' tells 'awk' to skip the rest of the rules, get the next
|
||
record, and start processing the rules over again at the top. The
|
||
reason it's there is to avoid printing the bracketing 'START' and
|
||
'END' lines.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) This program has a bug; it prints lines starting with 'END'. How
|
||
would you fix it?
|
||
|
||
|
||
File: gawk.info, Node: Conditional Exp, Prev: Boolean Ops, Up: Truth Values and Conditions
|
||
|
||
6.3.4 Conditional Expressions
|
||
-----------------------------
|
||
|
||
A "conditional expression" is a special kind of expression that has
|
||
three operands. It allows you to use one expression's value to select
|
||
one of two other expressions. The conditional expression in 'awk' is
|
||
the same as in the C language, as shown here:
|
||
|
||
SELECTOR ? IF-TRUE-EXP : IF-FALSE-EXP
|
||
|
||
There are three subexpressions. The first, SELECTOR, is always computed
|
||
first. If it is "true" (not zero or not null), then IF-TRUE-EXP is
|
||
computed next, and its value becomes the value of the whole expression.
|
||
Otherwise, IF-FALSE-EXP is computed next, and its value becomes the
|
||
value of the whole expression. For example, the following expression
|
||
produces the absolute value of 'x':
|
||
|
||
x >= 0 ? x : -x
|
||
|
||
Each time the conditional expression is computed, only one of
|
||
IF-TRUE-EXP and IF-FALSE-EXP is used; the other is ignored. This is
|
||
important when the expressions have side effects. For example, this
|
||
conditional expression examines element 'i' of either array 'a' or array
|
||
'b', and increments 'i':
|
||
|
||
x == y ? a[i++] : b[i++]
|
||
|
||
This is guaranteed to increment 'i' exactly once, because each time only
|
||
one of the two increment expressions is executed and the other is not.
|
||
*Note Arrays::, for more information about arrays.
|
||
|
||
As a minor 'gawk' extension, a statement that uses '?:' can be
|
||
continued simply by putting a newline after either character. However,
|
||
putting a newline in front of either character does not work without
|
||
using backslash continuation (*note Statements/Lines::). If '--posix'
|
||
is specified (*note Options::), this extension is disabled.
|
||
|
||
|
||
File: gawk.info, Node: Function Calls, Next: Precedence, Prev: Truth Values and Conditions, Up: Expressions
|
||
|
||
6.4 Function Calls
|
||
==================
|
||
|
||
A "function" is a name for a particular calculation. This enables you
|
||
to ask for it by name at any point in the program. For example, the
|
||
function 'sqrt()' computes the square root of a number.
|
||
|
||
A fixed set of functions are "built in", which means they are
|
||
available in every 'awk' program. The 'sqrt()' function is one of
|
||
these. *Note Built-in::, for a list of built-in functions and their
|
||
descriptions. In addition, you can define functions for use in your
|
||
program. *Note User-defined::, for instructions on how to do this.
|
||
Finally, 'gawk' lets you write functions in C or C++ that may be called
|
||
from your program (*note Dynamic Extensions::).
|
||
|
||
The way to use a function is with a "function call" expression, which
|
||
consists of the function name followed immediately by a list of
|
||
"arguments" in parentheses. The arguments are expressions that provide
|
||
the raw materials for the function's calculations. When there is more
|
||
than one argument, they are separated by commas. If there are no
|
||
arguments, just write '()' after the function name. The following
|
||
examples show function calls with and without arguments:
|
||
|
||
sqrt(x^2 + y^2) one argument
|
||
atan2(y, x) two arguments
|
||
rand() no arguments
|
||
|
||
CAUTION: Do not put any space between the function name and the
|
||
opening parenthesis! A user-defined function name looks just like
|
||
the name of a variable--a space would make the expression look like
|
||
concatenation of a variable with an expression inside parentheses.
|
||
With built-in functions, space before the parenthesis is harmless,
|
||
but it is best not to get into the habit of using space to avoid
|
||
mistakes with user-defined functions.
|
||
|
||
Each function expects a particular number of arguments. For example,
|
||
the 'sqrt()' function must be called with a single argument, the number
|
||
of which to take the square root:
|
||
|
||
sqrt(ARGUMENT)
|
||
|
||
Some of the built-in functions have one or more optional arguments.
|
||
If those arguments are not supplied, the functions use a reasonable
|
||
default value. *Note Built-in::, for full details. If arguments are
|
||
omitted in calls to user-defined functions, then those arguments are
|
||
treated as local variables. Such local variables act like the empty
|
||
string if referenced where a string value is required, and like zero if
|
||
referenced where a numeric value is required (*note User-defined::).
|
||
|
||
As an advanced feature, 'gawk' provides indirect function calls,
|
||
which is a way to choose the function to call at runtime, instead of
|
||
when you write the source code to your program. We defer discussion of
|
||
this feature until later; see *note Indirect Calls::.
|
||
|
||
Like every other expression, the function call has a value, often
|
||
called the "return value", which is computed by the function based on
|
||
the arguments you give it. In this example, the return value of
|
||
'sqrt(ARGUMENT)' is the square root of ARGUMENT. The following program
|
||
reads numbers, one number per line, and prints the square root of each
|
||
one:
|
||
|
||
$ awk '{ print "The square root of", $1, "is", sqrt($1) }'
|
||
1
|
||
-| The square root of 1 is 1
|
||
3
|
||
-| The square root of 3 is 1.73205
|
||
5
|
||
-| The square root of 5 is 2.23607
|
||
Ctrl-d
|
||
|
||
A function can also have side effects, such as assigning values to
|
||
certain variables or doing I/O. This program shows how the 'match()'
|
||
function (*note String Functions::) changes the variables 'RSTART' and
|
||
'RLENGTH':
|
||
|
||
{
|
||
if (match($1, $2))
|
||
print RSTART, RLENGTH
|
||
else
|
||
print "no match"
|
||
}
|
||
|
||
Here is a sample run:
|
||
|
||
$ awk -f matchit.awk
|
||
aaccdd c+
|
||
-| 3 2
|
||
foo bar
|
||
-| no match
|
||
abcdefg e
|
||
-| 5 1
|
||
|
||
|
||
File: gawk.info, Node: Precedence, Next: Locales, Prev: Function Calls, Up: Expressions
|
||
|
||
6.5 Operator Precedence (How Operators Nest)
|
||
============================================
|
||
|
||
"Operator precedence" determines how operators are grouped when
|
||
different operators appear close by in one expression. For example, '*'
|
||
has higher precedence than '+'; thus, 'a + b * c' means to multiply 'b'
|
||
and 'c', and then add 'a' to the product (i.e., 'a + (b * c)').
|
||
|
||
The normal precedence of the operators can be overruled by using
|
||
parentheses. Think of the precedence rules as saying where the
|
||
parentheses are assumed to be. In fact, it is wise to always use
|
||
parentheses whenever there is an unusual combination of operators,
|
||
because other people who read the program may not remember what the
|
||
precedence is in this case. Even experienced programmers occasionally
|
||
forget the exact rules, which leads to mistakes. Explicit parentheses
|
||
help prevent any such mistakes.
|
||
|
||
When operators of equal precedence are used together, the leftmost
|
||
operator groups first, except for the assignment, conditional, and
|
||
exponentiation operators, which group in the opposite order. Thus, 'a -
|
||
b + c' groups as '(a - b) + c' and 'a = b = c' groups as 'a = (b = c)'.
|
||
|
||
Normally the precedence of prefix unary operators does not matter,
|
||
because there is only one way to interpret them: innermost first. Thus,
|
||
'$++i' means '$(++i)' and '++$x' means '++($x)'. However, when another
|
||
operator follows the operand, then the precedence of the unary operators
|
||
can matter. '$x^2' means '($x)^2', but '-x^2' means '-(x^2)', because
|
||
'-' has lower precedence than '^', whereas '$' has higher precedence.
|
||
Also, operators cannot be combined in a way that violates the precedence
|
||
rules; for example, '$$0++--' is not a valid expression because the
|
||
first '$' has higher precedence than the '++'; to avoid the problem the
|
||
expression can be rewritten as '$($0++)--'.
|
||
|
||
This list presents 'awk''s operators, in order of highest to lowest
|
||
precedence:
|
||
|
||
'('...')'
|
||
Grouping.
|
||
|
||
'$'
|
||
Field reference.
|
||
|
||
'++ --'
|
||
Increment, decrement.
|
||
|
||
'^ **'
|
||
Exponentiation. These operators group right to left.
|
||
|
||
'+ - !'
|
||
Unary plus, minus, logical "not."
|
||
|
||
'* / %'
|
||
Multiplication, division, remainder.
|
||
|
||
'+ -'
|
||
Addition, subtraction.
|
||
|
||
String concatenation
|
||
There is no special symbol for concatenation. The operands are
|
||
simply written side by side (*note Concatenation::).
|
||
|
||
'< <= == != > >= >> | |&'
|
||
Relational and redirection. The relational operators and the
|
||
redirections have the same precedence level. Characters such as
|
||
'>' serve both as relationals and as redirections; the context
|
||
distinguishes between the two meanings.
|
||
|
||
Note that the I/O redirection operators in 'print' and 'printf'
|
||
statements belong to the statement level, not to expressions. The
|
||
redirection does not produce an expression that could be the
|
||
operand of another operator. As a result, it does not make sense
|
||
to use a redirection operator near another operator of lower
|
||
precedence without parentheses. Such combinations (e.g., 'print
|
||
foo > a ? b : c') result in syntax errors. The correct way to
|
||
write this statement is 'print foo > (a ? b : c)'.
|
||
|
||
'~ !~'
|
||
Matching, nonmatching.
|
||
|
||
'in'
|
||
Array membership.
|
||
|
||
'&&'
|
||
Logical "and."
|
||
|
||
'||'
|
||
Logical "or."
|
||
|
||
'?:'
|
||
Conditional. This operator groups right to left.
|
||
|
||
'= += -= *= /= %= ^= **='
|
||
Assignment. These operators group right to left.
|
||
|
||
NOTE: The '|&', '**', and '**=' operators are not specified by
|
||
POSIX. For maximum portability, do not use them.
|
||
|
||
|
||
File: gawk.info, Node: Locales, Next: Expressions Summary, Prev: Precedence, Up: Expressions
|
||
|
||
6.6 Where You Are Makes a Difference
|
||
====================================
|
||
|
||
Modern systems support the notion of "locales": a way to tell the system
|
||
about the local character set and language. The ISO C standard defines
|
||
a default '"C"' locale, which is an environment that is typical of what
|
||
many C programmers are used to.
|
||
|
||
Once upon a time, the locale setting used to affect regexp matching,
|
||
but this is no longer true (*note Ranges and Locales::).
|
||
|
||
Locales can affect record splitting. For the normal case of 'RS =
|
||
"\n"', the locale is largely irrelevant. For other single-character
|
||
record separators, setting 'LC_ALL=C' in the environment will give you
|
||
much better performance when reading records. Otherwise, 'gawk' has to
|
||
make several function calls, _per input character_, to find the record
|
||
terminator.
|
||
|
||
Locales can affect how dates and times are formatted (*note Time
|
||
Functions::). For example, a common way to abbreviate the date
|
||
September 4, 2015, in the United States is "9/4/15." In many countries
|
||
in Europe, however, it is abbreviated "4.9.15." Thus, the '%x'
|
||
specification in a '"US"' locale might produce '9/4/15', while in a
|
||
'"EUROPE"' locale, it might produce '4.9.15'.
|
||
|
||
According to POSIX, string comparison is also affected by locales
|
||
(similar to regular expressions). The details are presented in *note
|
||
POSIX String Comparison::.
|
||
|
||
Finally, the locale affects the value of the decimal point character
|
||
used when 'gawk' parses input data. This is discussed in detail in
|
||
*note Conversion::.
|
||
|
||
|
||
File: gawk.info, Node: Expressions Summary, Prev: Locales, Up: Expressions
|
||
|
||
6.7 Summary
|
||
===========
|
||
|
||
* Expressions are the basic elements of computation in programs.
|
||
They are built from constants, variables, function calls, and
|
||
combinations of the various kinds of values with operators.
|
||
|
||
* 'awk' supplies three kinds of constants: numeric, string, and
|
||
regexp. 'gawk' lets you specify numeric constants in octal and
|
||
hexadecimal (bases 8 and 16) as well as decimal (base 10). In
|
||
certain contexts, a standalone regexp constant such as '/foo/' has
|
||
the same meaning as '$0 ~ /foo/'.
|
||
|
||
* Variables hold values between uses in computations. A number of
|
||
built-in variables provide information to your 'awk' program, and a
|
||
number of others let you control how 'awk' behaves.
|
||
|
||
* Numbers are automatically converted to strings, and strings to
|
||
numbers, as needed by 'awk'. Numeric values are converted as if
|
||
they were formatted with 'sprintf()' using the format in 'CONVFMT'.
|
||
Locales can influence the conversions.
|
||
|
||
* 'awk' provides the usual arithmetic operators (addition,
|
||
subtraction, multiplication, division, modulus), and unary plus and
|
||
minus. It also provides comparison operators, Boolean operators,
|
||
an array membership testing operator, and regexp matching
|
||
operators. String concatenation is accomplished by placing two
|
||
expressions next to each other; there is no explicit operator. The
|
||
three-operand '?:' operator provides an "if-else" test within
|
||
expressions.
|
||
|
||
* Assignment operators provide convenient shorthands for common
|
||
arithmetic operations.
|
||
|
||
* In 'awk', a value is considered to be true if it is nonzero _or_
|
||
non-null. Otherwise, the value is false.
|
||
|
||
* A variable's type is set upon each assignment and may change over
|
||
its lifetime. The type determines how it behaves in comparisons
|
||
(string or numeric).
|
||
|
||
* Function calls return a value that may be used as part of a larger
|
||
expression. Expressions used to pass parameter values are fully
|
||
evaluated before the function is called. 'awk' provides built-in
|
||
and user-defined functions; this is described in *note Functions::.
|
||
|
||
* Operator precedence specifies the order in which operations are
|
||
performed, unless explicitly overridden by parentheses. 'awk''s
|
||
operator precedence is compatible with that of C.
|
||
|
||
* Locales can affect the format of data as output by an 'awk'
|
||
program, and occasionally the format for data read as input.
|
||
|
||
|
||
File: gawk.info, Node: Patterns and Actions, Next: Arrays, Prev: Expressions, Up: Top
|
||
|
||
7 Patterns, Actions, and Variables
|
||
**********************************
|
||
|
||
As you have already seen, each 'awk' statement consists of a pattern
|
||
with an associated action. This major node describes how you build
|
||
patterns and actions, what kinds of things you can do within actions,
|
||
and 'awk''s predefined variables.
|
||
|
||
The pattern-action rules and the statements available for use within
|
||
actions form the core of 'awk' programming. In a sense, everything
|
||
covered up to here has been the foundation that programs are built on
|
||
top of. Now it's time to start building something useful.
|
||
|
||
* Menu:
|
||
|
||
* Pattern Overview:: What goes into a pattern.
|
||
* Using Shell Variables:: How to use shell variables with 'awk'.
|
||
* Action Overview:: What goes into an action.
|
||
* Statements:: Describes the various control statements in
|
||
detail.
|
||
* Built-in Variables:: Summarizes the predefined variables.
|
||
* Pattern Action Summary:: Patterns and Actions summary.
|
||
|
||
|
||
File: gawk.info, Node: Pattern Overview, Next: Using Shell Variables, Up: Patterns and Actions
|
||
|
||
7.1 Pattern Elements
|
||
====================
|
||
|
||
* Menu:
|
||
|
||
* Regexp Patterns:: Using regexps as patterns.
|
||
* Expression Patterns:: Any expression can be used as a pattern.
|
||
* Ranges:: Pairs of patterns specify record ranges.
|
||
* BEGIN/END:: Specifying initialization and cleanup rules.
|
||
* BEGINFILE/ENDFILE:: Two special patterns for advanced control.
|
||
* Empty:: The empty pattern, which matches every record.
|
||
|
||
Patterns in 'awk' control the execution of rules--a rule is executed
|
||
when its pattern matches the current input record. The following is a
|
||
summary of the types of 'awk' patterns:
|
||
|
||
'/REGULAR EXPRESSION/'
|
||
A regular expression. It matches when the text of the input record
|
||
fits the regular expression. (*Note Regexp::.)
|
||
|
||
'EXPRESSION'
|
||
A single expression. It matches when its value is nonzero (if a
|
||
number) or non-null (if a string). (*Note Expression Patterns::.)
|
||
|
||
'BEGPAT, ENDPAT'
|
||
A pair of patterns separated by a comma, specifying a "range" of
|
||
records. The range includes both the initial record that matches
|
||
BEGPAT and the final record that matches ENDPAT. (*Note Ranges::.)
|
||
|
||
'BEGIN'
|
||
'END'
|
||
Special patterns for you to supply startup or cleanup actions for
|
||
your 'awk' program. (*Note BEGIN/END::.)
|
||
|
||
'BEGINFILE'
|
||
'ENDFILE'
|
||
Special patterns for you to supply startup or cleanup actions to be
|
||
done on a per-file basis. (*Note BEGINFILE/ENDFILE::.)
|
||
|
||
'EMPTY'
|
||
The empty pattern matches every input record. (*Note Empty::.)
|
||
|
||
|
||
File: gawk.info, Node: Regexp Patterns, Next: Expression Patterns, Up: Pattern Overview
|
||
|
||
7.1.1 Regular Expressions as Patterns
|
||
-------------------------------------
|
||
|
||
Regular expressions are one of the first kinds of patterns presented in
|
||
this book. This kind of pattern is simply a regexp constant in the
|
||
pattern part of a rule. Its meaning is '$0 ~ /PATTERN/'. The pattern
|
||
matches when the input record matches the regexp. For example:
|
||
|
||
/foo|bar|baz/ { buzzwords++ }
|
||
END { print buzzwords, "buzzwords seen" }
|
||
|
||
|
||
File: gawk.info, Node: Expression Patterns, Next: Ranges, Prev: Regexp Patterns, Up: Pattern Overview
|
||
|
||
7.1.2 Expressions as Patterns
|
||
-----------------------------
|
||
|
||
Any 'awk' expression is valid as an 'awk' pattern. The pattern matches
|
||
if the expression's value is nonzero (if a number) or non-null (if a
|
||
string). The expression is reevaluated each time the rule is tested
|
||
against a new input record. If the expression uses fields such as '$1',
|
||
the value depends directly on the new input record's text; otherwise, it
|
||
depends on only what has happened so far in the execution of the 'awk'
|
||
program.
|
||
|
||
Comparison expressions, using the comparison operators described in
|
||
*note Typing and Comparison::, are a very common kind of pattern.
|
||
Regexp matching and nonmatching are also very common expressions. The
|
||
left operand of the '~' and '!~' operators is a string. The right
|
||
operand is either a constant regular expression enclosed in slashes
|
||
('/REGEXP/'), or any expression whose string value is used as a dynamic
|
||
regular expression (*note Computed Regexps::). The following example
|
||
prints the second field of each input record whose first field is
|
||
precisely 'li':
|
||
|
||
$ awk '$1 == "li" { print $2 }' mail-list
|
||
|
||
(There is no output, because there is no person with the exact name
|
||
'li'.) Contrast this with the following regular expression match, which
|
||
accepts any record with a first field that contains 'li':
|
||
|
||
$ awk '$1 ~ /li/ { print $2 }' mail-list
|
||
-| 555-5553
|
||
-| 555-6699
|
||
|
||
A regexp constant as a pattern is also a special case of an
|
||
expression pattern. The expression '/li/' has the value one if 'li'
|
||
appears in the current input record. Thus, as a pattern, '/li/' matches
|
||
any record containing 'li'.
|
||
|
||
Boolean expressions are also commonly used as patterns. Whether the
|
||
pattern matches an input record depends on whether its subexpressions
|
||
match. For example, the following command prints all the records in
|
||
'mail-list' that contain both 'edu' and 'li':
|
||
|
||
$ awk '/edu/ && /li/' mail-list
|
||
-| Samuel 555-3430 samuel.lanceolis@shu.edu A
|
||
|
||
The following command prints all records in 'mail-list' that contain
|
||
_either_ 'edu' or 'li' (or both, of course):
|
||
|
||
$ awk '/edu/ || /li/' mail-list
|
||
-| Amelia 555-5553 amelia.zodiacusque@gmail.com F
|
||
-| Broderick 555-0542 broderick.aliquotiens@yahoo.com R
|
||
-| Fabius 555-1234 fabius.undevicesimus@ucb.edu F
|
||
-| Julie 555-6699 julie.perscrutabor@skeeve.com F
|
||
-| Samuel 555-3430 samuel.lanceolis@shu.edu A
|
||
-| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R
|
||
|
||
The following command prints all records in 'mail-list' that do _not_
|
||
contain the string 'li':
|
||
|
||
$ awk '! /li/' mail-list
|
||
-| Anthony 555-3412 anthony.asserturo@hotmail.com A
|
||
-| Becky 555-7685 becky.algebrarum@gmail.com A
|
||
-| Bill 555-1675 bill.drowning@hotmail.com A
|
||
-| Camilla 555-2912 camilla.infusarum@skynet.be R
|
||
-| Fabius 555-1234 fabius.undevicesimus@ucb.edu F
|
||
-| Martin 555-6480 martin.codicibus@hotmail.com A
|
||
-| Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R
|
||
|
||
The subexpressions of a Boolean operator in a pattern can be constant
|
||
regular expressions, comparisons, or any other 'awk' expressions. Range
|
||
patterns are not expressions, so they cannot appear inside Boolean
|
||
patterns. Likewise, the special patterns 'BEGIN', 'END', 'BEGINFILE',
|
||
and 'ENDFILE', which never match any input record, are not expressions
|
||
and cannot appear inside Boolean patterns.
|
||
|
||
The precedence of the different operators that can appear in patterns
|
||
is described in *note Precedence::.
|
||
|
||
|
||
File: gawk.info, Node: Ranges, Next: BEGIN/END, Prev: Expression Patterns, Up: Pattern Overview
|
||
|
||
7.1.3 Specifying Record Ranges with Patterns
|
||
--------------------------------------------
|
||
|
||
A "range pattern" is made of two patterns separated by a comma, in the
|
||
form 'BEGPAT, ENDPAT'. It is used to match ranges of consecutive input
|
||
records. The first pattern, BEGPAT, controls where the range begins,
|
||
while ENDPAT controls where the pattern ends. For example, the
|
||
following:
|
||
|
||
awk '$1 == "on", $1 == "off"' myfile
|
||
|
||
prints every record in 'myfile' between 'on'/'off' pairs, inclusive.
|
||
|
||
A range pattern starts out by matching BEGPAT against every input
|
||
record. When a record matches BEGPAT, the range pattern is "turned on",
|
||
and the range pattern matches this record as well. As long as the range
|
||
pattern stays turned on, it automatically matches every input record
|
||
read. The range pattern also matches ENDPAT against every input record;
|
||
when this succeeds, the range pattern is "turned off" again for the
|
||
following record. Then the range pattern goes back to checking BEGPAT
|
||
against each record.
|
||
|
||
The record that turns on the range pattern and the one that turns it
|
||
off both match the range pattern. If you don't want to operate on these
|
||
records, you can write 'if' statements in the rule's action to
|
||
distinguish them from the records you are interested in.
|
||
|
||
It is possible for a pattern to be turned on and off by the same
|
||
record. If the record satisfies both conditions, then the action is
|
||
executed for just that record. For example, suppose there is text
|
||
between two identical markers (e.g., the '%' symbol), each on its own
|
||
line, that should be ignored. A first attempt would be to combine a
|
||
range pattern that describes the delimited text with the 'next'
|
||
statement (not discussed yet, *note Next Statement::). This causes
|
||
'awk' to skip any further processing of the current record and start
|
||
over again with the next input record. Such a program looks like this:
|
||
|
||
/^%$/,/^%$/ { next }
|
||
{ print }
|
||
|
||
This program fails because the range pattern is both turned on and
|
||
turned off by the first line, which just has a '%' on it. To accomplish
|
||
this task, write the program in the following manner, using a flag:
|
||
|
||
/^%$/ { skip = ! skip; next }
|
||
skip == 1 { next } # skip lines with `skip' set
|
||
|
||
In a range pattern, the comma (',') has the lowest precedence of all
|
||
the operators (i.e., it is evaluated last). Thus, the following program
|
||
attempts to combine a range pattern with another, simpler test:
|
||
|
||
echo Yes | awk '/1/,/2/ || /Yes/'
|
||
|
||
The intent of this program is '(/1/,/2/) || /Yes/'. However, 'awk'
|
||
interprets this as '/1/, (/2/ || /Yes/)'. This cannot be changed or
|
||
worked around; range patterns do not combine with other patterns:
|
||
|
||
$ echo Yes | gawk '(/1/,/2/) || /Yes/'
|
||
error-> gawk: cmd. line:1: (/1/,/2/) || /Yes/
|
||
error-> gawk: cmd. line:1: ^ syntax error
|
||
|
||
As a minor point of interest, although it is poor style, POSIX allows
|
||
you to put a newline after the comma in a range pattern. (d.c.)
|
||
|
||
|
||
File: gawk.info, Node: BEGIN/END, Next: BEGINFILE/ENDFILE, Prev: Ranges, Up: Pattern Overview
|
||
|
||
7.1.4 The 'BEGIN' and 'END' Special Patterns
|
||
--------------------------------------------
|
||
|
||
All the patterns described so far are for matching input records. The
|
||
'BEGIN' and 'END' special patterns are different. They supply startup
|
||
and cleanup actions for 'awk' programs. 'BEGIN' and 'END' rules must
|
||
have actions; there is no default action for these rules because there
|
||
is no current record when they run. 'BEGIN' and 'END' rules are often
|
||
referred to as "'BEGIN' and 'END' blocks" by longtime 'awk' programmers.
|
||
|
||
* Menu:
|
||
|
||
* Using BEGIN/END:: How and why to use BEGIN/END rules.
|
||
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
|
||
|
||
|
||
File: gawk.info, Node: Using BEGIN/END, Next: I/O And BEGIN/END, Up: BEGIN/END
|
||
|
||
7.1.4.1 Startup and Cleanup Actions
|
||
...................................
|
||
|
||
A 'BEGIN' rule is executed once only, before the first input record is
|
||
read. Likewise, an 'END' rule is executed once only, after all the
|
||
input is read. For example:
|
||
|
||
$ awk '
|
||
> BEGIN { print "Analysis of \"li\"" }
|
||
> /li/ { ++n }
|
||
> END { print "\"li\" appears in", n, "records." }' mail-list
|
||
-| Analysis of "li"
|
||
-| "li" appears in 4 records.
|
||
|
||
This program finds the number of records in the input file
|
||
'mail-list' that contain the string 'li'. The 'BEGIN' rule prints a
|
||
title for the report. There is no need to use the 'BEGIN' rule to
|
||
initialize the counter 'n' to zero, as 'awk' does this automatically
|
||
(*note Variables::). The second rule increments the variable 'n' every
|
||
time a record containing the pattern 'li' is read. The 'END' rule
|
||
prints the value of 'n' at the end of the run.
|
||
|
||
The special patterns 'BEGIN' and 'END' cannot be used in ranges or
|
||
with Boolean operators (indeed, they cannot be used with any operators).
|
||
An 'awk' program may have multiple 'BEGIN' and/or 'END' rules. They are
|
||
executed in the order in which they appear: all the 'BEGIN' rules at
|
||
startup and all the 'END' rules at termination. 'BEGIN' and 'END' rules
|
||
may be intermixed with other rules. This feature was added in the 1987
|
||
version of 'awk' and is included in the POSIX standard. The original
|
||
(1978) version of 'awk' required the 'BEGIN' rule to be placed at the
|
||
beginning of the program, the 'END' rule to be placed at the end, and
|
||
only allowed one of each. This is no longer required, but it is a good
|
||
idea to follow this template in terms of program organization and
|
||
readability.
|
||
|
||
Multiple 'BEGIN' and 'END' rules are useful for writing library
|
||
functions, because each library file can have its own 'BEGIN' and/or
|
||
'END' rule to do its own initialization and/or cleanup. The order in
|
||
which library functions are named on the command line controls the order
|
||
in which their 'BEGIN' and 'END' rules are executed. Therefore, you
|
||
have to be careful when writing such rules in library files so that the
|
||
order in which they are executed doesn't matter. *Note Options::, for
|
||
more information on using library functions. *Note Library Functions::,
|
||
for a number of useful library functions.
|
||
|
||
If an 'awk' program has only 'BEGIN' rules and no other rules, then
|
||
the program exits after the 'BEGIN' rules are run.(1) However, if an
|
||
'END' rule exists, then the input is read, even if there are no other
|
||
rules in the program. This is necessary in case the 'END' rule checks
|
||
the 'FNR' and 'NR' variables.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) The original version of 'awk' kept reading and ignoring input
|
||
until the end of the file was seen.
|
||
|
||
|
||
File: gawk.info, Node: I/O And BEGIN/END, Prev: Using BEGIN/END, Up: BEGIN/END
|
||
|
||
7.1.4.2 Input/Output from 'BEGIN' and 'END' Rules
|
||
.................................................
|
||
|
||
There are several (sometimes subtle) points to be aware of when doing
|
||
I/O from a 'BEGIN' or 'END' rule. The first has to do with the value of
|
||
'$0' in a 'BEGIN' rule. Because 'BEGIN' rules are executed before any
|
||
input is read, there simply is no input record, and therefore no fields,
|
||
when executing 'BEGIN' rules. References to '$0' and the fields yield a
|
||
null string or zero, depending upon the context. One way to give '$0' a
|
||
real value is to execute a 'getline' command without a variable (*note
|
||
Getline::). Another way is simply to assign a value to '$0'.
|
||
|
||
The second point is similar to the first, but from the other
|
||
direction. Traditionally, due largely to implementation issues, '$0'
|
||
and 'NF' were _undefined_ inside an 'END' rule. The POSIX standard
|
||
specifies that 'NF' is available in an 'END' rule. It contains the
|
||
number of fields from the last input record. Most probably due to an
|
||
oversight, the standard does not say that '$0' is also preserved,
|
||
although logically one would think that it should be. In fact, all of
|
||
BWK 'awk', 'mawk', and 'gawk' preserve the value of '$0' for use in
|
||
'END' rules. Be aware, however, that some other implementations and
|
||
many older versions of Unix 'awk' do not.
|
||
|
||
The third point follows from the first two. The meaning of 'print'
|
||
inside a 'BEGIN' or 'END' rule is the same as always: 'print $0'. If
|
||
'$0' is the null string, then this prints an empty record. Many
|
||
longtime 'awk' programmers use an unadorned 'print' in 'BEGIN' and 'END'
|
||
rules, to mean 'print ""', relying on '$0' being null. Although one
|
||
might generally get away with this in 'BEGIN' rules, it is a very bad
|
||
idea in 'END' rules, at least in 'gawk'. It is also poor style, because
|
||
if an empty line is needed in the output, the program should print one
|
||
explicitly.
|
||
|
||
Finally, the 'next' and 'nextfile' statements are not allowed in a
|
||
'BEGIN' rule, because the implicit
|
||
read-a-record-and-match-against-the-rules loop has not started yet.
|
||
Similarly, those statements are not valid in an 'END' rule, because all
|
||
the input has been read. (*Note Next Statement::, and *note Nextfile
|
||
Statement::,.)
|
||
|
||
|
||
File: gawk.info, Node: BEGINFILE/ENDFILE, Next: Empty, Prev: BEGIN/END, Up: Pattern Overview
|
||
|
||
7.1.5 The 'BEGINFILE' and 'ENDFILE' Special Patterns
|
||
----------------------------------------------------
|
||
|
||
This minor node describes a 'gawk'-specific feature.
|
||
|
||
Two special kinds of rule, 'BEGINFILE' and 'ENDFILE', give you
|
||
"hooks" into 'gawk''s command-line file processing loop. As with the
|
||
'BEGIN' and 'END' rules (*note BEGIN/END::), all 'BEGINFILE' rules in a
|
||
program are merged, in the order they are read by 'gawk', and all
|
||
'ENDFILE' rules are merged as well.
|
||
|
||
The body of the 'BEGINFILE' rules is executed just before 'gawk'
|
||
reads the first record from a file. 'FILENAME' is set to the name of
|
||
the current file, and 'FNR' is set to zero.
|
||
|
||
The 'BEGINFILE' rule provides you the opportunity to accomplish two
|
||
tasks that would otherwise be difficult or impossible to perform:
|
||
|
||
* You can test if the file is readable. Normally, it is a fatal
|
||
error if a file named on the command line cannot be opened for
|
||
reading. However, you can bypass the fatal error and move on to
|
||
the next file on the command line.
|
||
|
||
You do this by checking if the 'ERRNO' variable is not the empty
|
||
string; if so, then 'gawk' was not able to open the file. In this
|
||
case, your program can execute the 'nextfile' statement (*note
|
||
Nextfile Statement::). This causes 'gawk' to skip the file
|
||
entirely. Otherwise, 'gawk' exits with the usual fatal error.
|
||
|
||
* If you have written extensions that modify the record handling (by
|
||
inserting an "input parser"; *note Input Parsers::), you can invoke
|
||
them at this point, before 'gawk' has started processing the file.
|
||
(This is a _very_ advanced feature, currently used only by the
|
||
'gawkextlib' project (http://sourceforge.net/projects/gawkextlib).)
|
||
|
||
The 'ENDFILE' rule is called when 'gawk' has finished processing the
|
||
last record in an input file. For the last input file, it will be
|
||
called before any 'END' rules. The 'ENDFILE' rule is executed even for
|
||
empty input files.
|
||
|
||
Normally, when an error occurs when reading input in the normal
|
||
input-processing loop, the error is fatal. However, if an 'ENDFILE'
|
||
rule is present, the error becomes non-fatal, and instead 'ERRNO' is
|
||
set. This makes it possible to catch and process I/O errors at the
|
||
level of the 'awk' program.
|
||
|
||
The 'next' statement (*note Next Statement::) is not allowed inside
|
||
either a 'BEGINFILE' or an 'ENDFILE' rule. The 'nextfile' statement is
|
||
allowed only inside a 'BEGINFILE' rule, not inside an 'ENDFILE' rule.
|
||
|
||
The 'getline' statement (*note Getline::) is restricted inside both
|
||
'BEGINFILE' and 'ENDFILE': only redirected forms of 'getline' are
|
||
allowed.
|
||
|
||
'BEGINFILE' and 'ENDFILE' are 'gawk' extensions. In most other 'awk'
|
||
implementations, or if 'gawk' is in compatibility mode (*note
|
||
Options::), they are not special.
|
||
|
||
|
||
File: gawk.info, Node: Empty, Prev: BEGINFILE/ENDFILE, Up: Pattern Overview
|
||
|
||
7.1.6 The Empty Pattern
|
||
-----------------------
|
||
|
||
An empty (i.e., nonexistent) pattern is considered to match _every_
|
||
input record. For example, the program:
|
||
|
||
awk '{ print $1 }' mail-list
|
||
|
||
prints the first field of every record.
|
||
|
||
|
||
File: gawk.info, Node: Using Shell Variables, Next: Action Overview, Prev: Pattern Overview, Up: Patterns and Actions
|
||
|
||
7.2 Using Shell Variables in Programs
|
||
=====================================
|
||
|
||
'awk' programs are often used as components in larger programs written
|
||
in shell. For example, it is very common to use a shell variable to
|
||
hold a pattern that the 'awk' program searches for. There are two ways
|
||
to get the value of the shell variable into the body of the 'awk'
|
||
program.
|
||
|
||
A common method is to use shell quoting to substitute the variable's
|
||
value into the program inside the script. For example, consider the
|
||
following program:
|
||
|
||
printf "Enter search pattern: "
|
||
read pattern
|
||
awk "/$pattern/ "'{ nmatches++ }
|
||
END { print nmatches, "found" }' /path/to/data
|
||
|
||
The 'awk' program consists of two pieces of quoted text that are
|
||
concatenated together to form the program. The first part is
|
||
double-quoted, which allows substitution of the 'pattern' shell variable
|
||
inside the quotes. The second part is single-quoted.
|
||
|
||
Variable substitution via quoting works, but can potentially be
|
||
messy. It requires a good understanding of the shell's quoting rules
|
||
(*note Quoting::), and it's often difficult to correctly match up the
|
||
quotes when reading the program.
|
||
|
||
A better method is to use 'awk''s variable assignment feature (*note
|
||
Assignment Options::) to assign the shell variable's value to an 'awk'
|
||
variable. Then use dynamic regexps to match the pattern (*note Computed
|
||
Regexps::). The following shows how to redo the previous example using
|
||
this technique:
|
||
|
||
printf "Enter search pattern: "
|
||
read pattern
|
||
awk -v pat="$pattern" '$0 ~ pat { nmatches++ }
|
||
END { print nmatches, "found" }' /path/to/data
|
||
|
||
Now, the 'awk' program is just one single-quoted string. The assignment
|
||
'-v pat="$pattern"' still requires double quotes, in case there is
|
||
whitespace in the value of '$pattern'. The 'awk' variable 'pat' could
|
||
be named 'pattern' too, but that would be more confusing. Using a
|
||
variable also provides more flexibility, as the variable can be used
|
||
anywhere inside the program--for printing, as an array subscript, or for
|
||
any other use--without requiring the quoting tricks at every point in
|
||
the program.
|
||
|
||
|
||
File: gawk.info, Node: Action Overview, Next: Statements, Prev: Using Shell Variables, Up: Patterns and Actions
|
||
|
||
7.3 Actions
|
||
===========
|
||
|
||
An 'awk' program or script consists of a series of rules and function
|
||
definitions interspersed. (Functions are described later. *Note
|
||
User-defined::.) A rule contains a pattern and an action, either of
|
||
which (but not both) may be omitted. The purpose of the "action" is to
|
||
tell 'awk' what to do once a match for the pattern is found. Thus, in
|
||
outline, an 'awk' program generally looks like this:
|
||
|
||
[PATTERN] '{ ACTION }'
|
||
PATTERN ['{ ACTION }']
|
||
...
|
||
'function NAME(ARGS) { ... }'
|
||
...
|
||
|
||
An action consists of one or more 'awk' "statements", enclosed in
|
||
braces ('{...}'). Each statement specifies one thing to do. The
|
||
statements are separated by newlines or semicolons. The braces around
|
||
an action must be used even if the action contains only one statement,
|
||
or if it contains no statements at all. However, if you omit the action
|
||
entirely, omit the braces as well. An omitted action is equivalent to
|
||
'{ print $0 }':
|
||
|
||
/foo/ { } match 'foo', do nothing -- empty action
|
||
/foo/ match 'foo', print the record -- omitted action
|
||
|
||
The following types of statements are supported in 'awk':
|
||
|
||
Expressions
|
||
Call functions or assign values to variables (*note Expressions::).
|
||
Executing this kind of statement simply computes the value of the
|
||
expression. This is useful when the expression has side effects
|
||
(*note Assignment Ops::).
|
||
|
||
Control statements
|
||
Specify the control flow of 'awk' programs. The 'awk' language
|
||
gives you C-like constructs ('if', 'for', 'while', and 'do') as
|
||
well as a few special ones (*note Statements::).
|
||
|
||
Compound statements
|
||
Enclose one or more statements in braces. A compound statement is
|
||
used in order to put several statements together in the body of an
|
||
'if', 'while', 'do', or 'for' statement.
|
||
|
||
Input statements
|
||
Use the 'getline' command (*note Getline::). Also supplied in
|
||
'awk' are the 'next' statement (*note Next Statement::) and the
|
||
'nextfile' statement (*note Nextfile Statement::).
|
||
|
||
Output statements
|
||
Such as 'print' and 'printf'. *Note Printing::.
|
||
|
||
Deletion statements
|
||
For deleting array elements. *Note Delete::.
|
||
|
||
|
||
File: gawk.info, Node: Statements, Next: Built-in Variables, Prev: Action Overview, Up: Patterns and Actions
|
||
|
||
7.4 Control Statements in Actions
|
||
=================================
|
||
|
||
"Control statements", such as 'if', 'while', and so on, control the flow
|
||
of execution in 'awk' programs. Most of 'awk''s control statements are
|
||
patterned after similar statements in C.
|
||
|
||
All the control statements start with special keywords, such as 'if'
|
||
and 'while', to distinguish them from simple expressions. Many control
|
||
statements contain other statements. For example, the 'if' statement
|
||
contains another statement that may or may not be executed. The
|
||
contained statement is called the "body". To include more than one
|
||
statement in the body, group them into a single "compound statement"
|
||
with braces, separating them with newlines or semicolons.
|
||
|
||
* Menu:
|
||
|
||
* If Statement:: Conditionally execute some 'awk'
|
||
statements.
|
||
* While Statement:: Loop until some condition is satisfied.
|
||
* Do Statement:: Do specified action while looping until some
|
||
condition is satisfied.
|
||
* For Statement:: Another looping statement, that provides
|
||
initialization and increment clauses.
|
||
* Switch Statement:: Switch/case evaluation for conditional
|
||
execution of statements based on a value.
|
||
* Break Statement:: Immediately exit the innermost enclosing loop.
|
||
* Continue Statement:: Skip to the end of the innermost enclosing
|
||
loop.
|
||
* Next Statement:: Stop processing the current input record.
|
||
* Nextfile Statement:: Stop processing the current file.
|
||
* Exit Statement:: Stop execution of 'awk'.
|
||
|
||
|
||
File: gawk.info, Node: If Statement, Next: While Statement, Up: Statements
|
||
|
||
7.4.1 The 'if'-'else' Statement
|
||
-------------------------------
|
||
|
||
The 'if'-'else' statement is 'awk''s decision-making statement. It
|
||
looks like this:
|
||
|
||
'if (CONDITION) THEN-BODY' ['else ELSE-BODY']
|
||
|
||
The CONDITION is an expression that controls what the rest of the
|
||
statement does. If the CONDITION is true, THEN-BODY is executed;
|
||
otherwise, ELSE-BODY is executed. The 'else' part of the statement is
|
||
optional. The condition is considered false if its value is zero or the
|
||
null string; otherwise, the condition is true. Refer to the following:
|
||
|
||
if (x % 2 == 0)
|
||
print "x is even"
|
||
else
|
||
print "x is odd"
|
||
|
||
In this example, if the expression 'x % 2 == 0' is true (i.e., if the
|
||
value of 'x' is evenly divisible by two), then the first 'print'
|
||
statement is executed; otherwise, the second 'print' statement is
|
||
executed. If the 'else' keyword appears on the same line as THEN-BODY
|
||
and THEN-BODY is not a compound statement (i.e., not surrounded by
|
||
braces), then a semicolon must separate THEN-BODY from the 'else'. To
|
||
illustrate this, the previous example can be rewritten as:
|
||
|
||
if (x % 2 == 0) print "x is even"; else
|
||
print "x is odd"
|
||
|
||
If the ';' is left out, 'awk' can't interpret the statement and it
|
||
produces a syntax error. Don't actually write programs this way,
|
||
because a human reader might fail to see the 'else' if it is not the
|
||
first thing on its line.
|
||
|
||
|
||
File: gawk.info, Node: While Statement, Next: Do Statement, Prev: If Statement, Up: Statements
|
||
|
||
7.4.2 The 'while' Statement
|
||
---------------------------
|
||
|
||
In programming, a "loop" is a part of a program that can be executed two
|
||
or more times in succession. The 'while' statement is the simplest
|
||
looping statement in 'awk'. It repeatedly executes a statement as long
|
||
as a condition is true. For example:
|
||
|
||
while (CONDITION)
|
||
BODY
|
||
|
||
BODY is a statement called the "body" of the loop, and CONDITION is an
|
||
expression that controls how long the loop keeps running. The first
|
||
thing the 'while' statement does is test the CONDITION. If the
|
||
CONDITION is true, it executes the statement BODY. (The CONDITION is
|
||
true when the value is not zero and not a null string.) After BODY has
|
||
been executed, CONDITION is tested again, and if it is still true, BODY
|
||
executes again. This process repeats until the CONDITION is no longer
|
||
true. If the CONDITION is initially false, the body of the loop never
|
||
executes and 'awk' continues with the statement following the loop.
|
||
This example prints the first three fields of each record, one per line:
|
||
|
||
awk '
|
||
{
|
||
i = 1
|
||
while (i <= 3) {
|
||
print $i
|
||
i++
|
||
}
|
||
}' inventory-shipped
|
||
|
||
The body of this loop is a compound statement enclosed in braces,
|
||
containing two statements. The loop works in the following manner:
|
||
first, the value of 'i' is set to one. Then, the 'while' statement
|
||
tests whether 'i' is less than or equal to three. This is true when 'i'
|
||
equals one, so the 'i'th field is printed. Then the 'i++' increments
|
||
the value of 'i' and the loop repeats. The loop terminates when 'i'
|
||
reaches four.
|
||
|
||
A newline is not required between the condition and the body;
|
||
however, using one makes the program clearer unless the body is a
|
||
compound statement or else is very simple. The newline after the open
|
||
brace that begins the compound statement is not required either, but the
|
||
program is harder to read without it.
|
||
|
||
|
||
File: gawk.info, Node: Do Statement, Next: For Statement, Prev: While Statement, Up: Statements
|
||
|
||
7.4.3 The 'do'-'while' Statement
|
||
--------------------------------
|
||
|
||
The 'do' loop is a variation of the 'while' looping statement. The 'do'
|
||
loop executes the BODY once and then repeats the BODY as long as the
|
||
CONDITION is true. It looks like this:
|
||
|
||
do
|
||
BODY
|
||
while (CONDITION)
|
||
|
||
Even if the CONDITION is false at the start, the BODY executes at
|
||
least once (and only once, unless executing BODY makes CONDITION true).
|
||
Contrast this with the corresponding 'while' statement:
|
||
|
||
while (CONDITION)
|
||
BODY
|
||
|
||
This statement does not execute the BODY even once if the CONDITION is
|
||
false to begin with. The following is an example of a 'do' statement:
|
||
|
||
{
|
||
i = 1
|
||
do {
|
||
print $0
|
||
i++
|
||
} while (i <= 10)
|
||
}
|
||
|
||
This program prints each input record 10 times. However, it isn't a
|
||
very realistic example, because in this case an ordinary 'while' would
|
||
do just as well. This situation reflects actual experience; only
|
||
occasionally is there a real use for a 'do' statement.
|
||
|
||
|
||
File: gawk.info, Node: For Statement, Next: Switch Statement, Prev: Do Statement, Up: Statements
|
||
|
||
7.4.4 The 'for' Statement
|
||
-------------------------
|
||
|
||
The 'for' statement makes it more convenient to count iterations of a
|
||
loop. The general form of the 'for' statement looks like this:
|
||
|
||
for (INITIALIZATION; CONDITION; INCREMENT)
|
||
BODY
|
||
|
||
The INITIALIZATION, CONDITION, and INCREMENT parts are arbitrary 'awk'
|
||
expressions, and BODY stands for any 'awk' statement.
|
||
|
||
The 'for' statement starts by executing INITIALIZATION. Then, as
|
||
long as the CONDITION is true, it repeatedly executes BODY and then
|
||
INCREMENT. Typically, INITIALIZATION sets a variable to either zero or
|
||
one, INCREMENT adds one to it, and CONDITION compares it against the
|
||
desired number of iterations. For example:
|
||
|
||
awk '
|
||
{
|
||
for (i = 1; i <= 3; i++)
|
||
print $i
|
||
}' inventory-shipped
|
||
|
||
This prints the first three fields of each input record, with one field
|
||
per line.
|
||
|
||
It isn't possible to set more than one variable in the INITIALIZATION
|
||
part without using a multiple assignment statement such as 'x = y = 0'.
|
||
This makes sense only if all the initial values are equal. (But it is
|
||
possible to initialize additional variables by writing their assignments
|
||
as separate statements preceding the 'for' loop.)
|
||
|
||
The same is true of the INCREMENT part. Incrementing additional
|
||
variables requires separate statements at the end of the loop. The C
|
||
compound expression, using C's comma operator, is useful in this
|
||
context, but it is not supported in 'awk'.
|
||
|
||
Most often, INCREMENT is an increment expression, as in the previous
|
||
example. But this is not required; it can be any expression whatsoever.
|
||
For example, the following statement prints all the powers of two
|
||
between 1 and 100:
|
||
|
||
for (i = 1; i <= 100; i *= 2)
|
||
print i
|
||
|
||
If there is nothing to be done, any of the three expressions in the
|
||
parentheses following the 'for' keyword may be omitted. Thus, 'for (; x > 0;)'
|
||
is equivalent to 'while (x > 0)'. If the CONDITION is omitted, it is
|
||
treated as true, effectively yielding an "infinite loop" (i.e., a loop
|
||
that never terminates).
|
||
|
||
In most cases, a 'for' loop is an abbreviation for a 'while' loop, as
|
||
shown here:
|
||
|
||
INITIALIZATION
|
||
while (CONDITION) {
|
||
BODY
|
||
INCREMENT
|
||
}
|
||
|
||
The only exception is when the 'continue' statement (*note Continue
|
||
Statement::) is used inside the loop. Changing a 'for' statement to a
|
||
'while' statement in this way can change the effect of the 'continue'
|
||
statement inside the loop.
|
||
|
||
The 'awk' language has a 'for' statement in addition to a 'while'
|
||
statement because a 'for' loop is often both less work to type and more
|
||
natural to think of. Counting the number of iterations is very common
|
||
in loops. It can be easier to think of this counting as part of looping
|
||
rather than as something to do inside the loop.
|
||
|
||
There is an alternative version of the 'for' loop, for iterating over
|
||
all the indices of an array:
|
||
|
||
for (i in array)
|
||
DO SOMETHING WITH array[i]
|
||
|
||
*Note Scanning an Array::, for more information on this version of the
|
||
'for' loop.
|
||
|
||
|
||
File: gawk.info, Node: Switch Statement, Next: Break Statement, Prev: For Statement, Up: Statements
|
||
|
||
7.4.5 The 'switch' Statement
|
||
----------------------------
|
||
|
||
This minor node describes a 'gawk'-specific feature. If 'gawk' is in
|
||
compatibility mode (*note Options::), it is not available.
|
||
|
||
The 'switch' statement allows the evaluation of an expression and the
|
||
execution of statements based on a 'case' match. Case statements are
|
||
checked for a match in the order they are defined. If no suitable
|
||
'case' is found, the 'default' section is executed, if supplied.
|
||
|
||
Each 'case' contains a single constant, be it numeric, string, or
|
||
regexp. The 'switch' expression is evaluated, and then each 'case''s
|
||
constant is compared against the result in turn. The type of constant
|
||
determines the comparison: numeric or string do the usual comparisons.
|
||
A regexp constant does a regular expression match against the string
|
||
value of the original expression. The general form of the 'switch'
|
||
statement looks like this:
|
||
|
||
switch (EXPRESSION) {
|
||
case VALUE OR REGULAR EXPRESSION:
|
||
CASE-BODY
|
||
default:
|
||
DEFAULT-BODY
|
||
}
|
||
|
||
Control flow in the 'switch' statement works as it does in C. Once a
|
||
match to a given case is made, the case statement bodies execute until a
|
||
'break', 'continue', 'next', 'nextfile', or 'exit' is encountered, or
|
||
the end of the 'switch' statement itself. For example:
|
||
|
||
while ((c = getopt(ARGC, ARGV, "aksx")) != -1) {
|
||
switch (c) {
|
||
case "a":
|
||
# report size of all files
|
||
all_files = TRUE;
|
||
break
|
||
case "k":
|
||
BLOCK_SIZE = 1024 # 1K block size
|
||
break
|
||
case "s":
|
||
# do sums only
|
||
sum_only = TRUE
|
||
break
|
||
case "x":
|
||
# don't cross filesystems
|
||
fts_flags = or(fts_flags, FTS_XDEV)
|
||
break
|
||
case "?":
|
||
default:
|
||
usage()
|
||
break
|
||
}
|
||
}
|
||
|
||
Note that if none of the statements specified here halt execution of
|
||
a matched 'case' statement, execution falls through to the next 'case'
|
||
until execution halts. In this example, the 'case' for '"?"' falls
|
||
through to the 'default' case, which is to call a function named
|
||
'usage()'. (The 'getopt()' function being called here is described in
|
||
*note Getopt Function::.)
|
||
|
||
|
||
File: gawk.info, Node: Break Statement, Next: Continue Statement, Prev: Switch Statement, Up: Statements
|
||
|
||
7.4.6 The 'break' Statement
|
||
---------------------------
|
||
|
||
The 'break' statement jumps out of the innermost 'for', 'while', or 'do'
|
||
loop that encloses it. The following example finds the smallest divisor
|
||
of any integer, and also identifies prime numbers:
|
||
|
||
# find smallest divisor of num
|
||
{
|
||
num = $1
|
||
for (divisor = 2; divisor * divisor <= num; divisor++) {
|
||
if (num % divisor == 0)
|
||
break
|
||
}
|
||
if (num % divisor == 0)
|
||
printf "Smallest divisor of %d is %d\n", num, divisor
|
||
else
|
||
printf "%d is prime\n", num
|
||
}
|
||
|
||
When the remainder is zero in the first 'if' statement, 'awk'
|
||
immediately "breaks out" of the containing 'for' loop. This means that
|
||
'awk' proceeds immediately to the statement following the loop and
|
||
continues processing. (This is very different from the 'exit'
|
||
statement, which stops the entire 'awk' program. *Note Exit
|
||
Statement::.)
|
||
|
||
The following program illustrates how the CONDITION of a 'for' or
|
||
'while' statement could be replaced with a 'break' inside an 'if':
|
||
|
||
# find smallest divisor of num
|
||
{
|
||
num = $1
|
||
for (divisor = 2; ; divisor++) {
|
||
if (num % divisor == 0) {
|
||
printf "Smallest divisor of %d is %d\n", num, divisor
|
||
break
|
||
}
|
||
if (divisor * divisor > num) {
|
||
printf "%d is prime\n", num
|
||
break
|
||
}
|
||
}
|
||
}
|
||
|
||
The 'break' statement is also used to break out of the 'switch'
|
||
statement. This is discussed in *note Switch Statement::.
|
||
|
||
The 'break' statement has no meaning when used outside the body of a
|
||
loop or 'switch'. However, although it was never documented, historical
|
||
implementations of 'awk' treated the 'break' statement outside of a loop
|
||
as if it were a 'next' statement (*note Next Statement::). (d.c.)
|
||
Recent versions of BWK 'awk' no longer allow this usage, nor does
|
||
'gawk'.
|
||
|
||
|
||
File: gawk.info, Node: Continue Statement, Next: Next Statement, Prev: Break Statement, Up: Statements
|
||
|
||
7.4.7 The 'continue' Statement
|
||
------------------------------
|
||
|
||
Similar to 'break', the 'continue' statement is used only inside 'for',
|
||
'while', and 'do' loops. It skips over the rest of the loop body,
|
||
causing the next cycle around the loop to begin immediately. Contrast
|
||
this with 'break', which jumps out of the loop altogether.
|
||
|
||
The 'continue' statement in a 'for' loop directs 'awk' to skip the
|
||
rest of the body of the loop and resume execution with the
|
||
increment-expression of the 'for' statement. The following program
|
||
illustrates this fact:
|
||
|
||
BEGIN {
|
||
for (x = 0; x <= 20; x++) {
|
||
if (x == 5)
|
||
continue
|
||
printf "%d ", x
|
||
}
|
||
print ""
|
||
}
|
||
|
||
This program prints all the numbers from 0 to 20--except for 5, for
|
||
which the 'printf' is skipped. Because the increment 'x++' is not
|
||
skipped, 'x' does not remain stuck at 5. Contrast the 'for' loop from
|
||
the previous example with the following 'while' loop:
|
||
|
||
BEGIN {
|
||
x = 0
|
||
while (x <= 20) {
|
||
if (x == 5)
|
||
continue
|
||
printf "%d ", x
|
||
x++
|
||
}
|
||
print ""
|
||
}
|
||
|
||
This program loops forever once 'x' reaches 5, because the increment
|
||
('x++') is never reached.
|
||
|
||
The 'continue' statement has no special meaning with respect to the
|
||
'switch' statement, nor does it have any meaning when used outside the
|
||
body of a loop. Historical versions of 'awk' treated a 'continue'
|
||
statement outside a loop the same way they treated a 'break' statement
|
||
outside a loop: as if it were a 'next' statement (*note Next
|
||
Statement::). (d.c.) Recent versions of BWK 'awk' no longer work this
|
||
way, nor does 'gawk'.
|
||
|
||
|
||
File: gawk.info, Node: Next Statement, Next: Nextfile Statement, Prev: Continue Statement, Up: Statements
|
||
|
||
7.4.8 The 'next' Statement
|
||
--------------------------
|
||
|
||
The 'next' statement forces 'awk' to immediately stop processing the
|
||
current record and go on to the next record. This means that no further
|
||
rules are executed for the current record, and the rest of the current
|
||
rule's action isn't executed.
|
||
|
||
Contrast this with the effect of the 'getline' function (*note
|
||
Getline::). That also causes 'awk' to read the next record immediately,
|
||
but it does not alter the flow of control in any way (i.e., the rest of
|
||
the current action executes with a new input record).
|
||
|
||
At the highest level, 'awk' program execution is a loop that reads an
|
||
input record and then tests each rule's pattern against it. If you
|
||
think of this loop as a 'for' statement whose body contains the rules,
|
||
then the 'next' statement is analogous to a 'continue' statement. It
|
||
skips to the end of the body of this implicit loop and executes the
|
||
increment (which reads another record).
|
||
|
||
For example, suppose an 'awk' program works only on records with four
|
||
fields, and it shouldn't fail when given bad input. To avoid
|
||
complicating the rest of the program, write a "weed out" rule near the
|
||
beginning, in the following manner:
|
||
|
||
NF != 4 {
|
||
printf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) > "/dev/stderr"
|
||
next
|
||
}
|
||
|
||
Because of the 'next' statement, the program's subsequent rules won't
|
||
see the bad record. The error message is redirected to the standard
|
||
error output stream, as error messages should be. For more detail, see
|
||
*note Special Files::.
|
||
|
||
If the 'next' statement causes the end of the input to be reached,
|
||
then the code in any 'END' rules is executed. *Note BEGIN/END::.
|
||
|
||
The 'next' statement is not allowed inside 'BEGINFILE' and 'ENDFILE'
|
||
rules. *Note BEGINFILE/ENDFILE::.
|
||
|
||
According to the POSIX standard, the behavior is undefined if the
|
||
'next' statement is used in a 'BEGIN' or 'END' rule. 'gawk' treats it
|
||
as a syntax error. Although POSIX does not disallow it, most other
|
||
'awk' implementations don't allow the 'next' statement inside function
|
||
bodies (*note User-defined::). Just as with any other 'next' statement,
|
||
a 'next' statement inside a function body reads the next record and
|
||
starts processing it with the first rule in the program.
|
||
|
||
|
||
File: gawk.info, Node: Nextfile Statement, Next: Exit Statement, Prev: Next Statement, Up: Statements
|
||
|
||
7.4.9 The 'nextfile' Statement
|
||
------------------------------
|
||
|
||
The 'nextfile' statement is similar to the 'next' statement. However,
|
||
instead of abandoning processing of the current record, the 'nextfile'
|
||
statement instructs 'awk' to stop processing the current data file.
|
||
|
||
Upon execution of the 'nextfile' statement, 'FILENAME' is updated to
|
||
the name of the next data file listed on the command line, 'FNR' is
|
||
reset to one, and processing starts over with the first rule in the
|
||
program. If the 'nextfile' statement causes the end of the input to be
|
||
reached, then the code in any 'END' rules is executed. An exception to
|
||
this is when 'nextfile' is invoked during execution of any statement in
|
||
an 'END' rule; in this case, it causes the program to stop immediately.
|
||
*Note BEGIN/END::.
|
||
|
||
The 'nextfile' statement is useful when there are many data files to
|
||
process but it isn't necessary to process every record in every file.
|
||
Without 'nextfile', in order to move on to the next data file, a program
|
||
would have to continue scanning the unwanted records. The 'nextfile'
|
||
statement accomplishes this much more efficiently.
|
||
|
||
In 'gawk', execution of 'nextfile' causes additional things to
|
||
happen: any 'ENDFILE' rules are executed if 'gawk' is not currently in
|
||
an 'END' or 'BEGINFILE' rule, 'ARGIND' is incremented, and any
|
||
'BEGINFILE' rules are executed. ('ARGIND' hasn't been introduced yet.
|
||
*Note Built-in Variables::.)
|
||
|
||
With 'gawk', 'nextfile' is useful inside a 'BEGINFILE' rule to skip
|
||
over a file that would otherwise cause 'gawk' to exit with a fatal
|
||
error. In this case, 'ENDFILE' rules are not executed. *Note
|
||
BEGINFILE/ENDFILE::.
|
||
|
||
Although it might seem that 'close(FILENAME)' would accomplish the
|
||
same as 'nextfile', this isn't true. 'close()' is reserved for closing
|
||
files, pipes, and coprocesses that are opened with redirections. It is
|
||
not related to the main processing that 'awk' does with the files listed
|
||
in 'ARGV'.
|
||
|
||
NOTE: For many years, 'nextfile' was a common extension. In
|
||
September 2012, it was accepted for inclusion into the POSIX
|
||
standard. See the Austin Group website
|
||
(http://austingroupbugs.net/view.php?id=607).
|
||
|
||
The current version of BWK 'awk' and 'mawk' also support 'nextfile'.
|
||
However, they don't allow the 'nextfile' statement inside function
|
||
bodies (*note User-defined::). 'gawk' does; a 'nextfile' inside a
|
||
function body reads the first record from the next file and starts
|
||
processing it with the first rule in the program, just as any other
|
||
'nextfile' statement.
|
||
|
||
|
||
File: gawk.info, Node: Exit Statement, Prev: Nextfile Statement, Up: Statements
|
||
|
||
7.4.10 The 'exit' Statement
|
||
---------------------------
|
||
|
||
The 'exit' statement causes 'awk' to immediately stop executing the
|
||
current rule and to stop processing input; any remaining input is
|
||
ignored. The 'exit' statement is written as follows:
|
||
|
||
'exit' [RETURN CODE]
|
||
|
||
When an 'exit' statement is executed from a 'BEGIN' rule, the program
|
||
stops processing everything immediately. No input records are read.
|
||
However, if an 'END' rule is present, as part of executing the 'exit'
|
||
statement, the 'END' rule is executed (*note BEGIN/END::). If 'exit' is
|
||
used in the body of an 'END' rule, it causes the program to stop
|
||
immediately.
|
||
|
||
An 'exit' statement that is not part of a 'BEGIN' or 'END' rule stops
|
||
the execution of any further automatic rules for the current record,
|
||
skips reading any remaining input records, and executes the 'END' rule
|
||
if there is one. 'gawk' also skips any 'ENDFILE' rules; they do not
|
||
execute.
|
||
|
||
In such a case, if you don't want the 'END' rule to do its job, set a
|
||
variable to a nonzero value before the 'exit' statement and check that
|
||
variable in the 'END' rule. *Note Assert Function::, for an example
|
||
that does this.
|
||
|
||
If an argument is supplied to 'exit', its value is used as the exit
|
||
status code for the 'awk' process. If no argument is supplied, 'exit'
|
||
causes 'awk' to return a "success" status. In the case where an
|
||
argument is supplied to a first 'exit' statement, and then 'exit' is
|
||
called a second time from an 'END' rule with no argument, 'awk' uses the
|
||
previously supplied exit value. (d.c.) *Note Exit Status::, for more
|
||
information.
|
||
|
||
For example, suppose an error condition occurs that is difficult or
|
||
impossible to handle. Conventionally, programs report this by exiting
|
||
with a nonzero status. An 'awk' program can do this using an 'exit'
|
||
statement with a nonzero argument, as shown in the following example:
|
||
|
||
BEGIN {
|
||
if (("date" | getline date_now) <= 0) {
|
||
print "Can't get system date" > "/dev/stderr"
|
||
exit 1
|
||
}
|
||
print "current date is", date_now
|
||
close("date")
|
||
}
|
||
|
||
NOTE: For full portability, exit values should be between zero and
|
||
126, inclusive. Negative values, and values of 127 or greater, may
|
||
not produce consistent results across different operating systems.
|
||
|
||
|
||
File: gawk.info, Node: Built-in Variables, Next: Pattern Action Summary, Prev: Statements, Up: Patterns and Actions
|
||
|
||
7.5 Predefined Variables
|
||
========================
|
||
|
||
Most 'awk' variables are available to use for your own purposes; they
|
||
never change unless your program assigns values to them, and they never
|
||
affect anything unless your program examines them. However, a few
|
||
variables in 'awk' have special built-in meanings. 'awk' examines some
|
||
of these automatically, so that they enable you to tell 'awk' how to do
|
||
certain things. Others are set automatically by 'awk', so that they
|
||
carry information from the internal workings of 'awk' to your program.
|
||
|
||
This minor node documents all of 'gawk''s predefined variables, most
|
||
of which are also documented in the major nodes describing their areas
|
||
of activity.
|
||
|
||
* Menu:
|
||
|
||
* User-modified:: Built-in variables that you change to control
|
||
'awk'.
|
||
* Auto-set:: Built-in variables where 'awk' gives
|
||
you information.
|
||
* ARGC and ARGV:: Ways to use 'ARGC' and 'ARGV'.
|
||
|
||
|
||
File: gawk.info, Node: User-modified, Next: Auto-set, Up: Built-in Variables
|
||
|
||
7.5.1 Built-in Variables That Control 'awk'
|
||
-------------------------------------------
|
||
|
||
The following is an alphabetical list of variables that you can change
|
||
to control how 'awk' does certain things.
|
||
|
||
The variables that are specific to 'gawk' are marked with a pound
|
||
sign ('#'). These variables are 'gawk' extensions. In other 'awk'
|
||
implementations or if 'gawk' is in compatibility mode (*note Options::),
|
||
they are not special. (Any exceptions are noted in the description of
|
||
each variable.)
|
||
|
||
'BINMODE #'
|
||
On non-POSIX systems, this variable specifies use of binary mode
|
||
for all I/O. Numeric values of one, two, or three specify that
|
||
input files, output files, or all files, respectively, should use
|
||
binary I/O. A numeric value less than zero is treated as zero, and
|
||
a numeric value greater than three is treated as three.
|
||
Alternatively, string values of '"r"' or '"w"' specify that input
|
||
files and output files, respectively, should use binary I/O. A
|
||
string value of '"rw"' or '"wr"' indicates that all files should
|
||
use binary I/O. Any other string value is treated the same as
|
||
'"rw"', but causes 'gawk' to generate a warning message. 'BINMODE'
|
||
is described in more detail in *note PC Using::. 'mawk' (*note
|
||
Other Versions::) also supports this variable, but only using
|
||
numeric values.
|
||
|
||
'CONVFMT'
|
||
A string that controls the conversion of numbers to strings (*note
|
||
Conversion::). It works by being passed, in effect, as the first
|
||
argument to the 'sprintf()' function (*note String Functions::).
|
||
Its default value is '"%.6g"'. 'CONVFMT' was introduced by the
|
||
POSIX standard.
|
||
|
||
'FIELDWIDTHS #'
|
||
A space-separated list of columns that tells 'gawk' how to split
|
||
input with fixed columnar boundaries. Assigning a value to
|
||
'FIELDWIDTHS' overrides the use of 'FS' and 'FPAT' for field
|
||
splitting. *Note Constant Size::, for more information.
|
||
|
||
'FPAT #'
|
||
A regular expression (as a string) that tells 'gawk' to create the
|
||
fields based on text that matches the regular expression.
|
||
Assigning a value to 'FPAT' overrides the use of 'FS' and
|
||
'FIELDWIDTHS' for field splitting. *Note Splitting By Content::,
|
||
for more information.
|
||
|
||
'FS'
|
||
The input field separator (*note Field Separators::). The value is
|
||
a single-character string or a multicharacter regular expression
|
||
that matches the separations between fields in an input record. If
|
||
the value is the null string ('""'), then each character in the
|
||
record becomes a separate field. (This behavior is a 'gawk'
|
||
extension. POSIX 'awk' does not specify the behavior when 'FS' is
|
||
the null string. Nonetheless, some other versions of 'awk' also
|
||
treat '""' specially.)
|
||
|
||
The default value is '" "', a string consisting of a single space.
|
||
As a special exception, this value means that any sequence of
|
||
spaces, TABs, and/or newlines is a single separator.(1) It also
|
||
causes spaces, TABs, and newlines at the beginning and end of a
|
||
record to be ignored.
|
||
|
||
You can set the value of 'FS' on the command line using the '-F'
|
||
option:
|
||
|
||
awk -F, 'PROGRAM' INPUT-FILES
|
||
|
||
If 'gawk' is using 'FIELDWIDTHS' or 'FPAT' for field splitting,
|
||
assigning a value to 'FS' causes 'gawk' to return to the normal,
|
||
'FS'-based field splitting. An easy way to do this is to simply
|
||
say 'FS = FS', perhaps with an explanatory comment.
|
||
|
||
'IGNORECASE #'
|
||
If 'IGNORECASE' is nonzero or non-null, then all string comparisons
|
||
and all regular expression matching are case-independent. This
|
||
applies to regexp matching with '~' and '!~', the 'gensub()',
|
||
'gsub()', 'index()', 'match()', 'patsplit()', 'split()', and
|
||
'sub()' functions, record termination with 'RS', and field
|
||
splitting with 'FS' and 'FPAT'. However, the value of 'IGNORECASE'
|
||
does _not_ affect array subscripting and it does not affect field
|
||
splitting when using a single-character field separator. *Note
|
||
Case-sensitivity::.
|
||
|
||
'LINT #'
|
||
When this variable is true (nonzero or non-null), 'gawk' behaves as
|
||
if the '--lint' command-line option is in effect (*note Options::).
|
||
With a value of '"fatal"', lint warnings become fatal errors. With
|
||
a value of '"invalid"', only warnings about things that are
|
||
actually invalid are issued. (This is not fully implemented yet.)
|
||
Any other true value prints nonfatal warnings. Assigning a false
|
||
value to 'LINT' turns off the lint warnings.
|
||
|
||
This variable is a 'gawk' extension. It is not special in other
|
||
'awk' implementations. Unlike with the other special variables,
|
||
changing 'LINT' does affect the production of lint warnings, even
|
||
if 'gawk' is in compatibility mode. Much as the '--lint' and
|
||
'--traditional' options independently control different aspects of
|
||
'gawk''s behavior, the control of lint warnings during program
|
||
execution is independent of the flavor of 'awk' being executed.
|
||
|
||
'OFMT'
|
||
A string that controls conversion of numbers to strings (*note
|
||
Conversion::) for printing with the 'print' statement. It works by
|
||
being passed as the first argument to the 'sprintf()' function
|
||
(*note String Functions::). Its default value is '"%.6g"'.
|
||
Earlier versions of 'awk' used 'OFMT' to specify the format for
|
||
converting numbers to strings in general expressions; this is now
|
||
done by 'CONVFMT'.
|
||
|
||
'OFS'
|
||
The output field separator (*note Output Separators::). It is
|
||
output between the fields printed by a 'print' statement. Its
|
||
default value is '" "', a string consisting of a single space.
|
||
|
||
'ORS'
|
||
The output record separator. It is output at the end of every
|
||
'print' statement. Its default value is '"\n"', the newline
|
||
character. (*Note Output Separators::.)
|
||
|
||
'PREC #'
|
||
The working precision of arbitrary-precision floating-point
|
||
numbers, 53 bits by default (*note Setting precision::).
|
||
|
||
'ROUNDMODE #'
|
||
The rounding mode to use for arbitrary-precision arithmetic on
|
||
numbers, by default '"N"' ('roundTiesToEven' in the IEEE 754
|
||
standard; *note Setting the rounding mode::).
|
||
|
||
'RS'
|
||
The input record separator. Its default value is a string
|
||
containing a single newline character, which means that an input
|
||
record consists of a single line of text. It can also be the null
|
||
string, in which case records are separated by runs of blank lines.
|
||
If it is a regexp, records are separated by matches of the regexp
|
||
in the input text. (*Note Records::.)
|
||
|
||
The ability for 'RS' to be a regular expression is a 'gawk'
|
||
extension. In most other 'awk' implementations, or if 'gawk' is in
|
||
compatibility mode (*note Options::), just the first character of
|
||
'RS''s value is used.
|
||
|
||
'SUBSEP'
|
||
The subscript separator. It has the default value of '"\034"' and
|
||
is used to separate the parts of the indices of a multidimensional
|
||
array. Thus, the expression 'foo["A", "B"]' really accesses
|
||
'foo["A\034B"]' (*note Multidimensional::).
|
||
|
||
'TEXTDOMAIN #'
|
||
Used for internationalization of programs at the 'awk' level. It
|
||
sets the default text domain for specially marked string constants
|
||
in the source text, as well as for the 'dcgettext()',
|
||
'dcngettext()', and 'bindtextdomain()' functions (*note
|
||
Internationalization::). The default value of 'TEXTDOMAIN' is
|
||
'"messages"'.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) In POSIX 'awk', newline does not count as whitespace.
|
||
|
||
|
||
File: gawk.info, Node: Auto-set, Next: ARGC and ARGV, Prev: User-modified, Up: Built-in Variables
|
||
|
||
7.5.2 Built-in Variables That Convey Information
|
||
------------------------------------------------
|
||
|
||
The following is an alphabetical list of variables that 'awk' sets
|
||
automatically on certain occasions in order to provide information to
|
||
your program.
|
||
|
||
The variables that are specific to 'gawk' are marked with a pound
|
||
sign ('#'). These variables are 'gawk' extensions. In other 'awk'
|
||
implementations or if 'gawk' is in compatibility mode (*note Options::),
|
||
they are not special:
|
||
|
||
'ARGC', 'ARGV'
|
||
The command-line arguments available to 'awk' programs are stored
|
||
in an array called 'ARGV'. 'ARGC' is the number of command-line
|
||
arguments present. *Note Other Arguments::. Unlike most 'awk'
|
||
arrays, 'ARGV' is indexed from 0 to 'ARGC' - 1. In the following
|
||
example:
|
||
|
||
$ awk 'BEGIN {
|
||
> for (i = 0; i < ARGC; i++)
|
||
> print ARGV[i]
|
||
> }' inventory-shipped mail-list
|
||
-| awk
|
||
-| inventory-shipped
|
||
-| mail-list
|
||
|
||
'ARGV[0]' contains 'awk', 'ARGV[1]' contains 'inventory-shipped',
|
||
and 'ARGV[2]' contains 'mail-list'. The value of 'ARGC' is three,
|
||
one more than the index of the last element in 'ARGV', because the
|
||
elements are numbered from zero.
|
||
|
||
The names 'ARGC' and 'ARGV', as well as the convention of indexing
|
||
the array from 0 to 'ARGC' - 1, are derived from the C language's
|
||
method of accessing command-line arguments.
|
||
|
||
The value of 'ARGV[0]' can vary from system to system. Also, you
|
||
should note that the program text is _not_ included in 'ARGV', nor
|
||
are any of 'awk''s command-line options. *Note ARGC and ARGV::,
|
||
for information about how 'awk' uses these variables. (d.c.)
|
||
|
||
'ARGIND #'
|
||
The index in 'ARGV' of the current file being processed. Every
|
||
time 'gawk' opens a new data file for processing, it sets 'ARGIND'
|
||
to the index in 'ARGV' of the file name. When 'gawk' is processing
|
||
the input files, 'FILENAME == ARGV[ARGIND]' is always true.
|
||
|
||
This variable is useful in file processing; it allows you to tell
|
||
how far along you are in the list of data files as well as to
|
||
distinguish between successive instances of the same file name on
|
||
the command line.
|
||
|
||
While you can change the value of 'ARGIND' within your 'awk'
|
||
program, 'gawk' automatically sets it to a new value when it opens
|
||
the next file.
|
||
|
||
'ENVIRON'
|
||
An associative array containing the values of the environment. The
|
||
array indices are the environment variable names; the elements are
|
||
the values of the particular environment variables. For example,
|
||
'ENVIRON["HOME"]' might be '"/home/arnold"'. Changing this array
|
||
does not affect the environment passed on to any programs that
|
||
'awk' may spawn via redirection or the 'system()' function. (In a
|
||
future version of 'gawk', it may do so.)
|
||
|
||
Some operating systems may not have environment variables. On such
|
||
systems, the 'ENVIRON' array is empty (except for 'ENVIRON["AWKPATH"]'
|
||
and 'ENVIRON["AWKLIBPATH"]'; *note AWKPATH Variable::, and *note
|
||
AWKLIBPATH Variable::).
|
||
|
||
'ERRNO #'
|
||
If a system error occurs during a redirection for 'getline', during
|
||
a read for 'getline', or during a 'close()' operation, then 'ERRNO'
|
||
contains a string describing the error.
|
||
|
||
In addition, 'gawk' clears 'ERRNO' before opening each command-line
|
||
input file. This enables checking if the file is readable inside a
|
||
'BEGINFILE' pattern (*note BEGINFILE/ENDFILE::).
|
||
|
||
Otherwise, 'ERRNO' works similarly to the C variable 'errno'.
|
||
Except for the case just mentioned, 'gawk' _never_ clears it (sets
|
||
it to zero or '""'). Thus, you should only expect its value to be
|
||
meaningful when an I/O operation returns a failure value, such as
|
||
'getline' returning -1. You are, of course, free to clear it
|
||
yourself before doing an I/O operation.
|
||
|
||
'FILENAME'
|
||
The name of the current input file. When no data files are listed
|
||
on the command line, 'awk' reads from the standard input and
|
||
'FILENAME' is set to '"-"'. 'FILENAME' changes each time a new
|
||
file is read (*note Reading Files::). Inside a 'BEGIN' rule, the
|
||
value of 'FILENAME' is '""', because there are no input files being
|
||
processed yet.(1) (d.c.) Note, though, that using 'getline'
|
||
(*note Getline::) inside a 'BEGIN' rule can give 'FILENAME' a
|
||
value.
|
||
|
||
'FNR'
|
||
The current record number in the current file. 'awk' increments
|
||
'FNR' each time it reads a new record (*note Records::). 'awk'
|
||
resets 'FNR' to zero each time it starts a new input file.
|
||
|
||
'NF'
|
||
The number of fields in the current input record. 'NF' is set each
|
||
time a new record is read, when a new field is created, or when
|
||
'$0' changes (*note Fields::).
|
||
|
||
Unlike most of the variables described in this node, assigning a
|
||
value to 'NF' has the potential to affect 'awk''s internal
|
||
workings. In particular, assignments to 'NF' can be used to create
|
||
fields in or remove fields from the current record. *Note Changing
|
||
Fields::.
|
||
|
||
'FUNCTAB #'
|
||
An array whose indices and corresponding values are the names of
|
||
all the built-in, user-defined, and extension functions in the
|
||
program.
|
||
|
||
NOTE: Attempting to use the 'delete' statement with the
|
||
'FUNCTAB' array causes a fatal error. Any attempt to assign
|
||
to an element of 'FUNCTAB' also causes a fatal error.
|
||
|
||
'NR'
|
||
The number of input records 'awk' has processed since the beginning
|
||
of the program's execution (*note Records::). 'awk' increments
|
||
'NR' each time it reads a new record.
|
||
|
||
'PROCINFO #'
|
||
The elements of this array provide access to information about the
|
||
running 'awk' program. The following elements (listed
|
||
alphabetically) are guaranteed to be available:
|
||
|
||
'PROCINFO["egid"]'
|
||
The value of the 'getegid()' system call.
|
||
|
||
'PROCINFO["euid"]'
|
||
The value of the 'geteuid()' system call.
|
||
|
||
'PROCINFO["FS"]'
|
||
This is '"FS"' if field splitting with 'FS' is in effect,
|
||
'"FIELDWIDTHS"' if field splitting with 'FIELDWIDTHS' is in
|
||
effect, or '"FPAT"' if field matching with 'FPAT' is in
|
||
effect.
|
||
|
||
'PROCINFO["identifiers"]'
|
||
A subarray, indexed by the names of all identifiers used in
|
||
the text of the 'awk' program. An "identifier" is simply the
|
||
name of a variable (be it scalar or array), built-in function,
|
||
user-defined function, or extension function. For each
|
||
identifier, the value of the element is one of the following:
|
||
|
||
'"array"'
|
||
The identifier is an array.
|
||
|
||
'"builtin"'
|
||
The identifier is a built-in function.
|
||
|
||
'"extension"'
|
||
The identifier is an extension function loaded via
|
||
'@load' or '-l'.
|
||
|
||
'"scalar"'
|
||
The identifier is a scalar.
|
||
|
||
'"untyped"'
|
||
The identifier is untyped (could be used as a scalar or
|
||
an array; 'gawk' doesn't know yet).
|
||
|
||
'"user"'
|
||
The identifier is a user-defined function.
|
||
|
||
The values indicate what 'gawk' knows about the identifiers
|
||
after it has finished parsing the program; they are _not_
|
||
updated while the program runs.
|
||
|
||
'PROCINFO["gid"]'
|
||
The value of the 'getgid()' system call.
|
||
|
||
'PROCINFO["pgrpid"]'
|
||
The process group ID of the current process.
|
||
|
||
'PROCINFO["pid"]'
|
||
The process ID of the current process.
|
||
|
||
'PROCINFO["ppid"]'
|
||
The parent process ID of the current process.
|
||
|
||
'PROCINFO["sorted_in"]'
|
||
If this element exists in 'PROCINFO', its value controls the
|
||
order in which array indices will be processed by 'for (INDX
|
||
in ARRAY)' loops. This is an advanced feature, so we defer
|
||
the full description until later; see *note Scanning an
|
||
Array::.
|
||
|
||
'PROCINFO["strftime"]'
|
||
The default time format string for 'strftime()'. Assigning a
|
||
new value to this element changes the default. *Note Time
|
||
Functions::.
|
||
|
||
'PROCINFO["uid"]'
|
||
The value of the 'getuid()' system call.
|
||
|
||
'PROCINFO["version"]'
|
||
The version of 'gawk'.
|
||
|
||
The following additional elements in the array are available to
|
||
provide information about the MPFR and GMP libraries if your
|
||
version of 'gawk' supports arbitrary-precision arithmetic (*note
|
||
Arbitrary Precision Arithmetic::):
|
||
|
||
'PROCINFO["mpfr_version"]'
|
||
The version of the GNU MPFR library.
|
||
|
||
'PROCINFO["gmp_version"]'
|
||
The version of the GNU MP library.
|
||
|
||
'PROCINFO["prec_max"]'
|
||
The maximum precision supported by MPFR.
|
||
|
||
'PROCINFO["prec_min"]'
|
||
The minimum precision required by MPFR.
|
||
|
||
The following additional elements in the array are available to
|
||
provide information about the version of the extension API, if your
|
||
version of 'gawk' supports dynamic loading of extension functions
|
||
(*note Dynamic Extensions::):
|
||
|
||
'PROCINFO["api_major"]'
|
||
The major version of the extension API.
|
||
|
||
'PROCINFO["api_minor"]'
|
||
The minor version of the extension API.
|
||
|
||
On some systems, there may be elements in the array, '"group1"'
|
||
through '"groupN"' for some N. N is the number of supplementary
|
||
groups that the process has. Use the 'in' operator to test for
|
||
these elements (*note Reference to Elements::).
|
||
|
||
The 'PROCINFO' array has the following additional uses:
|
||
|
||
* It may be used to provide a timeout when reading from any open
|
||
input file, pipe, or coprocess. *Note Read Timeout::, for
|
||
more information.
|
||
|
||
* It may be used to cause coprocesses to communicate over
|
||
pseudo-ttys instead of through two-way pipes; this is
|
||
discussed further in *note Two-way I/O::.
|
||
|
||
'RLENGTH'
|
||
The length of the substring matched by the 'match()' function
|
||
(*note String Functions::). 'RLENGTH' is set by invoking the
|
||
'match()' function. Its value is the length of the matched string,
|
||
or -1 if no match is found.
|
||
|
||
'RSTART'
|
||
The start index in characters of the substring that is matched by
|
||
the 'match()' function (*note String Functions::). 'RSTART' is set
|
||
by invoking the 'match()' function. Its value is the position of
|
||
the string where the matched substring starts, or zero if no match
|
||
was found.
|
||
|
||
'RT #'
|
||
The input text that matched the text denoted by 'RS', the record
|
||
separator. It is set every time a record is read.
|
||
|
||
'SYMTAB #'
|
||
An array whose indices are the names of all defined global
|
||
variables and arrays in the program. 'SYMTAB' makes 'gawk''s
|
||
symbol table visible to the 'awk' programmer. It is built as
|
||
'gawk' parses the program and is complete before the program starts
|
||
to run.
|
||
|
||
The array may be used for indirect access to read or write the
|
||
value of a variable:
|
||
|
||
foo = 5
|
||
SYMTAB["foo"] = 4
|
||
print foo # prints 4
|
||
|
||
The 'isarray()' function (*note Type Functions::) may be used to
|
||
test if an element in 'SYMTAB' is an array. Also, you may not use
|
||
the 'delete' statement with the 'SYMTAB' array.
|
||
|
||
You may use an index for 'SYMTAB' that is not a predefined
|
||
identifier:
|
||
|
||
SYMTAB["xxx"] = 5
|
||
print SYMTAB["xxx"]
|
||
|
||
This works as expected: in this case 'SYMTAB' acts just like a
|
||
regular array. The only difference is that you can't then delete
|
||
'SYMTAB["xxx"]'.
|
||
|
||
The 'SYMTAB' array is more interesting than it looks. Andrew
|
||
Schorr points out that it effectively gives 'awk' data pointers.
|
||
Consider his example:
|
||
|
||
# Indirect multiply of any variable by amount, return result
|
||
|
||
function multiply(variable, amount)
|
||
{
|
||
return SYMTAB[variable] *= amount
|
||
}
|
||
|
||
NOTE: In order to avoid severe time-travel paradoxes,(2)
|
||
neither 'FUNCTAB' nor 'SYMTAB' is available as an element
|
||
within the 'SYMTAB' array.
|
||
|
||
Changing 'NR' and 'FNR'
|
||
|
||
'awk' increments 'NR' and 'FNR' each time it reads a record, instead
|
||
of setting them to the absolute value of the number of records read.
|
||
This means that a program can change these variables and their new
|
||
values are incremented for each record. (d.c.) The following example
|
||
shows this:
|
||
|
||
$ echo '1
|
||
> 2
|
||
> 3
|
||
> 4' | awk 'NR == 2 { NR = 17 }
|
||
> { print NR }'
|
||
-| 1
|
||
-| 17
|
||
-| 18
|
||
-| 19
|
||
|
||
Before 'FNR' was added to the 'awk' language (*note V7/SVR3.1::), many
|
||
'awk' programs used this feature to track the number of records in a
|
||
file by resetting 'NR' to zero when 'FILENAME' changed.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) Some early implementations of Unix 'awk' initialized 'FILENAME'
|
||
to '"-"', even if there were data files to be processed. This behavior
|
||
was incorrect and should not be relied upon in your programs.
|
||
|
||
(2) Not to mention difficult implementation issues.
|
||
|
||
|
||
File: gawk.info, Node: ARGC and ARGV, Prev: Auto-set, Up: Built-in Variables
|
||
|
||
7.5.3 Using 'ARGC' and 'ARGV'
|
||
-----------------------------
|
||
|
||
*note Auto-set::, presented the following program describing the
|
||
information contained in 'ARGC' and 'ARGV':
|
||
|
||
$ awk 'BEGIN {
|
||
> for (i = 0; i < ARGC; i++)
|
||
> print ARGV[i]
|
||
> }' inventory-shipped mail-list
|
||
-| awk
|
||
-| inventory-shipped
|
||
-| mail-list
|
||
|
||
In this example, 'ARGV[0]' contains 'awk', 'ARGV[1]' contains
|
||
'inventory-shipped', and 'ARGV[2]' contains 'mail-list'. Notice that
|
||
the 'awk' program is not entered in 'ARGV'. The other command-line
|
||
options, with their arguments, are also not entered. This includes
|
||
variable assignments done with the '-v' option (*note Options::).
|
||
Normal variable assignments on the command line _are_ treated as
|
||
arguments and do show up in the 'ARGV' array. Given the following
|
||
program in a file named 'showargs.awk':
|
||
|
||
BEGIN {
|
||
printf "A=%d, B=%d\n", A, B
|
||
for (i = 0; i < ARGC; i++)
|
||
printf "\tARGV[%d] = %s\n", i, ARGV[i]
|
||
}
|
||
END { printf "A=%d, B=%d\n", A, B }
|
||
|
||
Running it produces the following:
|
||
|
||
$ awk -v A=1 -f showargs.awk B=2 /dev/null
|
||
-| A=1, B=0
|
||
-| ARGV[0] = awk
|
||
-| ARGV[1] = B=2
|
||
-| ARGV[2] = /dev/null
|
||
-| A=1, B=2
|
||
|
||
A program can alter 'ARGC' and the elements of 'ARGV'. Each time
|
||
'awk' reaches the end of an input file, it uses the next element of
|
||
'ARGV' as the name of the next input file. By storing a different
|
||
string there, a program can change which files are read. Use '"-"' to
|
||
represent the standard input. Storing additional elements and
|
||
incrementing 'ARGC' causes additional files to be read.
|
||
|
||
If the value of 'ARGC' is decreased, that eliminates input files from
|
||
the end of the list. By recording the old value of 'ARGC' elsewhere, a
|
||
program can treat the eliminated arguments as something other than file
|
||
names.
|
||
|
||
To eliminate a file from the middle of the list, store the null
|
||
string ('""') into 'ARGV' in place of the file's name. As a special
|
||
feature, 'awk' ignores file names that have been replaced with the null
|
||
string. Another option is to use the 'delete' statement to remove
|
||
elements from 'ARGV' (*note Delete::).
|
||
|
||
All of these actions are typically done in the 'BEGIN' rule, before
|
||
actual processing of the input begins. *Note Split Program::, and *note
|
||
Tee Program::, for examples of each way of removing elements from
|
||
'ARGV'.
|
||
|
||
To actually get options into an 'awk' program, end the 'awk' options
|
||
with '--' and then supply the 'awk' program's options, in the following
|
||
manner:
|
||
|
||
awk -f myprog.awk -- -v -q file1 file2 ...
|
||
|
||
The following fragment processes 'ARGV' in order to examine, and then
|
||
remove, the previously mentioned command-line options:
|
||
|
||
BEGIN {
|
||
for (i = 1; i < ARGC; i++) {
|
||
if (ARGV[i] == "-v")
|
||
verbose = 1
|
||
else if (ARGV[i] == "-q")
|
||
debug = 1
|
||
else if (ARGV[i] ~ /^-./) {
|
||
e = sprintf("%s: unrecognized option -- %c",
|
||
ARGV[0], substr(ARGV[i], 2, 1))
|
||
print e > "/dev/stderr"
|
||
} else
|
||
break
|
||
delete ARGV[i]
|
||
}
|
||
}
|
||
|
||
Ending the 'awk' options with '--' isn't necessary in 'gawk'. Unless
|
||
'--posix' has been specified, 'gawk' silently puts any unrecognized
|
||
options into 'ARGV' for the 'awk' program to deal with. As soon as it
|
||
sees an unknown option, 'gawk' stops looking for other options that it
|
||
might otherwise recognize. The previous command line with 'gawk' would
|
||
be:
|
||
|
||
gawk -f myprog.awk -q -v file1 file2 ...
|
||
|
||
Because '-q' is not a valid 'gawk' option, it and the following '-v' are
|
||
passed on to the 'awk' program. (*Note Getopt Function::, for an 'awk'
|
||
library function that parses command-line options.)
|
||
|
||
When designing your program, you should choose options that don't
|
||
conflict with 'gawk''s, because it will process any options that it
|
||
accepts before passing the rest of the command line on to your program.
|
||
Using '#!' with the '-E' option may help (*note Executable Scripts::,
|
||
and *note Options::,).
|
||
|
||
|
||
File: gawk.info, Node: Pattern Action Summary, Prev: Built-in Variables, Up: Patterns and Actions
|
||
|
||
7.6 Summary
|
||
===========
|
||
|
||
* Pattern-action pairs make up the basic elements of an 'awk'
|
||
program. Patterns are either normal expressions, range
|
||
expressions, or regexp constants; one of the special keywords
|
||
'BEGIN', 'END', 'BEGINFILE', or 'ENDFILE'; or empty. The action
|
||
executes if the current record matches the pattern. Empty
|
||
(missing) patterns match all records.
|
||
|
||
* I/O from 'BEGIN' and 'END' rules has certain constraints. This is
|
||
also true, only more so, for 'BEGINFILE' and 'ENDFILE' rules. The
|
||
latter two give you "hooks" into 'gawk''s file processing, allowing
|
||
you to recover from a file that otherwise would cause a fatal error
|
||
(such as a file that cannot be opened).
|
||
|
||
* Shell variables can be used in 'awk' programs by careful use of
|
||
shell quoting. It is easier to pass a shell variable into 'awk' by
|
||
using the '-v' option and an 'awk' variable.
|
||
|
||
* Actions consist of statements enclosed in curly braces. Statements
|
||
are built up from expressions, control statements, compound
|
||
statements, input and output statements, and deletion statements.
|
||
|
||
* The control statements in 'awk' are 'if'-'else', 'while', 'for',
|
||
and 'do'-'while'. 'gawk' adds the 'switch' statement. There are
|
||
two flavors of 'for' statement: one for performing general looping,
|
||
and the other for iterating through an array.
|
||
|
||
* 'break' and 'continue' let you exit early or start the next
|
||
iteration of a loop (or get out of a 'switch').
|
||
|
||
* 'next' and 'nextfile' let you read the next record and start over
|
||
at the top of your program or skip to the next input file and start
|
||
over, respectively.
|
||
|
||
* The 'exit' statement terminates your program. When executed from
|
||
an action (or function body), it transfers control to the 'END'
|
||
statements. From an 'END' statement body, it exits immediately.
|
||
You may pass an optional numeric value to be used as 'awk''s exit
|
||
status.
|
||
|
||
* Some predefined variables provide control over 'awk', mainly for
|
||
I/O. Other variables convey information from 'awk' to your program.
|
||
|
||
* 'ARGC' and 'ARGV' make the command-line arguments available to your
|
||
program. Manipulating them from a 'BEGIN' rule lets you control
|
||
how 'awk' will process the provided data files.
|
||
|
||
|
||
File: gawk.info, Node: Arrays, Next: Functions, Prev: Patterns and Actions, Up: Top
|
||
|
||
8 Arrays in 'awk'
|
||
*****************
|
||
|
||
An "array" is a table of values called "elements". The elements of an
|
||
array are distinguished by their "indices". Indices may be either
|
||
numbers or strings.
|
||
|
||
This major node describes how arrays work in 'awk', how to use array
|
||
elements, how to scan through every element in an array, and how to
|
||
remove array elements. It also describes how 'awk' simulates
|
||
multidimensional arrays, as well as some of the less obvious points
|
||
about array usage. The major node moves on to discuss 'gawk''s facility
|
||
for sorting arrays, and ends with a brief description of 'gawk''s
|
||
ability to support true arrays of arrays.
|
||
|
||
* Menu:
|
||
|
||
* Array Basics:: The basics of arrays.
|
||
* Numeric Array Subscripts:: How to use numbers as subscripts in
|
||
'awk'.
|
||
* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
|
||
* Delete:: The 'delete' statement removes an element
|
||
from an array.
|
||
* Multidimensional:: Emulating multidimensional arrays in
|
||
'awk'.
|
||
* Arrays of Arrays:: True multidimensional arrays.
|
||
* Arrays Summary:: Summary of arrays.
|
||
|
||
|
||
File: gawk.info, Node: Array Basics, Next: Numeric Array Subscripts, Up: Arrays
|
||
|
||
8.1 The Basics of Arrays
|
||
========================
|
||
|
||
This minor node presents the basics: working with elements in arrays one
|
||
at a time, and traversing all of the elements in an array.
|
||
|
||
* Menu:
|
||
|
||
* Array Intro:: Introduction to Arrays
|
||
* Reference to Elements:: How to examine one element of an array.
|
||
* Assigning Elements:: How to change an element of an array.
|
||
* Array Example:: Basic Example of an Array
|
||
* Scanning an Array:: A variation of the 'for' statement. It
|
||
loops through the indices of an array's
|
||
existing elements.
|
||
* Controlling Scanning:: Controlling the order in which arrays are
|
||
scanned.
|
||
|
||
|
||
File: gawk.info, Node: Array Intro, Next: Reference to Elements, Up: Array Basics
|
||
|
||
8.1.1 Introduction to Arrays
|
||
----------------------------
|
||
|
||
Doing linear scans over an associative array is like trying to club
|
||
someone to death with a loaded Uzi.
|
||
-- _Larry Wall_
|
||
|
||
The 'awk' language provides one-dimensional arrays for storing groups
|
||
of related strings or numbers. Every 'awk' array must have a name.
|
||
Array names have the same syntax as variable names; any valid variable
|
||
name would also be a valid array name. But one name cannot be used in
|
||
both ways (as an array and as a variable) in the same 'awk' program.
|
||
|
||
Arrays in 'awk' superficially resemble arrays in other programming
|
||
languages, but there are fundamental differences. In 'awk', it isn't
|
||
necessary to specify the size of an array before starting to use it.
|
||
Additionally, any number or string, not just consecutive integers, may
|
||
be used as an array index.
|
||
|
||
In most other languages, arrays must be "declared" before use,
|
||
including a specification of how many elements or components they
|
||
contain. In such languages, the declaration causes a contiguous block
|
||
of memory to be allocated for that many elements. Usually, an index in
|
||
the array must be a nonnegative integer. For example, the index zero
|
||
specifies the first element in the array, which is actually stored at
|
||
the beginning of the block of memory. Index one specifies the second
|
||
element, which is stored in memory right after the first element, and so
|
||
on. It is impossible to add more elements to the array, because it has
|
||
room only for as many elements as given in the declaration. (Some
|
||
languages allow arbitrary starting and ending indices--e.g., '15 ..
|
||
27'--but the size of the array is still fixed when the array is
|
||
declared.)
|
||
|
||
A contiguous array of four elements might look like *note Figure 8.1:
|
||
figure-array-elements, conceptually, if the element values are eight,
|
||
'"foo"', '""', and 30.
|
||
|
||
|