Upload
donna-fletcher
View
221
Download
1
Embed Size (px)
Citation preview
1
awk• awk is a file-processing programming language.
• Makes it easy to perform text manipulation tasks.
• Is used in
– Generating reports
– Matching patterns
– Validating data
– Filtering data for transmission
• An awk program is a sequence of statements of the form
– Pattern {action}
– Scans the input lines, in order, one at a time.
– Searches for the pattern and if pattern is found, the corresponding action is performed.
– Each statement of awk program is executed for each line of input.
2
awk
3
awk programming model
• awk program consists of a main input loop (you don’t write the loop but the main program works as one).
• The main routine reads one line of input from a file and makes it available for processing. The main loop executes as many times as there are lines in the input.
• Preprocessing before the main loop and post processing after the loop are done with BEGIN and END.
• The routine is applied to each input line, one line at a time.
4
awk
• Two ways to present the program to awk.– Make the program the first
argument on the command line – if the program is short.
– awk ‘program ‘ [filename ....]– Examples:
%awk '/Smith/ {print}' people
%awk '/Smith/ {print}' -
– Put the program in a separate file and tell awk to use the program file on the input files.
– Examples:awk -f awkprog file1 file2
• Keywords and some important functions– BEGIN, END, FILENAME,
FS, NF, NR, OFS, ORS, OFMT, RS
– break, close, continue, exit, exp, for, getline, if, in, index, int, length
– log, next, number, print, printf, split, sprintf, sqrt, string, string, substr, while
• Operators– Assignment, compound
assignment, arithmetic, relational, logical and regular expression matching operators.
5
Some Regular Expression Metacharacters
• \ - escapes any meta character that follows, including itself.
• ^ - anchors the following regular expression to the beginning of string.
• $ - anchors the following regular expression to the end of string.
• . (dot) Matches any character including newline
• […] – matches any one of the class characters enclosed between the brackets.
• [^] – A circumflex as first character inside [] reverses the match to all characters except those listed in the [].
• r1 | r2: between two regular expressions r1 and r2, it allows either of the regular expressions to be matched.
• r* - Matches any number (including zero) of the regular expression that precedes it.
• r+ - Matches one or more occurences of the regular expression that precedes it.
• r? - Matches 0 or 1 occurences of the regular expression that precedes it.
• () – groups regular expressions• \{n,m\} – Matches a range of
occurences of a single character that precedes it. Matches any number of occurences between n and m.
May not be available in very old versions.
6
Writing Regular Expressions• Writing regular expressions involves three steps:
– Specification: Knowing what you want to match.
– Coding: Writing an expression to describe what you want to match
– Testing: Testing the pattern to see what it matches.
– Testing your regular expression may result in,
• Hits: Lines you wanted to match
• Misses: Lines you did not want to match
• Omissions:Lines you wanted to match but did not.
• False Alarms: The lines you matched but did not want to match.
– Eliminate false alarms by limiting the matches and capture the omissions by expanding the possible matches.
–
7
Some Examples
What do they match?
• [a-zA-Z?+!] -
• [a-zA-Z][?+!] -
• [-+*/] -
• AB\{2,4\}C -
• UNIX|LINUX -
• Compan(y|ies) -
• [0-9][0-9]*\.\{2,\}[a-z][a-z]* -
8
Multiline Records• FS – default value is a single space. FS can be set to a single character. When
more than one character is given it is interpreted as a regular expression.
• RS – default value is a newline. Default value can be changed.
• Example:BEGIN {RS = "" ; FS = "\n"} # Record separator is a blank
line
{ print "Name ", $1
print "Zip ", $NF
}Input file:
John Smith
235 Alameda
Santa Clara
CA
95053
Output:
Name John Smith
Zip 95053
9
Examples
cat prog1.awk
# test for integer, string or a blank line.
/[0-9]+/ {print $0 ": An integer"}
/[A-Za-z]+/ { print $0 ": A String"}
/^$/ {print "A Blank line"}
# + metacharacter – one or more
cat testfile
1234
This is a test
789 Hello
%awk –f prog1.awk testfile
1234: An integer
This is a test: A String
789 Hello: An integer
789 Hello: A String
A Blank line
A Blank line
10
Examples
%cat prog2.awk BEGIN {FS = ","} # Comma is the field separator
{ print $1 print $2 print $3}% cat prog3.awk BEGIN {FS = ","}/CA/ {print $1 "," $3} # will match any field with CA
$3 ~ /CA/ {print $1 "," $3} # field match
%cat testfile2John Smith, Santa Clara, CA
Mary Jones, Red Bank, NJ
Susan Wang, Denver, CO
% awk –f prog2.awk testfile2
What is the output?
• More than one character can be specified as a field separator, it will be interpreted as a regular expression.
• Examples:
FS = “\t+”
How many fields are in the following line?
IJK\t\tXYZ
FS= “[‘:,\t\]
11
Examples
$cat prog4.awk
BEGIN {printf ("Scores\n "); }
{ print $0; total = total + $2}
#NR – number of input records that are read
END {print "Average score is ", total / NR }
$cat scores
Smith 80
Jones 97
Chan 95
King 78
$ awk -f prog4.awk scores
Scores
Smith 80
Jones 97
Chan 95
King 78
Average score is 87.5
12
Passing Parameters into awk script• Parameters can be passed from the command line into an awk script. A
variable(s) is set from the command line and can be accessed from the awk script.
• Parameters that are passed in, are not available in BEGIn, they are available to the script only after the first line of input is read.
• Example – param.awkBEGIN {print "Passing Parameters"}{print "arg1 = ", arg1print "arg2 = ", arg2}From the command line, invokeawk –f param.awk arg1=100 arg2=200 datafileA shell script’s command line arguments can be passed in as
follows: Assume that the following line is in a shell script called awktest.sh
awk –f param.awk “arg1=$1 arg2=$2” datafile$1 and $2 are the positional parameters given as arguments on
command line when awktest.sh is invoked asawktest.sh 100 200
13
Patterns Using Regular Expressions
# print lines ending with iaawk ‘ia$/ {print}’ countries -
#print countries ending with iaAwk ‘$1 ~ /ia$/ {print $1 }’
countries
#select lines where the third field #matches Asia or begins with North #or South
$3 ~ /Asia |^North | ^South/{print}
#Pattern Ranges/Russia/,/Brazil/ {print}#Replace USA by United States
/USA/ {$1 = "United States";print}
%cat countries
Australia 3000 Australia
USA 3615 North America
Argentina 1072 South America
India 1270 Asia
Russia 8650 Asia
China 3692 Asia
Brazil 3286 South America
14
Associative Arrays• Arrays in awk are associative arrays where the index can be a number
or a string.
• The order in which the items are retrieved may be random.
%cat prog6.awk
{ x [$1] = $2 }
END {
for (item in x)
print item,x[item]
} %awk –f prog6.awk scores
Jones 89
Smith 65
Chen 100
King 120
Lowel 200
15
Example: Computing GradesCat prog7.awk
BEGIN { OFS = "\t" }{# main loop applied to all input lines total = 0 for (I = 2; I <= NF; ++I) total += $I; average = total / (NF -1)
# store each student average stAvg[NR] = average avgByName[$1] = average
#determine the letter grade if (average >= 90) grade = "A" else if (average >= 80) grade = "B" else if (average >= 70) grade = "C" else grade = "F“
#store a count of the letter grades ++classGrade[grade]}
16
#class statisticsEND{ #calculate class average for (x = 1; x <= NR; x++) classTotal += stAvg[x] classAve = classTotal / NR print "Class Average = " classAve
#determine how many above or below average #print number of students per letter grade print "Enter name " getline name < "-" print name ": " avgByName[name] for (letterGrade in classGrade) print letterGrade ":"
classGrade[letterGrade] | "sort"}
17
%cat grades
Smith 90 80 50
Jones 20 0 70
Wang 67 90 80
Wolf 70 100 90
Pratt 90 88 92
%awk -f prog7.awk gradesSmith 73.3333 CJones 30 FWang 79 CWolf 86.6667 BPratt 90 AClass Average = 71.8Enter nameSmithSmith: 73.3333A:1B:1C:2F:1
18
Multidimensional arrays#awk offers a syntax for subscripts that simulate
a reference to multidimensional arrays{ for (i = 1; i <= NF; ++i) table[NR,i] = $i}END{ for (k = 1; k <= NR ; ++k){ for (i = 1; i <= 4; ++i){ total += table[k,i] printf("%d ", table[k,i]) } printf("\n") }
{print "Total = " total}}
19
next and getline
• Next causes the next input line to be read.
• Next statement passes control back to the top of the script.%cat prog9.awk
NF == 2 {next} # skips to the next record and starts the program from the
# beginning
/USA/ {$4 = "United States Of America"; print $0}
{print NR }%cat countries
Japan Asia
2: UK Europe
3: Brazil S.America
Egypt Africa
5: USA N.America
Canada N.America
% awk –f prog9.awk countries
2
3
5: USA N.America United States Of America
5
20
Using getline#Using getline function to read the next line of input/^\/+/ { getline print $1 } #get input from command lineBEGIN{ printf "Enter your name: " getline name < "-" print name}/Smith/ { getline print $1}
21
#Reading from a pipe using a getline
{while ("who" | getline)
terminal[$1] = $2
}
END{
for (item in terminal)
print item, terminal[item]
}
22
Example - An word lookup
# reads a file with acronyms and their expansions,
#handles users queries
BEGIN { FS = “\t”; OFS = “\t”
printf (“Enter a word for lookup: “);
}
#Load the file named acronyms
FILENAME == “acronyms” {
wordList[$1] = $2
next
}
23
Example - An word lookup (cont)#scan for command to exit program$0 ~ /^(quit|qQ|[Xx]|exit|)$/ { exit }#process any non-empty line$0 != “” {
if ( $0 in wordList) { print wordList[$0]}
else print $0 “ not found”}#Prompt user to enter another word{printf (“Enter another word or q|Q to quit”);
} acronyms -
24
split ()
• Split () is a built-in function that can parse any string into elements of an array.
• Syntax:
• No Of elements = split (string,array,separator). If no separator is specified, FS is used as the field separator.
n = split($0,days)
{for (j = 1; j <= n; ++j)
print days[j]
}
25
next• The next statement forces awk to immediately stop processing
the current record and go on to the next record. The rest of the current rule's action is not executed either.
• If you think of the main body in awk is a loop, the next statement is analogous to a continue statement: it skips to the end of the body of this implicit loop, and executes the increment (which reads another record).
• Note: getline function causes awk to read the next record immediately, but it does not alter the flow of control in any way. So the rest of the current action executes with a new input record.
• For example, if your awk program works only on records with four fields, and you don't want it to fail when given bad input, you might use this rule near the beginning of the program:
26
Example:
FILENAME == "names.txt" {
count += 1;
next
}
{print $0 }
END{
print count
}
#Counts each line in the file, “names.txt”.
27
%cat prog9.awkNF == 2 {next} # skips to the next record and starts the program from the
# beginning/USA/ {$4 = "United States Of America"; print $0}{print NR }%cat countriesJapan Asia2: UK Europe3: Brazil S.AmericaEgypt Africa5: USA N.AmericaCanada N.America % awk –f prog9.awk countries235: USA N.America United States Of America5
28
getline
• getline is used to read the next line of input input from the current input file, from a specified file and a pipe.
• The getline command can be used without arguments to read input from the current input file.
• Reads the next input record and split it up into fields. This is useful if you've finished processing the current record, but you want to continue processing from the next record.
• Note: the new value of $0 is used in testing the patterns of any subsequent rules. The original value of $0 that triggered the rule which executed getline is lost.
29
Example:
/^[0-9]+/ {print "Line number ", NR, ":", "starts with a number" }
/^\/\*/ { getline }{print NR “:” $0 }Input:This is a cat1234 a catA test/* A comment line */990 is the scoreOutput:1:This is a catLine number 2 : starts with a number2:1234 a cat3:A test5:990 is the score
30
getline• Using getline to read a line into a variable
• You can use `getline variable' to read the next record from awk's input into the variable variable. No other processing is done.
• For example, suppose the next line is a comment, or a special string, and you want to read it, without triggering any rules. This form of getline allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of awk never sees it.
• The getline command used in this way sets only the variables NR and FNR.
• The record is not split into fields, so the values of the fields (including $0) and the value of NF do not change.
31
• What is the output of the following program on input file given below:/^[A-Za-z]/ { getline tmp print tmp}{print $0 }
Inputfile:ABCD1234EFGH5678
32
getline• Using getline to read the next record from the file file.
• Here file is a string-valued expression that specifies the file name. `< file' is called a redirection since it directs input to come from a different place.
• For example, the following program reads its input record from the file `input.dat when it encounters a first field with a value equal to 10 in the current input file.
• awk '{ if ($1 == 10) { getline < "input.dat" print } else print }' .
• Since the main input stream is not used, the values of NR and FNR are not changed. But the record read is split into fields in the normal manner, so the values of $0 and other fields are changed. So is the value of NF.
33
• Using getline to read the output of a command from a pipe:
• You can pipe the output of a command into getline, using `command | getline'. In this case, the string command is run as a shell command and its output is piped into awk to be used as input. This form of getline reads one record at a time from the pipe.
• For example, the following program copies its input to its output, except for lines that begin with `@execute', which are replaced by the output produced by running the rest of the line as a shell command:
awk ‘{ if ($1 == "@execute") { tmp = substr($0, 10)
while ((tmp | getline) > 0) print
close(tmp) }else print }' input
The close function is called to ensure that if two identical `@execute' lines appear in the input, the command is run for each one.
34
Close()
• Close () allows you to close open files and pipes.
– There may be a limitation on the number of files and pipes that can be open at the same time.
– Closing a pipe allows you to run the same command twice.
– Example: Close (“who”)
35
What is the output for the given input file
Jsmith
Mjones
@execute who
TWolf
36
• Using getline to read the output of a command from pipe into a variable:
• When you use `command | getline var', the output of the command command is sent through a pipe to getline and into the variable var.
• Example:
• awk 'BEGIN { "date" | getline current_time close("date") print "Report printed on " current_time }'
• In this version of getline, none of the built-in variables are changed, and the record is not split into fields.
37
Using system()
• System() function executes a command supplied as an expression.
• The output generated from executing system() is not available within the program for processing.
• System() returns the exit status of the program that was executed.
Example:
#!/bin/awk -f
BEGIN{
status = system ("mkdir temp")
if (status != 0)
print "command failed"
}
38
User-defined functions• A Function definition can be anywhere that a pattern-action rule
can be.• Input to the function are passed as a list of parameters.
Example:
# inserts a string, insertStr after position in aString
function insertString(aString, position, insertStr){
before = substr(aString, 1,position)
after = substr(aString,position +1)
return before insertStr after
}
{ print insertString($1,5,"BBBB") }#No spaces are allowed between the function name and the left parenthesis.
39
• All the variables in the parameter list are considered local to the function.• All variables defined in the body of the function are treated as global
variables.• Therefore any temporary variables that are declared are put at the end of
the parameter list.• Example:function insertString(aString, position, insertStr,after){
before = substr(aString, 1,position) after = substr(aString,position +1) return before insertStr after}{ print insertString($1,5,"BBBB") }{ print aString }{ print "before: " before}{ print "after: "after }
40
cat testFile
HelloWorld
This is a test
XYZ1234567890
awk –f fun2.awk testFile
HelloBBBBWorld
before: Hello
after:
ThisBBBB
before: This
after:
XYZ12BBBB34567890
before: XYZ12
after:
41
Functions• Arrays are passed by reference#!/bin/awk -ffunction moveSmallest(LIST,SIZE, temp,small,smal small = LIST[1] for (i = 2; i <= SIZE; ++i){ if (LIST[i] < small){ small = LIST[i] smallIndex = i; } } LIST[smallIndex] = LIST[1] LIST[1] = small return}END{ array[1] = 12; array[2] = 0; array[3] = -1; array[4] = 100; moveSmallest(array,4) for(i = 1; i <= 4;++i){ print array[i] }}
42
Some built-in Functions
• Arithmetic Functions• cos, exp,int,log,sin,sqrt,atan2,rand,srand• Some useful String Functions• index, length, split, sub,substr,tolower,loupper• gsub(regExp,replaceWithString,inString) – globally
substitutes replaceWithString for regExp in inString.• match (string, regExp) – returns the position of
where the regExp is found in string or 0 if no occurences are found.
43
Passing parameters into a script• Input is passed into an awk script by setting variables on the command line.• Example:
– awk –f awkprog x=1 y=2 inputfile– The variables x and y can be accessed in the main loop (not in the BEGIN
section).– The system variables ARGC and ARGV can be used to access the command line
argumentsExample:BEGIN { print "BEGIN: " n }NR == 1 { print ARGC; print nfor (i = 0; i < ARGC; ++i){ print ARGV[i]}}% awk -f param.awk n=20 testfileBEGIN:320awkntestfile
44
An array of Environment variables
#!/bin/awk -f
BEGIN{
for (env in ENVIRON){
print env "=" ENVIRON[env]
}
print “Logname = “,ENVIRON[“LOGNAME”]
}