Regular expressions are a powerful tool for text processing in awk. They allow you to search for patterns in a text file and manipulate the data based on those patterns. In this article, we will explore how to use regular expressions in awk with examples.
Regular Expression Basics
Regular expressions are patterns that match a specific set of characters. The following table lists some of the basic regular expression metacharacters that you can use in awk:
Metacharacter | Description |
---|---|
. | Matches any single character |
[ ] | Matches any character within the brackets |
^ | Matches the beginning of a line |
$ | Matches the end of a line |
* | Matches zero or more occurrences of the previous character |
+ | Matches one or more occurrences of the previous character |
? | Matches zero or one occurrence of the previous character |
Awk provides two built-in functions for using regular expressions: match()
and sub()
. The match()
function is used to find the first occurrence of a regular expression in a string, and sub()
is used to replace the first occurrence of a regular expression in a string. Here are some examples:
Example 1: Matching a Regular Expression
Let’s say we have a file containing a list of email addresses, and we want to find all email addresses that end with “.com”. We can use the match()
function to accomplish this task as follows:
awk '{
if (match($0, /\.com$/)) {
print $0
}
}' email.txt
Here, we use the match()
function to search for the regular expression /.com$/ (which matches any string that ends with “.com”) in each line of the file. If a match is found, we print the line.
Example 2: Replacing a Regular Expression
Let’s say we have a file containing a list of phone numbers, and we want to replace all instances of “555” with “666”. We can use the sub()
function to accomplish this task as follows:
awk '{
sub(/555/, "666", $0)
print $0
}' phone.txt
Here, we use the sub()
function to search for the regular expression /555/ (which matches any string containing “555”) in each line of the file, and replace it with “666”. We then print the modified line.
Advanced Regular Expression Techniques
In addition to the basic regular expression metacharacters, awk supports several advanced regular expression techniques that can help you accomplish more complex text processing tasks. These include:
1. Grouping:
You can group parts of a regular expression together using parentheses. This allows you to apply a quantifier to the group as a whole, or to extract specific parts of the matched string.
Let’s say we have a file containing a list of employee names and salaries, and we want to extract the names and salaries separately. We can use grouping to accomplish this task as follows:
awk '{
if (match($0, /^(\w+)\s+(\d+)$/)) {
name = substr($0, RSTART, RLENGTH)
salary = substr($0, RSTART+length(name)+1, length($0)-RSTART-length(name))
print name
print salary
}
}' employees.txt
Here, we use grouping to match the regular expression /^(\w+)\s+(\d+)$/ (which matches a line containing one or more word characters followed by one or more whitespace characters, followed by one or more digits) and extract the name and salary separately.
2. Backreferences:
You can use backreferences (i.e., \1, \2, etc.) to refer to parts of the regular expression that were matched by a group. This allows you to reuse matched substrings in the replacement string.
Let’s say we have a file containing a list of phone numbers in the format (XXX) XXX-XXXX, and we want to change the format to XXX-XXX-XXXX. We can use backreferences to accomplish this task as follows:
awk '{
sub(/\((\d{3})\) (\d{3})-(\d{4})/, "\1-\2-\3", $0)
print $0
}' phone.txt
Here, we use backreferences (i.e., \1, \2, and \3) to refer to the three groups of digits matched by the regular expression “/(\d3)(\d3) (\d{3})-(\d{4})/” (which matches a phone number in the format (XXX) XXX-XXXX) and replace the format with XXX-XXX-XXXX.
3. Lookahead and Lookbehind:
You can use lookahead (?=) and lookbehind (?<=) to match patterns only if they are followed by or preceded by another pattern, respectively.
Let’s say we have a file containing a list of URLs, and we want to extract only the domain names (i.e., the text between “http://” and the next “/” character). We can use lookahead and lookbehind to accomplish this task as follows:
awk '{
if (match($0, /(?<=http:\/\/)[^\/]+/)) {
print substr($0, RSTART, RLENGTH)
}
}' urls.txt
Here, we use lookahead (?<=) to match the regular expression "/(?<=http://)[^/]+/" (which matches any characters that come after "http://" and before the next "/" character) and extract the domain name.
4. Negated character classes:
Let's say we have a file containing a list of email addresses, and we want to extract only the addresses that belong to a specific domain (e.g., example.com). We can use negated character classes to accomplish this task as follows:
awk '{
if (match($0, /^[^@]+@example\.com$/)) {
print $0
}
}' emails.txt
Here, we use a negated character class ([^@]+) to match any characters that are not "@" and extract the username, and then match the literal string "@example.com" to ensure that the address belongs to the specified domain.
5. Alternation:
Let's say we have a file containing a list of phone numbers, and we want to extract only the numbers that are either in the format "(XXX) XXX-XXXX" or "XXX-XXX-XXXX". We can use alternation to accomplish this task as follows:
awk '{
if (match($0, /\((\d{3})\) (\d{3})-(\d{4})|(\d{3})-(\d{3})-(\d{4})/)) {
print substr($0, RSTART, RLENGTH)
}
}' phones.txt
Here, we use alternation (|) to match either the regular expression "/(\d3)(\d3) (\d{3})-(\d{4})/" (which matches a phone number in the format (XXX) XXX-XXXX) or the regular expression "/(\d{3})-(\d{3})-(\d{4})/" (which matches a phone number in the format XXX-XXX-XXXX).
Conclusion
Regular expressions are a powerful tool for text processing in awk. They allow you to search for patterns in a text file, and manipulate the data based on those patterns. By mastering regular expressions in awk, you can become more effective and efficient in your text processing tasks, and accomplish complex data manipulation with ease.