|
|
|
Perl string manipulations
The most powerful features of Perl are in its vast collection of string manipulation operators and functions. Perl would not be as popular as it is today in bioinformatics applications if not because of its flexible and powerful string manipulation capabilities. It is not possible to cover all such capabilities in this class note, but the following are those commonly used ones that should be useful to most applications and should be remembered:
String concatenation
To concatenate two strings together, just use the . dot:
- $a . $b;
- $c = $a . $b;
- $a = $a . $b;
- $a .= $b;
The first expression concatenates $a and $b together, but the the result was immediately lost unless it is saved to the third string $c as in case two. If $b is meant to be appended to the end of $a, use the .= operator will be more convenient. As is any other assignments in Perl, if you see an assignment written this way $a = $a op expr, where op stands for any operator and expr stands for the rest of the statement, you can make a shorter version by moving the op to the front of the assignment, e.g., $a op= expr. The string concatenation operator . is just one possible op among many others that can be shortened this way.
Substring extraction
The counterpart of string concatenation is substring extraction. To extract the substring at certain location inside a string, use the substr function:
- $second_char = substr($a, 1, 1);
- $last_char = substr($a, -1, 1);
- $last_three_char = substr($a, -3);
The first argument to the substr function is the source string, the second argument is the start position of the substring in the source string, and the third argument is the length of the substring to extract. The second argument can be negative, and if that being the case, the start position will be counted from the back of the source string. Also, the third argument can be omitted. In that case, it will run to the end of the source string. As is any expression, if the value returned from the substr function is not saved to another variable, then it will be lost right after the statement. The source string is not changed in substr extractions above. However, a particularly interesting feature in Perl is that the substr function can be assigned into as well, meaning that in addition to string extraction, it can be used as string replacement:
- substr($a, 1, 1) = 'b'; # change the second character to b
- substr($a, -1) = 'abc'; # replace the last character as abc (i.e., also add two new letters bc)
- substr($a, 1, 0) = 'abc'; #insert abc in front of the second character
Note that most Perl functions cannot be assigned into, but the substr function is an exception. This is one unique feature in Perl not seen in the other programming languages.
Substring search
In order to provide the second argument to substr, usually you need to locate the substring to be extracted or replaced first. The index function does the job:
- $loc1 = index($string, "abc");
- $loc2 = index($string, "abc", $loc+1);
- print "not found" if $loc2<0;
The index function takes two arguments, the source string to search, and the substring to be located inside the source string. It can optionally take a third argument to mean the start position of the search. Thus, you can find all locations of some substrings by providing the third argument which is one more than the last location found. If the index function finds no substring in the source string anymore, then it returns -1. Note that the index function will not interpret the substring to be searched as regular expressions (see next), thus it is more efficient to locate a fixed substring but not patterns, which requires the use of pattern matching methods below.
Regular expression
Regular expression is a way to write a pattern which describes certain substrings. In general, the number of possible strings that can match a pattern is large, thus you need to make use of the regular expression to describe them instead of listing all possibilities. If the possible substring matches are just one, then maybe the index function is more efficient. The following are some basic syntax rules of regular expression:
- Any character except the following special ones stands for itself. Thus abc matches 'abc', and xyz matches 'xyz'.
- The character . matches any single character. To match it only with the . character itself, put an escape \ in front of it, so \. will match only '.', but . will match anything. To match the escape character itself, type two of them \\ to escape itself.
- If instead of matching any character, you just want to match a subset of characters, put all of them into brackets [ ], thus [abc] will match 'a', 'b', or 'c'. It is also possible to shorten the listing if characters in a set are consecutive, so [a-z] will match all lowercase alphabets, [0-9] will match all single digits, etc. A character set can be negated by the special ^ character, thus [^0-9] will match anything but numbers, and [^a-f] will match anything but 'a' through 'f'. Again, if you just want to match the special symbols themselves, put an escape in front of them, e.g., \[, \^ and \].
- All the above so far just match single characters. The power of regular expression lies in its ability to match multiple characters with some meta symbols. The * will match 0 or more of the previous symbol, the + will match 1 or more of the previous symbol, and ? will match 0 or 1 of the previous symbol. For example, a* will match 'aaaa...' for any number of a's including none '', a+ will match 1 or more a's, and a? will match zero or one a's. A more complicated example is to match numbers, which can be written this way [0-9]+. To matching real numbers, you need to write [0-9]+\.?[0-9]*. Note that the decimal point and fraction numbers can be omitted, thus we use ?, and * instead of +.
- If you want to combine two regular expressions together, just write them consecutively. If you want to use either one of the two regular expressions, use the | meta symbol. Thus, a|b will match a or b, which is equivalent to [ab], and a+|b+ will match any string of a's or b's. The second case cannot be expressed using character subset because [ab]+ does not mean the same thing as a+|b+.
- Finally, regular expressions can be grouped together with parentheses to change the order of their interpretation. For example, a(b|c)d will match 'abd' or 'acd'. Without the parentheses, it would match 'ab' or 'cd'.
The rules above are simple, but it takes some experience to apply them successfully on the actual substrings you wish to match. There are no better ways to learn this then simply to write some regular expressions and see if they match the substrings you have in mind. The following are some examples:
- [A-Z][a-z]* will match all words whose first character are capitalized
- [A-Za-z_][A-Za-z0-9_]* will match all legal perl variable names
- [+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)? will match scientific numbers
- [acgtACGT]+ will match all DNA strings
- ^> will match the > symbol only at the beginning of a string
- a$ will match the a letter only at the end of a string
In the last two examples above, we introduced another two special symbols. The ^ which when not used inside a character set to negate the character set, stands for the beginning of the string. Thus, ^> will match '>' only when it is the first character of the string. Similarly, $ inside a regular expression means the end of the string, so a$ will match 'a' only when it is the last character of the string. These are so called anchor symbols. Another commonly used anchor is \b which stands for the boundary of a word. In addition, Perl introduces predefined character sets for some commonly used patterns, thus \d stands for digits and is equivalent to [0-9], \w stands for word letters or numbers, etc. The scientific number pattern above can therefore be rewritten as:
- [+-]?\d+\.?\d*([eE][+-]?\d+)?
Pattern matching
Regular expressions are used in a few Perl statements, and their most common use is in pattern matching. To match a regular expression pattern inside a $string, use the string operator =~ combines with the pattern matching operator / /:
- $string =~ /\w+/; # match alphanumeric words in $string
- $string =~ /\d+/; # match numbers in $string
The pattern matching operator / / does not alter the source $string. Instead, it just returns a true or false value to determine if the pattern is found in $string:
- if ($string =~ /\d+/) {
print "there are numbers in $string\n";
}
Sometimes not only you want to know if the pattern exists in a string, but also what it actually matched. In that case, use the parentheses to indicate the matched substring you want to know, and they will be assigned to the special $1, $2, ..., variables if the match is successful:
- if ($string =~ /(\d+)\s+(\d+)\s+(\d+)/) {
print "first three matched numbers are $1, $2, $3 in $string\n";
}
Note that all three numbers above must be found for the whole pattern to match successfully, thus $1, $2 and $3 should be defined when the if statement is true. The same memory of matched substrings within the regular expression are \1, \2, \3, etc. So, to check if the same number happened twice in the $string, you can do this:
- if ($string =~ /(\d).+\1/) {
print "$1 happened at least twice in $string\n";
}
You cannot use $1 in the pattern to indicate the previously matched number because $ means the end of the line inside the pattern. Use \1 instead.
Pattern substitution
In addition to matching a pattern, you can replace the matched substring with a new string using the substitution operator. In this case, just write the substitution string after the pattern to match and replace:
- $string =~ s/\d+/0/; # replace a number with zero
- $string =~ s:/:\\:; # replace the forward slash with backward slash
Unlike the pattern matching operator, the substitution operator does change the $string if a match is found. The second example above indicates that you do not always need to use / to break the pattern and substitution parts apart; you can basically use any symbol right after the s operator as the separator. In the second case above, since what we want to replace is the forward slash symbol, using it to indicate the pattern boundary would be very cumbersome and need a lot of escape characters:
- $string =~ s/\//\\/; # this is the same but much harder to read
For pattern matching, you can also use any separator by writing them with m operator, i.e., m"/" will match the forward splash symbol. Natually, the substitution string may (and often) contain the \1, \2 special memory substrings to mean the just matched substrings. For example, the following will add parentheses around the matched number in the source $string:
- $string =~ s/(\d+)/(\1)/;
The parentheses in the replacement string have no special meanings, thus they were just added to surround the matched number.
Modifiers to pattern matching and substitution
You can add some suffix modifiers to Perl pattern matching or substitution operators to tell them more precisely what you intend to do:
- /g tells Perl to match all existing patterns, thus the following prints all numbers in $string
while ($string =~ /(\d+)/g) {
print "$1\n";
}
- $string =~ s/\d+/0/g; # replace all numbers in $string with zero
- /i tells Perl to ignore cases, thus
$string =~ /abc/i; # matches AbC, abC, Abc, etc.
- /m tells perl to ignore newlines, thus
"a\na\na" =~ /a$/m will match the last a in the $string, not just the a before the first newline
Greedy or non-greedy
Perl's default behavior is to match as many characters as possible when you make use of the * and + meta symbols. This may be confusing and surprising to new users of Perl:
- my $s = "a123b123b";
$s =~ /(a.*b)/;
print "$1\n";
You may be surprised to see that you actually matched the whole string! Maybe your intention is to match just 'a123b', but .* can match any character, so it actually eats the first b and keeps matching until the last b in the source string, which will now have to match the final b otherwise the match won't be successful. So Perl stops here for the match. There are several fixes to this problem:
- $s =~ /(a\d*b)/; # explicitly say that you want to match numbers only between a and b
- $s =~ /(a[^b]*b)/; # do not allow match to b in the * pattern
- $s =~ /(a.*?b)/; # tell Perl your * should not be greedy and should stop when it can
Not all three of the solutions above are applicable in all different situations. The first solution says that you know between a and b there are only numbers, thus the first b will not be eaten by the * pattern and will match the b you specified. The second solution is similar, but you are saying anything but b can match the * pattern. Finally, the third solution tells Perl to just match enough characters to reach the first occurrence of b, i.e., to not be so greedy in matching. In general, when you are using the + and * patterns that can match an arbitrary number of characters, you need to be careful about the greediness of the match. Also note that in the third solution above the question mark after * is not the same as the question mark to indicate zero or one symbols; since it's after * but not any other regular symbol, Perl knows you mean to make * non-greedy.
The split function
There are other uses of the regular expression patterns. The split function we mentioned previously actually takes in a pattern to determine where to divide the words inside a line. Thus, when we gave the special pattern " " to split, it is actually splitting on one or more space characters:
- my @words = split " ", $string; # special " " means one or more space characters
- my @words = split /\s+/, $string; # the actual pattern to split on is /\s+/
The join function
Although not actually a pattern matching function, the counterpart of the split function is the join function, which connects all members in an array with some fixed strings:
- my $string = join " ", @words; # this time the " " is just what it is, one space character
One might sometimes need to exchange data with Microsoft Excel, do some calculations in Perl, and then return the data to Excel. For example, suppose there is a column 3 in an Excel spreadsheet that lists people's names as 'last, first' order, and you want to replace that with the more common 'first last' order without the extra comma. You could use Perl to help you:
- First save your Excel file in text format using the tabular character as field separator
- Use the following Perl code to do the work; remember field 3 has index 2 in Perl
while (<>) {
my @fields = split /\t/, $_; # now all Excel fields are separated in @fields
$fields[2] =~ s/(\w+),\s*(\w)/\2 \1/; # now the name is reversed and the comma is gone
print join "\t", @fields; # put back the tabular characters
}
- Then import the text file back to Excel.
There are much more stuff to Perl's string manipulation then this short class note can cover. However, if you master at least the above, you should be able to take a lot of strings apart and manipulate them successfully in Perl.
|