Perl hashes

What is hash?

Previously in discussing arrays, we introduced the map function, which applies a function you specified on each member of the array, and returns an array of the mapped values. Examples of map usage are the following:

  • map { length } @words;         # map words to their length values
  • map { scalar reverse } @words; # reverse each word in the array
  • map { sin } @numbers;          # get the sine value of each number

These kinds of mapping use a static function that can determine the mapped values solely based on the input, thus the length, the reversal and the sine value are all easily calculated based on the input to the mapping function. However, there are situations that the mapping needs to be explicitly spelled out, such as to map the complement code of DNA, i.e., A -> T, T->A, etc. In those cases, you could try to put the code fragment to map inside the block of Perl code:

  • map {
      if ($_ eq 'A') {
        'T';
      } elsif ($_ eq 'T') {
        'A';
      } elsif ($_ eq 'G') {
        'C';
      } elsif ($_ eq 'C') {
        'G';
      }
    } @DNA_bases;

If each base of a DNA sequence has been split into individual members of the @DNA_bases array, then this map command will give you the complement of each base. There are easy ways to split each character in a string into an array of single characters for use with the map command above. We will mention those in more detail later in string and pattern matching lectures. It suffices here just to give you an example; the following command will split all bases in a DNA sequence into an array of individual bases:

  • my @DNA_bases = split //, "ACGTTTACGT";

In the map example above, you have to explicitly spell out the mapping to be conducted; there are no simpler function like length, reverse or sin that can do this automatically. Although you can use the if-elsif-elsif-... block of code to map, things will get really complicated when you want to map DNA codons to proteins; that will require over 60 elsif!!

How to use hash

To provide a more general framework to do the explicit mapping, you can use the hash data type inside Perl. Hash is just like a table at a glance. For example, the following hash spells out the mapping from DNA codons to protein code:

  • my %DNA_code = (
    'GCT' => 'A', 'GCC' => 'A', 'GCA' => 'A', 'GCG' => 'A', 'TTA' => 'L',
    'TTG' => 'L', 'CTT' => 'L', 'CTC' => 'L', 'CTA' => 'L', 'CTG' => 'L',
    'CGT' => 'R', 'CGC' => 'R', 'CGA' => 'R', 'CGG' => 'R', 'AGA' => 'R',
    'AGG' => 'R', 'AAA' => 'K', 'AAG' => 'K', 'AAT' => 'N', 'AAC' => 'N',
    'ATG' => 'M', 'GAT' => 'D', 'GAC' => 'D', 'TTT' => 'F', 'TTC' => 'F',
    'TGT' => 'C', 'TGC' => 'C', 'CCT' => 'P', 'CCC' => 'P', 'CCA' => 'P',
    'CCG' => 'P', 'CAA' => 'Q', 'CAG' => 'Q', 'TCT' => 'S', 'TCC' => 'S',
    'TCA' => 'S', 'TCG' => 'S', 'AGT' => 'S', 'AGC' => 'S', 'GAA' => 'E',
    'GAG' => 'E', 'ACT' => 'T', 'ACC' => 'T', 'ACA' => 'T', 'ACG' => 'T',
    'GGT' => 'G', 'GGC' => 'G', 'GGA' => 'G', 'GGG' => 'G', 'TGG' => 'W',
    'CAT' => 'H', 'CAC' => 'H', 'TAT' => 'Y', 'TAC' => 'Y', 'ATT' => 'I',
    'ATC' => 'I', 'ATA' => 'I', 'GTT' => 'V', 'GTC' => 'V', 'GTA' => 'V',
    'GTG' => 'V',);

Here we declared a hash called %DNA_code, and we initialized its content with pairs of the mapping from a particular DNA codon to protein code. For example, we know GCT is going to be translated into Alanine, so we write 'GCT' => 'A', for A means Alanine. All other code of translations were set up similarly. After the %DNA_code hash has been properly established, you can now easily use it to map DNA sequences to protein sequences. Again, assuming some string manipulations already put the three base codons from a DNA sequence into an array of @DNA_codons, instead of writing this big if-elsif block:

  • map {
     if ($_ eq 'GCT') {
        'A';
      } elsif ($_ eq 'GCC') {
        'A';
      } elsif ($_ eq 'GCA') {
        'A';
      } elsif ($_ eq 'GCG') {
        'A';
      } ...
      ... # 57 more to continue !!
    } @DNA_codons;

You may now map by a simple hash entry look up:

  • map { $DNA_code{$_} } @DNA_codons;

This will achieve the same thing and with much easier syntax. The format of hash element lookup is very similar to array element lookup. In the array syntax, you look up the 5th element of an array by typing $array[4 ] (remember the index will be 4 because the first element starts with an index of 0) . Here, to look up the corresponding protein code for the codon 'GCT' you type $DNA_code{'GCT'}, and that will give you 'A' if it exists in the hash table. In another word, the hash syntax can be seen as using strings instead of indices to look up a member of the hash. That's why sometimes hashes are also called the associate arrays, meaning that they are arrays that associate strings to other strings.

Other use of hashes

The typical use for hash is for mapping, as we have seen above. Another example is if you somehow constructed a hash that maps student ID's to their names:

  • my %students = (
    012345678 => 'John Smart',
    012345679 => 'Mary Beautiful',
    ....
    );

Then whenever you need to look up a name given a student ID, just refer to it as $students{$id}. In this case hashes function like a database. Hashes can also be used to count frequencies of some string occurrences. For example, if you want to count the codon usage frequency, you can do it by the following code:

  • for (@DNA_codons) {
      $DNA_codon_counters{$_}++;
    }

After the for loop is finished, the content of the $DNA_codon_counters hash will be similar to the %DNA_code table above except that it does not map from codon to protein code but to the occurrence count of each codon in @DNA_codons:

  • %DNA_codon_counters = (
    'GCT' => GCT's occurrence count in number
    'GCC' => GCC's occurrence count in number
    ...
    );

Now you know how they determined the most frequently used english words, right?

Orderly inspecting the hash content

To go over all hash elements and do something about its stored values, use the each function:

  • while (my ($key, $value) = each %some_hash) {
      print "$key ===> $value\n";
      $reverse_hash{$value} = $key;
    }

The each function will go over each pair of hash mapping entries, and assign their values to the $key and $value variables. In the above, we also show you how to create a reverse hash, by assigning the opposite ($value, $key) pairs to another hash %reverse_hash. However, you should know that if $value's are not unique, this %reverse_hash will not be exactly the opposite of %some_hash. The reason is that keys must be unique in a hash, but values are not required to be unique.

 

If you just want the keys, use the keys function instead:

  • for (keys %some_hash) {
      print "$_ occurs $some_hash{$_} times\n";
    }

The usage here is slightly different than the above because the keys function returns all keys of a hash at once, while the each function returns just one pair at a time when called. Similarly, to get just the values of a hash, you can use the values function.

Checking the existence of a hash mapping entry and to delete some entries

Sometimes, the value of a hash mapping does not matter; as long as such mapping exists, it serves the purpose. In this case, the hash functions as a set in mathematics, and the collections are the keys in the hash. For example, to know if a codon GUU is valid, i.e., to check if its mapping existsin the %DNA_code table,, use the exists function:

  • if (exists $DNA_code{'GUU'}) {
      print "GUU is a valid DNA codon\n";
    }

To delete an entry from the hash, use the delete function:

  • delete $DNA_code{'GUU'}; # afterward GUU is no longer mapped to G in %DNA_code

 

Last modified July 13, 2007. All rights reserved.