Substitution Matrix in Sequence Alignment
Definition
A Substitution Matrix is the matrix used as a criterion for matches and mismatches when scoring sequence alignment.
Example
using BioAlignments
EDNAFULL
BLOSUM45
PAM30
Let’s dive straight into the examples. In Julia, there is a package called BioAlignments
available, which allows one to easily load the desired substitution matrix. When loading matrices frequently used for DNA analysis like EDNAFULL
or BLOSUM
(BLOcks SUbstitution Matrix), and PAM
(Point Accepted Mutation) matrices used for protein sequences, we get the following.
Not all matches are treated equally, nor are all mismatches. This reflects some frequency-based insights. An example from information retrieval theory could be typo correction based on the keyboard layout. For instance, when considering the nonsensical word “juloa”, since i and o are next to each other on the QWERTY keyboard, it is logical to give a low mismatch penalty between i and o, suggesting that it might be a typo for julia. Similarly, although julea could also be a typo for julia, since the distance between i and e is greater, the same typo is less suspected of being coincidental compared to juloa.