Smith-Waterman Alignment: Local Sequence Alignment
Overview
Finding the alignment of the most similar parts from two base sequences is called local alignment, and the most widely used method for this is introducing the Smith-Waterman Algorithm.
There are so many possible cases in Sequence Alignment, so there is a need to calculate efficiently and quickly through Dynamic Programming.
Algorithm1
Input
Given two strings $\textbf{v}, \textbf{w}$ and Substitution Matrix $\delta$, let’s say they are displayed as follows.
$$ \textbf{v} := \left( v_{1} , \cdots , v_{m} \right) \\ \textbf{w} := \left( w_{1} , \cdots , w_{n} \right) $$
The length of $\textbf{v}$ is $m$, the length of $\textbf{w}$ is $n$. The penalty for two characters $x,y$ is shown as $\delta (x,y)$, and especially, the character representing the gap is indicated as $-$.
Step 1. Initialization
Create an empty matrix $\left( s_{i,j} \right)_{i=0, \cdots, m \\ j = 0, \cdots , n}$ and initialize the first row and first column as follows.
$$ s_{i,0} \gets 0 \\ s_{0,j} \gets 0 $$
Step 2. Dynamic Programming
If $s_{i-1,j}$ and $s_{i,j-1}$ have already been calculated, then calculate $s_{i,j}$ as follows. (Of course, $s_{i-1,j-1}$ will have been calculated.)
$$ s_{i,j} \gets \max \begin{cases} s_{i-1,j} + \delta \left( v_{i}, - \right) \\ s_{i,j-1} + \delta \left( -, w_{j} \right) \\ s_{i-1, j-1} + \delta \left( v_{i} , w_{j} \right) \\ 0 \end{cases}$$
Ouput
Let’s say the element with the highest score in the matrix is $s_{\text{max}}$. Starting from $s_{\text{max}}$, stop when you encounter 0 while choosing the element with the highest score among the left, upper-left, and top elements. Obtains the local alignment, which is most similar to a part of two strings. That score is $s_{\text{max}}$.
Example
Input
julia> s1 = "ATTAGCTCAC";
julia> s2 = "ACTATAGCGCAG";
Perform global alignment on two base sequences as above.
julia> EDNAFULL
SubstitutionMatrix{BioSymbols.DNA,Int64}:
A C M G R S V T W Y H K D B N
A 5 -4 1 -4 1 -4 -1 -4 1 -4 -1 -4 -1 -4 -2
C -4 5 1 -4 -4 1 -1 -4 -4 1 -1 -4 -4 -1 -2
M 1 1 -1 -4 -2 -2 -1 -4 -2 -2 -1 -4 -3 -3 -1
G -4 -4 -4 5 1 1 -1 -4 -4 -4 -4 1 -1 -1 -2
R 1 -4 -2 1 -1 -2 -1 -4 -2 -4 -3 -2 -1 -3 -1
S -4 1 -2 1 -2 -1 -1 -4 -4 -2 -3 -2 -3 -1 -1
V -1 -1 -1 -1 -1 -1 -1 -4 -3 -3 -2 -3 -2 -2 -1
T -4 -4 -4 -4 -4 -4 -4 5 1 1 -1 1 -1 -1 -2
W 1 -4 -2 -4 -2 -4 -3 1 -1 -2 -1 -2 -1 -3 -1
Y -4 1 -2 -4 -4 -2 -3 1 -2 -1 -1 -2 -3 -1 -1
H -1 -1 -1 -4 -3 -3 -2 -1 -1 -1 -1 -3 -2 -2 -1
K -4 -4 -4 1 -2 -2 -3 1 -2 -2 -3 -1 -1 -1 -1
D -1 -4 -3 -1 -1 -3 -2 -1 -1 -3 -2 -1 -1 -2 -1
B -4 -1 -3 -1 -3 -1 -2 -1 -3 -1 -2 -1 -2 -1 -1
N -2 -2 -1 -2 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1
(underlined values are default ones)
Let’s say the gap penalty in the Substitution Matrix is linearly $-5$ points. Then, the used characters are only four major bases, receiving $5$ points when characters match, $-4$ points when mismatched, and $-5$ as a gap penalty.
For understanding, let’s implement it simply in Excel.
Step 1. Write the query sequence as rows and the reference sequence as columns and initialize.
Step 2. Calculate the score in each cell.
Enter the function as follows at C3 (third row, third column) and copy it to all cells.
=MAX(B2+IF(C$1=$A3,5,-4), C2-5,B3-5,0)
Output. Trace back from the bottom right.
Starting from the bottom right corner, climb backward according to the following rules.
- Red ↖: The row and column do not match, so either side can be chosen.
- Orange ↖: The row and column match, so choose that base.
- Grey ← or ↑: Represents a gap, Indel. Choose the relatively higher scoring side between the top and left.
Code
As with dynamic programming, it can be implemented simply in Excel as shown in the example. Although it’s fine for study purposes, continuously reusing it is certainly unwise. The following is code that performs the same task easily and quickly in Julia.
using BioAlignments
s1 = "ATTAGCTCAC";
s2 = "ACTATAGCGCAG";
scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=0);
local_result = pairalign(LocalAlignment(), s1, s2, scoremodel)
Execution Result
julia> using BioAlignments
julia> s1 = "ATTAGCTCAC";
julia> s2 = "ACTATAGCGCAG";
julia> scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=0);
julia> local_result = pairalign(LocalAlignment(), s1, s2, scoremodel)
PairwiseAlignmentResult{Int64,String,String}:
score: 26
seq: 1 ATTAGCTCA 9
|| ||| ||
ref: 4 AT-AGCGCA 11
Del C. Jones, Pavel A. Pevzner. (2009). An Introduction to Bioinformatics Algorithms (Computational Molecular Biology): p181. ↩︎