Smith-Waterman Alignment: Local Sequence Alignment 📂Algorithm

Smith-Waterman Alignment: Local Sequence Alignment

Overview

Finding the alignment of the most similar parts from two base sequences is called local alignment, and the most widely used method for this is introducing the Smith-Waterman Algorithm.

There are so many possible cases in Sequence Alignment, so there is a need to calculate efficiently and quickly through Dynamic Programming.

Algorithm¹

Input

Given two strings $\textbf{v}, \textbf{w}$ and Substitution Matrix $\delta$ , let’s say they are displayed as follows.

$\textbf{v} := \left( v_{1} , \cdots , v_{m} \right) \\ \textbf{w} := \left( w_{1} , \cdots , w_{n} \right)$

The length of $\textbf{v}$ is $m$ , the length of $\textbf{w}$ is $n$ . The penalty for two characters $x,y$ is shown as $\delta (x,y)$ , and especially, the character representing the gap is indicated as $-$ .

Step 1. Initialization

Create an empty matrix $\left( s_{i,j} \right)_{i=0, \cdots, m \\ j = 0, \cdots , n}$ and initialize the first row and first column as follows.

$s_{i,0} \gets 0 \\ s_{0,j} \gets 0$

Step 2. Dynamic Programming

If $s_{i-1,j}$ and $s_{i,j-1}$ have already been calculated, then calculate $s_{i,j}$ as follows. (Of course, $s_{i-1,j-1}$ will have been calculated.)

$s_{i,j} \gets \max \begin{cases} s_{i-1,j} + \delta \left( v_{i}, - \right) \\ s_{i,j-1} + \delta \left( -, w_{j} \right) \\ s_{i-1, j-1} + \delta \left( v_{i} , w_{j} \right) \\ 0 \end{cases}$

Ouput

Let’s say the element with the highest score in the matrix is $s_{\text{max}}$ . Starting from $s_{\text{max}}$ , stop when you encounter 0 while choosing the element with the highest score among the left, upper-left, and top elements. Obtains the local alignment, which is most similar to a part of two strings. That score is $s_{\text{max}}$ .

Example

Input

julia> s1 = "ATTAGCTCAC";

julia> s2 = "ACTATAGCGCAG";

Perform global alignment on two base sequences as above.

julia> EDNAFULL
SubstitutionMatrix{BioSymbols.DNA,Int64}:
     A  C  M  G  R  S  V  T  W  Y  H  K  D  B  N
  A  5 -4  1 -4  1 -4 -1 -4  1 -4 -1 -4 -1 -4 -2
  C -4  5  1 -4 -4  1 -1 -4 -4  1 -1 -4 -4 -1 -2
  M  1  1 -1 -4 -2 -2 -1 -4 -2 -2 -1 -4 -3 -3 -1
  G -4 -4 -4  5  1  1 -1 -4 -4 -4 -4  1 -1 -1 -2
  R  1 -4 -2  1 -1 -2 -1 -4 -2 -4 -3 -2 -1 -3 -1
  S -4  1 -2  1 -2 -1 -1 -4 -4 -2 -3 -2 -3 -1 -1
  V -1 -1 -1 -1 -1 -1 -1 -4 -3 -3 -2 -3 -2 -2 -1
  T -4 -4 -4 -4 -4 -4 -4  5  1  1 -1  1 -1 -1 -2
  W  1 -4 -2 -4 -2 -4 -3  1 -1 -2 -1 -2 -1 -3 -1
  Y -4  1 -2 -4 -4 -2 -3  1 -2 -1 -1 -2 -3 -1 -1
  H -1 -1 -1 -4 -3 -3 -2 -1 -1 -1 -1 -3 -2 -2 -1
  K -4 -4 -4  1 -2 -2 -3  1 -2 -2 -3 -1 -1 -1 -1
  D -1 -4 -3 -1 -1 -3 -2 -1 -1 -3 -2 -1 -1 -2 -1
  B -4 -1 -3 -1 -3 -1 -2 -1 -3 -1 -2 -1 -2 -1 -1
  N -2 -2 -1 -2 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1
(underlined values are default ones)

Let’s say the gap penalty in the Substitution Matrix is linearly $-5$ points. Then, the used characters are only four major bases, receiving $5$ points when characters match, $-4$ points when mismatched, and $-5$ as a gap penalty.

For understanding, let’s implement it simply in Excel.

Step 1. Write the query sequence as rows and the reference sequence as columns and initialize.

step1

Step 2. Calculate the score in each cell.

step2

Enter the function as follows at C3 (third row, third column) and copy it to all cells.

=MAX(B2+IF(C$1=$A3,5,-4), C2-5,B3-5,0)

Output. Trace back from the bottom right.

step3

Starting from the bottom right corner, climb backward according to the following rules.

Red ↖: The row and column do not match, so either side can be chosen.
Orange ↖: The row and column match, so choose that base.
Grey ← or ↑: Represents a gap, Indel. Choose the relatively higher scoring side between the top and left.

Code

As with dynamic programming, it can be implemented simply in Excel as shown in the example. Although it’s fine for study purposes, continuously reusing it is certainly unwise. The following is code that performs the same task easily and quickly in Julia.

using BioAlignments

s1 = "ATTAGCTCAC";
s2 = "ACTATAGCGCAG";

scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=0);
local_result = pairalign(LocalAlignment(), s1, s2, scoremodel)

Execution Result

julia> using BioAlignments

julia> s1 = "ATTAGCTCAC";

julia> s2 = "ACTATAGCGCAG";

julia> scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=0);

julia> local_result = pairalign(LocalAlignment(), s1, s2, scoremodel)
PairwiseAlignmentResult{Int64,String,String}:
  score: 26
  seq:  1 ATTAGCTCA  9
          || ||| ||
  ref:  4 AT-AGCGCA 11

Del C. Jones, Pavel A. Pevzner. (2009). An Introduction to Bioinformatics Algorithms (Computational Molecular Biology): p181. ↩︎