Sequence Alignment Scores and Gap Penalty
Definitions
Let’s assume we are given a reference sequence and a query sequence. A Sequence Alignment Score is the numerical representation and methodology of how well two sequences match when compared. Scoring is calculated based on weights assigned to the following:
- Match: The number of times the two sequences align.
- Mismatch: The number of times the two sequences do not align.
Example
For instance, consider the two nucleotide sequences shown above. There are various ways to align these sequences, but objectively, there should be a basis for which alignment is best. The two sequences in this example are of different lengths, which could lead to trailing issues; hence, a gap will be introduced to prevent this.
If we want to quantify how well they match, it would be rational to assign scores proportional to the number of matches and mismatches. There are 6 matches, marked in green, and 2 mismatches, marked in red. Let’s consider two scoring systems A and B. System A gives 2 points for a match and -1 point for a mismatch, whereas System B gives 1 point for a match and -2 points for a mismatch. Thus, System A would result in a score of $2\cdot 6 - 1 \cdot 2= 10$, and System B would result in a score of $1\cdot 6 - 2 \cdot 2= 2$. This evaluates the alignment from two perspectives: similarity and difference. From now on, let’s assign 1 point for a match and -1 point for a mismatch, which gives the above alignment a score of $6-2=4$. While the penalties and bonuses for matches and mismatches can be arbitrary, widely used score tables are available, such as substitution matrices, which can be applied not only to nucleotide sequences but also to protein sequences and to other characters to indicate how similar they are in detail.
In addition to introducing gaps to avoid trailing, these too present differences and thus are penalized, known as gap penalties.
Gap Penalties
- Gap: The number of times space is added to match the lengths.
- Constant Gap Penalty: Penalizes the presence of gaps.
- Linear Gap Penalty: Penalizes according to the number of gaps.
- Affine Gap Penalty: Penalizes both the presence of gaps and their lengths.
- Open (Gap Penalty): The count at which a gap sequence starts.
- Extension (Gap Penalty): The total count of continuous gaps after opening.
Example
If a constant gap penalty of -10 is given, then the score for the below alignment would be $4-10 = -6$. Giving a linear gap penalty of -1 would result in a score proportional to the number of gaps, which would be $4-1\cdot 5 = -1$.
The Affine Gap Penalty is the most widely used method. Open refers to the initiation of a gap sequence, and Extension to how much that sequence extends. If we assign a gap penalty of -2 for Open and -1 for Extension, since there are two gap sequences, Open would be 2, and the total Extension, subtracting those two, becomes $5-2=3$, leading to a final score of $10-2\cdot 2 - 1 \cdot 3 = 3$.
The rationale behind the names assigned to gap penalties can be easily understood from the following diagram. Constant applies a penalty for the mere presence of a gap, Linear applies penalties in proportion to their numbers, and Affine or Affine inserts the concept of ’translation’, fitting well with the idea that the penalty lines for gaps represent translations.
https://en.wikipedia.org/wiki/Gap_penalty ↩︎