シーケンスアラインメントとは?
定義
塩基配列間の類似度に基づいて並べることをシーケンスアライメントsequence Alignmentという。
Description
In bioinformatics, since genomes are extremely long, even digitalizing them is a huge task. Ideally, we would like to store information by reading from the upstream to the downstream of DNA like polymerase, but realistically, that is not possible. Therefore, we need to obtain many short base sequences called Reads and concatenate them. During this process, it is necessary to identify overlapping parts at the ends of these sequences, and thus Sequence Alignment is performed.
Even though this process might seem doubtful at first glance, it is surprisingly accurate. For instance, imagine that the ends of two reads match exactly in 4 places as shown in ${\color{red}AATC}$: $$ ATCGCC {\color{red}AATC} \\ {\color{red}AATC} TATTC $$ Thinking very simply, the chance of four bases coincidentally matching is one in $4^{4} = 256$ possibilities, and with a probability distribution, we can also verify how certain this match is through hypothesis testing. With just four matching bases, one could argue it is quite reliable. Of course, there are mistakes with this method, so more reads are sequenced to compensate, and the long sequence created this way is called a Consensus Sequence. In fact, NGS (Next Generation Sequence) data provides the bases at each position and the probability of those bases being accurate. Once a consensus sequence is created, it’s easier to compare with other entities, hence in the context of sequence alignment, it is also called a Reference Sequence, and the compared sequence is called a Query Sequence.
Sequence alignments can be broadly divided into Pairwise Sequence Alignment, which compares two sequences for similarity, and Multiple Sequence Alignment, which compares several sequences. Pairwise Sequence Alignment can be further divided as follows:
- Global Alignment: Finds the alignment that maximizes the number of common parts between two sequences.
- Semi-global Alignment: Finds the most similar part of the query sequence in the reference sequence.
- Local Alignment: Finds the most similar parts among the two sequences. Naively, since sequence alignment requires a tremendous amount of computation, algorithms utilizing dynamic programming have been developed. The representative algorithms for global and local alignments are, respectively, Needleman-Wunsch Alignment and Smith-Waterman Alignment.