Bioinformatics: DNA Sequencing 📂Algorithm

Bioinformatics: DNA Sequencing

Build-up

A polymer is a large molecule composed of repeating monomeric units linked by chemical synthesis.
Phosphoric Acid is a type of inorganic oxyacid, with the chemical formula $H_{3}PO_{4}$.
A monosaccharide with five carbon atoms is called Pentose.
The molecule that functions as the basic unit of genetic information is known as a Nitrogenous base or simply Base.
A nucleotide is a molecule that consists of phosphate-pentose-base and becomes the monomeric unit of nucleic acids.
Essential for life processes, the polymer of nucleotides is called Nucleic Acid.
The nucleic acid that forms a chain structure based on the pentose ribose is called Ribo Nucleic Acid (RNA).
The nucleic acid that consists of two long strands of nucleotide polymers twisted together into a double helix structure is called Deoxyribo Nucleic Acid (DNA).
DNA or RNA are known as Genetic Material.

Comparison with Chromosomes

If we accept these explanations as definitions, then DNA and RNA, beyond being genetic materials, are actually existing polymers, and depending on what bases constitute the nucleotides, their combinations could be infinite. RNA follows the form of an ordered chain, while DNA has its chains form base pairs and a double helix structure. This means that by looking into genetic materials and recording their order, we can obtain genetic information.

With this information age as a starting point, we declare a departure from chemistry and biology. If we focus on the differences with chromosomes, then DNA or RNA getting twisted and clumped together forms a chromosome, whereas the act of reading and noting down information forms the nucleotide sequence. Chromosomes are physical entities, whereas nucleotide sequences are data.

Definition

The arrangement of the bases of genetic material in sequence is known as a Nucleic Sequence.

Major Bases and Characters

In the context of bioinformatics, the major bases are limited to five characters $A, T, G, C, U$. These, in order from the front, represent adenine, thymine, guanine, cytosine, and uracil; from a computer science perspective, the information encoded nucleotide sequences become strings of these five characters. It’s important to understand that what data analysts are dealing with are not DNA or RNA themselves but their nucleotide sequences.

The primary issue in dealing with nucleotide sequences is that their sizes are not small by any means. For instance, the human genome is comprised of a whopping 3.3 billion base pairs, requiring more intelligent approaches than the naive method of reading from front to back for analyzing them and producing meaningful results.

Moreover, the direction of nucleotide sequences can be determined by checking upstream and downstream, so there’s no need to worry about nucleotide sequences being reversed.