logo

String Edit Distance 📂Algorithm

String Edit Distance

Buildup 1

Strings have the following four operations:

  1. Insertion: Insert a new character into the string.
  2. Deletion: Remove a character from the string.
  3. Replacement: Replace one character in the string with another.
  4. Transposition: Swap the positions of two characters.

Definition

The edit distance is a distance function between strings, categorized into the following types by allowing or forbidding certain editing methods:

  • (1) Hamming distance: Hamming distance is the simplest, allowing only replacements. In other words, given vectors of the same length, it merely counts the positions where they differ. Though not much to it, it’s often omitted in explanations, so it’s a good idea to remember the term itself. For example, the distance between the following two vectors is $3$. $$ x = (1,1,0,1) \\ y = (0,1,1,0) $$
  • (2) Levenshtein distance: Levenshtein distance allows insertion, deletion, and replacement. It is the most majorly used distance when thinking of an edit distance proper and is encountered in fields such as information retrieval theory and bioinformatics, among others. An algorithm for measuring Levenshtein distance is the Levenshtein algorithm. For example, “cats” and “facts” undergo the following two edits, so their Levenshtein distance is $2$.
    • (Replacement) cats → fats
    • (Insertion) fats → facts
  • (3) Jaro distance: Contrary to Levenshtein distance, Jaro distance allows only transpositions.
  • (4) Longest common subsequence distance: LCS distance allows only insertions and deletions.