n-Gram and Jaccard Coefficient 📂Algorithm

n-Gram and Jaccard Coefficient

Definition

An n-gram is a contiguous sequence of n items from a given sample of text or speech.
The Jaccard Coefficient is a measure of how similar two sets are, ranging between $0$ and $1$ . It can be mathematically represented as follows: $JC(A,B) = {{| A \cap B|} \over {| A \cup B| }} = {{| A \cap B|} \over { |A|+ |B| -| A \cap B| }}$

Example

For instance, the bigrams (2-grams) of the string ‘오마이갓’ would be ‘오마’, ‘마이’, and ‘이갓’. Using these two concepts, one can express the similarity between two given strings in concrete numbers. How to set $n$ properly and from what value of the Jaccard Coefficient the strings are considered similar is up to the developer.

$20180412\_105813.png$

The Jaccard Coefficient calculated after dividing ‘oh my god’ and ‘oh my girl’ into bigrams is $0.5454545$ . If $0.5$ is considered the threshold for similarity, then ‘oh my god’ and ‘oh my girl’ can be considered similar strings.

Code

JC<-function(A,B,k=2)
{
  subA<-character()
  subB<-character()
  for(i in 1:(nchar(A)-k+1))
  {
    subA<-c(subA,substring(A,i,i+k-1))
  }
  subB<-character()
  for(i in 1:(nchar(B)-k+1))
  {
    subB<-c(subB,substring(B,i,i+k-1))
  }
  
  return(length(intersect(subA,subB))/(length(subA)+length(subB)-length(intersect(subA,subB))))
}
 
JC("oh my god",'oh my girl',2)