logo

n-Gram and Jaccard Coefficient 📂Algorithm

n-Gram and Jaccard Coefficient

Definition

  • An n-gram is a contiguous sequence of n items from a given sample of text or speech.
  • The Jaccard Coefficient is a measure of how similar two sets are, ranging between 00 and 11. It can be mathematically represented as follows: JC(A,B)=ABAB=ABA+BAB JC(A,B) = {{| A \cap B|} \over {| A \cup B| }} = {{| A \cap B|} \over { |A|+ |B| -| A \cap B| }}

Example

For instance, the bigrams (2-grams) of the string ‘오마이갓’ would be ‘오마’, ‘마이’, and ‘이갓’. Using these two concepts, one can express the similarity between two given strings in concrete numbers. How to set nn properly and from what value of the Jaccard Coefficient the strings are considered similar is up to the developer.

20180412\_105813.png

The Jaccard Coefficient calculated after dividing ‘oh my god’ and ‘oh my girl’ into bigrams is 0.54545450.5454545. If 0.50.5 is considered the threshold for similarity, then ‘oh my god’ and ‘oh my girl’ can be considered similar strings.

Code

JC<-function(A,B,k=2)
{
  subA<-character()
  subB<-character()
  for(i in 1:(nchar(A)-k+1))
  {
    subA<-c(subA,substring(A,i,i+k-1))
  }
  subB<-character()
  for(i in 1:(nchar(B)-k+1))
  {
    subB<-c(subB,substring(B,i,i+k-1))
  }
  
  return(length(intersect(subA,subB))/(length(subA)+length(subB)-length(intersect(subA,subB))))
}
 
JC("oh my god",'oh my girl',2)