n-Gram and Jaccard Coefficient
Definition
- An n-gram is a contiguous sequence of n items from a given sample of text or speech.
- The Jaccard Coefficient is a measure of how similar two sets are, ranging between $0$ and $1$. It can be mathematically represented as follows: $$ JC(A,B) = {{| A \cap B|} \over {| A \cup B| }} = {{| A \cap B|} \over { |A|+ |B| -| A \cap B| }} $$
Example
For instance, the bigrams (2-grams) of the string ‘오마이갓’ would be ‘오마’, ‘마이’, and ‘이갓’. Using these two concepts, one can express the similarity between two given strings in concrete numbers. How to set $n$ properly and from what value of the Jaccard Coefficient the strings are considered similar is up to the developer.
The Jaccard Coefficient calculated after dividing ‘oh my god’ and ‘oh my girl’ into bigrams is $0.5454545$. If $0.5$ is considered the threshold for similarity, then ‘oh my god’ and ‘oh my girl’ can be considered similar strings.
Code
JC<-function(A,B,k=2)
{
subA<-character()
subB<-character()
for(i in 1:(nchar(A)-k+1))
{
subA<-c(subA,substring(A,i,i+k-1))
}
subB<-character()
for(i in 1:(nchar(B)-k+1))
{
subB<-c(subB,substring(B,i,i+k-1))
}
return(length(intersect(subA,subB))/(length(subA)+length(subB)-length(intersect(subA,subB))))
}
JC("oh my god",'oh my girl',2)