logo

Introduction to the Julia Natural Language Processing Package TextAnalysis.jl 📂Julia

Introduction to the Julia Natural Language Processing Package TextAnalysis.jl

Overview

This section introduces TextAnalysis.jl, a natural language processing package for Julia1. For practical natural language processing, packages like Snowball.jl for stemming are also used2.

Code

julia> using TextAnalysis

julia> crps = Corpus(StringDocument.([
             "To be or not to be",
             "It should be a question to be",
             "It is not a question"
       ]))
A Corpus with 3 documents:
 * 3 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

A corpus is primarily defined using the Corpus constructor, where documents can be in the form of StringDocument, FileDocument, TokenDocument, NGramDocument, etc.

Vocabulary Update update_lexicon!

julia> update_lexicon!(crps)

julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 10 entries:
  "It"       => 2
  "To"       => 1
  "a"        => 2
  "be"       => 4
  "is"       => 1
  "not"      => 2
  "or"       => 1
  "question" => 2
  "should"   => 1
  "to"       => 2

The corpus is updated using the update_lexicon! function. Checking the lexicon property, you can see the frequency of each word in the corpus.

Document Term Matrix DocumentTermMatrix, dtm

julia> M = DocumentTermMatrix(crps)
A 3 X 10 DocumentTermMatrix

julia> dtm(M)
3×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 16 stored entries:
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 1  ⋅  1  2  ⋅  ⋅  ⋅  1  1  1
 1  ⋅  1  ⋅  1  1  ⋅  1  ⋅  ⋅

Using the two functions DocumentTermMatrix, dtm, the frequency of words in each corpus can be checked. Given as a sparse matrix where rows are documents and columns are vocabulary, the vocabulary is sorted lexically, from uppercase to lowercase.

julia> dtm(M)'dtm(M)
10×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 72 stored entries:
 2  ⋅  2  2  1  1  ⋅  2  1  1
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 2  ⋅  2  2  1  1  ⋅  2  1  1
 2  2  2  8  ⋅  2  2  2  2  4
 1  ⋅  1  ⋅  1  1  ⋅  1  ⋅  ⋅
 1  1  1  2  1  2  1  1  ⋅  1
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 2  ⋅  2  2  1  1  ⋅  2  1  1
 1  ⋅  1  2  ⋅  ⋅  ⋅  1  1  1
 1  1  1  4  ⋅  1  1  1  1  2

Simply taking $M^{T} M$ on the document term matrix $M$ can yield a matrix like the covariance matrix.

Remove Case remove_case!

julia> remove_case!(crps)

julia> update_lexicon!(crps)

julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 9 entries:
  "a"        => 2
  "be"       => 4
  "is"       => 1
  "it"       => 2
  "not"      => 2
  "or"       => 1
  "question" => 2
  "should"   => 1
  "to"       => 3

In many cases, it is unnecessary to distinguish between uppercase and lowercase in natural language processing. The remove_case! function allows for the removal of case distinctions3, and updating the corpus with the update_lexicon! function shows that uppercase letters have been combined with lowercase ones.

Full Code

using TextAnalysis

crps = Corpus(StringDocument.([
      "To be or not to be",
      "It should be a question to be",
      "It is not a question"
]))

update_lexicon!(crps)
sort(crps.lexicon)

M = DocumentTermMatrix(crps)
dtm(M)
dtm(M)'dtm(M)

remove_case!(crps)
update_lexicon!(crps)
sort(crps.lexicon)

Besides, many essential features for natural language processing are implemented.

Environment

  • OS: Windows
  • julia: v1.10.0