Introduction to the Julia Natural Language Processing Package TextAnalysis.jl
Overview
This section introduces TextAnalysis.jl
, a natural language processing package for Julia1. For practical natural language processing, packages like Snowball.jl
for stemming are also used2.
Code
julia> using TextAnalysis
julia> crps = Corpus(StringDocument.([
"To be or not to be",
"It should be a question to be",
"It is not a question"
]))
A Corpus with 3 documents:
* 3 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
A corpus is primarily defined using the Corpus
constructor, where documents can be in the form of StringDocument
, FileDocument
, TokenDocument
, NGramDocument
, etc.
Vocabulary Update update_lexicon!
julia> update_lexicon!(crps)
julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 10 entries:
"It" => 2
"To" => 1
"a" => 2
"be" => 4
"is" => 1
"not" => 2
"or" => 1
"question" => 2
"should" => 1
"to" => 2
The corpus is updated using the update_lexicon!
function. Checking the lexicon
property, you can see the frequency of each word in the corpus.
Document Term Matrix DocumentTermMatrix
, dtm
julia> M = DocumentTermMatrix(crps)
A 3 X 10 DocumentTermMatrix
julia> dtm(M)
3×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 16 stored entries:
⋅ 1 ⋅ 2 ⋅ 1 1 ⋅ ⋅ 1
1 ⋅ 1 2 ⋅ ⋅ ⋅ 1 1 1
1 ⋅ 1 ⋅ 1 1 ⋅ 1 ⋅ ⋅
Using the two functions DocumentTermMatrix
, dtm
, the frequency of words in each corpus can be checked. Given as a sparse matrix where rows are documents and columns are vocabulary, the vocabulary is sorted lexically, from uppercase to lowercase.
julia> dtm(M)'dtm(M)
10×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 72 stored entries:
2 ⋅ 2 2 1 1 ⋅ 2 1 1
⋅ 1 ⋅ 2 ⋅ 1 1 ⋅ ⋅ 1
2 ⋅ 2 2 1 1 ⋅ 2 1 1
2 2 2 8 ⋅ 2 2 2 2 4
1 ⋅ 1 ⋅ 1 1 ⋅ 1 ⋅ ⋅
1 1 1 2 1 2 1 1 ⋅ 1
⋅ 1 ⋅ 2 ⋅ 1 1 ⋅ ⋅ 1
2 ⋅ 2 2 1 1 ⋅ 2 1 1
1 ⋅ 1 2 ⋅ ⋅ ⋅ 1 1 1
1 1 1 4 ⋅ 1 1 1 1 2
Simply taking $M^{T} M$ on the document term matrix $M$ can yield a matrix like the covariance matrix.
Remove Case remove_case!
julia> remove_case!(crps)
julia> update_lexicon!(crps)
julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 9 entries:
"a" => 2
"be" => 4
"is" => 1
"it" => 2
"not" => 2
"or" => 1
"question" => 2
"should" => 1
"to" => 3
In many cases, it is unnecessary to distinguish between uppercase and lowercase in natural language processing. The remove_case!
function allows for the removal of case distinctions3, and updating the corpus with the update_lexicon!
function shows that uppercase letters have been combined with lowercase ones.
Full Code
using TextAnalysis
crps = Corpus(StringDocument.([
"To be or not to be",
"It should be a question to be",
"It is not a question"
]))
update_lexicon!(crps)
sort(crps.lexicon)
M = DocumentTermMatrix(crps)
dtm(M)
dtm(M)'dtm(M)
remove_case!(crps)
update_lexicon!(crps)
sort(crps.lexicon)
Besides, many essential features for natural language processing are implemented.
Environment
- OS: Windows
- julia: v1.10.0