ジュリア自然言語処理パッケージTextAnalysis.jlの紹介 📂ジュリア

ジュリア自然言語処理パッケージTextAnalysis.jlの紹介

概要

ジュリアの自然言語処理パッケージであるTextAnaylsis.jlについて軽く紹介する¹。実践的な自然言語処理のためには、語幹抽出^stemmingのためのSnowball.jlのようなパッケージも使用する²。

コード

julia> using TextAnalysis

julia> crps = Corpus(StringDocument.([
             "To be or not to be",
             "It should be a question to be",
             "It is not a question"
       ]))
A Corpus with 3 documents:
 * 3 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

基本的に、コーパス^corpusはCorpusコンストラクタを通じて定義され、このとき文書はStringDocument、FileDocument、TokenDocument、NGramDocumentなどの形で与えられる。

語彙更新 `update_lexicon!`

julia> update_lexicon!(crps)

julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 10 entries:
  "It"       => 2
  "To"       => 1
  "a"        => 2
  "be"       => 4
  "is"       => 1
  "not"      => 2
  "or"       => 1
  "question" => 2
  "should"   => 1
  "to"       => 2

コーパスはupdate_lexicon!関数を通じて更新される。lexiconプロパティを確認すると、コーパス語彙ごとの頻度^frequencyを見ることができる。

文書語彙行列 `DocumentTermMatrix`, `dtm`

julia> M = DocumentTermMatrix(crps)
A 3 X 10 DocumentTermMatrix

julia> dtm(M)
3×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 16 stored entries:
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 1  ⋅  1  2  ⋅  ⋅  ⋅  1  1  1
 1  ⋅  1  ⋅  1  1  ⋅  1  ⋅  ⋅

二つの関数DocumentTermMatrix、dtmを使用して、各コーパスごとに語彙の出現回数を確認することができる。行は文書で、列は語彙であり、スパース行列として与えられる。このとき、語彙は辞書順^lexicalで、大文字から小文字の順に並べられる。

julia> dtm(M)'dtm(M)
10×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 72 stored entries:
 2  ⋅  2  2  1  1  ⋅  2  1  1
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 2  ⋅  2  2  1  1  ⋅  2  1  1
 2  2  2  8  ⋅  2  2  2  2  4
 1  ⋅  1  ⋅  1  1  ⋅  1  ⋅  ⋅
 1  1  1  2  1  2  1  1  ⋅  1
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 2  ⋅  2  2  1  1  ⋅  2  1  1
 1  ⋅  1  2  ⋅  ⋅  ⋅  1  1  1
 1  1  1  4  ⋅  1  1  1  1  2

簡単に文書語彙行列 $M$ に対して $M^{T} M$ を取ると、共分散行列のような行列を求めることができる。

大文字小文字除去 `remove_case!`

julia> remove_case!(crps)

julia> update_lexicon!(crps)

julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 9 entries:
  "a"        => 2
  "be"       => 4
  "is"       => 1
  "it"       => 2
  "not"      => 2
  "or"       => 1
  "question" => 2
  "should"   => 1
  "to"       => 3

多くの場合、自然言語処理で大文字と小文字を区別する必要はない。remove_case!関数を通じて大文字小文字を除去でき、update_lexicon!関数を通じてコーパスを更新すると、大文字が小文字に統合されたことが確認できる。

全体コード

using TextAnalysis

crps = Corpus(StringDocument.([
      "To be or not to be",
      "It should be a question to be",
      "It is not a question"
]))

update_lexicon!(crps)
sort(crps.lexicon)

M = DocumentTermMatrix(crps)
dtm(M)
dtm(M)'dtm(M)

remove_case!(crps)
update_lexicon!(crps)
sort(crps.lexicon)

これ以外にも、自然言語処理に必要な多くの機能が実装されている。

環境

OS: Windows
julia: v1.10.0