줄리아 자연어 처리 패키지 TextAnaylsis.jl 소개 📂줄리아

줄리아 자연어 처리 패키지 TextAnaylsis.jl 소개

개요

줄리아의 자연어처리 패키지인 TextAnaylsis.jl에 대해 가볍게 소개한다¹. 실전적인 자연어처리를 위해서는 어간 추출^stemming을 위한 Snowball.jl과 같은 패키지도 사용한다².

코드

julia> using TextAnalysis

julia> crps = Corpus(StringDocument.([
             "To be or not to be",
             "It should be a question to be",
             "It is not a question"
       ]))
A Corpus with 3 documents:
 * 3 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

기본적으로 코퍼스^corpus는 Corpus 생성자를 통해 정의하고, 이 때 문서는 StringDocument, FileDocument, TokenDocument, NGramDocument 등의 형태로 주어진다.

어휘 업데이트 `update_lexicon!`

julia> update_lexicon!(crps)

julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 10 entries:
  "It"       => 2
  "To"       => 1
  "a"        => 2
  "be"       => 4
  "is"       => 1
  "not"      => 2
  "or"       => 1
  "question" => 2
  "should"   => 1
  "to"       => 2

코퍼스는 update_lexicon! 함수를 통해 업데이트한다. lexicon 프로퍼티를 확인해보면 코퍼스 어휘마다의 빈도^frequency를 볼 수 있다.

문서 단어 행렬 `DocumentTermMatrix`, `dtm`

julia> M = DocumentTermMatrix(crps)
A 3 X 10 DocumentTermMatrix

julia> dtm(M)
3×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 16 stored entries:
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 1  ⋅  1  2  ⋅  ⋅  ⋅  1  1  1
 1  ⋅  1  ⋅  1  1  ⋅  1  ⋅  ⋅

두 함수 DocumentTermMatrix, dtm를 사용해서 각 코퍼스별로 단어의 빈도 수를 확인 할 수 있다. 행은 문서, 열은 어휘인 스파스 행렬로 주어지며 이 때 어휘는 사전 순서^lexical로, 대문자부터 소문자 순으로 정렬된다.

julia> dtm(M)'dtm(M)
10×10 SparseArrays.SparseMatrixCSC{Int64, Int64} with 72 stored entries:
 2  ⋅  2  2  1  1  ⋅  2  1  1
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 2  ⋅  2  2  1  1  ⋅  2  1  1
 2  2  2  8  ⋅  2  2  2  2  4
 1  ⋅  1  ⋅  1  1  ⋅  1  ⋅  ⋅
 1  1  1  2  1  2  1  1  ⋅  1
 ⋅  1  ⋅  2  ⋅  1  1  ⋅  ⋅  1
 2  ⋅  2  2  1  1  ⋅  2  1  1
 1  ⋅  1  2  ⋅  ⋅  ⋅  1  1  1
 1  1  1  4  ⋅  1  1  1  1  2

간단히 문서 단어 행렬 $M$ 에 대해 $M^{T} M$ 를 취하면 공분산행렬 같은 행렬을 구할 수 있다.

대소문자 제거 `remove_case!`

julia> remove_case!(crps)

julia> update_lexicon!(crps)

julia> sort(crps.lexicon)
OrderedCollections.OrderedDict{String, Int64} with 9 entries:
  "a"        => 2
  "be"       => 4
  "is"       => 1
  "it"       => 2
  "not"      => 2
  "or"       => 1
  "question" => 2
  "should"   => 1
  "to"       => 3

많은 경우 자연어 처리에서 대소문자를 구분할 필요가 없다. remove_case! 함수를 통해 대소문자를 제거할 수 있고³, update_lexicon! 함수를 통해 코퍼스를 업데이트 하면 대문자가 소문자로 합쳐진 것을 확인할 수 있다.

전체 코드

using TextAnalysis

crps = Corpus(StringDocument.([
      "To be or not to be",
      "It should be a question to be",
      "It is not a question"
]))

update_lexicon!(crps)
sort(crps.lexicon)

M = DocumentTermMatrix(crps)
dtm(M)
dtm(M)'dtm(M)

remove_case!(crps)
update_lexicon!(crps)
sort(crps.lexicon)

이 외에도 자연어 처리에 당연히 있어야 할 많은 기능들이 구현되어 있다.

환경

OS: Windows
julia: v1.10.0