Slicing Only a Part of Unicode Strings in Julia 📂Julia

Slicing Only a Part of Unicode Strings in Julia

Overview

As with many programming languages, in Julia, English is written in ASCII code and characters like Chinese and Korean are written in Unicode. The trouble, unlike with other languages, is that dealing with these strings is quite tricky, which is intended for performance reasons¹, so one has no choice but to bear with it and use them as they are.

Code

julia> str1 = "English"
"English"

julia> str2 = "日本語"
"日本語"

julia> str3 = "한국어"
"한국어"

For example, let’s say the strings are given as above.

julia> str1[2:end]
"nglish"

str1 is a simple English string and, because it’s ASCII code, it can be sliced like accessing a regular array as above.

julia> str2[2:end]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'日', [4]=>'本'
Stacktrace:
 [1] string_index_err(s::String, i::Int64)
   @ Base .\strings\string.jl:12
 [2] getindex(s::String, r::UnitRange{Int64})
   @ Base .\strings\string.jl:266
 [3] top-level scope
   @ c:\Users\rmsms\OneDrive\lab\population_dynamics\REPL.jl:6

However, str2 is written in Unicode because of the Chinese characters, and as shown, raises an index error. Judging by the error message, one can guess that the index for the second character is not 2 but 4, and indeed, starting indexing at 4 slices it as originally intended.

julia> str2[4:end]
"本語"

This applies to Korean as well in the same way. There’s no reason for it to be different because it’s also Unicode.

julia> str3[4:end]
"국어"

julia> str3[6]
ERROR: StringIndexError: invalid index [6], valid nearby indices [4]=>'국', [7]=>'어'
Stacktrace:
 [1] string_index_err(s::String, i::Int64)
   @ Base .\strings\string.jl:12
 [2] getindex_continued(s::String, i::Int64, u::UInt32)
   @ Base .\strings\string.jl:237
 [3] getindex(s::String, i::Int64)
   @ Base .\strings\string.jl:230
 [4] top-level scope
   @ c:\Users\rmsms\OneDrive\lab\population_dynamics\REPL.jl:9

julia> str3[7]
'어': Unicode U+C5B4 (category Lo: Letter, other)

Trick

julia> String(collect(str3)[2:3])
"국어"

A somewhat convenient way to use it is to unravel the string into an array of characters using collect(), slice it, and then reassemble it into a string like above.

Environment

OS: Windows
julia: v1.8.3

https://discourse.julialang.org/t/weird-string-slicing-in-korean/92252/2 ↩︎