Text segmentation is a way to divide text into units like characters, words, and sentences.
Let’s say you have the following Japanese text and you’d like to perform a word count:
???????????????
If you’re unfamiliar with Japanese, you might try built-in string methods in your first attempt.
For English strings, a rough way to count the words is to split by space characters:
const str = "How many words. Are there?"; const words = str.split(" ");
console.log(words);
console.log(words.length);
The punctuation is mixed in with the word matches, and this will be inaccurate, but it’s a good approximation.
The problem is we don’t have any spaces separating the characters in the Japanese string.
Maybe your next idea would be to reach for str.length
to count the characters.
Using string length, you’d get 15
, and if you remove the full stops (?
) you might guess 13 words.
The problem is we actually have 8 words in the string without punctuation: '??' '?' '?' '?' '??' '??' '?' '???'
.
If you rely on string methods for a word count, you’ll quickly run into trouble as you can’t reliably split by specific character and you can’t use spaces as separators like you can in English.
This is what locale-sensitive segmentation is built for.
The format for creating a segmenter in the Intl
namespace is as follows:
new Intl.Segmenter(locales, options);
Let’s try passing the string into the segmenter with the ja-JP
locale for Japanese, and we explicitly set each segment to be of word-level granularity:
const jaSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" }); const segments = jaSegmenter.segment("???????????????");
console.log(Array.from(segments));
This example logs the following array to the console:
[ { "segment": "??", "index": 0, "input": "???????????????", "isWordLike": true }, { "segment": "?", "index": 2, "input": "???????????????", "isWordLike": true }, { "segment": "?", "index": 3, "input": "???????????????", "isWordLike": true },
For each item in the array, we get the segment, it’s index as it appears in the original string, the full input string, and a Boolean isWordLike
to disambiguate words from punctuation etc.
Now we have a robust and structured way to interact with the words that is locale-aware.
The segmenter’s granularity is word
in this example, so we can filter each item based on whether it’s isWordLike
to ignore punctuation:
const jaString = "???????????????"; const jaSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" }); const segments = jaSegmenter.segment(jaString); const words = Array.from(segments) .filter((item) => item.isWordLike) .map((item) => item.segment);
console.log(words);
console.log(words.length);
This looks much better.
We have an array with Japanese words using the segmenter, ready for adding locale-aware word count to our application.
We’ll explore that use case a bit more with a small example in the following sections.
Before that, we’ll take a look at the rest of the options that you can pass into a segmenter.