Skip to main content

Cornell University

How does Chinese-Japanese-Korean (CJK) searching work?

Tokenization

As with searching in any language, Chinese, Japanese, and Korean searching has as many nuances as we have the time and willingness to explore, but the core issue that makes CJK searching a challenge is tokenization. That’s how we break text into the low level units of meaning a user might be looking for.

For most other languages, we break text into words using spaces and punctuation. This is imperfect, but it’s the core of what we do. While we sometimes build important ideas out of a series of words it’s important that we never match a user’s search term against part of a word or vice-versa.

The idea of finding the Journal of environmental health by searching for “mental health” is ridiculous, but it’s what could happen if we were trying to break up our words any further into lower level ideas. The problem comes when we try to apply a system of searching designed for Western languages to Chinese, Japanese, and Korean text. At heart, the problem is that these languages don’t require or typically use spaces to divide conceptual concepts (like words) within a sentence.

For example, consider this work:

Unknown

This work on the process of course development at Nankai University should be found when searching in Chinese for the name of the University or for the concept of course development. Yet, neither of these concepts exists neatly separated by spaces or other markers that might separate them from the rest of the text. But the answer is not to treat each character as an isolated unit of meaning. Indeed, if we were to take the four characters that spell out Nankai University (南开大学) and treat each character individually, we would have the concepts of South, Open, Large and Study. Instead, it’s how the characters are used together that gives them meaning.

So how do we know when a series of characters must be taken together to attain their useful meaning, and when they should be considered separately? Generally, without a dictionary-powered indexing protocol (which we have for no languages at present), our computers can only guess. Instead, our solution is to index the characters both individually and in sequence. We look for works that use the characters in a user’s search, giving precedence to those that actually contain the user’s characters in sequence.

Other Indexing Nuances

First, we have Chinese Traditional/Simplified character equivalence set up. Simplified characters are used almost exclusively in Mainland China these days, though older works and those published elsewhere may still use the traditional forms. With this setup, a user will find a work even if the work doesn’t use the same form as the user search. To go back to Nankai University, the form in the record above is the simplified form, and the traditional form differs in both the second and fourth characters. Because the University is older than the official adoption of simplified forms in China, it’s quite likely that older works may use the other form, and a user searching for 南开大学 should find 南開大學. For example, this work originally published in 1951 will also be among the search results:

Unknown-1(title: Nankai University Introduction)

Secondly, for Japanese text we have a similar equivalence set up between the two main phonetic forms used in Japanese: Katakana and Hiragana. Though we haven’t yet gotten any user feedback on the utility of this mapping, we are sticking with it primarily because Stanford University seems happy with the mapping.

Much of our work to implement support for Chinese, Japanese and Korean searching is based on the excellent example of Stanford’s SearchWorks interface.