Skip to content

Emacs Tokenizer tokenizing CJK words with WinRT API or ICU.

License

Notifications You must be signed in to change notification settings

Master-Hash/ewt-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ewt-rs

Emacs Tokenizer tokenizing CJK words with WinRT API or ICU.

EWT stands for Emacs Windows Tokenizer. But it works on all platforms, if built with ICU.

Installation

This crate provides dynamic module which emt.el consumes. Install emt.el first, put the module dynamic lib into emt-lib-path (by default located at ~/.emacs.d/modules/libEMT.{dll,so,etc}).

Pre-built

Download from Releases, or CI Artifact for unversioned binaries.

I offer .dll files for msvc/gnu/gnullvm target, which are all ABI compatible, so mixture with either UCRT or CLANG64 Emacs is all right. You may need to build yourself, if you use Emacs built with MSVCRT.

If you use gnullvm target binary, it links against libunwind.dll, so make sure it's included in PATH. It won't work if it's only included in load-path or same directory of the module dll.

Manually build

  1. Install Rust toolchain
  2. (On Windows) Install MSYS2, and put ${MSYSTEM}/bin to PATH to make libclang work
  3. cargo build --release to use ICU
  4. cargo build --release --no-default-features -F windows to use WinRT API

It's possible to build for *-pc-windows-gnullvm, but manual adjustment of -I include directory, libclang target, link target and link sysroot (when cross compiling) is required. You may refer to the CI script.

Adjustment

The segmenter language with WinRT API is hardcoded. Users can adjust zh-CN to the favoured language.

C vs C++ vs Rust

Microsoft doesn't and will never provide WinRT API for C.

C++ 20 is required for cppwinrt. I encounter auto type deduction error in the cppwinrt header file, which I cannot fix. The size could be much smaller (~100k?) though, if it works, it's favourable.

I have to use unsafe extern "C" all the way to write Rust binding. The safety no better than C++, but it has better WinRT API support and type inference. When built with lto, the size ~260K is acceptable.

WinRT API vs ICU

Personally I recommand WinRT API for Simplified Chinese and ICU for Traditional Chinese.

WinRT API ICU
'有|异曲同工|之|妙' '有异|曲|同工|之|妙'
'有|異|曲|同工|之|妙' '有|異曲同工|之|妙'
'丧心病狂|的|异想天开' '丧心病狂|的|异|想|天|开'

Note on UTF-8 Grapheme Cluster

This crate handles String on char level instead of grapheme cluster level. However, this causes no problem, probally because emt.el only use the helper function when moving in CJK characters.

Future Work

  • Try ICU Backend
  • Find out why M-S-{F,B} doesn't select anything
  • Stop linking against libunwind.dll

Credit

About

Emacs Tokenizer tokenizing CJK words with WinRT API or ICU.

Resources

License

Stars

Watchers

Forks

Packages

No packages published