Emacs Tokenizer tokenizing CJK words with WinRT API or ICU.
EWT stands for Emacs Windows Tokenizer. But it works on all platforms, if built with ICU.
This crate provides dynamic module which emt.el consumes. Install emt.el first, put the module dynamic lib into emt-lib-path
(by default located at ~/.emacs.d/modules/libEMT.{dll,so,etc}
).
Download from Releases, or CI Artifact for unversioned binaries.
I offer .dll
files for msvc/gnu/gnullvm target, which are all ABI compatible, so mixture with either UCRT or CLANG64 Emacs is all right. You may need to build yourself, if you use Emacs built with MSVCRT.
If you use gnullvm target binary, it links against libunwind.dll
, so make sure it's included in PATH. It won't work if it's only included in load-path
or same directory of the module dll.
- Install Rust toolchain
- (On Windows) Install MSYS2, and put
${MSYSTEM}/bin
to PATH to make libclang work cargo build --release
to use ICUcargo build --release --no-default-features -F windows
to use WinRT API
It's possible to build for *-pc-windows-gnullvm
, but manual adjustment of -I
include directory, libclang target, link target and link sysroot (when cross compiling) is required. You may refer to the CI script.
The segmenter language with WinRT API is hardcoded. Users can adjust zh-CN
to the favoured language.
Microsoft doesn't and will never provide WinRT API for C.
C++ 20 is required for cppwinrt. I encounter auto type deduction error in the cppwinrt header file, which I cannot fix. The size could be much smaller (~100k?) though, if it works, it's favourable.
I have to use unsafe extern "C" all the way to write Rust binding. The safety no better than C++, but it has better WinRT API support and type inference. When built with lto, the size ~260K is acceptable.
Personally I recommand WinRT API for Simplified Chinese and ICU for Traditional Chinese.
WinRT API | ICU |
---|---|
'有|异曲同工|之|妙' | '有异|曲|同工|之|妙' |
'有|異|曲|同工|之|妙' | '有|異曲同工|之|妙' |
'丧心病狂|的|异想天开' | '丧心病狂|的|异|想|天|开' |
This crate handles String on char level instead of grapheme cluster level. However, this causes no problem, probally because emt.el only use the helper function when moving in CJK characters.
- Try ICU Backend
- Find out why M-S-{F,B} doesn't select anything
- Stop linking against libunwind.dll
- emt.el
- ubolonton/emacs-module-rs I don't use it because of issue, but it helps me learn how Emacs Dynamic Module works, and provides useful functions.
- Article: Writing an Emacs module in Rust