-
Notifications
You must be signed in to change notification settings - Fork 0
Added some prose, alternatives, closure-taking API, and future direct… #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
} | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have compiled but not tested this code, it might be bonkers 😅
#### Stability over ancient versions of Unicode | ||
|
||
Early versions of Unicode experienced significant churn and the modern definition of normalization stability was added in version 4.1. When we talk about stability of assigned code points, we are referring to notions of stability from 4.1 onwards. A normalized string is stable from either the version of Unicode in which the latest code point was assigned or Unicode version 4.1 (whichever one is later). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're going to want more of a discussion about what we mean by "all versions of Unicode" in doc comments.
### Normalizing intializers | ||
|
||
We could define initializers similar to `String(decoding:as)`, e.g. `String(normalizing:)`, which decode and transform the contents into the stdlib's preferred normal form without always needing to create an intermediary `String` instance. Such strings will compare much faster to each other, but may have different contents when viewed under the `UnicodeScalarView`, `UTF8View`, or `UTF16View`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal preference would be to see what we can do about pulling this into this pitch / proposal, or immediately afterwards, as the perf improvements to comparison, Regex matching (when we get the corresponding optimizations next release), etc. are dramatic.
Normalization API is presented in terms of `Unicode.Scalar`s, but a lower-level interface might run directly on top of (validated) UTF-8 code units. Many useful normalization fast-paths can be implemented directly on top of UTF-8 without needing to decode. | ||
|
||
For example, any `Unicode.Scalar` that is less than `U+0300` is in NFC and can be skipped over for NFC normalization. In UTF-8, this amounts to skipping until we find a byte that is `0xCC` or larger. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that rather than have UTF8Span
vend a stateful closure, we might want to have UTF8Span
be a basis of low-level fast normal views. But that might be pending non-escapable sequences/iterators.
No description provided.