Skip to content

Added some prose, alternatives, closure-taking API, and future direct… #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

milseman
Copy link

No description provided.

}
}
}
```
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have compiled but not tested this code, it might be bonkers 😅

#### Stability over ancient versions of Unicode

Early versions of Unicode experienced significant churn and the modern definition of normalization stability was added in version 4.1. When we talk about stability of assigned code points, we are referring to notions of stability from 4.1 onwards. A normalized string is stable from either the version of Unicode in which the latest code point was assigned or Unicode version 4.1 (whichever one is later).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to want more of a discussion about what we mean by "all versions of Unicode" in doc comments.

### Normalizing intializers

We could define initializers similar to `String(decoding:as)`, e.g. `String(normalizing:)`, which decode and transform the contents into the stdlib's preferred normal form without always needing to create an intermediary `String` instance. Such strings will compare much faster to each other, but may have different contents when viewed under the `UnicodeScalarView`, `UTF8View`, or `UTF16View`.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference would be to see what we can do about pulling this into this pitch / proposal, or immediately afterwards, as the perf improvements to comparison, Regex matching (when we get the corresponding optimizations next release), etc. are dramatic.

Normalization API is presented in terms of `Unicode.Scalar`s, but a lower-level interface might run directly on top of (validated) UTF-8 code units. Many useful normalization fast-paths can be implemented directly on top of UTF-8 without needing to decode.

For example, any `Unicode.Scalar` that is less than `U+0300` is in NFC and can be skipped over for NFC normalization. In UTF-8, this amounts to skipping until we find a byte that is `0xCC` or larger.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that rather than have UTF8Span vend a stateful closure, we might want to have UTF8Span be a basis of low-level fast normal views. But that might be pending non-escapable sequences/iterators.

@karwa karwa merged commit 0bffa9d into karwa:unicode Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants