Added some prose, alternatives, closure-taking API, and future direct… #2

milseman · 2024-07-14T21:08:09Z

No description provided.

…ions

milseman · 2024-07-14T21:08:51Z

proposals/XXXX-unicode-normalization.md

+    }
+  }
+}
+```


I have compiled but not tested this code, it might be bonkers 😅

milseman · 2024-07-14T21:09:15Z

proposals/XXXX-unicode-normalization.md

+#### Stability over ancient versions of Unicode
+
+Early versions of Unicode experienced significant churn and the modern definition of normalization stability was added in version 4.1. When we talk about stability of assigned code points, we are referring to notions of stability from 4.1 onwards. A normalized string is stable from either the version of Unicode in which the latest code point was assigned or Unicode version 4.1 (whichever one is later).
+


I think we're going to want more of a discussion about what we mean by "all versions of Unicode" in doc comments.

milseman · 2024-07-14T21:10:29Z

proposals/XXXX-unicode-normalization.md

+### Normalizing intializers
+
+We could define initializers similar to `String(decoding:as)`, e.g. `String(normalizing:)`, which decode and transform the contents into the stdlib's preferred normal form without always needing to create an intermediary `String` instance. Such strings will compare much faster to each other, but may have different contents when viewed under the `UnicodeScalarView`, `UTF8View`, or `UTF16View`.
+


My personal preference would be to see what we can do about pulling this into this pitch / proposal, or immediately afterwards, as the perf improvements to comparison, Regex matching (when we get the corresponding optimizations next release), etc. are dramatic.

milseman · 2024-07-14T21:11:17Z

proposals/XXXX-unicode-normalization.md

+Normalization API is presented in terms of `Unicode.Scalar`s, but a lower-level interface might run directly on top of (validated) UTF-8 code units. Many useful normalization fast-paths can be implemented directly on top of UTF-8 without needing to decode. 
+
+For example, any `Unicode.Scalar` that is less than `U+0300` is in NFC and can be skipped over for NFC normalization. In UTF-8, this amounts to skipping until we find a byte that is `0xCC` or larger.
+


My thought is that rather than have UTF8Span vend a stateful closure, we might want to have UTF8Span be a basis of low-level fast normal views. But that might be pending non-escapable sequences/iterators.

Added some prose, alternatives, closure-taking API, and future direct…

8540403

…ions

milseman commented Jul 14, 2024

View reviewed changes

karwa merged commit 0bffa9d into karwa:unicode Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added some prose, alternatives, closure-taking API, and future direct… #2

Added some prose, alternatives, closure-taking API, and future direct… #2

milseman commented Jul 14, 2024

milseman Jul 14, 2024

milseman Jul 14, 2024

milseman Jul 14, 2024

milseman Jul 14, 2024

		#### Stability over ancient versions of Unicode

		Early versions of Unicode experienced significant churn and the modern definition of normalization stability was added in version 4.1. When we talk about stability of assigned code points, we are referring to notions of stability from 4.1 onwards. A normalized string is stable from either the version of Unicode in which the latest code point was assigned or Unicode version 4.1 (whichever one is later).

		### Normalizing intializers

		We could define initializers similar to `String(decoding:as)`, e.g. `String(normalizing:)`, which decode and transform the contents into the stdlib's preferred normal form without always needing to create an intermediary `String` instance. Such strings will compare much faster to each other, but may have different contents when viewed under the `UnicodeScalarView`, `UTF8View`, or `UTF16View`.

		Normalization API is presented in terms of `Unicode.Scalar`s, but a lower-level interface might run directly on top of (validated) UTF-8 code units. Many useful normalization fast-paths can be implemented directly on top of UTF-8 without needing to decode.

		For example, any `Unicode.Scalar` that is less than `U+0300` is in NFC and can be skipped over for NFC normalization. In UTF-8, this amounts to skipping until we find a byte that is `0xCC` or larger.

Added some prose, alternatives, closure-taking API, and future direct… #2

Added some prose, alternatives, closure-taking API, and future direct… #2

Conversation

milseman commented Jul 14, 2024

milseman Jul 14, 2024

Choose a reason for hiding this comment

milseman Jul 14, 2024

Choose a reason for hiding this comment

milseman Jul 14, 2024

Choose a reason for hiding this comment

milseman Jul 14, 2024

Choose a reason for hiding this comment