-
-
Notifications
You must be signed in to change notification settings - Fork 66
Conversation
This contains some serious feature creep, I think it is better suited as a (native?) module. |
Have to disagree. It's actually rather small: it is very similar to stuff we are already doing (e.g. `buffer.toString('hex') re-encodes), uses code we already have in our dependency tree, limits itself only to the encodings we already support so as to avoid stepping on userland, and provides useful functionality that is difficult (update: well, difficult to do is a bit strong, complicated...) to do in a performant way otherwise (slicing on the utf-8 boundary, for instance). I fail to see how this could qualify as "serious feature creep". |
Essentially, we're getting this functionality for free from code we already use. |
XXX-icu-module.md
Outdated
| Title | ICU Module | | ||
|--------|-----------------------------| | ||
| Author | @jasnell | | ||
| Status | REJECTED | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious. Why is this labeled as REJECTED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha!! Copied the header from a different one lol
On Apr 27, 2016 2:56 PM, "Trevor Norris" [email protected] wrote:
In XXX-icu-module.md
#15 (comment):@@ -0,0 +1,137 @@
+| Title | ICU Module |
+|--------|-----------------------------|
+| Author | @jasnell |
+| Status | REJECTED |Just curious. Why is this labeled as REJECTED?
—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com./nodejs/node-eps/pull/15/files/3fad3f260a0533ef0fead0d640ac2c3cc78d2d0a#r61343698
I'm in principle okay with exposing ICU in core but character encoding detection doesn't strike me as super useful and I'd hate to see |
The encoding detection is definitely one that would be more convenience than anything ... in particular when reading arbitrary data whose origin and encoding you don't already know in advance. It is, however, a best guess on ICU's part and isn't perfect by any means. So dropping that bit is fine. On the conversion part, ICU contains a huge number of converts built in if we're using the full-icu data. Small-data only includes the algorithmic unicode converters. We could obsolete iconv and iconv-lite if we make it so that when full-icu is used, the full range of converters is available. Just not sure how much value there is in opening it up fully like that. |
@jasnell You know if win1252 requires the full data set? That is the one exception I'd like to make for converting the Buffer to utf8 so it can be properly turned into a string. This would help mitigate issues where I believe this is the only case where the WHATWG specification differs from every other specification of what ISO-8859-1 actually is (reference table entry "windows-1252" from https://encoding.spec.whatwg.org/#names-and-labels). Meaning, in every other specification "latin1" aliases ISO-8859-1. Which was first released as ISO_8859-1:1987. Then in 1992 the IANA released an RFC for an extension of the specification (ref http://tools.ietf.org/html/rfc1345). This was than later finalized as ISO/IEC 8859-1:1998 (ref http://www.open-std.org/JTC1/SC2/WG3/docs/n411.pdf). So we can see that the byte code to code point matches aren't consistent with how the WHATWG and everyone else specifies ISO-8859-1 to be. Unfortunately browsers render |
win1252 is not included in the small-icu data set we use. It could likely be added. |
It's not just data size. We build with the conversion code ifdeffed out also. |
yep, even with the non-legacy conversion code switched back on (which I do in the prototype), it's not available. @srl295 .. do you know offhand if windows-1252 requires the legacy conversions and/or data? |
Yes and yes. Pretty much anything that's not one of the UTFs or iso-8859-1 or ascii requires both code and data |
@jasnell Why was this closed? |
I opted to close due critical personal feedback I received via private conversation implying that it was somehow inappropriate for me to be trying to "ram" such new capabilities into core. If others wish to continue pursuing this then I will support that effort in whatever way I can but I will not pursue it further myself. |
@jasnell even rejected ep are merged. :-) And screw private conversations. This is a public discussion. If someone wants to say something they can do it out in the open. No ep will be closed without justified reason. |
And for the record, I'd not consider approval for this except for the fact that we're already including the code because of icu support. So it seems dumb for us to leave good features on the table that already exist within core. I do have thoughts on what the API should look like, for sure, but that's for another post. Time for bed. |
Just seeing this now .. sorry. I think I'm with @trevnorris on this in that we have the functionality sitting there and although I'm generally -1 on feature creep, having solid unicode support in Node is turning out to be pretty important for its global growth.
Can someone educate me on how the |
@rvagg Currently you need to do something like |
Thanks, I thought we had the functionality already so wondered why it was needed. So the clarification here is that it's both a performance improvement an an API simplification. I'm cool with that. |
If there is enough agreement to move forward on iterating this proposal On Jul 13, 2016 8:13 PM, "Rod Vagg" [email protected] wrote:
|
Sounds good to me. |
Looking at revitalizing this proposal, before I do I however, here's what I'm proposing:
In other words, this would isolate the additional APIs strictly to the new |
Updated.. |
Updated work in progress implementation: nodejs/node#8075 |
So if full-icu is installed, does that mean conversion between other encodings (not natively supported by Buffer -- such as utf-32) would still not be available? |
For now, yes. The initially stated concern when I first suggested this last spring was that it would be "feature creep" to do too much. However, it full-icu were installed, it would be fairly trivial to support the full range of encodings built in to ICU |
//fyi @martinheidegger |
btw, at this point in time I am not pursuing this. May return to it later, however. |
I filed a similar request at nodejs/node#13550 but with a slightly different take. I was mostly interested in the data tables and exotic functionality such as the bidi algorithm, collation, and IDNA. Whereas I guess this EP at least originally focused on encoding functionality. To me the main argument is that the code is already bundled so not exposing it to JS is just wasteful, leading to large libraries like https://github.com./Sebmaster/tr46.js which have to pull in large Unicode data tables, instead of using the built-in ones. Just exposing the data tables would make that library much smaller, but exposing IDNA would probably mostly obsolete the library. |
I'm generally all for doing something here. Happy the help work through it |
Closing as this seems inactive. By all means, comment or re-open if this is still an ongoing concern, although I suspect it should be moved to a more active repository (probably the main node repo) if that's the case. |
Description
The ICU4C library that we use for internationalization contains a significant array of additional functionality not currently exposed by the EcmaScript 402 standard. Some of this additional functionality would be useful to expose via a new
'icu'
or ('unicode'
) module.Note: Updated on 2016-08-15 to reflect the current state of the proposal
Interface
Initially, the
'icu'
module would provide methods for the following:One-Shot Transcoding
ICU includes code for converting from one encoding to another. This is similar to what is provided by
iconv
but it is built in to ICU4C. The'icu'
module would include converters for only the character encodings directly supported by core. Developers would continue to useiconv
oriconv-lite
for more exotic things.UTF-8 and UTF-16 aware
codePointAt()
andcharAt()
methods forBuffer
This one is pretty straightforward. They would return either the Unicode codepoint or the character at the given byte offset even if the byte offset is not on a UTF-8 or UTF-16 lead byte. These are intended to be symmetrical with
String.prototype.codePointAt()
andString.prototype.charAt()
UTF-16 and UTF-8 aware
slice()
forBuffer
This is similar to the existing
Buffer.prototype.slice()
except that, rather than byte offsets, thestart
andend
are codepoint/character offsets This would make it symmetrical withString.prototype.slice()
but for Buffers. The advantage is that this allows the Buffer to be sliced in a way that ensures proper alignment with UTF-8 or UTF-16 encodings.Passing in either
ascii
orbinary
would fallback to the currentBuffer.slice()
behavior.Proof-of-concept
A proof-on-concept implementation that includes everything except the streaming converters can be found at:
https://github.com./jasnell/node/tree/icu-module
The module requires turning on the basic ICU converter switch which has a nominal impact on binary size but nothing significant. Since the conversions we would require in core are purely algorithmic, there is no additional ICU data requirement (the current small-icu is just fine).
/cc @trevnorris @bnoordhuis