00X: Add 'icu' module #15

jasnell · 2016-04-20T19:51:58Z

Description

The ICU4C library that we use for internationalization contains a significant array of additional functionality not currently exposed by the EcmaScript 402 standard. Some of this additional functionality would be useful to expose via a new 'icu' or ('unicode') module.

Note: Updated on 2016-08-15 to reflect the current state of the proposal

Interface

Initially, the 'icu' module would provide methods for the following:

One-Shot Transcoding

ICU includes code for converting from one encoding to another. This is similar to what is provided by iconv but it is built in to ICU4C. The 'icu' module would include converters for only the character encodings directly supported by core. Developers would continue to use iconv or iconv-lite for more exotic things.

const icu = require('icu');

// One-shot conversion. Converts the entire Buffer in one go.
// Assumes that the Buffer is properly aligned on UFT-8 boundaries
const myBuffer = Buffer.from(getUtf8DataSomehow(), 'utf8');
const newBuffer = icu.transcode(myBuffer, 'utf8', 'ucs2');

UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`

This one is pretty straightforward. They would return either the Unicode codepoint or the character at the given byte offset even if the byte offset is not on a UTF-8 or UTF-16 lead byte. These are intended to be symmetrical with String.prototype.codePointAt() and String.prototype.charAt()

const icu = require('icu');

const myBuffer = Buffer.from('a€bc', 'utf8');

console.log(icu.codePointAt(myBuffer, 1, 'utf8'));

console.log(icu.charAt(myBuffer, 1, 'utf8'));

UTF-16 and UTF-8 aware `slice()` for `Buffer`

This is similar to the existing Buffer.prototype.slice() except that, rather than byte offsets, the start and end are codepoint/character offsets This would make it symmetrical with String.prototype.slice() but for Buffers. The advantage is that this allows the Buffer to be sliced in a way that ensures proper alignment with UTF-8 or UTF-16 encodings.

const icu = require('icu');

const myBuffer = Buffer.from('a€bc', 'utf8');

icu.slice(myBuffer, 'utf8', 1, 2); // returns a Buffer with €

Passing in either ascii or binary would fallback to the current Buffer.slice() behavior.

Proof-of-concept

A proof-on-concept implementation that includes everything except the streaming converters can be found at:

https://github.com./jasnell/node/tree/icu-module

The module requires turning on the basic ICU converter switch which has a nominal impact on binary size but nothing significant. Since the conversions we would require in core are purely algorithmic, there is no additional ICU data requirement (the current small-icu is just fine).

/cc @trevnorris @bnoordhuis

Fishrock123 · 2016-04-20T21:08:46Z

This contains some serious feature creep, I think it is better suited as a (native?) module.

jasnell · 2016-04-20T21:26:47Z

Have to disagree. It's actually rather small: it is very similar to stuff we are already doing (e.g. `buffer.toString('hex') re-encodes), uses code we already have in our dependency tree, limits itself only to the encodings we already support so as to avoid stepping on userland, and provides useful functionality that is difficult (update: well, difficult to do is a bit strong, complicated...) to do in a performant way otherwise (slicing on the utf-8 boundary, for instance). I fail to see how this could qualify as "serious feature creep".

jasnell · 2016-04-20T21:29:08Z

Essentially, we're getting this functionality for free from code we already use.

trevnorris · 2016-04-27T21:56:17Z

XXX-icu-module.md

+| Title  | ICU Module                  |
+|--------|-----------------------------|
+| Author | @jasnell                    |
+| Status | REJECTED                    |


Just curious. Why is this labeled as REJECTED?

Ha!! Copied the header from a different one lol
On Apr 27, 2016 2:56 PM, "Trevor Norris" [email protected] wrote:

In XXX-icu-module.md
#15 (comment):

@@ -0,0 +1,137 @@
+| Title | ICU Module |
+|--------|-----------------------------|
+| Author | @jasnell |
+| Status | REJECTED |

Just curious. Why is this labeled as REJECTED?

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com./nodejs/node-eps/pull/15/files/3fad3f260a0533ef0fead0d640ac2c3cc78d2d0a#r61343698

bnoordhuis · 2016-04-28T20:08:45Z

I'm in principle okay with exposing ICU in core but character encoding detection doesn't strike me as super useful and I'd hate to see Buffer.prototype burdened with yet more convenience methods of limited generality. Character encoding conversion seems useful however and I wouldn't mind if it obsoleted iconv and iconv-lite.

jasnell · 2016-04-28T20:21:30Z

The encoding detection is definitely one that would be more convenience than anything ... in particular when reading arbitrary data whose origin and encoding you don't already know in advance. It is, however, a best guess on ICU's part and isn't perfect by any means. So dropping that bit is fine.

On the conversion part, ICU contains a huge number of converts built in if we're using the full-icu data. Small-data only includes the algorithmic unicode converters. We could obsolete iconv and iconv-lite if we make it so that when full-icu is used, the full range of converters is available. Just not sure how much value there is in opening it up fully like that.

trevnorris · 2016-04-28T21:15:52Z

@jasnell You know if win1252 requires the full data set? That is the one exception I'd like to make for converting the Buffer to utf8 so it can be properly turned into a string. This would help mitigate issues where http.get() grabs a page with charset="win1252" (or other listed WHATWG variant aliases) and we simply don't have a way to decode it into a string.

I believe this is the only case where the WHATWG specification differs from every other specification of what ISO-8859-1 actually is (reference table entry "windows-1252" from https://encoding.spec.whatwg.org/#names-and-labels). Meaning, in every other specification "latin1" aliases ISO-8859-1. Which was first released as ISO_8859-1:1987. Then in 1992 the IANA released an RFC for an extension of the specification (ref http://tools.ietf.org/html/rfc1345). This was than later finalized as ISO/IEC 8859-1:1998 (ref http://www.open-std.org/JTC1/SC2/WG3/docs/n411.pdf). So we can see that the byte code to code point matches aren't consistent with how the WHATWG and everyone else specifies ISO-8859-1 to be.

Unfortunately browsers render charset="iso-8859-1" as windows-1252. So it's not uncommon for servers to return the same. Fortunately windows-1252 is a strict superset of iso-8859-1, so at most it'll not capture specific characters. But in case it does send the byte code as if it were using windows-1252 then http.get() should be able to properly interpret that.

jasnell · 2016-04-30T05:56:05Z

win1252 is not included in the small-icu data set we use. It could likely be added.

srl295 · 2016-04-30T15:41:49Z

It's not just data size. We build with the conversion code ifdeffed out also.

jasnell · 2016-04-30T16:26:35Z

yep, even with the non-legacy conversion code switched back on (which I do in the prototype), it's not available. @srl295 .. do you know offhand if windows-1252 requires the legacy conversions and/or data?

srl295 · 2016-04-30T19:59:29Z

Yes and yes. Pretty much anything that's not one of the UTFs or iso-8859-1 or ascii requires both code and data

trevnorris · 2016-05-04T16:34:44Z

@jasnell Why was this closed?

jasnell · 2016-05-04T16:39:13Z

I opted to close due critical personal feedback I received via private conversation implying that it was somehow inappropriate for me to be trying to "ram" such new capabilities into core. If others wish to continue pursuing this then I will support that effort in whatever way I can but I will not pursue it further myself.

trevnorris · 2016-05-05T09:02:53Z

@jasnell even rejected ep are merged. :-)

And screw private conversations. This is a public discussion. If someone wants to say something they can do it out in the open. No ep will be closed without justified reason.

trevnorris · 2016-05-05T09:06:10Z

And for the record, I'd not consider approval for this except for the fact that we're already including the code because of icu support. So it seems dumb for us to leave good features on the table that already exist within core. I do have thoughts on what the API should look like, for sure, but that's for another post. Time for bed.

rvagg · 2016-07-13T12:35:24Z

Just seeing this now .. sorry. I think I'm with @trevnorris on this in that we have the functionality sitting there and although I'm generally -1 on feature creep, having solid unicode support in Node is turning out to be pretty important for its global growth.

'unicode' would be a better name for this I think but that package is already used way too much to usurp: https://www.npmjs.com/package/unicode

Can someone educate me on how the reencode() functionality here is different to what you can already do with Buffer?

trevnorris · 2016-07-13T18:27:27Z

@rvagg Currently you need to do something like Buffer(b.toString(), 'utf8').toString('ucs2') in order to convert it. Meaning it needs to copy the string into the v8 heap. This method would allow that to happen outside of the v8 heap.

rvagg · 2016-07-14T03:13:03Z

Thanks, I thought we had the functionality already so wondered why it was needed. So the clarification here is that it's both a performance improvement an an API simplification. I'm cool with that.

jasnell · 2016-07-16T01:03:29Z

If there is enough agreement to move forward on iterating this proposal
I'll keep working on it (after vacation, of course)

On Jul 13, 2016 8:13 PM, "Rod Vagg" [email protected] wrote:

Thanks, I thought we had the functionality already so wondered why it was
needed. So the clarification here is that it's both a performance
improvement an an API simplification. I'm cool with that.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com./notifications/unsubscribe/AAa2eSGXwsRdJR7HznafzN_EzF9GiEWAks5qValBgaJpZM4IMDWW
.

trevnorris · 2016-07-19T22:35:11Z

Sounds good to me.

jasnell · 2016-08-11T18:51:18Z

Looking at revitalizing this proposal, before I do I however, here's what I'm proposing:

Renaming the icu.reencode() to icu.transcode(). There would not be any new methods added to Buffer.prototype
Removing the charset detection logic
Keeping the icu.codePointAt() and icu.charAt() APIs but not adding those to the Buffer.prototype
Keeping icu.slice() to allow for UTF-8 and UTF-16 aware Buffer slicing but not adding those to the Buffer.prototype
Rename the module from icu to unicode

In other words, this would isolate the additional APIs strictly to the new unicode module

jasnell · 2016-08-11T19:46:04Z

Updated..

jasnell · 2016-08-12T01:42:42Z

Updated work in progress implementation: nodejs/node#8075

mscdex · 2016-08-12T02:08:53Z

So if full-icu is installed, does that mean conversion between other encodings (not natively supported by Buffer -- such as utf-32) would still not be available?

jasnell · 2016-08-12T02:21:10Z

For now, yes. The initially stated concern when I first suggested this last spring was that it would be "feature creep" to do too much. However, it full-icu were installed, it would be fairly trivial to support the full range of encodings built in to ICU

srl295 · 2017-02-24T20:27:13Z

//fyi @martinheidegger

jasnell · 2017-02-24T21:32:58Z

btw, at this point in time I am not pursuing this. May return to it later, however.

domenic · 2017-06-08T15:59:16Z

I filed a similar request at nodejs/node#13550 but with a slightly different take. I was mostly interested in the data tables and exotic functionality such as the bidi algorithm, collation, and IDNA. Whereas I guess this EP at least originally focused on encoding functionality.

To me the main argument is that the code is already bundled so not exposing it to JS is just wasteful, leading to large libraries like https://github.com./Sebmaster/tr46.js which have to pull in large Unicode data tables, instead of using the built-in ones. Just exposing the data tables would make that library much smaller, but exposing IDNA would probably mostly obsolete the library.

jasnell · 2017-06-08T16:10:25Z

I'm generally all for doing something here. Happy the help work through it

Trott · 2018-06-12T23:22:54Z

Closing as this seems inactive. By all means, comment or re-open if this is still an ongoing concern, although I suspect it should be moved to a more active repository (probably the main node repo) if that's the case.

jasnell changed the title ~~Proposal: Add 'icu' module~~ 00X: Add 'icu' module Apr 20, 2016

trevnorris reviewed Apr 27, 2016
View reviewed changes

jasnell closed this Apr 30, 2016

trevnorris reopened this May 5, 2016

srl295 mentioned this pull request Jul 29, 2016

icu question nodejs/full-icu-npm#7

Closed

jasnell added 3 commits August 11, 2016 12:24

Proposal: Add 'icu' module

97347d7

fix nits

bedc674

Updating

71b757c

jasnell force-pushed the icu-module branch from 258c1c9 to 71b757c Compare August 11, 2016 19:34

jasnell mentioned this pull request Aug 12, 2016

[WIP] proposed new unicode module nodejs/node#8075

Closed

richardlau mentioned this pull request Jun 8, 2017

Feature request: exposing more ICU functionality or data? nodejs/node#13550

Closed

Trott closed this Jun 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

00X: Add 'icu' module #15

00X: Add 'icu' module #15

jasnell commented Apr 20, 2016 •

edited

Loading

Fishrock123 commented Apr 20, 2016

jasnell commented Apr 20, 2016 •

edited

Loading

jasnell commented Apr 20, 2016

trevnorris Apr 27, 2016

jasnell Apr 27, 2016

bnoordhuis commented Apr 28, 2016

jasnell commented Apr 28, 2016

trevnorris commented Apr 28, 2016

jasnell commented Apr 30, 2016

srl295 commented Apr 30, 2016

jasnell commented Apr 30, 2016

srl295 commented Apr 30, 2016

trevnorris commented May 4, 2016

jasnell commented May 4, 2016

trevnorris commented May 5, 2016

trevnorris commented May 5, 2016

rvagg commented Jul 13, 2016

trevnorris commented Jul 13, 2016

rvagg commented Jul 14, 2016

jasnell commented Jul 16, 2016

trevnorris commented Jul 19, 2016

jasnell commented Aug 11, 2016

jasnell commented Aug 11, 2016

jasnell commented Aug 12, 2016

mscdex commented Aug 12, 2016

jasnell commented Aug 12, 2016

srl295 commented Feb 24, 2017

jasnell commented Feb 24, 2017

domenic commented Jun 8, 2017

jasnell commented Jun 8, 2017

Trott commented Jun 12, 2018

00X: Add 'icu' module #15

00X: Add 'icu' module #15

Conversation

jasnell commented Apr 20, 2016 • edited Loading

Description

Interface

One-Shot Transcoding

UTF-8 and UTF-16 aware codePointAt() and charAt() methods for Buffer

UTF-16 and UTF-8 aware slice() for Buffer

Proof-of-concept

Fishrock123 commented Apr 20, 2016

jasnell commented Apr 20, 2016 • edited Loading

jasnell commented Apr 20, 2016

trevnorris Apr 27, 2016

Choose a reason for hiding this comment

jasnell Apr 27, 2016

Choose a reason for hiding this comment

bnoordhuis commented Apr 28, 2016

jasnell commented Apr 28, 2016

trevnorris commented Apr 28, 2016

jasnell commented Apr 30, 2016

srl295 commented Apr 30, 2016

jasnell commented Apr 30, 2016

srl295 commented Apr 30, 2016

trevnorris commented May 4, 2016

jasnell commented May 4, 2016

trevnorris commented May 5, 2016

trevnorris commented May 5, 2016

rvagg commented Jul 13, 2016

trevnorris commented Jul 13, 2016

rvagg commented Jul 14, 2016

jasnell commented Jul 16, 2016

trevnorris commented Jul 19, 2016

jasnell commented Aug 11, 2016

jasnell commented Aug 11, 2016

jasnell commented Aug 12, 2016

mscdex commented Aug 12, 2016

jasnell commented Aug 12, 2016

srl295 commented Feb 24, 2017

jasnell commented Feb 24, 2017

domenic commented Jun 8, 2017

jasnell commented Jun 8, 2017

Trott commented Jun 12, 2018

jasnell commented Apr 20, 2016 •

edited

Loading

UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`

UTF-16 and UTF-8 aware `slice()` for `Buffer`

jasnell commented Apr 20, 2016 •

edited

Loading