Skip to content
This repository was archived by the owner on Jul 31, 2018. It is now read-only.

00X: Add 'icu' module #15

Closed
wants to merge 3 commits into from
Closed

00X: Add 'icu' module #15

wants to merge 3 commits into from

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Apr 20, 2016

Description

The ICU4C library that we use for internationalization contains a significant array of additional functionality not currently exposed by the EcmaScript 402 standard. Some of this additional functionality would be useful to expose via a new 'icu' or ('unicode') module.

Note: Updated on 2016-08-15 to reflect the current state of the proposal

Interface

Initially, the 'icu' module would provide methods for the following:

One-Shot Transcoding

ICU includes code for converting from one encoding to another. This is similar to what is provided by iconv but it is built in to ICU4C. The 'icu' module would include converters for only the character encodings directly supported by core. Developers would continue to use iconv or iconv-lite for more exotic things.

const icu = require('icu');

// One-shot conversion. Converts the entire Buffer in one go.
// Assumes that the Buffer is properly aligned on UFT-8 boundaries
const myBuffer = Buffer.from(getUtf8DataSomehow(), 'utf8');
const newBuffer = icu.transcode(myBuffer, 'utf8', 'ucs2');

UTF-8 and UTF-16 aware codePointAt() and charAt() methods for Buffer

This one is pretty straightforward. They would return either the Unicode codepoint or the character at the given byte offset even if the byte offset is not on a UTF-8 or UTF-16 lead byte. These are intended to be symmetrical with String.prototype.codePointAt() and String.prototype.charAt()

const icu = require('icu');

const myBuffer = Buffer.from('a€bc', 'utf8');

console.log(icu.codePointAt(myBuffer, 1, 'utf8'));

console.log(icu.charAt(myBuffer, 1, 'utf8'));

UTF-16 and UTF-8 aware slice() for Buffer

This is similar to the existing Buffer.prototype.slice() except that, rather than byte offsets, the start and end are codepoint/character offsets This would make it symmetrical with String.prototype.slice() but for Buffers. The advantage is that this allows the Buffer to be sliced in a way that ensures proper alignment with UTF-8 or UTF-16 encodings.

const icu = require('icu');

const myBuffer = Buffer.from('a€bc', 'utf8');

icu.slice(myBuffer, 'utf8', 1, 2); // returns a Buffer with €

Passing in either ascii or binary would fallback to the current Buffer.slice() behavior.

Proof-of-concept

A proof-on-concept implementation that includes everything except the streaming converters can be found at:

https://github.com./jasnell/node/tree/icu-module

The module requires turning on the basic ICU converter switch which has a nominal impact on binary size but nothing significant. Since the conversions we would require in core are purely algorithmic, there is no additional ICU data requirement (the current small-icu is just fine).

/cc @trevnorris @bnoordhuis

@jasnell jasnell changed the title Proposal: Add 'icu' module 00X: Add 'icu' module Apr 20, 2016
@Fishrock123
Copy link

This contains some serious feature creep, I think it is better suited as a (native?) module.

@jasnell
Copy link
Member Author

jasnell commented Apr 20, 2016

Have to disagree. It's actually rather small: it is very similar to stuff we are already doing (e.g. `buffer.toString('hex') re-encodes), uses code we already have in our dependency tree, limits itself only to the encodings we already support so as to avoid stepping on userland, and provides useful functionality that is difficult (update: well, difficult to do is a bit strong, complicated...) to do in a performant way otherwise (slicing on the utf-8 boundary, for instance). I fail to see how this could qualify as "serious feature creep".

@jasnell
Copy link
Member Author

jasnell commented Apr 20, 2016

Essentially, we're getting this functionality for free from code we already use.

| Title | ICU Module |
|--------|-----------------------------|
| Author | @jasnell |
| Status | REJECTED |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. Why is this labeled as REJECTED?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha!! Copied the header from a different one lol
On Apr 27, 2016 2:56 PM, "Trevor Norris" [email protected] wrote:

In XXX-icu-module.md
#15 (comment):

@@ -0,0 +1,137 @@
+| Title | ICU Module |
+|--------|-----------------------------|
+| Author | @jasnell |
+| Status | REJECTED |

Just curious. Why is this labeled as REJECTED?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com./nodejs/node-eps/pull/15/files/3fad3f260a0533ef0fead0d640ac2c3cc78d2d0a#r61343698

@bnoordhuis
Copy link
Member

I'm in principle okay with exposing ICU in core but character encoding detection doesn't strike me as super useful and I'd hate to see Buffer.prototype burdened with yet more convenience methods of limited generality. Character encoding conversion seems useful however and I wouldn't mind if it obsoleted iconv and iconv-lite.

@jasnell
Copy link
Member Author

jasnell commented Apr 28, 2016

The encoding detection is definitely one that would be more convenience than anything ... in particular when reading arbitrary data whose origin and encoding you don't already know in advance. It is, however, a best guess on ICU's part and isn't perfect by any means. So dropping that bit is fine.

On the conversion part, ICU contains a huge number of converts built in if we're using the full-icu data. Small-data only includes the algorithmic unicode converters. We could obsolete iconv and iconv-lite if we make it so that when full-icu is used, the full range of converters is available. Just not sure how much value there is in opening it up fully like that.

@trevnorris
Copy link
Contributor

@jasnell You know if win1252 requires the full data set? That is the one exception I'd like to make for converting the Buffer to utf8 so it can be properly turned into a string. This would help mitigate issues where http.get() grabs a page with charset="win1252" (or other listed WHATWG variant aliases) and we simply don't have a way to decode it into a string.

I believe this is the only case where the WHATWG specification differs from every other specification of what ISO-8859-1 actually is (reference table entry "windows-1252" from https://encoding.spec.whatwg.org/#names-and-labels). Meaning, in every other specification "latin1" aliases ISO-8859-1. Which was first released as ISO_8859-1:1987. Then in 1992 the IANA released an RFC for an extension of the specification (ref http://tools.ietf.org/html/rfc1345). This was than later finalized as ISO/IEC 8859-1:1998 (ref http://www.open-std.org/JTC1/SC2/WG3/docs/n411.pdf). So we can see that the byte code to code point matches aren't consistent with how the WHATWG and everyone else specifies ISO-8859-1 to be.

Unfortunately browsers render charset="iso-8859-1" as windows-1252. So it's not uncommon for servers to return the same. Fortunately windows-1252 is a strict superset of iso-8859-1, so at most it'll not capture specific characters. But in case it does send the byte code as if it were using windows-1252 then http.get() should be able to properly interpret that.

@jasnell
Copy link
Member Author

jasnell commented Apr 30, 2016

win1252 is not included in the small-icu data set we use. It could likely be added.

@jasnell jasnell closed this Apr 30, 2016
@srl295
Copy link
Member

srl295 commented Apr 30, 2016

It's not just data size. We build with the conversion code ifdeffed out also.

@jasnell
Copy link
Member Author

jasnell commented Apr 30, 2016

yep, even with the non-legacy conversion code switched back on (which I do in the prototype), it's not available. @srl295 .. do you know offhand if windows-1252 requires the legacy conversions and/or data?

@srl295
Copy link
Member

srl295 commented Apr 30, 2016

Yes and yes. Pretty much anything that's not one of the UTFs or iso-8859-1 or ascii requires both code and data

@trevnorris
Copy link
Contributor

@jasnell Why was this closed?

@jasnell
Copy link
Member Author

jasnell commented May 4, 2016

I opted to close due critical personal feedback I received via private conversation implying that it was somehow inappropriate for me to be trying to "ram" such new capabilities into core. If others wish to continue pursuing this then I will support that effort in whatever way I can but I will not pursue it further myself.

@trevnorris
Copy link
Contributor

@jasnell even rejected ep are merged. :-)

And screw private conversations. This is a public discussion. If someone wants to say something they can do it out in the open. No ep will be closed without justified reason.

@trevnorris trevnorris reopened this May 5, 2016
@trevnorris
Copy link
Contributor

And for the record, I'd not consider approval for this except for the fact that we're already including the code because of icu support. So it seems dumb for us to leave good features on the table that already exist within core. I do have thoughts on what the API should look like, for sure, but that's for another post. Time for bed.

@rvagg
Copy link
Member

rvagg commented Jul 13, 2016

Just seeing this now .. sorry. I think I'm with @trevnorris on this in that we have the functionality sitting there and although I'm generally -1 on feature creep, having solid unicode support in Node is turning out to be pretty important for its global growth.

'unicode' would be a better name for this I think but that package is already used way too much to usurp: https://www.npmjs.com/package/unicode

Can someone educate me on how the reencode() functionality here is different to what you can already do with Buffer?

@trevnorris
Copy link
Contributor

@rvagg Currently you need to do something like Buffer(b.toString(), 'utf8').toString('ucs2') in order to convert it. Meaning it needs to copy the string into the v8 heap. This method would allow that to happen outside of the v8 heap.

@rvagg
Copy link
Member

rvagg commented Jul 14, 2016

Thanks, I thought we had the functionality already so wondered why it was needed. So the clarification here is that it's both a performance improvement an an API simplification. I'm cool with that.

@jasnell
Copy link
Member Author

jasnell commented Jul 16, 2016

If there is enough agreement to move forward on iterating this proposal
I'll keep working on it (after vacation, of course)

On Jul 13, 2016 8:13 PM, "Rod Vagg" [email protected] wrote:

Thanks, I thought we had the functionality already so wondered why it was
needed. So the clarification here is that it's both a performance
improvement an an API simplification. I'm cool with that.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com./notifications/unsubscribe/AAa2eSGXwsRdJR7HznafzN_EzF9GiEWAks5qValBgaJpZM4IMDWW
.

@trevnorris
Copy link
Contributor

Sounds good to me.

@jasnell
Copy link
Member Author

jasnell commented Aug 11, 2016

Looking at revitalizing this proposal, before I do I however, here's what I'm proposing:

  • Renaming the icu.reencode() to icu.transcode(). There would not be any new methods added to Buffer.prototype
  • Removing the charset detection logic
  • Keeping the icu.codePointAt() and icu.charAt() APIs but not adding those to the Buffer.prototype
  • Keeping icu.slice() to allow for UTF-8 and UTF-16 aware Buffer slicing but not adding those to the Buffer.prototype
  • Rename the module from icu to unicode

In other words, this would isolate the additional APIs strictly to the new unicode module

@jasnell
Copy link
Member Author

jasnell commented Aug 11, 2016

Updated..

@jasnell
Copy link
Member Author

jasnell commented Aug 12, 2016

Updated work in progress implementation: nodejs/node#8075

@mscdex
Copy link

mscdex commented Aug 12, 2016

So if full-icu is installed, does that mean conversion between other encodings (not natively supported by Buffer -- such as utf-32) would still not be available?

@jasnell
Copy link
Member Author

jasnell commented Aug 12, 2016

For now, yes. The initially stated concern when I first suggested this last spring was that it would be "feature creep" to do too much. However, it full-icu were installed, it would be fairly trivial to support the full range of encodings built in to ICU

@srl295
Copy link
Member

srl295 commented Feb 24, 2017

//fyi @martinheidegger

@jasnell
Copy link
Member Author

jasnell commented Feb 24, 2017

btw, at this point in time I am not pursuing this. May return to it later, however.

@domenic
Copy link

domenic commented Jun 8, 2017

I filed a similar request at nodejs/node#13550 but with a slightly different take. I was mostly interested in the data tables and exotic functionality such as the bidi algorithm, collation, and IDNA. Whereas I guess this EP at least originally focused on encoding functionality.

To me the main argument is that the code is already bundled so not exposing it to JS is just wasteful, leading to large libraries like https://github.com./Sebmaster/tr46.js which have to pull in large Unicode data tables, instead of using the built-in ones. Just exposing the data tables would make that library much smaller, but exposing IDNA would probably mostly obsolete the library.

@jasnell
Copy link
Member Author

jasnell commented Jun 8, 2017

I'm generally all for doing something here. Happy the help work through it

@Trott
Copy link
Member

Trott commented Jun 12, 2018

Closing as this seems inactive. By all means, comment or re-open if this is still an ongoing concern, although I suspect it should be moved to a more active repository (probably the main node repo) if that's the case.

@Trott Trott closed this Jun 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants