Intl.Segmenter: segmentation Unicode en JavaScript

Préface à la traduction



Il s'agit d'une traduction de la partie explicative de la proposition Intl.Segmenter, qui sera probablement ajoutée à la prochaine spécification ECMAScript.



La proposition a dĂ©jĂ  Ă©tĂ© implĂ©mentĂ©e en V8 et sans le drapeau peut ĂȘtre utilisĂ©e dans la version 8.7 (plus prĂ©cisĂ©ment, dans 8.7.38et au-dessus), elle peut donc ĂȘtre testĂ©e dans Google Chrome Canary (Ă  partir de la version 87.0.4252.0) ou dans Node.js V8 Canary (Ă  partir de la version v15.0.0-v8-canary202009025a2ca762b8; pour Windows les binaires sont disponibles v15.0.0-v8-canary202009173b56586162).



Si vous testez dans des versions antĂ©rieures avec l'indicateur --harmony-intl-segmenter, soyez prudent car la spĂ©cification a changĂ© et l'implĂ©mentation sous l'indicateur peut ĂȘtre obsolĂšte. VĂ©rifiez par sortie dans des exemples de code.



AprÚs la traduction, des liens sont fournis vers des documents en raison des problÚmes que cette proposition résout.






Intl.Segmenter: Segmentation Unicode en JavaScript



La proposition est Ă  l'Ă©tape 3 avec le soutien de Richard Gibson.



Motivation



(code point) «» . , (, ). , . , .



, CLDR (Common Locale Data Repository, ) (, locales). , , , .



, UAX 29. , JavaScript .



Chrome API Intl.v8BreakIterator. API . API, API JavaScript — , ES2015.







, segment(), Intl.Segmenter, Iterable.



//      .
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});

//       .
let input = "Moi?  N'est-ce pas.";
let segments = segmenter.segment(input);

//    !
for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}

//  console.log:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): «  »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»


, API .



// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";

let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;

current = segments.containing(0)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(5)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(6)
// → { index: 6, segment: "-", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }

current = segments.containing(current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → undefined


API



.



new Intl.Segmenter(locale, options)



.



options , granularity, ("grapheme" ( ), "word" ( ) "sentence" ( ); — "grapheme").



Intl.Segmenter.prototype.segment(string)



%Segments% Iterable .





:



  • segment — .
  • index — (code unit index) , .
  • input — .
  • isWordLike — true, "word" ( ) ( /// ..); false, "word" ( // ..); undefined, "word".


%Segments%.prototype:



%Segments%.prototype.containing(index)



, , (code unit) , undefined, .



%Segments%.prototype[Symbol.iterator]



%SegmentIterator%, "" (lazy, ) , .



%SegmentIterator%.prototype:



%SegmentIterator%.prototype.next()



next() Iterator, IteratorResult, value , .



FAQ



? ?



— , . . . CLDR. , CLDR/ICU , .



API ?



, 3- , . TC39 . ; , , .



?



API, , API : , API (, ). API CSS Houdini.



?



API:



  • .
  • .
  • , (.. Web API (Web Platform), ECMAScript).
  • , . CLDR ICU . CSS, . . , , , ; .


?



%SegmentIterator%.prototype, (, seek([inclusiveStartIndex = thisIterator.index + 1]) seekBefore([exclusiveLastIndex = thisIterator.index]), . ECMA-262 ( ). , , .



API Intl, String?



, . segment() SegmentIterator. , API Intl, ECMA-402. , . String, , .



?



n (code unit), . , "Hello, world\u{1F499}" ( , - — ), 0, 5, 6, 7 12. : ┃Hello┃,┃ ┃world┃\u{1F499}┃, (code units), (code point). , .



?



, next().



, ?



, - QA ;)



Number: null 0, — 0 1, , , Symbol BigInt, undefined NaN *. , ( , ).



* . "fail". Chrome Canary, Symbol BigInt TypeError, undefined NaN , 0.








JavaScript.



  1. Joel Spolsky. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  2. Dmitri Pavlutin. What every JavaScript developer should know about Unicode
  3. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 17. Unicode – a brief introduction
  4. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 18.6. Atoms of text: Unicode characters, JavaScript characters, grapheme clusters
  5. Jonathan New. "\u{1F4A9}".length === 2
  6. NicolĂĄs Bevacqua. ES6 Strings (and Unicode, ) in Depth
  7. Mathias Bynens. JavaScript has a Unicode problem
  8. Mathias Bynens. Unicode-aware regular expressions in ECMAScript 6
  9. Mathias Bynens. Unicode property escapes in JavaScript regular expressions
  10. Mathias Bynens. Unicode sequence property escapes
  11. Awesome Unicode: a curated list of delightful Unicode tidbits, packages and resources



All Articles