[ES6] Add support for Unicode regular expressions
authormsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Wed, 2 Mar 2016 00:39:01 +0000 (00:39 +0000)
committermsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Wed, 2 Mar 2016 00:39:01 +0000 (00:39 +0000)
https://bugs.webkit.org/show_bug.cgi?id=154842

Reviewed by Filip Pizlo.

Source/JavaScriptCore:

Added processing of Unicode regular expressions to the Yarr interpreter.

Changed parsing of regular expression patterns and PatternTerms to process characters as
UChar32 in the Yarr code.  The parser converts matched surrogate pairs into the appropriate
Unicode character when the expression is parsed.  When matching a unicode expression and
reading source characters, we convert proper surrogate pair into a Unicode character and
advance the source cursor, "pos", one more position.  The exception to this is when we
know when generating a fixed character atom that we need to match a unicode character
that doesn't fit in 16 bits.  The code calls this an extendedUnicodeCharacter and has a
helper to determine this.

Added 'u' flag and 'unicode' identifier to regular expression classes.  Added an "isUnicode"
parameter to YarrPattern pattern() and internal users of that function.

Updated the generation of the canonicalization tables to include a new set a tables that
follow the ES 6.0, 21.2.2.8.2 Step 2.  Renamed the YarrCanonicalizeUCS2.* files to
YarrCanonicalizeUnicode.*.

Added a new Layout/js test that tests the added functionality.  Updated other tests that
have minor es6 unicode checks and look for valid flags.

Ran the ChakraCore Unicode regular expression tests as well.

* CMakeLists.txt:
* JavaScriptCore.vcxproj/JavaScriptCore.vcxproj:
* JavaScriptCore.vcxproj/JavaScriptCore.vcxproj.filters:
* JavaScriptCore.xcodeproj/project.pbxproj:

* inspector/ContentSearchUtilities.cpp:
(Inspector::ContentSearchUtilities::findMagicComment):
* yarr/RegularExpression.cpp:
(JSC::Yarr::RegularExpression::Private::compile):
Updated use of pattern().

* runtime/CommonIdentifiers.h:
* runtime/RegExp.cpp:
(JSC::regExpFlags):
(JSC::RegExpFunctionalTestCollector::outputOneTest):
(JSC::RegExp::finishCreation):
(JSC::RegExp::compile):
(JSC::RegExp::compileMatchOnly):
* runtime/RegExp.h:
* runtime/RegExpKey.h:
* runtime/RegExpPrototype.cpp:
(JSC::regExpProtoFuncCompile):
(JSC::flagsString):
(JSC::regExpProtoGetterMultiline):
(JSC::regExpProtoGetterUnicode):
(JSC::regExpProtoGetterFlags):
Updated for new 'y' (unicode) flag.  Add check to use the interpreter for unicode regular expressions.

* tests/es6.yaml:
* tests/stress/static-getter-in-names.js:
Updated tests for new flag and for passing the minimal es6 regular expression processing.

* yarr/Yarr.h: Updated the size of information now kept for backtracking.

* yarr/YarrCanonicalizeUCS2.cpp: Removed.
* yarr/YarrCanonicalizeUCS2.h: Removed.
* yarr/YarrCanonicalizeUCS2.js: Removed.
* yarr/YarrCanonicalizeUnicode.cpp: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp.
* yarr/YarrCanonicalizeUnicode.h: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h.
(JSC::Yarr::canonicalCharacterSetInfo):
(JSC::Yarr::canonicalRangeInfoFor):
(JSC::Yarr::getCanonicalPair):
(JSC::Yarr::isCanonicallyUnique):
(JSC::Yarr::areCanonicallyEquivalent):
(JSC::Yarr::rangeInfoFor): Deleted.
* yarr/YarrCanonicalizeUnicode.js: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js.
(printHeader):
(printFooter):
(hex):
(canonicalize):
(canonicalizeUnicode):
(createUCS2CanonicalGroups):
(createUnicodeCanonicalGroups):
(cu.in.groupedCanonically.characters.sort): Deleted.
(cu.in.groupedCanonically.else): Deleted.
Refactored to output two sets of tables, one for UCS2 and one for Unicode.  The UCS2 tables follow
the legacy canonicalization rules now specified in ES 6.0, 21.2.2.8.2 Step 3.  The new Unicode
tables follow the rules specified in ES 6.0, 21.2.2.8.2 Step 2.  Eliminated the unused Latin1 tables.

* yarr/YarrInterpreter.cpp:
(JSC::Yarr::Interpreter::InputStream::InputStream):
(JSC::Yarr::Interpreter::InputStream::readChecked):
(JSC::Yarr::Interpreter::InputStream::readSurrogatePairChecked):
(JSC::Yarr::Interpreter::InputStream::reread):
(JSC::Yarr::Interpreter::InputStream::prev):
(JSC::Yarr::Interpreter::testCharacterClass):
(JSC::Yarr::Interpreter::checkCharacter):
(JSC::Yarr::Interpreter::checkSurrogatePair):
(JSC::Yarr::Interpreter::checkCasedCharacter):
(JSC::Yarr::Interpreter::tryConsumeBackReference):
(JSC::Yarr::Interpreter::backtrackPatternCharacter):
(JSC::Yarr::Interpreter::matchCharacterClass):
(JSC::Yarr::Interpreter::backtrackCharacterClass):
(JSC::Yarr::Interpreter::matchParenthesesTerminalEnd):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::Interpreter::Interpreter):
(JSC::Yarr::ByteCompiler::assertionWordBoundary):
(JSC::Yarr::ByteCompiler::atomPatternCharacter):
* yarr/YarrInterpreter.h:
(JSC::Yarr::ByteTerm::ByteTerm):
(JSC::Yarr::BytecodePattern::BytecodePattern):
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClassRange):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::notAtEndOfInput):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
* yarr/YarrParser.h:
(JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::Parser):
(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::consumePossibleSurrogatePair):
(JSC::Yarr::Parser::parseCharacterClass):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::parse):
(JSC::Yarr::Parser::atEndOfPattern):
(JSC::Yarr::Parser::patternRemaining):
(JSC::Yarr::Parser::peek):
(JSC::Yarr::parse):
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::append):
(JSC::Yarr::CharacterClassConstructor::putChar):
(JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):
(JSC::Yarr::CharacterClassConstructor::putRange):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::YarrPatternConstructor::YarrPatternConstructor):
(JSC::Yarr::YarrPatternConstructor::assertionWordBoundary):
(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassBegin):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassAtom):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassRange):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
(JSC::Yarr::YarrPattern::compile):
(JSC::Yarr::YarrPattern::YarrPattern):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterRange::CharacterRange):
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::PatternTerm::PatternTerm):
(JSC::Yarr::YarrPattern::reset):
* yarr/YarrSyntaxChecker.cpp:
(JSC::Yarr::SyntaxChecker::assertionBOL):
(JSC::Yarr::SyntaxChecker::assertionEOL):
(JSC::Yarr::SyntaxChecker::assertionWordBoundary):
(JSC::Yarr::SyntaxChecker::atomPatternCharacter):
(JSC::Yarr::SyntaxChecker::atomBuiltInCharacterClass):
(JSC::Yarr::SyntaxChecker::atomCharacterClassBegin):
(JSC::Yarr::SyntaxChecker::atomCharacterClassAtom):
(JSC::Yarr::checkSyntax):

LayoutTests:

Added a new test for the added unicode regular expression processing.

Updated several tests for the y flag changes and "unicode" property.

* js/regexp-unicode-expected.txt: Added.
* js/regexp-unicode.html: Added.
* js/script-tests/regexp-unicode.js: Added.
New test.

* js/Object-getOwnPropertyNames-expected.txt:
* js/regexp-flags-expected.txt:
* js/script-tests/Object-getOwnPropertyNames.js:
* js/script-tests/regexp-flags.js:
(RegExp.prototype.hasOwnProperty):
Updated tests.

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@197426 268f45cc-cd09-0410-ab3c-d52691b4dbfc

35 files changed:
LayoutTests/ChangeLog
LayoutTests/js/Object-getOwnPropertyNames-expected.txt
LayoutTests/js/regexp-flags-expected.txt
LayoutTests/js/regexp-unicode-expected.txt [new file with mode: 0644]
LayoutTests/js/regexp-unicode.html [new file with mode: 0644]
LayoutTests/js/script-tests/Object-getOwnPropertyNames.js
LayoutTests/js/script-tests/regexp-flags.js
LayoutTests/js/script-tests/regexp-unicode.js [new file with mode: 0644]
Source/JavaScriptCore/CMakeLists.txt
Source/JavaScriptCore/ChangeLog
Source/JavaScriptCore/JavaScriptCore.vcxproj/JavaScriptCore.vcxproj
Source/JavaScriptCore/JavaScriptCore.vcxproj/JavaScriptCore.vcxproj.filters
Source/JavaScriptCore/JavaScriptCore.xcodeproj/project.pbxproj
Source/JavaScriptCore/inspector/ContentSearchUtilities.cpp
Source/JavaScriptCore/runtime/CommonIdentifiers.h
Source/JavaScriptCore/runtime/RegExp.cpp
Source/JavaScriptCore/runtime/RegExp.h
Source/JavaScriptCore/runtime/RegExpKey.h
Source/JavaScriptCore/runtime/RegExpPrototype.cpp
Source/JavaScriptCore/tests/es6.yaml
Source/JavaScriptCore/tests/stress/static-getter-in-names.js
Source/JavaScriptCore/yarr/RegularExpression.cpp
Source/JavaScriptCore/yarr/Yarr.h
Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp [deleted file]
Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js [deleted file]
Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp [new file with mode: 0644]
Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.h [moved from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h with 65% similarity]
Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.js [new file with mode: 0644]
Source/JavaScriptCore/yarr/YarrInterpreter.cpp
Source/JavaScriptCore/yarr/YarrInterpreter.h
Source/JavaScriptCore/yarr/YarrJIT.cpp
Source/JavaScriptCore/yarr/YarrParser.h
Source/JavaScriptCore/yarr/YarrPattern.cpp
Source/JavaScriptCore/yarr/YarrPattern.h
Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp

index fd9525c..eae6c9d 100644 (file)
@@ -1,3 +1,26 @@
+2016-03-01  Michael Saboff  <msaboff@apple.com>
+
+        [ES6] Add support for Unicode regular expressions
+        https://bugs.webkit.org/show_bug.cgi?id=154842
+
+        Reviewed by Filip Pizlo.
+
+        Added a new test for the added unicode regular expression processing.
+
+        Updated several tests for the y flag changes and "unicode" property.
+
+        * js/regexp-unicode-expected.txt: Added.
+        * js/regexp-unicode.html: Added.
+        * js/script-tests/regexp-unicode.js: Added.
+        New test.
+
+        * js/Object-getOwnPropertyNames-expected.txt:
+        * js/regexp-flags-expected.txt:
+        * js/script-tests/Object-getOwnPropertyNames.js:
+        * js/script-tests/regexp-flags.js:
+        (RegExp.prototype.hasOwnProperty):
+        Updated tests.
+
 2016-03-01  Ryan Haddad  <ryanhaddad@apple.com>
 
         Marking fast/text/crash-complex-text-surrogate.html as flaky on mac
index 66687e8..5b9b8a1 100644 (file)
@@ -56,7 +56,7 @@ PASS getSortedOwnPropertyNames(Number.prototype) is ['constructor', 'toExponenti
 PASS getSortedOwnPropertyNames(Date) is ['UTC', 'length', 'name', 'now', 'parse', 'prototype']
 PASS getSortedOwnPropertyNames(Date.prototype) is ['constructor', 'getDate', 'getDay', 'getFullYear', 'getHours', 'getMilliseconds', 'getMinutes', 'getMonth', 'getSeconds', 'getTime', 'getTimezoneOffset', 'getUTCDate', 'getUTCDay', 'getUTCFullYear', 'getUTCHours', 'getUTCMilliseconds', 'getUTCMinutes', 'getUTCMonth', 'getUTCSeconds', 'getYear', 'setDate', 'setFullYear', 'setHours', 'setMilliseconds', 'setMinutes', 'setMonth', 'setSeconds', 'setTime', 'setUTCDate', 'setUTCFullYear', 'setUTCHours', 'setUTCMilliseconds', 'setUTCMinutes', 'setUTCMonth', 'setUTCSeconds', 'setYear', 'toDateString', 'toGMTString', 'toISOString', 'toJSON', 'toLocaleDateString', 'toLocaleString', 'toLocaleTimeString', 'toString', 'toTimeString', 'toUTCString', 'valueOf']
 PASS getSortedOwnPropertyNames(RegExp) is ['$&', "$'", '$*', '$+', '$1', '$2', '$3', '$4', '$5', '$6', '$7', '$8', '$9', '$_', '$`', 'input', 'lastMatch', 'lastParen', 'leftContext', 'length', 'multiline', 'name', 'prototype', 'rightContext']
-PASS getSortedOwnPropertyNames(RegExp.prototype) is ['compile', 'constructor', 'exec', 'flags', 'global', 'ignoreCase', 'lastIndex', 'multiline', 'source', 'test', 'toString']
+PASS getSortedOwnPropertyNames(RegExp.prototype) is ['compile', 'constructor', 'exec', 'flags', 'global', 'ignoreCase', 'lastIndex', 'multiline', 'source', 'test', 'toString', 'unicode']
 PASS getSortedOwnPropertyNames(Error) is ['length', 'name', 'prototype']
 PASS getSortedOwnPropertyNames(Error.prototype) is ['constructor', 'message', 'name', 'toString']
 PASS getSortedOwnPropertyNames(Math) is ['E','LN10','LN2','LOG10E','LOG2E','PI','SQRT1_2','SQRT2','abs','acos','acosh','asin','asinh','atan','atan2','atanh','cbrt','ceil','clz32','cos','cosh','exp','expm1','floor','fround','hypot','imul','log','log10','log1p','log2','max','min','pow','random','round','sign','sin','sinh','sqrt','tan','tanh','trunc']
index 64ae8bb..8c4a803 100644 (file)
@@ -23,6 +23,10 @@ PASS flags.call({}) is ''
 PASS flags.call({global: true, multiline: true, ignoreCase: true}) is 'gim'
 PASS flags.call({global: 1, multiline: 0, ignoreCase: 2}) is 'gi'
 PASS flags.call({ __proto__: { multiline: true } }) is 'm'
+unicode flag
+PASS /a/uimg.flags is 'gimu'
+PASS new RegExp('a', 'uimg').flags is 'gimu'
+PASS flags.call({global: true, multiline: true, ignoreCase: true, unicode: true}) is 'gimu'
 PASS successfullyParsed is true
 
 TEST COMPLETE
diff --git a/LayoutTests/js/regexp-unicode-expected.txt b/LayoutTests/js/regexp-unicode-expected.txt
new file mode 100644 (file)
index 0000000..a4ed10f
--- /dev/null
@@ -0,0 +1,97 @@
+Test for unicode regular expression processing
+
+On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE".
+
+
+PASS "a".match(/a/)[0].length is 1
+PASS "a".match(/A/i)[0].length is 1
+PASS "a".match(/a/u)[0].length is 1
+PASS "a".match(/A/iu)[0].length is 1
+PASS "Ȓ".match(/Ȓ/)[0].length is 1
+PASS "Ȓ".match(/Ȓ/u)[0].length is 1
+PASS "ሴ".match(/ሴ/)[0].length is 1
+PASS "ሴ".match(/ሴ/u)[0].length is 1
+PASS "⪼".match(/⪼/)[0].length is 1
+PASS "㿭".match(/㿭/u)[0].length is 1
+PASS "𒍅".match(/𒍅/u)[0].length is 2
+PASS "𒍅".match(/𒍅/u)[0].length is 2
+PASS "𝌆".match(/𝌆/)[0].length is 2
+PASS /𐑏/u.test("𐑏") is true
+PASS /𐑏/u.test("𐑏") is true
+PASS "𝌆".match(/𝌆/u)[0].length is 2
+PASS /(𐀀|𐐀|𐐩)/u.test("𐐀") is true
+PASS "𐄣".match(/a|𐄣|b/u)[0].length is 2
+PASS "b".match(/a|𐄣|b/u)[0].length is 1
+PASS /(?:a|𐄣|b)x/u.test("𐄣") is false
+PASS /(?:a|𐄣|b)x/u.test("𐄣x") is true
+PASS /(?:a|𐄣|b)x/u.test("b") is false
+PASS /(?:a|𐄣|b)x/u.test("bx") is true
+PASS "a𐄣x".match(/a𐄣b|a𐄣x/u)[0].length is 4
+PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐀x") is true
+PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐩x") is true
+PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐁x") is true
+PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐨x") is true
+PASS "𐐩".match(/a|𐐁|b/iu)[0].length is 2
+PASS "B".match(/a|𐄣|b/iu)[0].length is 1
+PASS /(?:A|𐄣|b)x/iu.test("𐄣") is false
+PASS /(?:A|𐄣|b)x/iu.test("𐄣x") is true
+PASS /(?:A|𐄣|b)x/iu.test("b") is false
+PASS /(?:A|𐄣|b)x/iu.test("bx") is true
+PASS "a𐄣X".match(/a𐄣b|a𐄣x/iu)[0].length is 4
+PASS "Ťx".match(/ťx/iu)[0].length is 2
+PASS "𝌆".match(/^.$/u)[0].length is 2
+PASS "It is 78°".match(/.*/u)[0].length is 9
+PASS "𝌆".match(/[𝌆a]/)[0].length is 1
+PASS "𝌆".match(/[a𝌆]/u)[0].length is 2
+PASS "𝌆".match(/[𝌆a]/u)[0].length is 2
+PASS "𝌆".match(/[a-𝌆]/)[0].length is 1
+PASS "𝌆".match(/[a-𝌆]/u)[0].length is 2
+PASS "X".match(/[ -𐑏]/u)[0].length is 1
+PASS "က".match(/[ -𐑏]/u)[0].length is 1
+PASS "𐐧".match(/[ -𐑏]/u)[0].length is 2
+PASS re1.test("Z") is false
+PASS re1.test("က") is false
+PASS re1.test("𐐀") is false
+PASS re2.test("A") is true
+PASS re2.test("￿") is false
+PASS re2.test("𒍅") is true
+PASS "𐌑𐌑𐌑".match(/𐌑*a|𐌑*./u)[0] is "𐌑𐌑𐌑"
+PASS "a𐌑𐌑".match(/a𐌑*?$/u)[0] is "a𐌑𐌑"
+PASS "a𐌑𐌑𐌑c".match(/a𐌑*cd|a𐌑*c/u)[0] is "a𐌑𐌑𐌑c"
+PASS "a𐌑𐌑𐌑c".match(/a𐌑+cd|a𐌑+c/u)[0] is "a𐌑𐌑𐌑c"
+PASS "𐌑𐌑𐌑".match(/𐌑+?a|𐌑+?./u)[0] is "𐌑𐌑"
+PASS "𐌑𐌑𐌑".match(/𐌑+?a|𐌑+?$/u)[0] is "𐌑𐌑𐌑"
+PASS "a𐌑𐌑𐌑c".match(/a𐌑*?cd|a𐌑*?c/u)[0] is "a𐌑𐌑𐌑c"
+PASS "a𐌑𐌑𐌑c".match(/a𐌑+?cd|a𐌑+?c/u)[0] is "a𐌑𐌑𐌑c"
+PASS "𐌑𐌑𐌑".match(/𐌑+?a|𐌑+?./iu)[0] is "𐌑𐌑"
+PASS "𐐪𐐪𐌑".match(/𐐂*𐈀|𐐂*𐌑/iu)[0] is "𐐪𐐪𐌑"
+PASS "𐐪𐐪𐌑".match(/𐐂+𐈀|𐐂+𐌑/iu)[0] is "𐐪𐐪𐌑"
+PASS "𐐪𐐪𐌑".match(/𐐂*?𐈀|𐐂*?𐌑/iu)[0] is "𐐪𐐪𐌑"
+PASS "𐐪𐐪𐌑".match(/𐐂+?𐈀|𐐂+?𐌑/iu)[0] is "𐐪𐐪𐌑"
+PASS "ab𐌑c𐨁".match(/abc|ab𐌑cd|ab𐌑c𐨁d|ab𐌑c𐨁/u)[0] is "ab𐌑c𐨁"
+PASS "ab𐐨c𐨁".match(/abc|ab𐐀cd|ab𐐀c𐨁d|ab𐐀c𐨁/iu)[0] is "ab𐐨c𐨁"
+PASS /abc|ab𐐀cd|ab𐐀c𐨁d|ab𐐀c𐨁/iu.test("qwerty123") is false
+PASS "a𐐨𐐨𐐨c".match(/ac|a𐐀*cd|a𐐀+cd|a𐐀+c/iu)[0] is "a𐐨𐐨𐐨c"
+PASS "ab𐐨𐐨𐐨c𐨁".match(/abc|ab𐐀*cd|ab𐐀+c𐨁d|ab𐐀+c𐨁/iu)[0] is "ab𐐨𐐨𐐨c𐨁"
+PASS "ab𐐨𐐨𐐨".match(/abc|ab𐐨*./u)[0] is "ab𐐨𐐨𐐨"
+PASS "ab𐐨𐐨𐐨".match(/abc|ab𐐀*./iu)[0] is "ab𐐨𐐨𐐨"
+PASS match3[0] is "a𐐐𐐐b"
+PASS match3[1] is undefined.
+PASS match3[2] is "a𐐐𐐐b"
+PASS match4[0] is "a𐐸𐐸b"
+PASS match4[1] is undefined.
+PASS match4[2] is "𐐸𐐸"
+PASS match5[0] is "a𐐒𐐒b𐐒𐐒"
+PASS match5[1] is undefined.
+PASS match5[2] is "𐐒𐐒"
+PASS match6[0] is "a𐐒𐐒b𐐺𐐒"
+PASS match6[1] is undefined.
+PASS match6[2] is "𐐒𐐒"
+PASS /ẚbc/ui.test("abc") is true
+PASS /abc/ui.test("ẚbc") is true
+PASS /texẗ/ui.test("text") is true
+PASS /text/ui.test("ẗext") is true
+PASS successfullyParsed is true
+
+TEST COMPLETE
+
diff --git a/LayoutTests/js/regexp-unicode.html b/LayoutTests/js/regexp-unicode.html
new file mode 100644 (file)
index 0000000..6352ac4
--- /dev/null
@@ -0,0 +1,10 @@
+<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
+<html>
+<head>
+<script src="../resources/js-test-pre.js"></script>
+</head>
+<body>
+<script src="script-tests/regexp-unicode.js"></script>
+<script src="../resources/js-test-post.js"></script>
+</body>
+</html>
index 3c742f0..924ab05 100644 (file)
@@ -65,7 +65,7 @@ var expectedPropertyNamesSet = {
     "Date": "['UTC', 'length', 'name', 'now', 'parse', 'prototype']",
     "Date.prototype": "['constructor', 'getDate', 'getDay', 'getFullYear', 'getHours', 'getMilliseconds', 'getMinutes', 'getMonth', 'getSeconds', 'getTime', 'getTimezoneOffset', 'getUTCDate', 'getUTCDay', 'getUTCFullYear', 'getUTCHours', 'getUTCMilliseconds', 'getUTCMinutes', 'getUTCMonth', 'getUTCSeconds', 'getYear', 'setDate', 'setFullYear', 'setHours', 'setMilliseconds', 'setMinutes', 'setMonth', 'setSeconds', 'setTime', 'setUTCDate', 'setUTCFullYear', 'setUTCHours', 'setUTCMilliseconds', 'setUTCMinutes', 'setUTCMonth', 'setUTCSeconds', 'setYear', 'toDateString', 'toGMTString', 'toISOString', 'toJSON', 'toLocaleDateString', 'toLocaleString', 'toLocaleTimeString', 'toString', 'toTimeString', 'toUTCString', 'valueOf']",
     "RegExp": "['$&', \"$'\", '$*', '$+', '$1', '$2', '$3', '$4', '$5', '$6', '$7', '$8', '$9', '$_', '$`', 'input', 'lastMatch', 'lastParen', 'leftContext', 'length', 'multiline', 'name', 'prototype', 'rightContext']",
-    "RegExp.prototype": "['compile', 'constructor', 'exec', 'flags', 'global', 'ignoreCase', 'lastIndex', 'multiline', 'source', 'test', 'toString']",
+    "RegExp.prototype": "['compile', 'constructor', 'exec', 'flags', 'global', 'ignoreCase', 'lastIndex', 'multiline', 'source', 'test', 'toString', 'unicode']",
     "Error": "['length', 'name', 'prototype']",
     "Error.prototype": "['constructor', 'message', 'name', 'toString']",
     "Math": "['E','LN10','LN2','LOG10E','LOG2E','PI','SQRT1_2','SQRT2','abs','acos','acosh','asin','asinh','atan','atan2','atanh','cbrt','ceil','clz32','cos','cosh','exp','expm1','floor','fround','hypot','imul','log','log10','log1p','log2','max','min','pow','random','round','sign','sin','sinh','sqrt','tan','tanh','trunc']",
index d45aad9..8124635 100644 (file)
@@ -28,6 +28,11 @@ shouldBe("flags.call({global: 1, multiline: 0, ignoreCase: 2})", "'gi'");
 // inherited properties count
 shouldBe("flags.call({ __proto__: { multiline: true } })", "'m'");
 
+debug("unicode flag");
+shouldBe("/a/uimg.flags", "'gimu'");
+shouldBe("new RegExp('a', 'uimg').flags", "'gimu'");
+shouldBe("flags.call({global: true, multiline: true, ignoreCase: true, unicode: true})", "'gimu'");
+
 if (RegExp.prototype.hasOwnProperty('sticky')) {
   debug("sticky flag");
   // when the engine supports "sticky", these tests will fail by design.
@@ -36,11 +41,3 @@ if (RegExp.prototype.hasOwnProperty('sticky')) {
   shouldBe("new RegExp('a', 'yimg').flags", "'gimy'");
   shouldBe("flags.call({global: true, multiline: true, ignoreCase: true, sticky: true})", "'gimy'");
 }
-if (RegExp.prototype.hasOwnProperty('unicode')) {
-  debug("unicode flag");
-  // when the engine supports "unicode", these tests will fail by design.
-  // Hopefully, only the expected output will need updating.
-  shouldBe("/a/uimg.flags", "'gimu'");
-  shouldBe("new RegExp('a', 'uimg').flags", "'gimu'");
-  shouldBe("flags.call({global: true, multiline: true, ignoreCase: true, unicode: true})", "'gimu'");
-}
diff --git a/LayoutTests/js/script-tests/regexp-unicode.js b/LayoutTests/js/script-tests/regexp-unicode.js
new file mode 100644 (file)
index 0000000..fadf623
--- /dev/null
@@ -0,0 +1,142 @@
+description(
+'Test for unicode regular expression processing'
+);
+
+// Test \u{} escapes in a regular expression
+shouldBe('"a".match(/\u{61}/)[0].length', '1');
+shouldBe('"a".match(/\u{41}/i)[0].length', '1');
+shouldBe('"a".match(/\u{061}/u)[0].length', '1');
+shouldBe('"a".match(/\u{041}/iu)[0].length', '1');
+shouldBe('"\u{212}".match(/\u{212}/)[0].length', '1');
+shouldBe('"\u{212}".match(/\u{0212}/u)[0].length', '1');
+shouldBe('"\u{1234}".match(/\u{1234}/)[0].length', '1');
+shouldBe('"\u{1234}".match(/\u{01234}/u)[0].length', '1');
+shouldBe('"\u{2abc}".match(/\u{2abc}/)[0].length', '1');
+shouldBe('"\u{03fed}".match(/\u{03fed}/u)[0].length', '1');
+shouldBe('"\u{12345}".match(/\u{12345}/u)[0].length', '2');
+shouldBe('"\u{12345}".match(/\u{012345}/u)[0].length', '2');
+shouldBe('"\u{1d306}".match(/\u{1d306}/)[0].length', '2');
+shouldBeTrue('/\u{1044f}/u.test("\ud801\udc4f")');
+shouldBeTrue('/\ud801\udc4f/u.test("\u{1044f}")');
+
+// Test basic unicode flag processing
+shouldBe('"\u{1d306}".match(/\u{1d306}/u)[0].length', '2');
+shouldBeTrue('/(\u{10000}|\u{10400}|\u{10429})/u.test("\u{10400}")');
+shouldBe('"\u{10123}".match(/a|\u{10123}|b/u)[0].length', '2');
+shouldBe('"b".match(/a|\u{10123}|b/u)[0].length', '1');
+shouldBeFalse('/(?:a|\u{10123}|b)x/u.test("\u{10123}")');
+shouldBeTrue('/(?:a|\u{10123}|b)x/u.test("\u{10123}x")');
+shouldBeFalse('/(?:a|\u{10123}|b)x/u.test("b")');
+shouldBeTrue('/(?:a|\u{10123}|b)x/u.test("bx")');
+shouldBe('"a\u{10123}x".match(/a\u{10123}b|a\u{10123}x/u)[0].length', '4');
+
+// Test unicode flag with ignore case
+shouldBeTrue('/(\u{10000}|\u{10400}|\u{10429})x/ui.test("\u{10400}x")');
+shouldBeTrue('/(\u{10000}|\u{10400}|\u{10429})x/ui.test("\u{10429}x")');
+shouldBeTrue('/(\u{10000}|\u{10400}|\u{10429})x/ui.test("\u{10401}x")');
+shouldBeTrue('/(\u{10000}|\u{10400}|\u{10429})x/ui.test("\u{10428}x")');
+shouldBe('"\u{10429}".match(/a|\u{10401}|b/iu)[0].length', '2');
+shouldBe('"B".match(/a|\u{10123}|b/iu)[0].length', '1');
+shouldBeFalse('/(?:A|\u{10123}|b)x/iu.test("\u{10123}")');
+shouldBeTrue('/(?:A|\u{10123}|b)x/iu.test("\u{10123}x")');
+shouldBeFalse('/(?:A|\u{10123}|b)x/iu.test("b")');
+shouldBeTrue('/(?:A|\u{10123}|b)x/iu.test("bx")');
+shouldBe('"a\u{10123}X".match(/a\u{10123}b|a\u{10123}x/iu)[0].length', '4');
+shouldBe('"\u0164x".match(/\u0165x/iu)[0].length', '2');
+
+// Test . matches with Unicode flag
+shouldBe('"\u{1D306}".match(/^.$/u)[0].length', '2');
+shouldBe('"It is 78\u00B0".match(/.*/u)[0].length', '9');
+// FIXME: These tests are disabled until https://bugs.webkit.org/show_bug.cgi?id=154863 is fixed
+// shouldBe('"\ud801XXX".match(/.*/u)[0].length', '4'); // We should match a dangling first surrogate as 1 character
+// shouldBe('"X\udfffXX".match(/.*/u)[0].length', '4'); // We should match a dangling second surrogate as 1 character
+
+// Test character classes with unicode characters with and without unicode flag
+shouldBe('"\u{1d306}".match(/[\u{1d306}a]/)[0].length', '1');
+shouldBe('"\u{1d306}".match(/[a\u{1d306}]/u)[0].length', '2');
+shouldBe('"\u{1d306}".match(/[\u{1d306}a]/u)[0].length', '2');
+shouldBe('"\u{1d306}".match(/[a-\u{1d306}]/)[0].length', '1');
+shouldBe('"\u{1d306}".match(/[a-\u{1d306}]/u)[0].length', '2');
+
+// Test a character class that is a range from one UTF16 to a Unicode character
+shouldBe('"X".match(/[\u0020-\ud801\udc4f]/u)[0].length', '1');
+shouldBe('"\u1000".match(/[\u0020-\ud801\udc4f]/u)[0].length', '1');
+shouldBe('"\ud801\udc27".match(/[\u0020-\ud801\udc4f]/u)[0].length', '2');
+
+var re1 = new RegExp("[^\u0020-\ud801\udc4f]", "u");
+shouldBeFalse('re1.test("Z")');
+shouldBeFalse('re1.test("\u{1000}")');
+shouldBeFalse('re1.test("\u{10400}")');
+
+var re2 = new RegExp("[a-z\u{10000}-\u{15000}]", "iu");
+shouldBeTrue('re2.test("A")');
+shouldBeFalse('re2.test("\uffff")');
+shouldBeTrue('re2.test("\u{12345}")');
+
+// Make sure we properly handle dangling surrogates and combined surrogates
+// FIXME: These tests are disabled until https://bugs.webkit.org/show_bug.cgi?id=154863 is fixed
+// shouldBe('/[\u{10c01}\uD803#\uDC01]/u.exec("\u{10c01}").toString()', '"\u{10c01}"');
+// shouldBe('/[\uD803\u{10c01}\uDC01]/u.exec("\u{10c01}").toString()', '"\u{10c01}"');
+// shouldBe('/[\uD803#\uDC01\u{10c01}]/u.exec("\u{10c01}").toString()', '"\u{10c01}"');
+// shouldBe('/[\uD803\uD803\uDC01\uDC01]/u.exec("\u{10c01}").toString()', '"\u{10c01}"');
+// shouldBeNull('/[\u{10c01}\uD803#\uDC01]{2}/u.exec("\u{10c01}")');
+// shouldBeNull('/[\uD803\u{10c01}\uDC01]{2}/u.exec("\u{10c01}")');
+// shouldBeNull('/[\uD803#\uDC01\u{10c01}]{2}/u.exec("\u{10c01}")');
+// shouldBeNull('/[\uD803\uD803\uDC01\uDC01]{2}/u.exec("\u{10c01}")');
+// shouldBe('/\uD803|\uDC01|\u{10c01}/u.exec("\u{10c01}").toString()', '"\u{10c01}"');
+// shouldBe('/\uD803|\uD803\uDC01|\uDC01/u.exec("\u{10c01}").toString()', '"\u{10c01}"');
+// shouldBe('/\uD803|\uDC01|\u{10c01}/u.exec("\u{D803}").toString()', '"\u{D803}"');
+// shouldBe('/\uD803|\uD803\uDC01|\uDC01/u.exec("\u{DC01}").toString()', '"\u{DC01}"');
+// shouldBeNull('/\uD803\u{10c01}/u.exec("\u{10c01}")');
+// shouldBeNull('/\uD803\u{10c01}/u.exec("\uD803")');
+// shouldBe('"\uD803\u{10c01}".match(/\uD803\u{10c01}/u)[0].length', '3');
+
+// Check back tracking on partial matches
+shouldBe('"\u{10311}\u{10311}\u{10311}".match(/\u{10311}*a|\u{10311}*./u)[0]', '"\u{10311}\u{10311}\u{10311}"');
+shouldBe('"a\u{10311}\u{10311}".match(/a\u{10311}*?$/u)[0]', '"a\u{10311}\u{10311}"');
+shouldBe('"a\u{10311}\u{10311}\u{10311}c".match(/a\u{10311}*cd|a\u{10311}*c/u)[0]', '"a\u{10311}\u{10311}\u{10311}c"');
+shouldBe('"a\u{10311}\u{10311}\u{10311}c".match(/a\u{10311}+cd|a\u{10311}+c/u)[0]', '"a\u{10311}\u{10311}\u{10311}c"');
+shouldBe('"\u{10311}\u{10311}\u{10311}".match(/\u{10311}+?a|\u{10311}+?./u)[0]', '"\u{10311}\u{10311}"');
+shouldBe('"\u{10311}\u{10311}\u{10311}".match(/\u{10311}+?a|\u{10311}+?$/u)[0]', '"\u{10311}\u{10311}\u{10311}"');
+shouldBe('"a\u{10311}\u{10311}\u{10311}c".match(/a\u{10311}*?cd|a\u{10311}*?c/u)[0]', '"a\u{10311}\u{10311}\u{10311}c"');
+shouldBe('"a\u{10311}\u{10311}\u{10311}c".match(/a\u{10311}+?cd|a\u{10311}+?c/u)[0]', '"a\u{10311}\u{10311}\u{10311}c"');
+shouldBe('"\u{10311}\u{10311}\u{10311}".match(/\u{10311}+?a|\u{10311}+?./iu)[0]', '"\u{10311}\u{10311}"');
+shouldBe('"\u{1042a}\u{1042a}\u{10311}".match(/\u{10402}*\u{10200}|\u{10402}*\u{10311}/iu)[0]', '"\u{1042a}\u{1042a}\u{10311}"');
+shouldBe('"\u{1042a}\u{1042a}\u{10311}".match(/\u{10402}+\u{10200}|\u{10402}+\u{10311}/iu)[0]', '"\u{1042a}\u{1042a}\u{10311}"');
+shouldBe('"\u{1042a}\u{1042a}\u{10311}".match(/\u{10402}*?\u{10200}|\u{10402}*?\u{10311}/iu)[0]', '"\u{1042a}\u{1042a}\u{10311}"');
+shouldBe('"\u{1042a}\u{1042a}\u{10311}".match(/\u{10402}+?\u{10200}|\u{10402}+?\u{10311}/iu)[0]', '"\u{1042a}\u{1042a}\u{10311}"');
+shouldBe('"ab\u{10311}c\u{10a01}".match(/abc|ab\u{10311}cd|ab\u{10311}c\u{10a01}d|ab\u{10311}c\u{10a01}/u)[0]', '"ab\u{10311}c\u{10a01}"');
+shouldBe('"ab\u{10428}c\u{10a01}".match(/abc|ab\u{10400}cd|ab\u{10400}c\u{10a01}d|ab\u{10400}c\u{10a01}/iu)[0]', '"ab\u{10428}c\u{10a01}"');
+shouldBeFalse('/abc|ab\u{10400}cd|ab\u{10400}c\u{10a01}d|ab\u{10400}c\u{10a01}/iu.test("qwerty123")');
+shouldBe('"a\u{10428}\u{10428}\u{10428}c".match(/ac|a\u{10400}*cd|a\u{10400}+cd|a\u{10400}+c/iu)[0]', '"a\u{10428}\u{10428}\u{10428}c"');
+shouldBe('"ab\u{10428}\u{10428}\u{10428}c\u{10a01}".match(/abc|ab\u{10400}*cd|ab\u{10400}+c\u{10a01}d|ab\u{10400}+c\u{10a01}/iu)[0]', '"ab\u{10428}\u{10428}\u{10428}c\u{10a01}"');
+shouldBe('"ab\u{10428}\u{10428}\u{10428}".match(/abc|ab\u{10428}*./u)[0]', '"ab\u{10428}\u{10428}\u{10428}"');
+shouldBe('"ab\u{10428}\u{10428}\u{10428}".match(/abc|ab\u{10400}*./iu)[0]', '"ab\u{10428}\u{10428}\u{10428}"');
+
+var re3 = new RegExp("(a\u{10410}*bc)|(a\u{10410}*b)", "u");
+var match3 = "a\u{10410}\u{10410}b".match(re3);
+shouldBe('match3[0]', '"a\u{10410}\u{10410}b"');
+shouldBeUndefined('match3[1]');
+shouldBe('match3[2]', '"a\u{10410}\u{10410}b"');
+
+var re4 = new RegExp("a(\u{10410}*)bc|a(\u{10410}*)b", "ui");
+var match4 = "a\u{10438}\u{10438}b".match(re4);
+shouldBe('match4[0]', '"a\u{10438}\u{10438}b"');
+shouldBeUndefined('match4[1]');
+shouldBe('match4[2]', '"\u{10438}\u{10438}"');
+
+var match5 = "a\u{10412}\u{10412}b\u{10412}\u{10412}".match(/a(\u{10412}*)bc\1|a(\u{10412}*)b\2/u);
+shouldBe('match5[0]', '"a\u{10412}\u{10412}b\u{10412}\u{10412}"');
+shouldBeUndefined('match5[1]');
+shouldBe('match5[2]', '"\u{10412}\u{10412}"');
+
+var match6 = "a\u{10412}\u{10412}b\u{1043a}\u{10412}\u{1043a}".match(/a(\u{1043a}*)bc\1|a(\u{1043a}*)b\2/iu);
+shouldBe('match6[0]', '"a\u{10412}\u{10412}b\u{1043a}\u{10412}"');
+shouldBeUndefined('match6[1]');
+shouldBe('match6[2]', '"\u{10412}\u{10412}"');
+
+// Miscellaneous tests
+shouldBeTrue('/\u1e9Abc/ui.test("abc")');
+shouldBeTrue('/abc/ui.test("\u1e9Abc")');
+shouldBeTrue('/tex\u1e97/ui.test("text")');
+shouldBeTrue('/text/ui.test("\u1e97ext")');
index 0f78308..6f13e5a 100644 (file)
@@ -826,7 +826,7 @@ set(JavaScriptCore_SOURCES
     wasm/WASMReader.cpp
 
     yarr/RegularExpression.cpp
-    yarr/YarrCanonicalizeUCS2.cpp
+    yarr/YarrCanonicalizeUnicode.cpp
     yarr/YarrInterpreter.cpp
     yarr/YarrJIT.cpp
     yarr/YarrPattern.cpp
index 31f44a3..83ba0de 100644 (file)
@@ -1,3 +1,169 @@
+2016-03-01  Michael Saboff  <msaboff@apple.com>
+
+        [ES6] Add support for Unicode regular expressions
+        https://bugs.webkit.org/show_bug.cgi?id=154842
+
+        Reviewed by Filip Pizlo.
+
+        Added processing of Unicode regular expressions to the Yarr interpreter.
+
+        Changed parsing of regular expression patterns and PatternTerms to process characters as
+        UChar32 in the Yarr code.  The parser converts matched surrogate pairs into the appropriate
+        Unicode character when the expression is parsed.  When matching a unicode expression and
+        reading source characters, we convert proper surrogate pair into a Unicode character and
+        advance the source cursor, "pos", one more position.  The exception to this is when we
+        know when generating a fixed character atom that we need to match a unicode character
+        that doesn't fit in 16 bits.  The code calls this an extendedUnicodeCharacter and has a
+        helper to determine this.
+
+        Added 'u' flag and 'unicode' identifier to regular expression classes.  Added an "isUnicode"
+        parameter to YarrPattern pattern() and internal users of that function.
+
+        Updated the generation of the canonicalization tables to include a new set a tables that
+        follow the ES 6.0, 21.2.2.8.2 Step 2.  Renamed the YarrCanonicalizeUCS2.* files to
+        YarrCanonicalizeUnicode.*. 
+
+        Added a new Layout/js test that tests the added functionality.  Updated other tests that
+        have minor es6 unicode checks and look for valid flags.
+
+        Ran the ChakraCore Unicode regular expression tests as well.
+
+        * CMakeLists.txt:
+        * JavaScriptCore.vcxproj/JavaScriptCore.vcxproj:
+        * JavaScriptCore.vcxproj/JavaScriptCore.vcxproj.filters:
+        * JavaScriptCore.xcodeproj/project.pbxproj:
+
+        * inspector/ContentSearchUtilities.cpp:
+        (Inspector::ContentSearchUtilities::findMagicComment):
+        * yarr/RegularExpression.cpp:
+        (JSC::Yarr::RegularExpression::Private::compile):
+        Updated use of pattern().
+
+        * runtime/CommonIdentifiers.h:
+        * runtime/RegExp.cpp:
+        (JSC::regExpFlags):
+        (JSC::RegExpFunctionalTestCollector::outputOneTest):
+        (JSC::RegExp::finishCreation):
+        (JSC::RegExp::compile):
+        (JSC::RegExp::compileMatchOnly):
+        * runtime/RegExp.h:
+        * runtime/RegExpKey.h:
+        * runtime/RegExpPrototype.cpp:
+        (JSC::regExpProtoFuncCompile):
+        (JSC::flagsString):
+        (JSC::regExpProtoGetterMultiline):
+        (JSC::regExpProtoGetterUnicode):
+        (JSC::regExpProtoGetterFlags):
+        Updated for new 'y' (unicode) flag.  Add check to use the interpreter for unicode regular expressions.
+
+        * tests/es6.yaml:
+        * tests/stress/static-getter-in-names.js:
+        Updated tests for new flag and for passing the minimal es6 regular expression processing.
+
+        * yarr/Yarr.h: Updated the size of information now kept for backtracking.
+
+        * yarr/YarrCanonicalizeUCS2.cpp: Removed.
+        * yarr/YarrCanonicalizeUCS2.h: Removed.
+        * yarr/YarrCanonicalizeUCS2.js: Removed.
+        * yarr/YarrCanonicalizeUnicode.cpp: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp.
+        * yarr/YarrCanonicalizeUnicode.h: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h.
+        (JSC::Yarr::canonicalCharacterSetInfo):
+        (JSC::Yarr::canonicalRangeInfoFor):
+        (JSC::Yarr::getCanonicalPair):
+        (JSC::Yarr::isCanonicallyUnique):
+        (JSC::Yarr::areCanonicallyEquivalent):
+        (JSC::Yarr::rangeInfoFor): Deleted.
+        * yarr/YarrCanonicalizeUnicode.js: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js.
+        (printHeader):
+        (printFooter):
+        (hex):
+        (canonicalize):
+        (canonicalizeUnicode):
+        (createUCS2CanonicalGroups):
+        (createUnicodeCanonicalGroups):
+        (cu.in.groupedCanonically.characters.sort): Deleted.
+        (cu.in.groupedCanonically.else): Deleted.
+        Refactored to output two sets of tables, one for UCS2 and one for Unicode.  The UCS2 tables follow
+        the legacy canonicalization rules now specified in ES 6.0, 21.2.2.8.2 Step 3.  The new Unicode
+        tables follow the rules specified in ES 6.0, 21.2.2.8.2 Step 2.  Eliminated the unused Latin1 tables.
+
+        * yarr/YarrInterpreter.cpp:
+        (JSC::Yarr::Interpreter::InputStream::InputStream):
+        (JSC::Yarr::Interpreter::InputStream::readChecked):
+        (JSC::Yarr::Interpreter::InputStream::readSurrogatePairChecked):
+        (JSC::Yarr::Interpreter::InputStream::reread):
+        (JSC::Yarr::Interpreter::InputStream::prev):
+        (JSC::Yarr::Interpreter::testCharacterClass):
+        (JSC::Yarr::Interpreter::checkCharacter):
+        (JSC::Yarr::Interpreter::checkSurrogatePair):
+        (JSC::Yarr::Interpreter::checkCasedCharacter):
+        (JSC::Yarr::Interpreter::tryConsumeBackReference):
+        (JSC::Yarr::Interpreter::backtrackPatternCharacter):
+        (JSC::Yarr::Interpreter::matchCharacterClass):
+        (JSC::Yarr::Interpreter::backtrackCharacterClass):
+        (JSC::Yarr::Interpreter::matchParenthesesTerminalEnd):
+        (JSC::Yarr::Interpreter::matchDisjunction):
+        (JSC::Yarr::Interpreter::Interpreter):
+        (JSC::Yarr::ByteCompiler::assertionWordBoundary):
+        (JSC::Yarr::ByteCompiler::atomPatternCharacter):
+        * yarr/YarrInterpreter.h:
+        (JSC::Yarr::ByteTerm::ByteTerm):
+        (JSC::Yarr::BytecodePattern::BytecodePattern):
+        * yarr/YarrJIT.cpp:
+        (JSC::Yarr::YarrGenerator::optimizeAlternative):
+        (JSC::Yarr::YarrGenerator::matchCharacterClassRange):
+        (JSC::Yarr::YarrGenerator::matchCharacterClass):
+        (JSC::Yarr::YarrGenerator::notAtEndOfInput):
+        (JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
+        * yarr/YarrParser.h:
+        (JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
+        (JSC::Yarr::Parser::Parser):
+        (JSC::Yarr::Parser::parseEscape):
+        (JSC::Yarr::Parser::consumePossibleSurrogatePair):
+        (JSC::Yarr::Parser::parseCharacterClass):
+        (JSC::Yarr::Parser::parseTokens):
+        (JSC::Yarr::Parser::parse):
+        (JSC::Yarr::Parser::atEndOfPattern):
+        (JSC::Yarr::Parser::patternRemaining):
+        (JSC::Yarr::Parser::peek):
+        (JSC::Yarr::parse):
+        * yarr/YarrPattern.cpp:
+        (JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
+        (JSC::Yarr::CharacterClassConstructor::append):
+        (JSC::Yarr::CharacterClassConstructor::putChar):
+        (JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):
+        (JSC::Yarr::CharacterClassConstructor::putRange):
+        (JSC::Yarr::CharacterClassConstructor::charClass):
+        (JSC::Yarr::CharacterClassConstructor::addSorted):
+        (JSC::Yarr::CharacterClassConstructor::addSortedRange):
+        (JSC::Yarr::YarrPatternConstructor::YarrPatternConstructor):
+        (JSC::Yarr::YarrPatternConstructor::assertionWordBoundary):
+        (JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
+        (JSC::Yarr::YarrPatternConstructor::atomCharacterClassBegin):
+        (JSC::Yarr::YarrPatternConstructor::atomCharacterClassAtom):
+        (JSC::Yarr::YarrPatternConstructor::atomCharacterClassRange):
+        (JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
+        (JSC::Yarr::YarrPattern::compile):
+        (JSC::Yarr::YarrPattern::YarrPattern):
+        * yarr/YarrPattern.h:
+        (JSC::Yarr::CharacterRange::CharacterRange):
+        (JSC::Yarr::CharacterClass::CharacterClass):
+        (JSC::Yarr::PatternTerm::PatternTerm):
+        (JSC::Yarr::YarrPattern::reset):
+        * yarr/YarrSyntaxChecker.cpp:
+        (JSC::Yarr::SyntaxChecker::assertionBOL):
+        (JSC::Yarr::SyntaxChecker::assertionEOL):
+        (JSC::Yarr::SyntaxChecker::assertionWordBoundary):
+        (JSC::Yarr::SyntaxChecker::atomPatternCharacter):
+        (JSC::Yarr::SyntaxChecker::atomBuiltInCharacterClass):
+        (JSC::Yarr::SyntaxChecker::atomCharacterClassBegin):
+        (JSC::Yarr::SyntaxChecker::atomCharacterClassAtom):
+        (JSC::Yarr::checkSyntax):
+
 2016-03-01  Saam barati  <sbarati@apple.com>
 
         Remove FIXMEs and add valid test cases after necessary patch has landed.
index a621e3c..774b179 100644 (file)
     <ClCompile Include="..\wasm\WASMModuleParser.cpp" />
     <ClCompile Include="..\wasm\WASMReader.cpp" />
     <ClCompile Include="..\yarr\RegularExpression.cpp" />
-    <ClCompile Include="..\yarr\YarrCanonicalizeUCS2.cpp" />
+    <ClCompile Include="..\yarr\YarrCanonicalizeUnicode.cpp" />
     <ClCompile Include="..\yarr\YarrInterpreter.cpp" />
     <ClCompile Include="..\yarr\YarrJIT.cpp" />
     <ClCompile Include="..\yarr\YarrPattern.cpp" />
     <ClInclude Include="..\wasm\WASMReader.h" />
     <ClInclude Include="..\yarr\RegularExpression.h" />
     <ClInclude Include="..\yarr\Yarr.h" />
-    <ClInclude Include="..\yarr\YarrCanonicalizeUCS2.h" />
+    <ClInclude Include="..\yarr\YarrCanonicalizeUnicode.h" />
     <ClInclude Include="..\yarr\YarrInterpreter.h" />
     <ClInclude Include="..\yarr\YarrJIT.h" />
     <ClInclude Include="..\yarr\YarrParser.h" />
index 790424f..b53fc6f 100644 (file)
     <ClCompile Include="..\yarr\RegularExpression.cpp">
       <Filter>yarr</Filter>
     </ClCompile>
-    <ClCompile Include="..\yarr\YarrCanonicalizeUCS2.cpp">
+    <ClCompile Include="..\yarr\YarrCanonicalizeUnicode.cpp">
       <Filter>yarr</Filter>
     </ClCompile>
     <ClCompile Include="..\yarr\YarrInterpreter.cpp">
     <ClInclude Include="..\yarr\RegularExpression.h">
       <Filter>yarr</Filter>
     </ClInclude>
-    <ClInclude Include="..\yarr\YarrCanonicalizeUCS2.h">
+    <ClInclude Include="..\yarr\YarrCanonicalizeUnicode.h">
       <Filter>yarr</Filter>
     </ClInclude>
     <ClInclude Include="..\yarr\YarrInterpreter.h">
index e1b98b3..a6b9e28 100644 (file)
                862553D116136DA9009F17D0 /* JSProxy.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 862553CE16136AA5009F17D0 /* JSProxy.cpp */; };
                862553D216136E1A009F17D0 /* JSProxy.h in Headers */ = {isa = PBXBuildFile; fileRef = 862553CF16136AA5009F17D0 /* JSProxy.h */; settings = {ATTRIBUTES = (Private, ); }; };
                863B23E00FC6118900703AA4 /* MacroAssemblerCodeRef.h in Headers */ = {isa = PBXBuildFile; fileRef = 863B23DF0FC60E6200703AA4 /* MacroAssemblerCodeRef.h */; settings = {ATTRIBUTES = (Private, ); }; };
-               863C6D9C1521111A00585E4E /* YarrCanonicalizeUCS2.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 863C6D981521111200585E4E /* YarrCanonicalizeUCS2.cpp */; };
+               863C6D9C1521111A00585E4E /* YarrCanonicalizeUnicode.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 863C6D981521111200585E4E /* YarrCanonicalizeUnicode.cpp */; };
                8642C510151C06A90046D4EF /* RegExpCachedResult.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 86F75EFB151C062F007C9BA3 /* RegExpCachedResult.cpp */; };
                8642C512151C083D0046D4EF /* RegExpMatchesArray.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 86F75EFD151C062F007C9BA3 /* RegExpMatchesArray.cpp */; };
                865A30F1135007E100CDB49E /* JSCJSValueInlines.h in Headers */ = {isa = PBXBuildFile; fileRef = 865A30F0135007E100CDB49E /* JSCJSValueInlines.h */; settings = {ATTRIBUTES = (Private, ); }; };
                862553CE16136AA5009F17D0 /* JSProxy.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = JSProxy.cpp; sourceTree = "<group>"; };
                862553CF16136AA5009F17D0 /* JSProxy.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = JSProxy.h; sourceTree = "<group>"; };
                863B23DF0FC60E6200703AA4 /* MacroAssemblerCodeRef.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = MacroAssemblerCodeRef.h; sourceTree = "<group>"; };
-               863C6D981521111200585E4E /* YarrCanonicalizeUCS2.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; name = YarrCanonicalizeUCS2.cpp; path = yarr/YarrCanonicalizeUCS2.cpp; sourceTree = "<group>"; };
-               863C6D991521111200585E4E /* YarrCanonicalizeUCS2.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = YarrCanonicalizeUCS2.h; path = yarr/YarrCanonicalizeUCS2.h; sourceTree = "<group>"; };
-               863C6D9A1521111200585E4E /* YarrCanonicalizeUCS2.js */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.javascript; name = YarrCanonicalizeUCS2.js; path = yarr/YarrCanonicalizeUCS2.js; sourceTree = "<group>"; };
+               863C6D981521111200585E4E /* YarrCanonicalizeUnicode.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; name = YarrCanonicalizeUnicode.cpp; path = yarr/YarrCanonicalizeUnicode.cpp; sourceTree = "<group>"; };
+               863C6D991521111200585E4E /* YarrCanonicalizeUnicode.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = YarrCanonicalizeUnicode.h; path = yarr/YarrCanonicalizeUnicode.h; sourceTree = "<group>"; };
+               863C6D9A1521111200585E4E /* YarrCanonicalizeUnicode.js */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.javascript; name = YarrCanonicalizeUnicode.js; path = yarr/YarrCanonicalizeUnicode.js; sourceTree = "<group>"; };
                8640923B156EED3B00566CB2 /* ARM64Assembler.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ARM64Assembler.h; sourceTree = "<group>"; };
                8640923C156EED3B00566CB2 /* MacroAssemblerARM64.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = MacroAssemblerARM64.h; sourceTree = "<group>"; };
                865A30F0135007E100CDB49E /* JSCJSValueInlines.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = JSCJSValueInlines.h; sourceTree = "<group>"; };
                                A57D23EB1891B5540031C7FA /* RegularExpression.cpp */,
                                A57D23EC1891B5540031C7FA /* RegularExpression.h */,
                                451539B812DC994500EF7AC4 /* Yarr.h */,
-                               863C6D981521111200585E4E /* YarrCanonicalizeUCS2.cpp */,
-                               863C6D991521111200585E4E /* YarrCanonicalizeUCS2.h */,
-                               863C6D9A1521111200585E4E /* YarrCanonicalizeUCS2.js */,
+                               863C6D981521111200585E4E /* YarrCanonicalizeUnicode.cpp */,
+                               863C6D991521111200585E4E /* YarrCanonicalizeUnicode.h */,
+                               863C6D9A1521111200585E4E /* YarrCanonicalizeUnicode.js */,
                                86704B7D12DBA33700A9FE7B /* YarrInterpreter.cpp */,
                                86704B7E12DBA33700A9FE7B /* YarrInterpreter.h */,
                                86704B7F12DBA33700A9FE7B /* YarrJIT.cpp */,
                                0FC8150B14043C0E00CFA603 /* WriteBarrierSupport.cpp in Sources */,
                                A7E5AB3A1799E4B200D2833D /* X86Disassembler.cpp in Sources */,
                                0F2BBD971C5FF3F50023EF23 /* B3Variable.cpp in Sources */,
-                               863C6D9C1521111A00585E4E /* YarrCanonicalizeUCS2.cpp in Sources */,
+                               863C6D9C1521111A00585E4E /* YarrCanonicalizeUnicode.cpp in Sources */,
                                86704B8412DBA33700A9FE7B /* YarrInterpreter.cpp in Sources */,
                                86704B8612DBA33700A9FE7B /* YarrJIT.cpp in Sources */,
                                86704B8912DBA33700A9FE7B /* YarrPattern.cpp in Sources */,
index f3515f4..323d480 100644 (file)
@@ -176,7 +176,7 @@ static String findMagicComment(const String& content, const String& patternStrin
 {
     ASSERT(!content.isNull());
     const char* error = nullptr;
-    JSC::Yarr::YarrPattern pattern(patternString, false, true, &error);
+    JSC::Yarr::YarrPattern pattern(patternString, false, true, false, &error);
     ASSERT(!error);
     BumpPointerAllocator regexAllocator;
     auto bytecodePattern = JSC::Yarr::byteCompile(pattern, &regexAllocator);
index 9739c3e..b217a2b 100644 (file)
     macro(toPrecision) \
     macro(toString) \
     macro(top) \
+    macro(unicode) \
     macro(usage) \
     macro(value) \
     macro(valueOf) \
index 26750b3..84b37a2 100644 (file)
@@ -66,6 +66,12 @@ RegExpFlags regExpFlags(const String& string)
             flags = static_cast<RegExpFlags>(flags | FlagMultiline);
             break;
 
+        case 'u':
+            if (flags & FlagUnicode)
+                return InvalidFlags;
+            flags = static_cast<RegExpFlags>(flags | FlagUnicode);
+            break;
+                
         default:
             return InvalidFlags;
         }
@@ -126,6 +132,8 @@ void RegExpFunctionalTestCollector::outputOneTest(RegExp* regExp, const String&
             fputc('i', m_file);
         if (regExp->multiline())
             fputc('m', m_file);
+        if (regExp->unicode())
+            fputc('u', m_file);
         fprintf(m_file, "\n");
     }
 
@@ -240,7 +248,7 @@ RegExp::RegExp(VM& vm, const String& patternString, RegExpFlags flags)
 void RegExp::finishCreation(VM& vm)
 {
     Base::finishCreation(vm);
-    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), &m_constructionError);
+    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), unicode(), &m_constructionError);
     if (m_constructionError)
         m_state = ParseError;
     else
@@ -280,7 +288,7 @@ RegExp* RegExp::create(VM& vm, const String& patternString, RegExpFlags flags)
 
 void RegExp::compile(VM* vm, Yarr::YarrCharSize charSize)
 {
-    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), &m_constructionError);
+    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), unicode(), &m_constructionError);
     if (m_constructionError) {
         RELEASE_ASSERT_NOT_REACHED();
 #if COMPILER_QUIRK(CONSIDERS_UNREACHABLE_CODE)
@@ -297,7 +305,7 @@ void RegExp::compile(VM* vm, Yarr::YarrCharSize charSize)
     }
 
 #if ENABLE(YARR_JIT)
-    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && vm->canUseRegExpJIT()) {
+    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && !unicode() && vm->canUseRegExpJIT()) {
         Yarr::jitCompile(pattern, charSize, vm, m_regExpJITCode);
         if (!m_regExpJITCode.isFallBack()) {
             m_state = JITCode;
@@ -399,7 +407,7 @@ int RegExp::match(VM& vm, const String& s, unsigned startOffset, Vector<int, 32>
 
 void RegExp::compileMatchOnly(VM* vm, Yarr::YarrCharSize charSize)
 {
-    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), &m_constructionError);
+    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), unicode(), &m_constructionError);
     if (m_constructionError) {
         RELEASE_ASSERT_NOT_REACHED();
 #if COMPILER_QUIRK(CONSIDERS_UNREACHABLE_CODE)
@@ -416,7 +424,7 @@ void RegExp::compileMatchOnly(VM* vm, Yarr::YarrCharSize charSize)
     }
 
 #if ENABLE(YARR_JIT)
-    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && vm->canUseRegExpJIT()) {
+    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && !unicode() && vm->canUseRegExpJIT()) {
         Yarr::jitCompile(pattern, charSize, vm, m_regExpJITCode, Yarr::MatchOnly);
         if (!m_regExpJITCode.isFallBack()) {
             m_state = JITCode;
index 3777fd5..611c5ab 100644 (file)
@@ -55,6 +55,7 @@ public:
     bool global() const { return m_flags & FlagGlobal; }
     bool ignoreCase() const { return m_flags & FlagIgnoreCase; }
     bool multiline() const { return m_flags & FlagMultiline; }
+    bool unicode() const { return m_flags & FlagUnicode; }
 
     const String& pattern() const { return m_patternString; }
 
index 58fa387..557923e 100644 (file)
@@ -38,7 +38,8 @@ enum RegExpFlags {
     FlagGlobal = 1,
     FlagIgnoreCase = 2,
     FlagMultiline = 4,
-    InvalidFlags = 8,
+    FlagUnicode = 8,
+    InvalidFlags = 16,
     DeletedValueFlags = -1
 };
 
index 3b1d41c..185549d 100644 (file)
@@ -48,6 +48,7 @@ static EncodedJSValue JSC_HOST_CALL regExpProtoFuncSearch(ExecState*);
 static EncodedJSValue JSC_HOST_CALL regExpProtoGetterGlobal(ExecState*);
 static EncodedJSValue JSC_HOST_CALL regExpProtoGetterIgnoreCase(ExecState*);
 static EncodedJSValue JSC_HOST_CALL regExpProtoGetterMultiline(ExecState*);
+static EncodedJSValue JSC_HOST_CALL regExpProtoGetterUnicode(ExecState*);
 static EncodedJSValue JSC_HOST_CALL regExpProtoGetterSource(ExecState*);
 static EncodedJSValue JSC_HOST_CALL regExpProtoGetterFlags(ExecState*);
 
@@ -68,6 +69,7 @@ const ClassInfo RegExpPrototype::s_info = { "RegExp", &RegExpObject::s_info, &re
   global        regExpProtoGetterGlobal     DontEnum|Accessor
   ignoreCase    regExpProtoGetterIgnoreCase DontEnum|Accessor
   multiline     regExpProtoGetterMultiline  DontEnum|Accessor
+  unicode       regExpProtoGetterUnicode    DontEnum|Accessor
   source        regExpProtoGetterSource     DontEnum|Accessor
   flags         regExpProtoGetterFlags      DontEnum|Accessor
 @end
@@ -146,7 +148,7 @@ EncodedJSValue JSC_HOST_CALL regExpProtoFuncCompile(ExecState* exec)
     return JSValue::encode(jsUndefined());
 }
 
-typedef std::array<char, 3 + 1> FlagsString; // 3 different flags and a null character terminator.
+typedef std::array<char, 4 + 1> FlagsString; // 4 different flags and a null character terminator.
 
 static inline FlagsString flagsString(ExecState* exec, JSObject* regexp)
 {
@@ -159,6 +161,9 @@ static inline FlagsString flagsString(ExecState* exec, JSObject* regexp)
     if (exec->hadException())
         return string;
     JSValue multilineValue = regexp->get(exec, exec->propertyNames().multiline);
+    if (exec->hadException())
+        return string;
+    JSValue unicodeValue = regexp->get(exec, exec->propertyNames().unicode);
 
     unsigned index = 0;
     if (globalValue.toBoolean(exec))
@@ -167,6 +172,8 @@ static inline FlagsString flagsString(ExecState* exec, JSObject* regexp)
         string[index++] = 'i';
     if (multilineValue.toBoolean(exec))
         string[index++] = 'm';
+    if (unicodeValue.toBoolean(exec))
+        string[index++] = 'u';
     ASSERT(index < string.size());
     string[index] = 0;
     return string;
@@ -225,6 +232,15 @@ EncodedJSValue JSC_HOST_CALL regExpProtoGetterMultiline(ExecState* exec)
     return JSValue::encode(jsBoolean(asRegExpObject(thisValue)->regExp()->multiline()));
 }
 
+EncodedJSValue JSC_HOST_CALL regExpProtoGetterUnicode(ExecState* exec)
+{
+    JSValue thisValue = exec->thisValue();
+    if (!thisValue.inherits(RegExpObject::info()))
+        return throwVMTypeError(exec);
+    
+    return JSValue::encode(jsBoolean(asRegExpObject(thisValue)->regExp()->unicode()));
+}
+
 EncodedJSValue JSC_HOST_CALL regExpProtoGetterFlags(ExecState* exec)
 {
     JSValue thisValue = exec->thisValue();
index d490061..3940537 100644 (file)
 - path: es6/RegExp_is_subclassable_correct_prototype_chain.js
   cmd: runES6 :normal
 - path: es6/RegExp_y_and_u_flags_u_flag.js
-  cmd: runES6 :fail
+  cmd: runES6 :normal
 - path: es6/RegExp_y_and_u_flags_u_flag_Unicode_code_point_escapes.js
-  cmd: runES6 :fail
+  cmd: runES6 :normal
 - path: es6/RegExp_y_and_u_flags_y_flag.js
   cmd: runES6 :fail
 - path: es6/RegExp_y_and_u_flags_y_flag_lastIndex.js
index 852a2e5..bc37deb 100644 (file)
@@ -3,5 +3,5 @@ function shouldBe(actual, expected) {
         throw new Error('bad value: ' + actual);
 }
 
-shouldBe(JSON.stringify(Object.getOwnPropertyNames(RegExp.prototype).sort()), '["compile","constructor","exec","flags","global","ignoreCase","lastIndex","multiline","source","test","toString"]');
+shouldBe(JSON.stringify(Object.getOwnPropertyNames(RegExp.prototype).sort()), '["compile","constructor","exec","flags","global","ignoreCase","lastIndex","multiline","source","test","toString","unicode"]');
 shouldBe(JSON.stringify(Object.getOwnPropertyNames(/Cocoa/).sort()), '["lastIndex"]');
index 0c70896..93fef3c 100644 (file)
@@ -57,7 +57,7 @@ private:
 
     std::unique_ptr<JSC::Yarr::BytecodePattern> compile(const String& patternString, TextCaseSensitivity caseSensitivity, MultilineMode multilineMode)
     {
-        JSC::Yarr::YarrPattern pattern(patternString, (caseSensitivity == TextCaseInsensitive), (multilineMode == MultilineEnabled), &m_constructionError);
+        JSC::Yarr::YarrPattern pattern(patternString, (caseSensitivity == TextCaseInsensitive), (multilineMode == MultilineEnabled), false, &m_constructionError);
         if (m_constructionError) {
             LOG_ERROR("RegularExpression: YARR compile failed with '%s'", m_constructionError);
             return nullptr;
index 463623e..cfcf3ea 100644 (file)
@@ -33,8 +33,8 @@
 
 namespace JSC { namespace Yarr {
 
-#define YarrStackSpaceForBackTrackInfoPatternCharacter 1 // Only for !fixed quantifiers.
-#define YarrStackSpaceForBackTrackInfoCharacterClass 1 // Only for !fixed quantifiers.
+#define YarrStackSpaceForBackTrackInfoPatternCharacter 2 // Only for !fixed quantifiers.
+#define YarrStackSpaceForBackTrackInfoCharacterClass 2 // Only for !fixed quantifiers.
 #define YarrStackSpaceForBackTrackInfoBackReference 2
 #define YarrStackSpaceForBackTrackInfoAlternative 1 // One per alternative.
 #define YarrStackSpaceForBackTrackInfoParentheticalAssertion 1
diff --git a/Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp b/Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp
deleted file mode 100644 (file)
index 52cb1a9..0000000
+++ /dev/null
@@ -1,463 +0,0 @@
-/*
- * Copyright (C) 2012 Apple Inc. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
- */
-
-// DO NOT EDIT! - this file autogenerated by YarrCanonicalizeUCS2.js
-
-#include "config.h"
-#include "YarrCanonicalizeUCS2.h"
-
-namespace JSC { namespace Yarr {
-
-#include <stdint.h>
-
-const uint16_t ucs2CharacterSet0[] = { 0x01c4u, 0x01c5u, 0x01c6u, 0 };
-const uint16_t ucs2CharacterSet1[] = { 0x01c7u, 0x01c8u, 0x01c9u, 0 };
-const uint16_t ucs2CharacterSet2[] = { 0x01cau, 0x01cbu, 0x01ccu, 0 };
-const uint16_t ucs2CharacterSet3[] = { 0x01f1u, 0x01f2u, 0x01f3u, 0 };
-const uint16_t ucs2CharacterSet4[] = { 0x0392u, 0x03b2u, 0x03d0u, 0 };
-const uint16_t ucs2CharacterSet5[] = { 0x0395u, 0x03b5u, 0x03f5u, 0 };
-const uint16_t ucs2CharacterSet6[] = { 0x0398u, 0x03b8u, 0x03d1u, 0 };
-const uint16_t ucs2CharacterSet7[] = { 0x0345u, 0x0399u, 0x03b9u, 0x1fbeu, 0 };
-const uint16_t ucs2CharacterSet8[] = { 0x039au, 0x03bau, 0x03f0u, 0 };
-const uint16_t ucs2CharacterSet9[] = { 0x00b5u, 0x039cu, 0x03bcu, 0 };
-const uint16_t ucs2CharacterSet10[] = { 0x03a0u, 0x03c0u, 0x03d6u, 0 };
-const uint16_t ucs2CharacterSet11[] = { 0x03a1u, 0x03c1u, 0x03f1u, 0 };
-const uint16_t ucs2CharacterSet12[] = { 0x03a3u, 0x03c2u, 0x03c3u, 0 };
-const uint16_t ucs2CharacterSet13[] = { 0x03a6u, 0x03c6u, 0x03d5u, 0 };
-const uint16_t ucs2CharacterSet14[] = { 0x1e60u, 0x1e61u, 0x1e9bu, 0 };
-
-static const size_t UCS2_CANONICALIZATION_SETS = 15;
-const uint16_t* const characterSetInfo[UCS2_CANONICALIZATION_SETS] = {
-    ucs2CharacterSet0,
-    ucs2CharacterSet1,
-    ucs2CharacterSet2,
-    ucs2CharacterSet3,
-    ucs2CharacterSet4,
-    ucs2CharacterSet5,
-    ucs2CharacterSet6,
-    ucs2CharacterSet7,
-    ucs2CharacterSet8,
-    ucs2CharacterSet9,
-    ucs2CharacterSet10,
-    ucs2CharacterSet11,
-    ucs2CharacterSet12,
-    ucs2CharacterSet13,
-    ucs2CharacterSet14,
-};
-
-const size_t UCS2_CANONICALIZATION_RANGES = 364;
-const UCS2CanonicalizationRange rangeInfo[UCS2_CANONICALIZATION_RANGES] = {
-    { 0x0000u, 0x0040u, 0x0000u, CanonicalizeUnique },
-    { 0x0041u, 0x005au, 0x0020u, CanonicalizeRangeLo },
-    { 0x005bu, 0x0060u, 0x0000u, CanonicalizeUnique },
-    { 0x0061u, 0x007au, 0x0020u, CanonicalizeRangeHi },
-    { 0x007bu, 0x00b4u, 0x0000u, CanonicalizeUnique },
-    { 0x00b5u, 0x00b5u, 0x0009u, CanonicalizeSet },
-    { 0x00b6u, 0x00bfu, 0x0000u, CanonicalizeUnique },
-    { 0x00c0u, 0x00d6u, 0x0020u, CanonicalizeRangeLo },
-    { 0x00d7u, 0x00d7u, 0x0000u, CanonicalizeUnique },
-    { 0x00d8u, 0x00deu, 0x0020u, CanonicalizeRangeLo },
-    { 0x00dfu, 0x00dfu, 0x0000u, CanonicalizeUnique },
-    { 0x00e0u, 0x00f6u, 0x0020u, CanonicalizeRangeHi },
-    { 0x00f7u, 0x00f7u, 0x0000u, CanonicalizeUnique },
-    { 0x00f8u, 0x00feu, 0x0020u, CanonicalizeRangeHi },
-    { 0x00ffu, 0x00ffu, 0x0079u, CanonicalizeRangeLo },
-    { 0x0100u, 0x012fu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0130u, 0x0131u, 0x0000u, CanonicalizeUnique },
-    { 0x0132u, 0x0137u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0138u, 0x0138u, 0x0000u, CanonicalizeUnique },
-    { 0x0139u, 0x0148u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x0149u, 0x0149u, 0x0000u, CanonicalizeUnique },
-    { 0x014au, 0x0177u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0178u, 0x0178u, 0x0079u, CanonicalizeRangeHi },
-    { 0x0179u, 0x017eu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x017fu, 0x017fu, 0x0000u, CanonicalizeUnique },
-    { 0x0180u, 0x0180u, 0x00c3u, CanonicalizeRangeLo },
-    { 0x0181u, 0x0181u, 0x00d2u, CanonicalizeRangeLo },
-    { 0x0182u, 0x0185u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0186u, 0x0186u, 0x00ceu, CanonicalizeRangeLo },
-    { 0x0187u, 0x0188u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x0189u, 0x018au, 0x00cdu, CanonicalizeRangeLo },
-    { 0x018bu, 0x018cu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x018du, 0x018du, 0x0000u, CanonicalizeUnique },
-    { 0x018eu, 0x018eu, 0x004fu, CanonicalizeRangeLo },
-    { 0x018fu, 0x018fu, 0x00cau, CanonicalizeRangeLo },
-    { 0x0190u, 0x0190u, 0x00cbu, CanonicalizeRangeLo },
-    { 0x0191u, 0x0192u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x0193u, 0x0193u, 0x00cdu, CanonicalizeRangeLo },
-    { 0x0194u, 0x0194u, 0x00cfu, CanonicalizeRangeLo },
-    { 0x0195u, 0x0195u, 0x0061u, CanonicalizeRangeLo },
-    { 0x0196u, 0x0196u, 0x00d3u, CanonicalizeRangeLo },
-    { 0x0197u, 0x0197u, 0x00d1u, CanonicalizeRangeLo },
-    { 0x0198u, 0x0199u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x019au, 0x019au, 0x00a3u, CanonicalizeRangeLo },
-    { 0x019bu, 0x019bu, 0x0000u, CanonicalizeUnique },
-    { 0x019cu, 0x019cu, 0x00d3u, CanonicalizeRangeLo },
-    { 0x019du, 0x019du, 0x00d5u, CanonicalizeRangeLo },
-    { 0x019eu, 0x019eu, 0x0082u, CanonicalizeRangeLo },
-    { 0x019fu, 0x019fu, 0x00d6u, CanonicalizeRangeLo },
-    { 0x01a0u, 0x01a5u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x01a6u, 0x01a6u, 0x00dau, CanonicalizeRangeLo },
-    { 0x01a7u, 0x01a8u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x01a9u, 0x01a9u, 0x00dau, CanonicalizeRangeLo },
-    { 0x01aau, 0x01abu, 0x0000u, CanonicalizeUnique },
-    { 0x01acu, 0x01adu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x01aeu, 0x01aeu, 0x00dau, CanonicalizeRangeLo },
-    { 0x01afu, 0x01b0u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x01b1u, 0x01b2u, 0x00d9u, CanonicalizeRangeLo },
-    { 0x01b3u, 0x01b6u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x01b7u, 0x01b7u, 0x00dbu, CanonicalizeRangeLo },
-    { 0x01b8u, 0x01b9u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x01bau, 0x01bbu, 0x0000u, CanonicalizeUnique },
-    { 0x01bcu, 0x01bdu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x01beu, 0x01beu, 0x0000u, CanonicalizeUnique },
-    { 0x01bfu, 0x01bfu, 0x0038u, CanonicalizeRangeLo },
-    { 0x01c0u, 0x01c3u, 0x0000u, CanonicalizeUnique },
-    { 0x01c4u, 0x01c6u, 0x0000u, CanonicalizeSet },
-    { 0x01c7u, 0x01c9u, 0x0001u, CanonicalizeSet },
-    { 0x01cau, 0x01ccu, 0x0002u, CanonicalizeSet },
-    { 0x01cdu, 0x01dcu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x01ddu, 0x01ddu, 0x004fu, CanonicalizeRangeHi },
-    { 0x01deu, 0x01efu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x01f0u, 0x01f0u, 0x0000u, CanonicalizeUnique },
-    { 0x01f1u, 0x01f3u, 0x0003u, CanonicalizeSet },
-    { 0x01f4u, 0x01f5u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x01f6u, 0x01f6u, 0x0061u, CanonicalizeRangeHi },
-    { 0x01f7u, 0x01f7u, 0x0038u, CanonicalizeRangeHi },
-    { 0x01f8u, 0x021fu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0220u, 0x0220u, 0x0082u, CanonicalizeRangeHi },
-    { 0x0221u, 0x0221u, 0x0000u, CanonicalizeUnique },
-    { 0x0222u, 0x0233u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0234u, 0x0239u, 0x0000u, CanonicalizeUnique },
-    { 0x023au, 0x023au, 0x2a2bu, CanonicalizeRangeLo },
-    { 0x023bu, 0x023cu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x023du, 0x023du, 0x00a3u, CanonicalizeRangeHi },
-    { 0x023eu, 0x023eu, 0x2a28u, CanonicalizeRangeLo },
-    { 0x023fu, 0x0240u, 0x2a3fu, CanonicalizeRangeLo },
-    { 0x0241u, 0x0242u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x0243u, 0x0243u, 0x00c3u, CanonicalizeRangeHi },
-    { 0x0244u, 0x0244u, 0x0045u, CanonicalizeRangeLo },
-    { 0x0245u, 0x0245u, 0x0047u, CanonicalizeRangeLo },
-    { 0x0246u, 0x024fu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0250u, 0x0250u, 0x2a1fu, CanonicalizeRangeLo },
-    { 0x0251u, 0x0251u, 0x2a1cu, CanonicalizeRangeLo },
-    { 0x0252u, 0x0252u, 0x2a1eu, CanonicalizeRangeLo },
-    { 0x0253u, 0x0253u, 0x00d2u, CanonicalizeRangeHi },
-    { 0x0254u, 0x0254u, 0x00ceu, CanonicalizeRangeHi },
-    { 0x0255u, 0x0255u, 0x0000u, CanonicalizeUnique },
-    { 0x0256u, 0x0257u, 0x00cdu, CanonicalizeRangeHi },
-    { 0x0258u, 0x0258u, 0x0000u, CanonicalizeUnique },
-    { 0x0259u, 0x0259u, 0x00cau, CanonicalizeRangeHi },
-    { 0x025au, 0x025au, 0x0000u, CanonicalizeUnique },
-    { 0x025bu, 0x025bu, 0x00cbu, CanonicalizeRangeHi },
-    { 0x025cu, 0x025fu, 0x0000u, CanonicalizeUnique },
-    { 0x0260u, 0x0260u, 0x00cdu, CanonicalizeRangeHi },
-    { 0x0261u, 0x0262u, 0x0000u, CanonicalizeUnique },
-    { 0x0263u, 0x0263u, 0x00cfu, CanonicalizeRangeHi },
-    { 0x0264u, 0x0264u, 0x0000u, CanonicalizeUnique },
-    { 0x0265u, 0x0265u, 0xa528u, CanonicalizeRangeLo },
-    { 0x0266u, 0x0267u, 0x0000u, CanonicalizeUnique },
-    { 0x0268u, 0x0268u, 0x00d1u, CanonicalizeRangeHi },
-    { 0x0269u, 0x0269u, 0x00d3u, CanonicalizeRangeHi },
-    { 0x026au, 0x026au, 0x0000u, CanonicalizeUnique },
-    { 0x026bu, 0x026bu, 0x29f7u, CanonicalizeRangeLo },
-    { 0x026cu, 0x026eu, 0x0000u, CanonicalizeUnique },
-    { 0x026fu, 0x026fu, 0x00d3u, CanonicalizeRangeHi },
-    { 0x0270u, 0x0270u, 0x0000u, CanonicalizeUnique },
-    { 0x0271u, 0x0271u, 0x29fdu, CanonicalizeRangeLo },
-    { 0x0272u, 0x0272u, 0x00d5u, CanonicalizeRangeHi },
-    { 0x0273u, 0x0274u, 0x0000u, CanonicalizeUnique },
-    { 0x0275u, 0x0275u, 0x00d6u, CanonicalizeRangeHi },
-    { 0x0276u, 0x027cu, 0x0000u, CanonicalizeUnique },
-    { 0x027du, 0x027du, 0x29e7u, CanonicalizeRangeLo },
-    { 0x027eu, 0x027fu, 0x0000u, CanonicalizeUnique },
-    { 0x0280u, 0x0280u, 0x00dau, CanonicalizeRangeHi },
-    { 0x0281u, 0x0282u, 0x0000u, CanonicalizeUnique },
-    { 0x0283u, 0x0283u, 0x00dau, CanonicalizeRangeHi },
-    { 0x0284u, 0x0287u, 0x0000u, CanonicalizeUnique },
-    { 0x0288u, 0x0288u, 0x00dau, CanonicalizeRangeHi },
-    { 0x0289u, 0x0289u, 0x0045u, CanonicalizeRangeHi },
-    { 0x028au, 0x028bu, 0x00d9u, CanonicalizeRangeHi },
-    { 0x028cu, 0x028cu, 0x0047u, CanonicalizeRangeHi },
-    { 0x028du, 0x0291u, 0x0000u, CanonicalizeUnique },
-    { 0x0292u, 0x0292u, 0x00dbu, CanonicalizeRangeHi },
-    { 0x0293u, 0x0344u, 0x0000u, CanonicalizeUnique },
-    { 0x0345u, 0x0345u, 0x0007u, CanonicalizeSet },
-    { 0x0346u, 0x036fu, 0x0000u, CanonicalizeUnique },
-    { 0x0370u, 0x0373u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0374u, 0x0375u, 0x0000u, CanonicalizeUnique },
-    { 0x0376u, 0x0377u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0378u, 0x037au, 0x0000u, CanonicalizeUnique },
-    { 0x037bu, 0x037du, 0x0082u, CanonicalizeRangeLo },
-    { 0x037eu, 0x0385u, 0x0000u, CanonicalizeUnique },
-    { 0x0386u, 0x0386u, 0x0026u, CanonicalizeRangeLo },
-    { 0x0387u, 0x0387u, 0x0000u, CanonicalizeUnique },
-    { 0x0388u, 0x038au, 0x0025u, CanonicalizeRangeLo },
-    { 0x038bu, 0x038bu, 0x0000u, CanonicalizeUnique },
-    { 0x038cu, 0x038cu, 0x0040u, CanonicalizeRangeLo },
-    { 0x038du, 0x038du, 0x0000u, CanonicalizeUnique },
-    { 0x038eu, 0x038fu, 0x003fu, CanonicalizeRangeLo },
-    { 0x0390u, 0x0390u, 0x0000u, CanonicalizeUnique },
-    { 0x0391u, 0x0391u, 0x0020u, CanonicalizeRangeLo },
-    { 0x0392u, 0x0392u, 0x0004u, CanonicalizeSet },
-    { 0x0393u, 0x0394u, 0x0020u, CanonicalizeRangeLo },
-    { 0x0395u, 0x0395u, 0x0005u, CanonicalizeSet },
-    { 0x0396u, 0x0397u, 0x0020u, CanonicalizeRangeLo },
-    { 0x0398u, 0x0398u, 0x0006u, CanonicalizeSet },
-    { 0x0399u, 0x0399u, 0x0007u, CanonicalizeSet },
-    { 0x039au, 0x039au, 0x0008u, CanonicalizeSet },
-    { 0x039bu, 0x039bu, 0x0020u, CanonicalizeRangeLo },
-    { 0x039cu, 0x039cu, 0x0009u, CanonicalizeSet },
-    { 0x039du, 0x039fu, 0x0020u, CanonicalizeRangeLo },
-    { 0x03a0u, 0x03a0u, 0x000au, CanonicalizeSet },
-    { 0x03a1u, 0x03a1u, 0x000bu, CanonicalizeSet },
-    { 0x03a2u, 0x03a2u, 0x0000u, CanonicalizeUnique },
-    { 0x03a3u, 0x03a3u, 0x000cu, CanonicalizeSet },
-    { 0x03a4u, 0x03a5u, 0x0020u, CanonicalizeRangeLo },
-    { 0x03a6u, 0x03a6u, 0x000du, CanonicalizeSet },
-    { 0x03a7u, 0x03abu, 0x0020u, CanonicalizeRangeLo },
-    { 0x03acu, 0x03acu, 0x0026u, CanonicalizeRangeHi },
-    { 0x03adu, 0x03afu, 0x0025u, CanonicalizeRangeHi },
-    { 0x03b0u, 0x03b0u, 0x0000u, CanonicalizeUnique },
-    { 0x03b1u, 0x03b1u, 0x0020u, CanonicalizeRangeHi },
-    { 0x03b2u, 0x03b2u, 0x0004u, CanonicalizeSet },
-    { 0x03b3u, 0x03b4u, 0x0020u, CanonicalizeRangeHi },
-    { 0x03b5u, 0x03b5u, 0x0005u, CanonicalizeSet },
-    { 0x03b6u, 0x03b7u, 0x0020u, CanonicalizeRangeHi },
-    { 0x03b8u, 0x03b8u, 0x0006u, CanonicalizeSet },
-    { 0x03b9u, 0x03b9u, 0x0007u, CanonicalizeSet },
-    { 0x03bau, 0x03bau, 0x0008u, CanonicalizeSet },
-    { 0x03bbu, 0x03bbu, 0x0020u, CanonicalizeRangeHi },
-    { 0x03bcu, 0x03bcu, 0x0009u, CanonicalizeSet },
-    { 0x03bdu, 0x03bfu, 0x0020u, CanonicalizeRangeHi },
-    { 0x03c0u, 0x03c0u, 0x000au, CanonicalizeSet },
-    { 0x03c1u, 0x03c1u, 0x000bu, CanonicalizeSet },
-    { 0x03c2u, 0x03c3u, 0x000cu, CanonicalizeSet },
-    { 0x03c4u, 0x03c5u, 0x0020u, CanonicalizeRangeHi },
-    { 0x03c6u, 0x03c6u, 0x000du, CanonicalizeSet },
-    { 0x03c7u, 0x03cbu, 0x0020u, CanonicalizeRangeHi },
-    { 0x03ccu, 0x03ccu, 0x0040u, CanonicalizeRangeHi },
-    { 0x03cdu, 0x03ceu, 0x003fu, CanonicalizeRangeHi },
-    { 0x03cfu, 0x03cfu, 0x0008u, CanonicalizeRangeLo },
-    { 0x03d0u, 0x03d0u, 0x0004u, CanonicalizeSet },
-    { 0x03d1u, 0x03d1u, 0x0006u, CanonicalizeSet },
-    { 0x03d2u, 0x03d4u, 0x0000u, CanonicalizeUnique },
-    { 0x03d5u, 0x03d5u, 0x000du, CanonicalizeSet },
-    { 0x03d6u, 0x03d6u, 0x000au, CanonicalizeSet },
-    { 0x03d7u, 0x03d7u, 0x0008u, CanonicalizeRangeHi },
-    { 0x03d8u, 0x03efu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x03f0u, 0x03f0u, 0x0008u, CanonicalizeSet },
-    { 0x03f1u, 0x03f1u, 0x000bu, CanonicalizeSet },
-    { 0x03f2u, 0x03f2u, 0x0007u, CanonicalizeRangeLo },
-    { 0x03f3u, 0x03f4u, 0x0000u, CanonicalizeUnique },
-    { 0x03f5u, 0x03f5u, 0x0005u, CanonicalizeSet },
-    { 0x03f6u, 0x03f6u, 0x0000u, CanonicalizeUnique },
-    { 0x03f7u, 0x03f8u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x03f9u, 0x03f9u, 0x0007u, CanonicalizeRangeHi },
-    { 0x03fau, 0x03fbu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x03fcu, 0x03fcu, 0x0000u, CanonicalizeUnique },
-    { 0x03fdu, 0x03ffu, 0x0082u, CanonicalizeRangeHi },
-    { 0x0400u, 0x040fu, 0x0050u, CanonicalizeRangeLo },
-    { 0x0410u, 0x042fu, 0x0020u, CanonicalizeRangeLo },
-    { 0x0430u, 0x044fu, 0x0020u, CanonicalizeRangeHi },
-    { 0x0450u, 0x045fu, 0x0050u, CanonicalizeRangeHi },
-    { 0x0460u, 0x0481u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0482u, 0x0489u, 0x0000u, CanonicalizeUnique },
-    { 0x048au, 0x04bfu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x04c0u, 0x04c0u, 0x000fu, CanonicalizeRangeLo },
-    { 0x04c1u, 0x04ceu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x04cfu, 0x04cfu, 0x000fu, CanonicalizeRangeHi },
-    { 0x04d0u, 0x0527u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x0528u, 0x0530u, 0x0000u, CanonicalizeUnique },
-    { 0x0531u, 0x0556u, 0x0030u, CanonicalizeRangeLo },
-    { 0x0557u, 0x0560u, 0x0000u, CanonicalizeUnique },
-    { 0x0561u, 0x0586u, 0x0030u, CanonicalizeRangeHi },
-    { 0x0587u, 0x109fu, 0x0000u, CanonicalizeUnique },
-    { 0x10a0u, 0x10c5u, 0x1c60u, CanonicalizeRangeLo },
-    { 0x10c6u, 0x1d78u, 0x0000u, CanonicalizeUnique },
-    { 0x1d79u, 0x1d79u, 0x8a04u, CanonicalizeRangeLo },
-    { 0x1d7au, 0x1d7cu, 0x0000u, CanonicalizeUnique },
-    { 0x1d7du, 0x1d7du, 0x0ee6u, CanonicalizeRangeLo },
-    { 0x1d7eu, 0x1dffu, 0x0000u, CanonicalizeUnique },
-    { 0x1e00u, 0x1e5fu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x1e60u, 0x1e61u, 0x000eu, CanonicalizeSet },
-    { 0x1e62u, 0x1e95u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x1e96u, 0x1e9au, 0x0000u, CanonicalizeUnique },
-    { 0x1e9bu, 0x1e9bu, 0x000eu, CanonicalizeSet },
-    { 0x1e9cu, 0x1e9fu, 0x0000u, CanonicalizeUnique },
-    { 0x1ea0u, 0x1effu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x1f00u, 0x1f07u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f08u, 0x1f0fu, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f10u, 0x1f15u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f16u, 0x1f17u, 0x0000u, CanonicalizeUnique },
-    { 0x1f18u, 0x1f1du, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f1eu, 0x1f1fu, 0x0000u, CanonicalizeUnique },
-    { 0x1f20u, 0x1f27u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f28u, 0x1f2fu, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f30u, 0x1f37u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f38u, 0x1f3fu, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f40u, 0x1f45u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f46u, 0x1f47u, 0x0000u, CanonicalizeUnique },
-    { 0x1f48u, 0x1f4du, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f4eu, 0x1f50u, 0x0000u, CanonicalizeUnique },
-    { 0x1f51u, 0x1f51u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f52u, 0x1f52u, 0x0000u, CanonicalizeUnique },
-    { 0x1f53u, 0x1f53u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f54u, 0x1f54u, 0x0000u, CanonicalizeUnique },
-    { 0x1f55u, 0x1f55u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f56u, 0x1f56u, 0x0000u, CanonicalizeUnique },
-    { 0x1f57u, 0x1f57u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f58u, 0x1f58u, 0x0000u, CanonicalizeUnique },
-    { 0x1f59u, 0x1f59u, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f5au, 0x1f5au, 0x0000u, CanonicalizeUnique },
-    { 0x1f5bu, 0x1f5bu, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f5cu, 0x1f5cu, 0x0000u, CanonicalizeUnique },
-    { 0x1f5du, 0x1f5du, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f5eu, 0x1f5eu, 0x0000u, CanonicalizeUnique },
-    { 0x1f5fu, 0x1f5fu, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f60u, 0x1f67u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1f68u, 0x1f6fu, 0x0008u, CanonicalizeRangeHi },
-    { 0x1f70u, 0x1f71u, 0x004au, CanonicalizeRangeLo },
-    { 0x1f72u, 0x1f75u, 0x0056u, CanonicalizeRangeLo },
-    { 0x1f76u, 0x1f77u, 0x0064u, CanonicalizeRangeLo },
-    { 0x1f78u, 0x1f79u, 0x0080u, CanonicalizeRangeLo },
-    { 0x1f7au, 0x1f7bu, 0x0070u, CanonicalizeRangeLo },
-    { 0x1f7cu, 0x1f7du, 0x007eu, CanonicalizeRangeLo },
-    { 0x1f7eu, 0x1fafu, 0x0000u, CanonicalizeUnique },
-    { 0x1fb0u, 0x1fb1u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1fb2u, 0x1fb7u, 0x0000u, CanonicalizeUnique },
-    { 0x1fb8u, 0x1fb9u, 0x0008u, CanonicalizeRangeHi },
-    { 0x1fbau, 0x1fbbu, 0x004au, CanonicalizeRangeHi },
-    { 0x1fbcu, 0x1fbdu, 0x0000u, CanonicalizeUnique },
-    { 0x1fbeu, 0x1fbeu, 0x0007u, CanonicalizeSet },
-    { 0x1fbfu, 0x1fc7u, 0x0000u, CanonicalizeUnique },
-    { 0x1fc8u, 0x1fcbu, 0x0056u, CanonicalizeRangeHi },
-    { 0x1fccu, 0x1fcfu, 0x0000u, CanonicalizeUnique },
-    { 0x1fd0u, 0x1fd1u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1fd2u, 0x1fd7u, 0x0000u, CanonicalizeUnique },
-    { 0x1fd8u, 0x1fd9u, 0x0008u, CanonicalizeRangeHi },
-    { 0x1fdau, 0x1fdbu, 0x0064u, CanonicalizeRangeHi },
-    { 0x1fdcu, 0x1fdfu, 0x0000u, CanonicalizeUnique },
-    { 0x1fe0u, 0x1fe1u, 0x0008u, CanonicalizeRangeLo },
-    { 0x1fe2u, 0x1fe4u, 0x0000u, CanonicalizeUnique },
-    { 0x1fe5u, 0x1fe5u, 0x0007u, CanonicalizeRangeLo },
-    { 0x1fe6u, 0x1fe7u, 0x0000u, CanonicalizeUnique },
-    { 0x1fe8u, 0x1fe9u, 0x0008u, CanonicalizeRangeHi },
-    { 0x1feau, 0x1febu, 0x0070u, CanonicalizeRangeHi },
-    { 0x1fecu, 0x1fecu, 0x0007u, CanonicalizeRangeHi },
-    { 0x1fedu, 0x1ff7u, 0x0000u, CanonicalizeUnique },
-    { 0x1ff8u, 0x1ff9u, 0x0080u, CanonicalizeRangeHi },
-    { 0x1ffau, 0x1ffbu, 0x007eu, CanonicalizeRangeHi },
-    { 0x1ffcu, 0x2131u, 0x0000u, CanonicalizeUnique },
-    { 0x2132u, 0x2132u, 0x001cu, CanonicalizeRangeLo },
-    { 0x2133u, 0x214du, 0x0000u, CanonicalizeUnique },
-    { 0x214eu, 0x214eu, 0x001cu, CanonicalizeRangeHi },
-    { 0x214fu, 0x215fu, 0x0000u, CanonicalizeUnique },
-    { 0x2160u, 0x216fu, 0x0010u, CanonicalizeRangeLo },
-    { 0x2170u, 0x217fu, 0x0010u, CanonicalizeRangeHi },
-    { 0x2180u, 0x2182u, 0x0000u, CanonicalizeUnique },
-    { 0x2183u, 0x2184u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x2185u, 0x24b5u, 0x0000u, CanonicalizeUnique },
-    { 0x24b6u, 0x24cfu, 0x001au, CanonicalizeRangeLo },
-    { 0x24d0u, 0x24e9u, 0x001au, CanonicalizeRangeHi },
-    { 0x24eau, 0x2bffu, 0x0000u, CanonicalizeUnique },
-    { 0x2c00u, 0x2c2eu, 0x0030u, CanonicalizeRangeLo },
-    { 0x2c2fu, 0x2c2fu, 0x0000u, CanonicalizeUnique },
-    { 0x2c30u, 0x2c5eu, 0x0030u, CanonicalizeRangeHi },
-    { 0x2c5fu, 0x2c5fu, 0x0000u, CanonicalizeUnique },
-    { 0x2c60u, 0x2c61u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x2c62u, 0x2c62u, 0x29f7u, CanonicalizeRangeHi },
-    { 0x2c63u, 0x2c63u, 0x0ee6u, CanonicalizeRangeHi },
-    { 0x2c64u, 0x2c64u, 0x29e7u, CanonicalizeRangeHi },
-    { 0x2c65u, 0x2c65u, 0x2a2bu, CanonicalizeRangeHi },
-    { 0x2c66u, 0x2c66u, 0x2a28u, CanonicalizeRangeHi },
-    { 0x2c67u, 0x2c6cu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x2c6du, 0x2c6du, 0x2a1cu, CanonicalizeRangeHi },
-    { 0x2c6eu, 0x2c6eu, 0x29fdu, CanonicalizeRangeHi },
-    { 0x2c6fu, 0x2c6fu, 0x2a1fu, CanonicalizeRangeHi },
-    { 0x2c70u, 0x2c70u, 0x2a1eu, CanonicalizeRangeHi },
-    { 0x2c71u, 0x2c71u, 0x0000u, CanonicalizeUnique },
-    { 0x2c72u, 0x2c73u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x2c74u, 0x2c74u, 0x0000u, CanonicalizeUnique },
-    { 0x2c75u, 0x2c76u, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x2c77u, 0x2c7du, 0x0000u, CanonicalizeUnique },
-    { 0x2c7eu, 0x2c7fu, 0x2a3fu, CanonicalizeRangeHi },
-    { 0x2c80u, 0x2ce3u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0x2ce4u, 0x2ceau, 0x0000u, CanonicalizeUnique },
-    { 0x2cebu, 0x2ceeu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0x2cefu, 0x2cffu, 0x0000u, CanonicalizeUnique },
-    { 0x2d00u, 0x2d25u, 0x1c60u, CanonicalizeRangeHi },
-    { 0x2d26u, 0xa63fu, 0x0000u, CanonicalizeUnique },
-    { 0xa640u, 0xa66du, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0xa66eu, 0xa67fu, 0x0000u, CanonicalizeUnique },
-    { 0xa680u, 0xa697u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0xa698u, 0xa721u, 0x0000u, CanonicalizeUnique },
-    { 0xa722u, 0xa72fu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0xa730u, 0xa731u, 0x0000u, CanonicalizeUnique },
-    { 0xa732u, 0xa76fu, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0xa770u, 0xa778u, 0x0000u, CanonicalizeUnique },
-    { 0xa779u, 0xa77cu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0xa77du, 0xa77du, 0x8a04u, CanonicalizeRangeHi },
-    { 0xa77eu, 0xa787u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0xa788u, 0xa78au, 0x0000u, CanonicalizeUnique },
-    { 0xa78bu, 0xa78cu, 0x0000u, CanonicalizeAlternatingUnaligned },
-    { 0xa78du, 0xa78du, 0xa528u, CanonicalizeRangeHi },
-    { 0xa78eu, 0xa78fu, 0x0000u, CanonicalizeUnique },
-    { 0xa790u, 0xa791u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0xa792u, 0xa79fu, 0x0000u, CanonicalizeUnique },
-    { 0xa7a0u, 0xa7a9u, 0x0000u, CanonicalizeAlternatingAligned },
-    { 0xa7aau, 0xff20u, 0x0000u, CanonicalizeUnique },
-    { 0xff21u, 0xff3au, 0x0020u, CanonicalizeRangeLo },
-    { 0xff3bu, 0xff40u, 0x0000u, CanonicalizeUnique },
-    { 0xff41u, 0xff5au, 0x0020u, CanonicalizeRangeHi },
-    { 0xff5bu, 0xffffu, 0x0000u, CanonicalizeUnique },
-};
-
-const size_t LATIN_CANONICALIZATION_RANGES = 20;
-LatinCanonicalizationRange latinRangeInfo[LATIN_CANONICALIZATION_RANGES] = {
-    { 0x0000u, 0x0040u, 0x0000u, CanonicalizeLatinSelf },
-    { 0x0041u, 0x005au, 0x0000u, CanonicalizeLatinMask0x20 },
-    { 0x005bu, 0x0060u, 0x0000u, CanonicalizeLatinSelf },
-    { 0x0061u, 0x007au, 0x0000u, CanonicalizeLatinMask0x20 },
-    { 0x007bu, 0x00bfu, 0x0000u, CanonicalizeLatinSelf },
-    { 0x00c0u, 0x00d6u, 0x0000u, CanonicalizeLatinMask0x20 },
-    { 0x00d7u, 0x00d7u, 0x0000u, CanonicalizeLatinSelf },
-    { 0x00d8u, 0x00deu, 0x0000u, CanonicalizeLatinMask0x20 },
-    { 0x00dfu, 0x00dfu, 0x0000u, CanonicalizeLatinSelf },
-    { 0x00e0u, 0x00f6u, 0x0000u, CanonicalizeLatinMask0x20 },
-    { 0x00f7u, 0x00f7u, 0x0000u, CanonicalizeLatinSelf },
-    { 0x00f8u, 0x00feu, 0x0000u, CanonicalizeLatinMask0x20 },
-    { 0x00ffu, 0x00ffu, 0x0000u, CanonicalizeLatinSelf },
-    { 0x0100u, 0x0177u, 0x0000u, CanonicalizeLatinInvalid },
-    { 0x0178u, 0x0178u, 0x00ffu, CanonicalizeLatinOther },
-    { 0x0179u, 0x039bu, 0x0000u, CanonicalizeLatinInvalid },
-    { 0x039cu, 0x039cu, 0x00b5u, CanonicalizeLatinOther },
-    { 0x039du, 0x03bbu, 0x0000u, CanonicalizeLatinInvalid },
-    { 0x03bcu, 0x03bcu, 0x00b5u, CanonicalizeLatinOther },
-    { 0x03bdu, 0xffffu, 0x0000u, CanonicalizeLatinInvalid },
-};
-
-} } // JSC::Yarr
-
diff --git a/Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js b/Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js
deleted file mode 100644 (file)
index 00361dd..0000000
+++ /dev/null
@@ -1,219 +0,0 @@
-/*
- * Copyright (C) 2012 Apple Inc. All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
- * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
- * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
- * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
- * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
- * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
- * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
- * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
- */
-
-// See ES 5.1, 15.10.2.8
-function canonicalize(ch)
-{
-    var u = String.fromCharCode(ch).toUpperCase();
-    if (u.length > 1)
-        return ch;
-    var cu = u.charCodeAt(0);
-    if (ch >= 128 && cu < 128)
-        return ch;
-    return cu;
-}
-
-var MAX_UCS2 = 0xFFFF;
-var MAX_LATIN = 0xFF;
-
-var groupedCanonically = [];
-// Pass 1: populate groupedCanonically - this is mapping from canonicalized
-// values back to the set of character code that canonicalize to them.
-for (var i = 0; i <= MAX_UCS2; ++i) {
-    var ch = canonicalize(i);
-    if (!groupedCanonically[ch])
-        groupedCanonically[ch] = [];
-    groupedCanonically[ch].push(i);
-}
-
-var typeInfo = [];
-var latinTypeInfo = [];
-var characterSetInfo = [];
-// Pass 2: populate typeInfo & characterSetInfo. For every character calculate
-// a typeInfo value, described by the types above, and a value payload.
-for (cu in groupedCanonically) {
-    // The set of characters that canonicalize to cu
-    var characters = groupedCanonically[cu];
-
-    // If there is only one, it is unique.
-    if (characters.length == 1) {
-        typeInfo[characters[0]] = "CanonicalizeUnique:0";
-        latinTypeInfo[characters[0]] = characters[0] <= MAX_LATIN ? "CanonicalizeLatinSelf:0" : "CanonicalizeLatinInvalid:0";
-        continue;
-    }
-
-    // Sort the array.
-    characters.sort(function(x,y){return x-y;});
-
-    // If there are more than two characters, create an entry in characterSetInfo.
-    if (characters.length > 2) {
-        for (i in characters)
-            typeInfo[characters[i]] = "CanonicalizeSet:" + characterSetInfo.length;
-        characterSetInfo.push(characters);
-
-        if (characters[1] <= MAX_LATIN)
-            throw new Error("sets with more than one latin character not supported!");
-        if (characters[0] <= MAX_LATIN) {
-            for (i in characters)
-                latinTypeInfo[characters[i]] = "CanonicalizeLatinOther:" + characters[0];
-            latinTypeInfo[characters[0]] = "CanonicalizeLatinSelf:0";
-        } else {
-            for (i in characters)
-                latinTypeInfo[characters[i]] = "CanonicalizeLatinInvalid:0";
-        }
-
-        continue;
-    }
-
-    // We have a pair, mark alternating ranges, otherwise track whether this is the low or high partner.
-    var lo = characters[0];
-    var hi = characters[1];
-    var delta = hi - lo;
-    if (delta == 1) {
-        var type = lo & 1 ? "CanonicalizeAlternatingUnaligned:0" : "CanonicalizeAlternatingAligned:0";
-        typeInfo[lo] = type;
-        typeInfo[hi] = type;
-    } else {
-        typeInfo[lo] = "CanonicalizeRangeLo:" + delta;
-        typeInfo[hi] = "CanonicalizeRangeHi:" + delta;
-    }
-
-    if (lo > MAX_LATIN) {
-        latinTypeInfo[lo] = "CanonicalizeLatinInvalid:0"; 
-        latinTypeInfo[hi] = "CanonicalizeLatinInvalid:0";
-    } else if (hi > MAX_LATIN) {
-        latinTypeInfo[lo] = "CanonicalizeLatinSelf:0"; 
-        latinTypeInfo[hi] = "CanonicalizeLatinOther:" + lo;
-    } else {
-        if (delta != 0x20 || lo & 0x20)
-            throw new Error("pairs of latin characters that don't mask with 0x20 not supported!");
-        latinTypeInfo[lo] = "CanonicalizeLatinMask0x20:0";
-        latinTypeInfo[hi] = "CanonicalizeLatinMask0x20:0";
-    }
-}
-
-var rangeInfo = [];
-// Pass 3: coallesce types into ranges.
-for (var end = 0; end <= MAX_UCS2; ++end) {
-    var begin = end;
-    var type = typeInfo[end];
-    while (end < MAX_UCS2 && typeInfo[end + 1] == type)
-        ++end;
-    rangeInfo.push({begin:begin, end:end, type:type});
-}
-
-var latinRangeInfo = [];
-// Pass 4: coallesce latin-1 types into ranges.
-for (var end = 0; end <= MAX_UCS2; ++end) {
-    var begin = end;
-    var type = latinTypeInfo[end];
-    while (end < MAX_UCS2 && latinTypeInfo[end + 1] == type)
-        ++end;
-    latinRangeInfo.push({begin:begin, end:end, type:type});
-}
-
-
-// Helper function to convert a number to a fixed width hex representation of a C uint16_t.
-function hex(x)
-{
-    var s = Number(x).toString(16);
-    while (s.length < 4)
-        s = 0 + s;
-    return "0x" + s + "u";
-}
-
-var copyright = (
-    "/*"                                                                            + "\n" +
-    " * Copyright (C) 2012 Apple Inc. All rights reserved."                         + "\n" +
-    " *"                                                                            + "\n" +
-    " * Redistribution and use in source and binary forms, with or without"         + "\n" +
-    " * modification, are permitted provided that the following conditions"         + "\n" +
-    " * are met:"                                                                   + "\n" +
-    " * 1. Redistributions of source code must retain the above copyright"          + "\n" +
-    " *    notice, this list of conditions and the following disclaimer."           + "\n" +
-    " * 2. Redistributions in binary form must reproduce the above copyright"       + "\n" +
-    " *    notice, this list of conditions and the following disclaimer in the"     + "\n" +
-    " *    documentation and/or other materials provided with the distribution."    + "\n" +
-    " *"                                                                            + "\n" +
-    " * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY"                  + "\n" +
-    " * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE"          + "\n" +
-    " * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR"         + "\n" +
-    " * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR"                   + "\n" +
-    " * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,"      + "\n" +
-    " * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,"        + "\n" +
-    " * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR"         + "\n" +
-    " * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY"        + "\n" +
-    " * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT"               + "\n" +
-    " * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE"      + "\n" +
-    " * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. "      + "\n" +
-    " */");
-
-print(copyright);
-print();
-print("// DO NOT EDIT! - this file autogenerated by YarrCanonicalizeUCS2.js");
-print();
-print('#include "config.h"');
-print('#include "YarrCanonicalizeUCS2.h"');
-print();
-print("namespace JSC { namespace Yarr {");
-print();
-print("#include <stdint.h>");
-print();
-
-for (i in characterSetInfo) {
-    var characters = ""
-    var set = characterSetInfo[i];
-    for (var j in set)
-        characters += hex(set[j]) + ", ";
-    print("uint16_t ucs2CharacterSet" + i + "[] = { " + characters + "0 };");
-}
-print();
-print("static const size_t UCS2_CANONICALIZATION_SETS = " + characterSetInfo.length + ";");
-print("uint16_t* characterSetInfo[UCS2_CANONICALIZATION_SETS] = {");
-for (i in characterSetInfo)
-print("    ucs2CharacterSet" + i + ",");
-print("};");
-print();
-print("const size_t UCS2_CANONICALIZATION_RANGES = " + rangeInfo.length + ";");
-print("UCS2CanonicalizationRange rangeInfo[UCS2_CANONICALIZATION_RANGES] = {");
-for (i in rangeInfo) {
-    var info = rangeInfo[i];
-    var typeAndValue = info.type.split(':');
-    print("    { " + hex(info.begin) + ", " + hex(info.end) + ", " + hex(typeAndValue[1]) + ", " + typeAndValue[0] + " },");
-}
-print("};");
-print();
-print("const size_t LATIN_CANONICALIZATION_RANGES = " + latinRangeInfo.length + ";");
-print("LatinCanonicalizationRange latinRangeInfo[LATIN_CANONICALIZATION_RANGES] = {");
-for (i in latinRangeInfo) {
-    var info = latinRangeInfo[i];
-    var typeAndValue = info.type.split(':');
-    print("    { " + hex(info.begin) + ", " + hex(info.end) + ", " + hex(typeAndValue[1]) + ", " + typeAndValue[0] + " },");
-}
-print("};");
-print();
-print("} } // JSC::Yarr");
-print();
-
diff --git a/Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp b/Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.cpp
new file mode 100644 (file)
index 0000000..289e66f
--- /dev/null
@@ -0,0 +1,1182 @@
+/*
+ * Copyright (C) 2012-2013, 2015-2016 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
+ */
+
+// DO NOT EDIT! - this file autogenerated by YarrCanonicalizeUnicode.js
+
+#include "config.h"
+#include "YarrCanonicalizeUnicode.h"
+
+namespace JSC { namespace Yarr {
+
+#include <stdint.h>
+
+const UChar32 ucs2CharacterSet0[] = { 0x01c4, 0x01c5, 0x01c6, 0 };
+const UChar32 ucs2CharacterSet1[] = { 0x01c7, 0x01c8, 0x01c9, 0 };
+const UChar32 ucs2CharacterSet2[] = { 0x01ca, 0x01cb, 0x01cc, 0 };
+const UChar32 ucs2CharacterSet3[] = { 0x01f1, 0x01f2, 0x01f3, 0 };
+const UChar32 ucs2CharacterSet4[] = { 0x0392, 0x03b2, 0x03d0, 0 };
+const UChar32 ucs2CharacterSet5[] = { 0x0395, 0x03b5, 0x03f5, 0 };
+const UChar32 ucs2CharacterSet6[] = { 0x0398, 0x03b8, 0x03d1, 0 };
+const UChar32 ucs2CharacterSet7[] = { 0x0345, 0x0399, 0x03b9, 0x1fbe, 0 };
+const UChar32 ucs2CharacterSet8[] = { 0x039a, 0x03ba, 0x03f0, 0 };
+const UChar32 ucs2CharacterSet9[] = { 0x00b5, 0x039c, 0x03bc, 0 };
+const UChar32 ucs2CharacterSet10[] = { 0x03a0, 0x03c0, 0x03d6, 0 };
+const UChar32 ucs2CharacterSet11[] = { 0x03a1, 0x03c1, 0x03f1, 0 };
+const UChar32 ucs2CharacterSet12[] = { 0x03a3, 0x03c2, 0x03c3, 0 };
+const UChar32 ucs2CharacterSet13[] = { 0x03a6, 0x03c6, 0x03d5, 0 };
+const UChar32 ucs2CharacterSet14[] = { 0x1e60, 0x1e61, 0x1e9b, 0 };
+
+static const size_t UCS2_CANONICALIZATION_SETS = 15;
+const UChar32* const ucs2CharacterSetInfo[UCS2_CANONICALIZATION_SETS] = {
+    ucs2CharacterSet0,
+    ucs2CharacterSet1,
+    ucs2CharacterSet2,
+    ucs2CharacterSet3,
+    ucs2CharacterSet4,
+    ucs2CharacterSet5,
+    ucs2CharacterSet6,
+    ucs2CharacterSet7,
+    ucs2CharacterSet8,
+    ucs2CharacterSet9,
+    ucs2CharacterSet10,
+    ucs2CharacterSet11,
+    ucs2CharacterSet12,
+    ucs2CharacterSet13,
+    ucs2CharacterSet14,
+};
+
+const size_t UCS2_CANONICALIZATION_RANGES = 391;
+const CanonicalizationRange ucs2RangeInfo[UCS2_CANONICALIZATION_RANGES] = {
+    { 0x0000, 0x0040, 0x0000, CanonicalizeUnique },
+    { 0x0041, 0x005a, 0x0020, CanonicalizeRangeLo },
+    { 0x005b, 0x0060, 0x0000, CanonicalizeUnique },
+    { 0x0061, 0x007a, 0x0020, CanonicalizeRangeHi },
+    { 0x007b, 0x00b4, 0x0000, CanonicalizeUnique },
+    { 0x00b5, 0x00b5, 0x0009, CanonicalizeSet },
+    { 0x00b6, 0x00bf, 0x0000, CanonicalizeUnique },
+    { 0x00c0, 0x00d6, 0x0020, CanonicalizeRangeLo },
+    { 0x00d7, 0x00d7, 0x0000, CanonicalizeUnique },
+    { 0x00d8, 0x00de, 0x0020, CanonicalizeRangeLo },
+    { 0x00df, 0x00df, 0x0000, CanonicalizeUnique },
+    { 0x00e0, 0x00f6, 0x0020, CanonicalizeRangeHi },
+    { 0x00f7, 0x00f7, 0x0000, CanonicalizeUnique },
+    { 0x00f8, 0x00fe, 0x0020, CanonicalizeRangeHi },
+    { 0x00ff, 0x00ff, 0x0079, CanonicalizeRangeLo },
+    { 0x0100, 0x012f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0130, 0x0131, 0x0000, CanonicalizeUnique },
+    { 0x0132, 0x0137, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0138, 0x0138, 0x0000, CanonicalizeUnique },
+    { 0x0139, 0x0148, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0149, 0x0149, 0x0000, CanonicalizeUnique },
+    { 0x014a, 0x0177, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0178, 0x0178, 0x0079, CanonicalizeRangeHi },
+    { 0x0179, 0x017e, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x017f, 0x017f, 0x0000, CanonicalizeUnique },
+    { 0x0180, 0x0180, 0x00c3, CanonicalizeRangeLo },
+    { 0x0181, 0x0181, 0x00d2, CanonicalizeRangeLo },
+    { 0x0182, 0x0185, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0186, 0x0186, 0x00ce, CanonicalizeRangeLo },
+    { 0x0187, 0x0188, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0189, 0x018a, 0x00cd, CanonicalizeRangeLo },
+    { 0x018b, 0x018c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x018d, 0x018d, 0x0000, CanonicalizeUnique },
+    { 0x018e, 0x018e, 0x004f, CanonicalizeRangeLo },
+    { 0x018f, 0x018f, 0x00ca, CanonicalizeRangeLo },
+    { 0x0190, 0x0190, 0x00cb, CanonicalizeRangeLo },
+    { 0x0191, 0x0192, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0193, 0x0193, 0x00cd, CanonicalizeRangeLo },
+    { 0x0194, 0x0194, 0x00cf, CanonicalizeRangeLo },
+    { 0x0195, 0x0195, 0x0061, CanonicalizeRangeLo },
+    { 0x0196, 0x0196, 0x00d3, CanonicalizeRangeLo },
+    { 0x0197, 0x0197, 0x00d1, CanonicalizeRangeLo },
+    { 0x0198, 0x0199, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x019a, 0x019a, 0x00a3, CanonicalizeRangeLo },
+    { 0x019b, 0x019b, 0x0000, CanonicalizeUnique },
+    { 0x019c, 0x019c, 0x00d3, CanonicalizeRangeLo },
+    { 0x019d, 0x019d, 0x00d5, CanonicalizeRangeLo },
+    { 0x019e, 0x019e, 0x0082, CanonicalizeRangeLo },
+    { 0x019f, 0x019f, 0x00d6, CanonicalizeRangeLo },
+    { 0x01a0, 0x01a5, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01a6, 0x01a6, 0x00da, CanonicalizeRangeLo },
+    { 0x01a7, 0x01a8, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01a9, 0x01a9, 0x00da, CanonicalizeRangeLo },
+    { 0x01aa, 0x01ab, 0x0000, CanonicalizeUnique },
+    { 0x01ac, 0x01ad, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01ae, 0x01ae, 0x00da, CanonicalizeRangeLo },
+    { 0x01af, 0x01b0, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01b1, 0x01b2, 0x00d9, CanonicalizeRangeLo },
+    { 0x01b3, 0x01b6, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01b7, 0x01b7, 0x00db, CanonicalizeRangeLo },
+    { 0x01b8, 0x01b9, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01ba, 0x01bb, 0x0000, CanonicalizeUnique },
+    { 0x01bc, 0x01bd, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01be, 0x01be, 0x0000, CanonicalizeUnique },
+    { 0x01bf, 0x01bf, 0x0038, CanonicalizeRangeLo },
+    { 0x01c0, 0x01c3, 0x0000, CanonicalizeUnique },
+    { 0x01c4, 0x01c6, 0x0000, CanonicalizeSet },
+    { 0x01c7, 0x01c9, 0x0001, CanonicalizeSet },
+    { 0x01ca, 0x01cc, 0x0002, CanonicalizeSet },
+    { 0x01cd, 0x01dc, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01dd, 0x01dd, 0x004f, CanonicalizeRangeHi },
+    { 0x01de, 0x01ef, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01f0, 0x01f0, 0x0000, CanonicalizeUnique },
+    { 0x01f1, 0x01f3, 0x0003, CanonicalizeSet },
+    { 0x01f4, 0x01f5, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01f6, 0x01f6, 0x0061, CanonicalizeRangeHi },
+    { 0x01f7, 0x01f7, 0x0038, CanonicalizeRangeHi },
+    { 0x01f8, 0x021f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0220, 0x0220, 0x0082, CanonicalizeRangeHi },
+    { 0x0221, 0x0221, 0x0000, CanonicalizeUnique },
+    { 0x0222, 0x0233, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0234, 0x0239, 0x0000, CanonicalizeUnique },
+    { 0x023a, 0x023a, 0x2a2b, CanonicalizeRangeLo },
+    { 0x023b, 0x023c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x023d, 0x023d, 0x00a3, CanonicalizeRangeHi },
+    { 0x023e, 0x023e, 0x2a28, CanonicalizeRangeLo },
+    { 0x023f, 0x0240, 0x2a3f, CanonicalizeRangeLo },
+    { 0x0241, 0x0242, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0243, 0x0243, 0x00c3, CanonicalizeRangeHi },
+    { 0x0244, 0x0244, 0x0045, CanonicalizeRangeLo },
+    { 0x0245, 0x0245, 0x0047, CanonicalizeRangeLo },
+    { 0x0246, 0x024f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0250, 0x0250, 0x2a1f, CanonicalizeRangeLo },
+    { 0x0251, 0x0251, 0x2a1c, CanonicalizeRangeLo },
+    { 0x0252, 0x0252, 0x2a1e, CanonicalizeRangeLo },
+    { 0x0253, 0x0253, 0x00d2, CanonicalizeRangeHi },
+    { 0x0254, 0x0254, 0x00ce, CanonicalizeRangeHi },
+    { 0x0255, 0x0255, 0x0000, CanonicalizeUnique },
+    { 0x0256, 0x0257, 0x00cd, CanonicalizeRangeHi },
+    { 0x0258, 0x0258, 0x0000, CanonicalizeUnique },
+    { 0x0259, 0x0259, 0x00ca, CanonicalizeRangeHi },
+    { 0x025a, 0x025a, 0x0000, CanonicalizeUnique },
+    { 0x025b, 0x025b, 0x00cb, CanonicalizeRangeHi },
+    { 0x025c, 0x025c, 0xa54f, CanonicalizeRangeLo },
+    { 0x025d, 0x025f, 0x0000, CanonicalizeUnique },
+    { 0x0260, 0x0260, 0x00cd, CanonicalizeRangeHi },
+    { 0x0261, 0x0261, 0xa54b, CanonicalizeRangeLo },
+    { 0x0262, 0x0262, 0x0000, CanonicalizeUnique },
+    { 0x0263, 0x0263, 0x00cf, CanonicalizeRangeHi },
+    { 0x0264, 0x0264, 0x0000, CanonicalizeUnique },
+    { 0x0265, 0x0265, 0xa528, CanonicalizeRangeLo },
+    { 0x0266, 0x0266, 0xa544, CanonicalizeRangeLo },
+    { 0x0267, 0x0267, 0x0000, CanonicalizeUnique },
+    { 0x0268, 0x0268, 0x00d1, CanonicalizeRangeHi },
+    { 0x0269, 0x0269, 0x00d3, CanonicalizeRangeHi },
+    { 0x026a, 0x026a, 0x0000, CanonicalizeUnique },
+    { 0x026b, 0x026b, 0x29f7, CanonicalizeRangeLo },
+    { 0x026c, 0x026c, 0xa541, CanonicalizeRangeLo },
+    { 0x026d, 0x026e, 0x0000, CanonicalizeUnique },
+    { 0x026f, 0x026f, 0x00d3, CanonicalizeRangeHi },
+    { 0x0270, 0x0270, 0x0000, CanonicalizeUnique },
+    { 0x0271, 0x0271, 0x29fd, CanonicalizeRangeLo },
+    { 0x0272, 0x0272, 0x00d5, CanonicalizeRangeHi },
+    { 0x0273, 0x0274, 0x0000, CanonicalizeUnique },
+    { 0x0275, 0x0275, 0x00d6, CanonicalizeRangeHi },
+    { 0x0276, 0x027c, 0x0000, CanonicalizeUnique },
+    { 0x027d, 0x027d, 0x29e7, CanonicalizeRangeLo },
+    { 0x027e, 0x027f, 0x0000, CanonicalizeUnique },
+    { 0x0280, 0x0280, 0x00da, CanonicalizeRangeHi },
+    { 0x0281, 0x0282, 0x0000, CanonicalizeUnique },
+    { 0x0283, 0x0283, 0x00da, CanonicalizeRangeHi },
+    { 0x0284, 0x0286, 0x0000, CanonicalizeUnique },
+    { 0x0287, 0x0287, 0xa52a, CanonicalizeRangeLo },
+    { 0x0288, 0x0288, 0x00da, CanonicalizeRangeHi },
+    { 0x0289, 0x0289, 0x0045, CanonicalizeRangeHi },
+    { 0x028a, 0x028b, 0x00d9, CanonicalizeRangeHi },
+    { 0x028c, 0x028c, 0x0047, CanonicalizeRangeHi },
+    { 0x028d, 0x0291, 0x0000, CanonicalizeUnique },
+    { 0x0292, 0x0292, 0x00db, CanonicalizeRangeHi },
+    { 0x0293, 0x029d, 0x0000, CanonicalizeUnique },
+    { 0x029e, 0x029e, 0xa512, CanonicalizeRangeLo },
+    { 0x029f, 0x0344, 0x0000, CanonicalizeUnique },
+    { 0x0345, 0x0345, 0x0007, CanonicalizeSet },
+    { 0x0346, 0x036f, 0x0000, CanonicalizeUnique },
+    { 0x0370, 0x0373, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0374, 0x0375, 0x0000, CanonicalizeUnique },
+    { 0x0376, 0x0377, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0378, 0x037a, 0x0000, CanonicalizeUnique },
+    { 0x037b, 0x037d, 0x0082, CanonicalizeRangeLo },
+    { 0x037e, 0x037e, 0x0000, CanonicalizeUnique },
+    { 0x037f, 0x037f, 0x0074, CanonicalizeRangeLo },
+    { 0x0380, 0x0385, 0x0000, CanonicalizeUnique },
+    { 0x0386, 0x0386, 0x0026, CanonicalizeRangeLo },
+    { 0x0387, 0x0387, 0x0000, CanonicalizeUnique },
+    { 0x0388, 0x038a, 0x0025, CanonicalizeRangeLo },
+    { 0x038b, 0x038b, 0x0000, CanonicalizeUnique },
+    { 0x038c, 0x038c, 0x0040, CanonicalizeRangeLo },
+    { 0x038d, 0x038d, 0x0000, CanonicalizeUnique },
+    { 0x038e, 0x038f, 0x003f, CanonicalizeRangeLo },
+    { 0x0390, 0x0390, 0x0000, CanonicalizeUnique },
+    { 0x0391, 0x0391, 0x0020, CanonicalizeRangeLo },
+    { 0x0392, 0x0392, 0x0004, CanonicalizeSet },
+    { 0x0393, 0x0394, 0x0020, CanonicalizeRangeLo },
+    { 0x0395, 0x0395, 0x0005, CanonicalizeSet },
+    { 0x0396, 0x0397, 0x0020, CanonicalizeRangeLo },
+    { 0x0398, 0x0398, 0x0006, CanonicalizeSet },
+    { 0x0399, 0x0399, 0x0007, CanonicalizeSet },
+    { 0x039a, 0x039a, 0x0008, CanonicalizeSet },
+    { 0x039b, 0x039b, 0x0020, CanonicalizeRangeLo },
+    { 0x039c, 0x039c, 0x0009, CanonicalizeSet },
+    { 0x039d, 0x039f, 0x0020, CanonicalizeRangeLo },
+    { 0x03a0, 0x03a0, 0x000a, CanonicalizeSet },
+    { 0x03a1, 0x03a1, 0x000b, CanonicalizeSet },
+    { 0x03a2, 0x03a2, 0x0000, CanonicalizeUnique },
+    { 0x03a3, 0x03a3, 0x000c, CanonicalizeSet },
+    { 0x03a4, 0x03a5, 0x0020, CanonicalizeRangeLo },
+    { 0x03a6, 0x03a6, 0x000d, CanonicalizeSet },
+    { 0x03a7, 0x03ab, 0x0020, CanonicalizeRangeLo },
+    { 0x03ac, 0x03ac, 0x0026, CanonicalizeRangeHi },
+    { 0x03ad, 0x03af, 0x0025, CanonicalizeRangeHi },
+    { 0x03b0, 0x03b0, 0x0000, CanonicalizeUnique },
+    { 0x03b1, 0x03b1, 0x0020, CanonicalizeRangeHi },
+    { 0x03b2, 0x03b2, 0x0004, CanonicalizeSet },
+    { 0x03b3, 0x03b4, 0x0020, CanonicalizeRangeHi },
+    { 0x03b5, 0x03b5, 0x0005, CanonicalizeSet },
+    { 0x03b6, 0x03b7, 0x0020, CanonicalizeRangeHi },
+    { 0x03b8, 0x03b8, 0x0006, CanonicalizeSet },
+    { 0x03b9, 0x03b9, 0x0007, CanonicalizeSet },
+    { 0x03ba, 0x03ba, 0x0008, CanonicalizeSet },
+    { 0x03bb, 0x03bb, 0x0020, CanonicalizeRangeHi },
+    { 0x03bc, 0x03bc, 0x0009, CanonicalizeSet },
+    { 0x03bd, 0x03bf, 0x0020, CanonicalizeRangeHi },
+    { 0x03c0, 0x03c0, 0x000a, CanonicalizeSet },
+    { 0x03c1, 0x03c1, 0x000b, CanonicalizeSet },
+    { 0x03c2, 0x03c3, 0x000c, CanonicalizeSet },
+    { 0x03c4, 0x03c5, 0x0020, CanonicalizeRangeHi },
+    { 0x03c6, 0x03c6, 0x000d, CanonicalizeSet },
+    { 0x03c7, 0x03cb, 0x0020, CanonicalizeRangeHi },
+    { 0x03cc, 0x03cc, 0x0040, CanonicalizeRangeHi },
+    { 0x03cd, 0x03ce, 0x003f, CanonicalizeRangeHi },
+    { 0x03cf, 0x03cf, 0x0008, CanonicalizeRangeLo },
+    { 0x03d0, 0x03d0, 0x0004, CanonicalizeSet },
+    { 0x03d1, 0x03d1, 0x0006, CanonicalizeSet },
+    { 0x03d2, 0x03d4, 0x0000, CanonicalizeUnique },
+    { 0x03d5, 0x03d5, 0x000d, CanonicalizeSet },
+    { 0x03d6, 0x03d6, 0x000a, CanonicalizeSet },
+    { 0x03d7, 0x03d7, 0x0008, CanonicalizeRangeHi },
+    { 0x03d8, 0x03ef, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x03f0, 0x03f0, 0x0008, CanonicalizeSet },
+    { 0x03f1, 0x03f1, 0x000b, CanonicalizeSet },
+    { 0x03f2, 0x03f2, 0x0007, CanonicalizeRangeLo },
+    { 0x03f3, 0x03f3, 0x0074, CanonicalizeRangeHi },
+    { 0x03f4, 0x03f4, 0x0000, CanonicalizeUnique },
+    { 0x03f5, 0x03f5, 0x0005, CanonicalizeSet },
+    { 0x03f6, 0x03f6, 0x0000, CanonicalizeUnique },
+    { 0x03f7, 0x03f8, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x03f9, 0x03f9, 0x0007, CanonicalizeRangeHi },
+    { 0x03fa, 0x03fb, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x03fc, 0x03fc, 0x0000, CanonicalizeUnique },
+    { 0x03fd, 0x03ff, 0x0082, CanonicalizeRangeHi },
+    { 0x0400, 0x040f, 0x0050, CanonicalizeRangeLo },
+    { 0x0410, 0x042f, 0x0020, CanonicalizeRangeLo },
+    { 0x0430, 0x044f, 0x0020, CanonicalizeRangeHi },
+    { 0x0450, 0x045f, 0x0050, CanonicalizeRangeHi },
+    { 0x0460, 0x0481, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0482, 0x0489, 0x0000, CanonicalizeUnique },
+    { 0x048a, 0x04bf, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x04c0, 0x04c0, 0x000f, CanonicalizeRangeLo },
+    { 0x04c1, 0x04ce, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x04cf, 0x04cf, 0x000f, CanonicalizeRangeHi },
+    { 0x04d0, 0x052f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0530, 0x0530, 0x0000, CanonicalizeUnique },
+    { 0x0531, 0x0556, 0x0030, CanonicalizeRangeLo },
+    { 0x0557, 0x0560, 0x0000, CanonicalizeUnique },
+    { 0x0561, 0x0586, 0x0030, CanonicalizeRangeHi },
+    { 0x0587, 0x109f, 0x0000, CanonicalizeUnique },
+    { 0x10a0, 0x10c5, 0x1c60, CanonicalizeRangeLo },
+    { 0x10c6, 0x10c6, 0x0000, CanonicalizeUnique },
+    { 0x10c7, 0x10c7, 0x1c60, CanonicalizeRangeLo },
+    { 0x10c8, 0x10cc, 0x0000, CanonicalizeUnique },
+    { 0x10cd, 0x10cd, 0x1c60, CanonicalizeRangeLo },
+    { 0x10ce, 0x1d78, 0x0000, CanonicalizeUnique },
+    { 0x1d79, 0x1d79, 0x8a04, CanonicalizeRangeLo },
+    { 0x1d7a, 0x1d7c, 0x0000, CanonicalizeUnique },
+    { 0x1d7d, 0x1d7d, 0x0ee6, CanonicalizeRangeLo },
+    { 0x1d7e, 0x1dff, 0x0000, CanonicalizeUnique },
+    { 0x1e00, 0x1e5f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x1e60, 0x1e61, 0x000e, CanonicalizeSet },
+    { 0x1e62, 0x1e95, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x1e96, 0x1e9a, 0x0000, CanonicalizeUnique },
+    { 0x1e9b, 0x1e9b, 0x000e, CanonicalizeSet },
+    { 0x1e9c, 0x1e9f, 0x0000, CanonicalizeUnique },
+    { 0x1ea0, 0x1eff, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x1f00, 0x1f07, 0x0008, CanonicalizeRangeLo },
+    { 0x1f08, 0x1f0f, 0x0008, CanonicalizeRangeHi },
+    { 0x1f10, 0x1f15, 0x0008, CanonicalizeRangeLo },
+    { 0x1f16, 0x1f17, 0x0000, CanonicalizeUnique },
+    { 0x1f18, 0x1f1d, 0x0008, CanonicalizeRangeHi },
+    { 0x1f1e, 0x1f1f, 0x0000, CanonicalizeUnique },
+    { 0x1f20, 0x1f27, 0x0008, CanonicalizeRangeLo },
+    { 0x1f28, 0x1f2f, 0x0008, CanonicalizeRangeHi },
+    { 0x1f30, 0x1f37, 0x0008, CanonicalizeRangeLo },
+    { 0x1f38, 0x1f3f, 0x0008, CanonicalizeRangeHi },
+    { 0x1f40, 0x1f45, 0x0008, CanonicalizeRangeLo },
+    { 0x1f46, 0x1f47, 0x0000, CanonicalizeUnique },
+    { 0x1f48, 0x1f4d, 0x0008, CanonicalizeRangeHi },
+    { 0x1f4e, 0x1f50, 0x0000, CanonicalizeUnique },
+    { 0x1f51, 0x1f51, 0x0008, CanonicalizeRangeLo },
+    { 0x1f52, 0x1f52, 0x0000, CanonicalizeUnique },
+    { 0x1f53, 0x1f53, 0x0008, CanonicalizeRangeLo },
+    { 0x1f54, 0x1f54, 0x0000, CanonicalizeUnique },
+    { 0x1f55, 0x1f55, 0x0008, CanonicalizeRangeLo },
+    { 0x1f56, 0x1f56, 0x0000, CanonicalizeUnique },
+    { 0x1f57, 0x1f57, 0x0008, CanonicalizeRangeLo },
+    { 0x1f58, 0x1f58, 0x0000, CanonicalizeUnique },
+    { 0x1f59, 0x1f59, 0x0008, CanonicalizeRangeHi },
+    { 0x1f5a, 0x1f5a, 0x0000, CanonicalizeUnique },
+    { 0x1f5b, 0x1f5b, 0x0008, CanonicalizeRangeHi },
+    { 0x1f5c, 0x1f5c, 0x0000, CanonicalizeUnique },
+    { 0x1f5d, 0x1f5d, 0x0008, CanonicalizeRangeHi },
+    { 0x1f5e, 0x1f5e, 0x0000, CanonicalizeUnique },
+    { 0x1f5f, 0x1f5f, 0x0008, CanonicalizeRangeHi },
+    { 0x1f60, 0x1f67, 0x0008, CanonicalizeRangeLo },
+    { 0x1f68, 0x1f6f, 0x0008, CanonicalizeRangeHi },
+    { 0x1f70, 0x1f71, 0x004a, CanonicalizeRangeLo },
+    { 0x1f72, 0x1f75, 0x0056, CanonicalizeRangeLo },
+    { 0x1f76, 0x1f77, 0x0064, CanonicalizeRangeLo },
+    { 0x1f78, 0x1f79, 0x0080, CanonicalizeRangeLo },
+    { 0x1f7a, 0x1f7b, 0x0070, CanonicalizeRangeLo },
+    { 0x1f7c, 0x1f7d, 0x007e, CanonicalizeRangeLo },
+    { 0x1f7e, 0x1faf, 0x0000, CanonicalizeUnique },
+    { 0x1fb0, 0x1fb1, 0x0008, CanonicalizeRangeLo },
+    { 0x1fb2, 0x1fb7, 0x0000, CanonicalizeUnique },
+    { 0x1fb8, 0x1fb9, 0x0008, CanonicalizeRangeHi },
+    { 0x1fba, 0x1fbb, 0x004a, CanonicalizeRangeHi },
+    { 0x1fbc, 0x1fbd, 0x0000, CanonicalizeUnique },
+    { 0x1fbe, 0x1fbe, 0x0007, CanonicalizeSet },
+    { 0x1fbf, 0x1fc7, 0x0000, CanonicalizeUnique },
+    { 0x1fc8, 0x1fcb, 0x0056, CanonicalizeRangeHi },
+    { 0x1fcc, 0x1fcf, 0x0000, CanonicalizeUnique },
+    { 0x1fd0, 0x1fd1, 0x0008, CanonicalizeRangeLo },
+    { 0x1fd2, 0x1fd7, 0x0000, CanonicalizeUnique },
+    { 0x1fd8, 0x1fd9, 0x0008, CanonicalizeRangeHi },
+    { 0x1fda, 0x1fdb, 0x0064, CanonicalizeRangeHi },
+    { 0x1fdc, 0x1fdf, 0x0000, CanonicalizeUnique },
+    { 0x1fe0, 0x1fe1, 0x0008, CanonicalizeRangeLo },
+    { 0x1fe2, 0x1fe4, 0x0000, CanonicalizeUnique },
+    { 0x1fe5, 0x1fe5, 0x0007, CanonicalizeRangeLo },
+    { 0x1fe6, 0x1fe7, 0x0000, CanonicalizeUnique },
+    { 0x1fe8, 0x1fe9, 0x0008, CanonicalizeRangeHi },
+    { 0x1fea, 0x1feb, 0x0070, CanonicalizeRangeHi },
+    { 0x1fec, 0x1fec, 0x0007, CanonicalizeRangeHi },
+    { 0x1fed, 0x1ff7, 0x0000, CanonicalizeUnique },
+    { 0x1ff8, 0x1ff9, 0x0080, CanonicalizeRangeHi },
+    { 0x1ffa, 0x1ffb, 0x007e, CanonicalizeRangeHi },
+    { 0x1ffc, 0x2131, 0x0000, CanonicalizeUnique },
+    { 0x2132, 0x2132, 0x001c, CanonicalizeRangeLo },
+    { 0x2133, 0x214d, 0x0000, CanonicalizeUnique },
+    { 0x214e, 0x214e, 0x001c, CanonicalizeRangeHi },
+    { 0x214f, 0x215f, 0x0000, CanonicalizeUnique },
+    { 0x2160, 0x216f, 0x0010, CanonicalizeRangeLo },
+    { 0x2170, 0x217f, 0x0010, CanonicalizeRangeHi },
+    { 0x2180, 0x2182, 0x0000, CanonicalizeUnique },
+    { 0x2183, 0x2184, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2185, 0x24b5, 0x0000, CanonicalizeUnique },
+    { 0x24b6, 0x24cf, 0x001a, CanonicalizeRangeLo },
+    { 0x24d0, 0x24e9, 0x001a, CanonicalizeRangeHi },
+    { 0x24ea, 0x2bff, 0x0000, CanonicalizeUnique },
+    { 0x2c00, 0x2c2e, 0x0030, CanonicalizeRangeLo },
+    { 0x2c2f, 0x2c2f, 0x0000, CanonicalizeUnique },
+    { 0x2c30, 0x2c5e, 0x0030, CanonicalizeRangeHi },
+    { 0x2c5f, 0x2c5f, 0x0000, CanonicalizeUnique },
+    { 0x2c60, 0x2c61, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2c62, 0x2c62, 0x29f7, CanonicalizeRangeHi },
+    { 0x2c63, 0x2c63, 0x0ee6, CanonicalizeRangeHi },
+    { 0x2c64, 0x2c64, 0x29e7, CanonicalizeRangeHi },
+    { 0x2c65, 0x2c65, 0x2a2b, CanonicalizeRangeHi },
+    { 0x2c66, 0x2c66, 0x2a28, CanonicalizeRangeHi },
+    { 0x2c67, 0x2c6c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2c6d, 0x2c6d, 0x2a1c, CanonicalizeRangeHi },
+    { 0x2c6e, 0x2c6e, 0x29fd, CanonicalizeRangeHi },
+    { 0x2c6f, 0x2c6f, 0x2a1f, CanonicalizeRangeHi },
+    { 0x2c70, 0x2c70, 0x2a1e, CanonicalizeRangeHi },
+    { 0x2c71, 0x2c71, 0x0000, CanonicalizeUnique },
+    { 0x2c72, 0x2c73, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2c74, 0x2c74, 0x0000, CanonicalizeUnique },
+    { 0x2c75, 0x2c76, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2c77, 0x2c7d, 0x0000, CanonicalizeUnique },
+    { 0x2c7e, 0x2c7f, 0x2a3f, CanonicalizeRangeHi },
+    { 0x2c80, 0x2ce3, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2ce4, 0x2cea, 0x0000, CanonicalizeUnique },
+    { 0x2ceb, 0x2cee, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2cef, 0x2cf1, 0x0000, CanonicalizeUnique },
+    { 0x2cf2, 0x2cf3, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2cf4, 0x2cff, 0x0000, CanonicalizeUnique },
+    { 0x2d00, 0x2d25, 0x1c60, CanonicalizeRangeHi },
+    { 0x2d26, 0x2d26, 0x0000, CanonicalizeUnique },
+    { 0x2d27, 0x2d27, 0x1c60, CanonicalizeRangeHi },
+    { 0x2d28, 0x2d2c, 0x0000, CanonicalizeUnique },
+    { 0x2d2d, 0x2d2d, 0x1c60, CanonicalizeRangeHi },
+    { 0x2d2e, 0xa63f, 0x0000, CanonicalizeUnique },
+    { 0xa640, 0xa66d, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa66e, 0xa67f, 0x0000, CanonicalizeUnique },
+    { 0xa680, 0xa69b, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa69c, 0xa721, 0x0000, CanonicalizeUnique },
+    { 0xa722, 0xa72f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa730, 0xa731, 0x0000, CanonicalizeUnique },
+    { 0xa732, 0xa76f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa770, 0xa778, 0x0000, CanonicalizeUnique },
+    { 0xa779, 0xa77c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0xa77d, 0xa77d, 0x8a04, CanonicalizeRangeHi },
+    { 0xa77e, 0xa787, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa788, 0xa78a, 0x0000, CanonicalizeUnique },
+    { 0xa78b, 0xa78c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0xa78d, 0xa78d, 0xa528, CanonicalizeRangeHi },
+    { 0xa78e, 0xa78f, 0x0000, CanonicalizeUnique },
+    { 0xa790, 0xa793, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa794, 0xa795, 0x0000, CanonicalizeUnique },
+    { 0xa796, 0xa7a9, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa7aa, 0xa7aa, 0xa544, CanonicalizeRangeHi },
+    { 0xa7ab, 0xa7ab, 0xa54f, CanonicalizeRangeHi },
+    { 0xa7ac, 0xa7ac, 0xa54b, CanonicalizeRangeHi },
+    { 0xa7ad, 0xa7ad, 0xa541, CanonicalizeRangeHi },
+    { 0xa7ae, 0xa7af, 0x0000, CanonicalizeUnique },
+    { 0xa7b0, 0xa7b0, 0xa512, CanonicalizeRangeHi },
+    { 0xa7b1, 0xa7b1, 0xa52a, CanonicalizeRangeHi },
+    { 0xa7b2, 0xff20, 0x0000, CanonicalizeUnique },
+    { 0xff21, 0xff3a, 0x0020, CanonicalizeRangeLo },
+    { 0xff3b, 0xff40, 0x0000, CanonicalizeUnique },
+    { 0xff41, 0xff5a, 0x0020, CanonicalizeRangeHi },
+    { 0xff5b, 0xffff, 0x0000, CanonicalizeUnique },
+};
+
+const UChar32 unicodeCharacterSet0[] = { 0x0041, 0x0061, 0x1e9a, 0 };
+const UChar32 unicodeCharacterSet1[] = { 0x0046, 0x0066, 0xfb00, 0xfb01, 0xfb02, 0xfb03, 0xfb04, 0 };
+const UChar32 unicodeCharacterSet2[] = { 0x0048, 0x0068, 0x1e96, 0 };
+const UChar32 unicodeCharacterSet3[] = { 0x0049, 0x0069, 0x0131, 0 };
+const UChar32 unicodeCharacterSet4[] = { 0x004a, 0x006a, 0x01f0, 0 };
+const UChar32 unicodeCharacterSet5[] = { 0x0053, 0x0073, 0x00df, 0x017f, 0xfb05, 0xfb06, 0 };
+const UChar32 unicodeCharacterSet6[] = { 0x0054, 0x0074, 0x1e97, 0 };
+const UChar32 unicodeCharacterSet7[] = { 0x0057, 0x0077, 0x1e98, 0 };
+const UChar32 unicodeCharacterSet8[] = { 0x0059, 0x0079, 0x1e99, 0 };
+const UChar32 unicodeCharacterSet9[] = { 0x01c4, 0x01c5, 0x01c6, 0 };
+const UChar32 unicodeCharacterSet10[] = { 0x01c7, 0x01c8, 0x01c9, 0 };
+const UChar32 unicodeCharacterSet11[] = { 0x01ca, 0x01cb, 0x01cc, 0 };
+const UChar32 unicodeCharacterSet12[] = { 0x01f1, 0x01f2, 0x01f3, 0 };
+const UChar32 unicodeCharacterSet13[] = { 0x0386, 0x03ac, 0x1fb4, 0 };
+const UChar32 unicodeCharacterSet14[] = { 0x0389, 0x03ae, 0x1fc4, 0 };
+const UChar32 unicodeCharacterSet15[] = { 0x038f, 0x03ce, 0x1ff4, 0 };
+const UChar32 unicodeCharacterSet16[] = { 0x0391, 0x03b1, 0x1fb3, 0x1fb6, 0x1fb7, 0x1fbc, 0 };
+const UChar32 unicodeCharacterSet17[] = { 0x0392, 0x03b2, 0x03d0, 0 };
+const UChar32 unicodeCharacterSet18[] = { 0x0395, 0x03b5, 0x03f5, 0 };
+const UChar32 unicodeCharacterSet19[] = { 0x0397, 0x03b7, 0x1fc3, 0x1fc6, 0x1fc7, 0x1fcc, 0 };
+const UChar32 unicodeCharacterSet20[] = { 0x0398, 0x03b8, 0x03d1, 0 };
+const UChar32 unicodeCharacterSet21[] = { 0x0345, 0x0390, 0x0399, 0x03b9, 0x1fbe, 0x1fd2, 0x1fd3, 0x1fd6, 0x1fd7, 0 };
+const UChar32 unicodeCharacterSet22[] = { 0x039a, 0x03ba, 0x03f0, 0 };
+const UChar32 unicodeCharacterSet23[] = { 0x00b5, 0x039c, 0x03bc, 0 };
+const UChar32 unicodeCharacterSet24[] = { 0x03a0, 0x03c0, 0x03d6, 0 };
+const UChar32 unicodeCharacterSet25[] = { 0x03a1, 0x03c1, 0x03f1, 0x1fe4, 0 };
+const UChar32 unicodeCharacterSet26[] = { 0x03a3, 0x03c2, 0x03c3, 0 };
+const UChar32 unicodeCharacterSet27[] = { 0x03a5, 0x03b0, 0x03c5, 0x1f50, 0x1f52, 0x1f54, 0x1f56, 0x1fe2, 0x1fe3, 0x1fe6, 0x1fe7, 0 };
+const UChar32 unicodeCharacterSet28[] = { 0x03a6, 0x03c6, 0x03d5, 0 };
+const UChar32 unicodeCharacterSet29[] = { 0x03a9, 0x03c9, 0x1ff3, 0x1ff6, 0x1ff7, 0x1ffc, 0 };
+const UChar32 unicodeCharacterSet30[] = { 0x0535, 0x0565, 0x0587, 0 };
+const UChar32 unicodeCharacterSet31[] = { 0x0544, 0x0574, 0xfb13, 0xfb14, 0xfb15, 0xfb17, 0 };
+const UChar32 unicodeCharacterSet32[] = { 0x054e, 0x057e, 0xfb16, 0 };
+const UChar32 unicodeCharacterSet33[] = { 0x1e60, 0x1e61, 0x1e9b, 0 };
+const UChar32 unicodeCharacterSet34[] = { 0x1f00, 0x1f08, 0x1f80, 0x1f88, 0 };
+const UChar32 unicodeCharacterSet35[] = { 0x1f01, 0x1f09, 0x1f81, 0x1f89, 0 };
+const UChar32 unicodeCharacterSet36[] = { 0x1f02, 0x1f0a, 0x1f82, 0x1f8a, 0 };
+const UChar32 unicodeCharacterSet37[] = { 0x1f03, 0x1f0b, 0x1f83, 0x1f8b, 0 };
+const UChar32 unicodeCharacterSet38[] = { 0x1f04, 0x1f0c, 0x1f84, 0x1f8c, 0 };
+const UChar32 unicodeCharacterSet39[] = { 0x1f05, 0x1f0d, 0x1f85, 0x1f8d, 0 };
+const UChar32 unicodeCharacterSet40[] = { 0x1f06, 0x1f0e, 0x1f86, 0x1f8e, 0 };
+const UChar32 unicodeCharacterSet41[] = { 0x1f07, 0x1f0f, 0x1f87, 0x1f8f, 0 };
+const UChar32 unicodeCharacterSet42[] = { 0x1f20, 0x1f28, 0x1f90, 0x1f98, 0 };
+const UChar32 unicodeCharacterSet43[] = { 0x1f21, 0x1f29, 0x1f91, 0x1f99, 0 };
+const UChar32 unicodeCharacterSet44[] = { 0x1f22, 0x1f2a, 0x1f92, 0x1f9a, 0 };
+const UChar32 unicodeCharacterSet45[] = { 0x1f23, 0x1f2b, 0x1f93, 0x1f9b, 0 };
+const UChar32 unicodeCharacterSet46[] = { 0x1f24, 0x1f2c, 0x1f94, 0x1f9c, 0 };
+const UChar32 unicodeCharacterSet47[] = { 0x1f25, 0x1f2d, 0x1f95, 0x1f9d, 0 };
+const UChar32 unicodeCharacterSet48[] = { 0x1f26, 0x1f2e, 0x1f96, 0x1f9e, 0 };
+const UChar32 unicodeCharacterSet49[] = { 0x1f27, 0x1f2f, 0x1f97, 0x1f9f, 0 };
+const UChar32 unicodeCharacterSet50[] = { 0x1f60, 0x1f68, 0x1fa0, 0x1fa8, 0 };
+const UChar32 unicodeCharacterSet51[] = { 0x1f61, 0x1f69, 0x1fa1, 0x1fa9, 0 };
+const UChar32 unicodeCharacterSet52[] = { 0x1f62, 0x1f6a, 0x1fa2, 0x1faa, 0 };
+const UChar32 unicodeCharacterSet53[] = { 0x1f63, 0x1f6b, 0x1fa3, 0x1fab, 0 };
+const UChar32 unicodeCharacterSet54[] = { 0x1f64, 0x1f6c, 0x1fa4, 0x1fac, 0 };
+const UChar32 unicodeCharacterSet55[] = { 0x1f65, 0x1f6d, 0x1fa5, 0x1fad, 0 };
+const UChar32 unicodeCharacterSet56[] = { 0x1f66, 0x1f6e, 0x1fa6, 0x1fae, 0 };
+const UChar32 unicodeCharacterSet57[] = { 0x1f67, 0x1f6f, 0x1fa7, 0x1faf, 0 };
+const UChar32 unicodeCharacterSet58[] = { 0x1f70, 0x1fb2, 0x1fba, 0 };
+const UChar32 unicodeCharacterSet59[] = { 0x1f74, 0x1fc2, 0x1fca, 0 };
+const UChar32 unicodeCharacterSet60[] = { 0x1f7c, 0x1ff2, 0x1ffa, 0 };
+
+static const size_t UNICODE_CANONICALIZATION_SETS = 61;
+const UChar32* const unicodeCharacterSetInfo[UNICODE_CANONICALIZATION_SETS] = {
+    unicodeCharacterSet0,
+    unicodeCharacterSet1,
+    unicodeCharacterSet2,
+    unicodeCharacterSet3,
+    unicodeCharacterSet4,
+    unicodeCharacterSet5,
+    unicodeCharacterSet6,
+    unicodeCharacterSet7,
+    unicodeCharacterSet8,
+    unicodeCharacterSet9,
+    unicodeCharacterSet10,
+    unicodeCharacterSet11,
+    unicodeCharacterSet12,
+    unicodeCharacterSet13,
+    unicodeCharacterSet14,
+    unicodeCharacterSet15,
+    unicodeCharacterSet16,
+    unicodeCharacterSet17,
+    unicodeCharacterSet18,
+    unicodeCharacterSet19,
+    unicodeCharacterSet20,
+    unicodeCharacterSet21,
+    unicodeCharacterSet22,
+    unicodeCharacterSet23,
+    unicodeCharacterSet24,
+    unicodeCharacterSet25,
+    unicodeCharacterSet26,
+    unicodeCharacterSet27,
+    unicodeCharacterSet28,
+    unicodeCharacterSet29,
+    unicodeCharacterSet30,
+    unicodeCharacterSet31,
+    unicodeCharacterSet32,
+    unicodeCharacterSet33,
+    unicodeCharacterSet34,
+    unicodeCharacterSet35,
+    unicodeCharacterSet36,
+    unicodeCharacterSet37,
+    unicodeCharacterSet38,
+    unicodeCharacterSet39,
+    unicodeCharacterSet40,
+    unicodeCharacterSet41,
+    unicodeCharacterSet42,
+    unicodeCharacterSet43,
+    unicodeCharacterSet44,
+    unicodeCharacterSet45,
+    unicodeCharacterSet46,
+    unicodeCharacterSet47,
+    unicodeCharacterSet48,
+    unicodeCharacterSet49,
+    unicodeCharacterSet50,
+    unicodeCharacterSet51,
+    unicodeCharacterSet52,
+    unicodeCharacterSet53,
+    unicodeCharacterSet54,
+    unicodeCharacterSet55,
+    unicodeCharacterSet56,
+    unicodeCharacterSet57,
+    unicodeCharacterSet58,
+    unicodeCharacterSet59,
+    unicodeCharacterSet60,
+};
+
+const size_t UNICODE_CANONICALIZATION_RANGES = 585;
+const CanonicalizationRange unicodeRangeInfo[UNICODE_CANONICALIZATION_RANGES] = {
+    { 0x0000, 0x0040, 0x0000, CanonicalizeUnique },
+    { 0x0041, 0x0041, 0x0000, CanonicalizeSet },
+    { 0x0042, 0x0045, 0x0020, CanonicalizeRangeLo },
+    { 0x0046, 0x0046, 0x0001, CanonicalizeSet },
+    { 0x0047, 0x0047, 0x0020, CanonicalizeRangeLo },
+    { 0x0048, 0x0048, 0x0002, CanonicalizeSet },
+    { 0x0049, 0x0049, 0x0003, CanonicalizeSet },
+    { 0x004a, 0x004a, 0x0004, CanonicalizeSet },
+    { 0x004b, 0x0052, 0x0020, CanonicalizeRangeLo },
+    { 0x0053, 0x0053, 0x0005, CanonicalizeSet },
+    { 0x0054, 0x0054, 0x0006, CanonicalizeSet },
+    { 0x0055, 0x0056, 0x0020, CanonicalizeRangeLo },
+    { 0x0057, 0x0057, 0x0007, CanonicalizeSet },
+    { 0x0058, 0x0058, 0x0020, CanonicalizeRangeLo },
+    { 0x0059, 0x0059, 0x0008, CanonicalizeSet },
+    { 0x005a, 0x005a, 0x0020, CanonicalizeRangeLo },
+    { 0x005b, 0x0060, 0x0000, CanonicalizeUnique },
+    { 0x0061, 0x0061, 0x0000, CanonicalizeSet },
+    { 0x0062, 0x0065, 0x0020, CanonicalizeRangeHi },
+    { 0x0066, 0x0066, 0x0001, CanonicalizeSet },
+    { 0x0067, 0x0067, 0x0020, CanonicalizeRangeHi },
+    { 0x0068, 0x0068, 0x0002, CanonicalizeSet },
+    { 0x0069, 0x0069, 0x0003, CanonicalizeSet },
+    { 0x006a, 0x006a, 0x0004, CanonicalizeSet },
+    { 0x006b, 0x0072, 0x0020, CanonicalizeRangeHi },
+    { 0x0073, 0x0073, 0x0005, CanonicalizeSet },
+    { 0x0074, 0x0074, 0x0006, CanonicalizeSet },
+    { 0x0075, 0x0076, 0x0020, CanonicalizeRangeHi },
+    { 0x0077, 0x0077, 0x0007, CanonicalizeSet },
+    { 0x0078, 0x0078, 0x0020, CanonicalizeRangeHi },
+    { 0x0079, 0x0079, 0x0008, CanonicalizeSet },
+    { 0x007a, 0x007a, 0x0020, CanonicalizeRangeHi },
+    { 0x007b, 0x00b4, 0x0000, CanonicalizeUnique },
+    { 0x00b5, 0x00b5, 0x0017, CanonicalizeSet },
+    { 0x00b6, 0x00bf, 0x0000, CanonicalizeUnique },
+    { 0x00c0, 0x00d6, 0x0020, CanonicalizeRangeLo },
+    { 0x00d7, 0x00d7, 0x0000, CanonicalizeUnique },
+    { 0x00d8, 0x00de, 0x0020, CanonicalizeRangeLo },
+    { 0x00df, 0x00df, 0x0005, CanonicalizeSet },
+    { 0x00e0, 0x00f6, 0x0020, CanonicalizeRangeHi },
+    { 0x00f7, 0x00f7, 0x0000, CanonicalizeUnique },
+    { 0x00f8, 0x00fe, 0x0020, CanonicalizeRangeHi },
+    { 0x00ff, 0x00ff, 0x0079, CanonicalizeRangeLo },
+    { 0x0100, 0x012f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0130, 0x0130, 0x0000, CanonicalizeUnique },
+    { 0x0131, 0x0131, 0x0003, CanonicalizeSet },
+    { 0x0132, 0x0137, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0138, 0x0138, 0x0000, CanonicalizeUnique },
+    { 0x0139, 0x0148, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0149, 0x0149, 0x0173, CanonicalizeRangeLo },
+    { 0x014a, 0x0177, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0178, 0x0178, 0x0079, CanonicalizeRangeHi },
+    { 0x0179, 0x017e, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x017f, 0x017f, 0x0005, CanonicalizeSet },
+    { 0x0180, 0x0180, 0x00c3, CanonicalizeRangeLo },
+    { 0x0181, 0x0181, 0x00d2, CanonicalizeRangeLo },
+    { 0x0182, 0x0185, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0186, 0x0186, 0x00ce, CanonicalizeRangeLo },
+    { 0x0187, 0x0188, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0189, 0x018a, 0x00cd, CanonicalizeRangeLo },
+    { 0x018b, 0x018c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x018d, 0x018d, 0x0000, CanonicalizeUnique },
+    { 0x018e, 0x018e, 0x004f, CanonicalizeRangeLo },
+    { 0x018f, 0x018f, 0x00ca, CanonicalizeRangeLo },
+    { 0x0190, 0x0190, 0x00cb, CanonicalizeRangeLo },
+    { 0x0191, 0x0192, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0193, 0x0193, 0x00cd, CanonicalizeRangeLo },
+    { 0x0194, 0x0194, 0x00cf, CanonicalizeRangeLo },
+    { 0x0195, 0x0195, 0x0061, CanonicalizeRangeLo },
+    { 0x0196, 0x0196, 0x00d3, CanonicalizeRangeLo },
+    { 0x0197, 0x0197, 0x00d1, CanonicalizeRangeLo },
+    { 0x0198, 0x0199, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x019a, 0x019a, 0x00a3, CanonicalizeRangeLo },
+    { 0x019b, 0x019b, 0x0000, CanonicalizeUnique },
+    { 0x019c, 0x019c, 0x00d3, CanonicalizeRangeLo },
+    { 0x019d, 0x019d, 0x00d5, CanonicalizeRangeLo },
+    { 0x019e, 0x019e, 0x0082, CanonicalizeRangeLo },
+    { 0x019f, 0x019f, 0x00d6, CanonicalizeRangeLo },
+    { 0x01a0, 0x01a5, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01a6, 0x01a6, 0x00da, CanonicalizeRangeLo },
+    { 0x01a7, 0x01a8, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01a9, 0x01a9, 0x00da, CanonicalizeRangeLo },
+    { 0x01aa, 0x01ab, 0x0000, CanonicalizeUnique },
+    { 0x01ac, 0x01ad, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01ae, 0x01ae, 0x00da, CanonicalizeRangeLo },
+    { 0x01af, 0x01b0, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01b1, 0x01b2, 0x00d9, CanonicalizeRangeLo },
+    { 0x01b3, 0x01b6, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01b7, 0x01b7, 0x00db, CanonicalizeRangeLo },
+    { 0x01b8, 0x01b9, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01ba, 0x01bb, 0x0000, CanonicalizeUnique },
+    { 0x01bc, 0x01bd, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01be, 0x01be, 0x0000, CanonicalizeUnique },
+    { 0x01bf, 0x01bf, 0x0038, CanonicalizeRangeLo },
+    { 0x01c0, 0x01c3, 0x0000, CanonicalizeUnique },
+    { 0x01c4, 0x01c6, 0x0009, CanonicalizeSet },
+    { 0x01c7, 0x01c9, 0x000a, CanonicalizeSet },
+    { 0x01ca, 0x01cc, 0x000b, CanonicalizeSet },
+    { 0x01cd, 0x01dc, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x01dd, 0x01dd, 0x004f, CanonicalizeRangeHi },
+    { 0x01de, 0x01ef, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01f0, 0x01f0, 0x0004, CanonicalizeSet },
+    { 0x01f1, 0x01f3, 0x000c, CanonicalizeSet },
+    { 0x01f4, 0x01f5, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x01f6, 0x01f6, 0x0061, CanonicalizeRangeHi },
+    { 0x01f7, 0x01f7, 0x0038, CanonicalizeRangeHi },
+    { 0x01f8, 0x021f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0220, 0x0220, 0x0082, CanonicalizeRangeHi },
+    { 0x0221, 0x0221, 0x0000, CanonicalizeUnique },
+    { 0x0222, 0x0233, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0234, 0x0239, 0x0000, CanonicalizeUnique },
+    { 0x023a, 0x023a, 0x2a2b, CanonicalizeRangeLo },
+    { 0x023b, 0x023c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x023d, 0x023d, 0x00a3, CanonicalizeRangeHi },
+    { 0x023e, 0x023e, 0x2a28, CanonicalizeRangeLo },
+    { 0x023f, 0x0240, 0x2a3f, CanonicalizeRangeLo },
+    { 0x0241, 0x0242, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x0243, 0x0243, 0x00c3, CanonicalizeRangeHi },
+    { 0x0244, 0x0244, 0x0045, CanonicalizeRangeLo },
+    { 0x0245, 0x0245, 0x0047, CanonicalizeRangeLo },
+    { 0x0246, 0x024f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0250, 0x0250, 0x2a1f, CanonicalizeRangeLo },
+    { 0x0251, 0x0251, 0x2a1c, CanonicalizeRangeLo },
+    { 0x0252, 0x0252, 0x2a1e, CanonicalizeRangeLo },
+    { 0x0253, 0x0253, 0x00d2, CanonicalizeRangeHi },
+    { 0x0254, 0x0254, 0x00ce, CanonicalizeRangeHi },
+    { 0x0255, 0x0255, 0x0000, CanonicalizeUnique },
+    { 0x0256, 0x0257, 0x00cd, CanonicalizeRangeHi },
+    { 0x0258, 0x0258, 0x0000, CanonicalizeUnique },
+    { 0x0259, 0x0259, 0x00ca, CanonicalizeRangeHi },
+    { 0x025a, 0x025a, 0x0000, CanonicalizeUnique },
+    { 0x025b, 0x025b, 0x00cb, CanonicalizeRangeHi },
+    { 0x025c, 0x025c, 0xa54f, CanonicalizeRangeLo },
+    { 0x025d, 0x025f, 0x0000, CanonicalizeUnique },
+    { 0x0260, 0x0260, 0x00cd, CanonicalizeRangeHi },
+    { 0x0261, 0x0261, 0xa54b, CanonicalizeRangeLo },
+    { 0x0262, 0x0262, 0x0000, CanonicalizeUnique },
+    { 0x0263, 0x0263, 0x00cf, CanonicalizeRangeHi },
+    { 0x0264, 0x0264, 0x0000, CanonicalizeUnique },
+    { 0x0265, 0x0265, 0xa528, CanonicalizeRangeLo },
+    { 0x0266, 0x0266, 0xa544, CanonicalizeRangeLo },
+    { 0x0267, 0x0267, 0x0000, CanonicalizeUnique },
+    { 0x0268, 0x0268, 0x00d1, CanonicalizeRangeHi },
+    { 0x0269, 0x0269, 0x00d3, CanonicalizeRangeHi },
+    { 0x026a, 0x026a, 0x0000, CanonicalizeUnique },
+    { 0x026b, 0x026b, 0x29f7, CanonicalizeRangeLo },
+    { 0x026c, 0x026c, 0xa541, CanonicalizeRangeLo },
+    { 0x026d, 0x026e, 0x0000, CanonicalizeUnique },
+    { 0x026f, 0x026f, 0x00d3, CanonicalizeRangeHi },
+    { 0x0270, 0x0270, 0x0000, CanonicalizeUnique },
+    { 0x0271, 0x0271, 0x29fd, CanonicalizeRangeLo },
+    { 0x0272, 0x0272, 0x00d5, CanonicalizeRangeHi },
+    { 0x0273, 0x0274, 0x0000, CanonicalizeUnique },
+    { 0x0275, 0x0275, 0x00d6, CanonicalizeRangeHi },
+    { 0x0276, 0x027c, 0x0000, CanonicalizeUnique },
+    { 0x027d, 0x027d, 0x29e7, CanonicalizeRangeLo },
+    { 0x027e, 0x027f, 0x0000, CanonicalizeUnique },
+    { 0x0280, 0x0280, 0x00da, CanonicalizeRangeHi },
+    { 0x0281, 0x0282, 0x0000, CanonicalizeUnique },
+    { 0x0283, 0x0283, 0x00da, CanonicalizeRangeHi },
+    { 0x0284, 0x0286, 0x0000, CanonicalizeUnique },
+    { 0x0287, 0x0287, 0xa52a, CanonicalizeRangeLo },
+    { 0x0288, 0x0288, 0x00da, CanonicalizeRangeHi },
+    { 0x0289, 0x0289, 0x0045, CanonicalizeRangeHi },
+    { 0x028a, 0x028b, 0x00d9, CanonicalizeRangeHi },
+    { 0x028c, 0x028c, 0x0047, CanonicalizeRangeHi },
+    { 0x028d, 0x0291, 0x0000, CanonicalizeUnique },
+    { 0x0292, 0x0292, 0x00db, CanonicalizeRangeHi },
+    { 0x0293, 0x029d, 0x0000, CanonicalizeUnique },
+    { 0x029e, 0x029e, 0xa512, CanonicalizeRangeLo },
+    { 0x029f, 0x02bb, 0x0000, CanonicalizeUnique },
+    { 0x02bc, 0x02bc, 0x0173, CanonicalizeRangeHi },
+    { 0x02bd, 0x0344, 0x0000, CanonicalizeUnique },
+    { 0x0345, 0x0345, 0x0015, CanonicalizeSet },
+    { 0x0346, 0x036f, 0x0000, CanonicalizeUnique },
+    { 0x0370, 0x0373, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0374, 0x0375, 0x0000, CanonicalizeUnique },
+    { 0x0376, 0x0377, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0378, 0x037a, 0x0000, CanonicalizeUnique },
+    { 0x037b, 0x037d, 0x0082, CanonicalizeRangeLo },
+    { 0x037e, 0x037e, 0x0000, CanonicalizeUnique },
+    { 0x037f, 0x037f, 0x0074, CanonicalizeRangeLo },
+    { 0x0380, 0x0385, 0x0000, CanonicalizeUnique },
+    { 0x0386, 0x0386, 0x000d, CanonicalizeSet },
+    { 0x0387, 0x0387, 0x0000, CanonicalizeUnique },
+    { 0x0388, 0x0388, 0x0025, CanonicalizeRangeLo },
+    { 0x0389, 0x0389, 0x000e, CanonicalizeSet },
+    { 0x038a, 0x038a, 0x0025, CanonicalizeRangeLo },
+    { 0x038b, 0x038b, 0x0000, CanonicalizeUnique },
+    { 0x038c, 0x038c, 0x0040, CanonicalizeRangeLo },
+    { 0x038d, 0x038d, 0x0000, CanonicalizeUnique },
+    { 0x038e, 0x038e, 0x003f, CanonicalizeRangeLo },
+    { 0x038f, 0x038f, 0x000f, CanonicalizeSet },
+    { 0x0390, 0x0390, 0x0015, CanonicalizeSet },
+    { 0x0391, 0x0391, 0x0010, CanonicalizeSet },
+    { 0x0392, 0x0392, 0x0011, CanonicalizeSet },
+    { 0x0393, 0x0394, 0x0020, CanonicalizeRangeLo },
+    { 0x0395, 0x0395, 0x0012, CanonicalizeSet },
+    { 0x0396, 0x0396, 0x0020, CanonicalizeRangeLo },
+    { 0x0397, 0x0397, 0x0013, CanonicalizeSet },
+    { 0x0398, 0x0398, 0x0014, CanonicalizeSet },
+    { 0x0399, 0x0399, 0x0015, CanonicalizeSet },
+    { 0x039a, 0x039a, 0x0016, CanonicalizeSet },
+    { 0x039b, 0x039b, 0x0020, CanonicalizeRangeLo },
+    { 0x039c, 0x039c, 0x0017, CanonicalizeSet },
+    { 0x039d, 0x039f, 0x0020, CanonicalizeRangeLo },
+    { 0x03a0, 0x03a0, 0x0018, CanonicalizeSet },
+    { 0x03a1, 0x03a1, 0x0019, CanonicalizeSet },
+    { 0x03a2, 0x03a2, 0x0000, CanonicalizeUnique },
+    { 0x03a3, 0x03a3, 0x001a, CanonicalizeSet },
+    { 0x03a4, 0x03a4, 0x0020, CanonicalizeRangeLo },
+    { 0x03a5, 0x03a5, 0x001b, CanonicalizeSet },
+    { 0x03a6, 0x03a6, 0x001c, CanonicalizeSet },
+    { 0x03a7, 0x03a8, 0x0020, CanonicalizeRangeLo },
+    { 0x03a9, 0x03a9, 0x001d, CanonicalizeSet },
+    { 0x03aa, 0x03ab, 0x0020, CanonicalizeRangeLo },
+    { 0x03ac, 0x03ac, 0x000d, CanonicalizeSet },
+    { 0x03ad, 0x03ad, 0x0025, CanonicalizeRangeHi },
+    { 0x03ae, 0x03ae, 0x000e, CanonicalizeSet },
+    { 0x03af, 0x03af, 0x0025, CanonicalizeRangeHi },
+    { 0x03b0, 0x03b0, 0x001b, CanonicalizeSet },
+    { 0x03b1, 0x03b1, 0x0010, CanonicalizeSet },
+    { 0x03b2, 0x03b2, 0x0011, CanonicalizeSet },
+    { 0x03b3, 0x03b4, 0x0020, CanonicalizeRangeHi },
+    { 0x03b5, 0x03b5, 0x0012, CanonicalizeSet },
+    { 0x03b6, 0x03b6, 0x0020, CanonicalizeRangeHi },
+    { 0x03b7, 0x03b7, 0x0013, CanonicalizeSet },
+    { 0x03b8, 0x03b8, 0x0014, CanonicalizeSet },
+    { 0x03b9, 0x03b9, 0x0015, CanonicalizeSet },
+    { 0x03ba, 0x03ba, 0x0016, CanonicalizeSet },
+    { 0x03bb, 0x03bb, 0x0020, CanonicalizeRangeHi },
+    { 0x03bc, 0x03bc, 0x0017, CanonicalizeSet },
+    { 0x03bd, 0x03bf, 0x0020, CanonicalizeRangeHi },
+    { 0x03c0, 0x03c0, 0x0018, CanonicalizeSet },
+    { 0x03c1, 0x03c1, 0x0019, CanonicalizeSet },
+    { 0x03c2, 0x03c3, 0x001a, CanonicalizeSet },
+    { 0x03c4, 0x03c4, 0x0020, CanonicalizeRangeHi },
+    { 0x03c5, 0x03c5, 0x001b, CanonicalizeSet },
+    { 0x03c6, 0x03c6, 0x001c, CanonicalizeSet },
+    { 0x03c7, 0x03c8, 0x0020, CanonicalizeRangeHi },
+    { 0x03c9, 0x03c9, 0x001d, CanonicalizeSet },
+    { 0x03ca, 0x03cb, 0x0020, CanonicalizeRangeHi },
+    { 0x03cc, 0x03cc, 0x0040, CanonicalizeRangeHi },
+    { 0x03cd, 0x03cd, 0x003f, CanonicalizeRangeHi },
+    { 0x03ce, 0x03ce, 0x000f, CanonicalizeSet },
+    { 0x03cf, 0x03cf, 0x0008, CanonicalizeRangeLo },
+    { 0x03d0, 0x03d0, 0x0011, CanonicalizeSet },
+    { 0x03d1, 0x03d1, 0x0014, CanonicalizeSet },
+    { 0x03d2, 0x03d4, 0x0000, CanonicalizeUnique },
+    { 0x03d5, 0x03d5, 0x001c, CanonicalizeSet },
+    { 0x03d6, 0x03d6, 0x0018, CanonicalizeSet },
+    { 0x03d7, 0x03d7, 0x0008, CanonicalizeRangeHi },
+    { 0x03d8, 0x03ef, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x03f0, 0x03f0, 0x0016, CanonicalizeSet },
+    { 0x03f1, 0x03f1, 0x0019, CanonicalizeSet },
+    { 0x03f2, 0x03f2, 0x0007, CanonicalizeRangeLo },
+    { 0x03f3, 0x03f3, 0x0074, CanonicalizeRangeHi },
+    { 0x03f4, 0x03f4, 0x0000, CanonicalizeUnique },
+    { 0x03f5, 0x03f5, 0x0012, CanonicalizeSet },
+    { 0x03f6, 0x03f6, 0x0000, CanonicalizeUnique },
+    { 0x03f7, 0x03f8, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x03f9, 0x03f9, 0x0007, CanonicalizeRangeHi },
+    { 0x03fa, 0x03fb, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x03fc, 0x03fc, 0x0000, CanonicalizeUnique },
+    { 0x03fd, 0x03ff, 0x0082, CanonicalizeRangeHi },
+    { 0x0400, 0x040f, 0x0050, CanonicalizeRangeLo },
+    { 0x0410, 0x042f, 0x0020, CanonicalizeRangeLo },
+    { 0x0430, 0x044f, 0x0020, CanonicalizeRangeHi },
+    { 0x0450, 0x045f, 0x0050, CanonicalizeRangeHi },
+    { 0x0460, 0x0481, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0482, 0x0489, 0x0000, CanonicalizeUnique },
+    { 0x048a, 0x04bf, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x04c0, 0x04c0, 0x000f, CanonicalizeRangeLo },
+    { 0x04c1, 0x04ce, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x04cf, 0x04cf, 0x000f, CanonicalizeRangeHi },
+    { 0x04d0, 0x052f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x0530, 0x0530, 0x0000, CanonicalizeUnique },
+    { 0x0531, 0x0534, 0x0030, CanonicalizeRangeLo },
+    { 0x0535, 0x0535, 0x001e, CanonicalizeSet },
+    { 0x0536, 0x0543, 0x0030, CanonicalizeRangeLo },
+    { 0x0544, 0x0544, 0x001f, CanonicalizeSet },
+    { 0x0545, 0x054d, 0x0030, CanonicalizeRangeLo },
+    { 0x054e, 0x054e, 0x0020, CanonicalizeSet },
+    { 0x054f, 0x0556, 0x0030, CanonicalizeRangeLo },
+    { 0x0557, 0x0560, 0x0000, CanonicalizeUnique },
+    { 0x0561, 0x0564, 0x0030, CanonicalizeRangeHi },
+    { 0x0565, 0x0565, 0x001e, CanonicalizeSet },
+    { 0x0566, 0x0573, 0x0030, CanonicalizeRangeHi },
+    { 0x0574, 0x0574, 0x001f, CanonicalizeSet },
+    { 0x0575, 0x057d, 0x0030, CanonicalizeRangeHi },
+    { 0x057e, 0x057e, 0x0020, CanonicalizeSet },
+    { 0x057f, 0x0586, 0x0030, CanonicalizeRangeHi },
+    { 0x0587, 0x0587, 0x001e, CanonicalizeSet },
+    { 0x0588, 0x109f, 0x0000, CanonicalizeUnique },
+    { 0x10a0, 0x10c5, 0x1c60, CanonicalizeRangeLo },
+    { 0x10c6, 0x10c6, 0x0000, CanonicalizeUnique },
+    { 0x10c7, 0x10c7, 0x1c60, CanonicalizeRangeLo },
+    { 0x10c8, 0x10cc, 0x0000, CanonicalizeUnique },
+    { 0x10cd, 0x10cd, 0x1c60, CanonicalizeRangeLo },
+    { 0x10ce, 0x1d78, 0x0000, CanonicalizeUnique },
+    { 0x1d79, 0x1d79, 0x8a04, CanonicalizeRangeLo },
+    { 0x1d7a, 0x1d7c, 0x0000, CanonicalizeUnique },
+    { 0x1d7d, 0x1d7d, 0x0ee6, CanonicalizeRangeLo },
+    { 0x1d7e, 0x1dff, 0x0000, CanonicalizeUnique },
+    { 0x1e00, 0x1e5f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x1e60, 0x1e61, 0x0021, CanonicalizeSet },
+    { 0x1e62, 0x1e95, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x1e96, 0x1e96, 0x0002, CanonicalizeSet },
+    { 0x1e97, 0x1e97, 0x0006, CanonicalizeSet },
+    { 0x1e98, 0x1e98, 0x0007, CanonicalizeSet },
+    { 0x1e99, 0x1e99, 0x0008, CanonicalizeSet },
+    { 0x1e9a, 0x1e9a, 0x0000, CanonicalizeSet },
+    { 0x1e9b, 0x1e9b, 0x0021, CanonicalizeSet },
+    { 0x1e9c, 0x1e9f, 0x0000, CanonicalizeUnique },
+    { 0x1ea0, 0x1eff, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x1f00, 0x1f00, 0x0022, CanonicalizeSet },
+    { 0x1f01, 0x1f01, 0x0023, CanonicalizeSet },
+    { 0x1f02, 0x1f02, 0x0024, CanonicalizeSet },
+    { 0x1f03, 0x1f03, 0x0025, CanonicalizeSet },
+    { 0x1f04, 0x1f04, 0x0026, CanonicalizeSet },
+    { 0x1f05, 0x1f05, 0x0027, CanonicalizeSet },
+    { 0x1f06, 0x1f06, 0x0028, CanonicalizeSet },
+    { 0x1f07, 0x1f07, 0x0029, CanonicalizeSet },
+    { 0x1f08, 0x1f08, 0x0022, CanonicalizeSet },
+    { 0x1f09, 0x1f09, 0x0023, CanonicalizeSet },
+    { 0x1f0a, 0x1f0a, 0x0024, CanonicalizeSet },
+    { 0x1f0b, 0x1f0b, 0x0025, CanonicalizeSet },
+    { 0x1f0c, 0x1f0c, 0x0026, CanonicalizeSet },
+    { 0x1f0d, 0x1f0d, 0x0027, CanonicalizeSet },
+    { 0x1f0e, 0x1f0e, 0x0028, CanonicalizeSet },
+    { 0x1f0f, 0x1f0f, 0x0029, CanonicalizeSet },
+    { 0x1f10, 0x1f15, 0x0008, CanonicalizeRangeLo },
+    { 0x1f16, 0x1f17, 0x0000, CanonicalizeUnique },
+    { 0x1f18, 0x1f1d, 0x0008, CanonicalizeRangeHi },
+    { 0x1f1e, 0x1f1f, 0x0000, CanonicalizeUnique },
+    { 0x1f20, 0x1f20, 0x002a, CanonicalizeSet },
+    { 0x1f21, 0x1f21, 0x002b, CanonicalizeSet },
+    { 0x1f22, 0x1f22, 0x002c, CanonicalizeSet },
+    { 0x1f23, 0x1f23, 0x002d, CanonicalizeSet },
+    { 0x1f24, 0x1f24, 0x002e, CanonicalizeSet },
+    { 0x1f25, 0x1f25, 0x002f, CanonicalizeSet },
+    { 0x1f26, 0x1f26, 0x0030, CanonicalizeSet },
+    { 0x1f27, 0x1f27, 0x0031, CanonicalizeSet },
+    { 0x1f28, 0x1f28, 0x002a, CanonicalizeSet },
+    { 0x1f29, 0x1f29, 0x002b, CanonicalizeSet },
+    { 0x1f2a, 0x1f2a, 0x002c, CanonicalizeSet },
+    { 0x1f2b, 0x1f2b, 0x002d, CanonicalizeSet },
+    { 0x1f2c, 0x1f2c, 0x002e, CanonicalizeSet },
+    { 0x1f2d, 0x1f2d, 0x002f, CanonicalizeSet },
+    { 0x1f2e, 0x1f2e, 0x0030, CanonicalizeSet },
+    { 0x1f2f, 0x1f2f, 0x0031, CanonicalizeSet },
+    { 0x1f30, 0x1f37, 0x0008, CanonicalizeRangeLo },
+    { 0x1f38, 0x1f3f, 0x0008, CanonicalizeRangeHi },
+    { 0x1f40, 0x1f45, 0x0008, CanonicalizeRangeLo },
+    { 0x1f46, 0x1f47, 0x0000, CanonicalizeUnique },
+    { 0x1f48, 0x1f4d, 0x0008, CanonicalizeRangeHi },
+    { 0x1f4e, 0x1f4f, 0x0000, CanonicalizeUnique },
+    { 0x1f50, 0x1f50, 0x001b, CanonicalizeSet },
+    { 0x1f51, 0x1f51, 0x0008, CanonicalizeRangeLo },
+    { 0x1f52, 0x1f52, 0x001b, CanonicalizeSet },
+    { 0x1f53, 0x1f53, 0x0008, CanonicalizeRangeLo },
+    { 0x1f54, 0x1f54, 0x001b, CanonicalizeSet },
+    { 0x1f55, 0x1f55, 0x0008, CanonicalizeRangeLo },
+    { 0x1f56, 0x1f56, 0x001b, CanonicalizeSet },
+    { 0x1f57, 0x1f57, 0x0008, CanonicalizeRangeLo },
+    { 0x1f58, 0x1f58, 0x0000, CanonicalizeUnique },
+    { 0x1f59, 0x1f59, 0x0008, CanonicalizeRangeHi },
+    { 0x1f5a, 0x1f5a, 0x0000, CanonicalizeUnique },
+    { 0x1f5b, 0x1f5b, 0x0008, CanonicalizeRangeHi },
+    { 0x1f5c, 0x1f5c, 0x0000, CanonicalizeUnique },
+    { 0x1f5d, 0x1f5d, 0x0008, CanonicalizeRangeHi },
+    { 0x1f5e, 0x1f5e, 0x0000, CanonicalizeUnique },
+    { 0x1f5f, 0x1f5f, 0x0008, CanonicalizeRangeHi },
+    { 0x1f60, 0x1f60, 0x0032, CanonicalizeSet },
+    { 0x1f61, 0x1f61, 0x0033, CanonicalizeSet },
+    { 0x1f62, 0x1f62, 0x0034, CanonicalizeSet },
+    { 0x1f63, 0x1f63, 0x0035, CanonicalizeSet },
+    { 0x1f64, 0x1f64, 0x0036, CanonicalizeSet },
+    { 0x1f65, 0x1f65, 0x0037, CanonicalizeSet },
+    { 0x1f66, 0x1f66, 0x0038, CanonicalizeSet },
+    { 0x1f67, 0x1f67, 0x0039, CanonicalizeSet },
+    { 0x1f68, 0x1f68, 0x0032, CanonicalizeSet },
+    { 0x1f69, 0x1f69, 0x0033, CanonicalizeSet },
+    { 0x1f6a, 0x1f6a, 0x0034, CanonicalizeSet },
+    { 0x1f6b, 0x1f6b, 0x0035, CanonicalizeSet },
+    { 0x1f6c, 0x1f6c, 0x0036, CanonicalizeSet },
+    { 0x1f6d, 0x1f6d, 0x0037, CanonicalizeSet },
+    { 0x1f6e, 0x1f6e, 0x0038, CanonicalizeSet },
+    { 0x1f6f, 0x1f6f, 0x0039, CanonicalizeSet },
+    { 0x1f70, 0x1f70, 0x003a, CanonicalizeSet },
+    { 0x1f71, 0x1f71, 0x004a, CanonicalizeRangeLo },
+    { 0x1f72, 0x1f73, 0x0056, CanonicalizeRangeLo },
+    { 0x1f74, 0x1f74, 0x003b, CanonicalizeSet },
+    { 0x1f75, 0x1f75, 0x0056, CanonicalizeRangeLo },
+    { 0x1f76, 0x1f77, 0x0064, CanonicalizeRangeLo },
+    { 0x1f78, 0x1f79, 0x0080, CanonicalizeRangeLo },
+    { 0x1f7a, 0x1f7b, 0x0070, CanonicalizeRangeLo },
+    { 0x1f7c, 0x1f7c, 0x003c, CanonicalizeSet },
+    { 0x1f7d, 0x1f7d, 0x007e, CanonicalizeRangeLo },
+    { 0x1f7e, 0x1f7f, 0x0000, CanonicalizeUnique },
+    { 0x1f80, 0x1f80, 0x0022, CanonicalizeSet },
+    { 0x1f81, 0x1f81, 0x0023, CanonicalizeSet },
+    { 0x1f82, 0x1f82, 0x0024, CanonicalizeSet },
+    { 0x1f83, 0x1f83, 0x0025, CanonicalizeSet },
+    { 0x1f84, 0x1f84, 0x0026, CanonicalizeSet },
+    { 0x1f85, 0x1f85, 0x0027, CanonicalizeSet },
+    { 0x1f86, 0x1f86, 0x0028, CanonicalizeSet },
+    { 0x1f87, 0x1f87, 0x0029, CanonicalizeSet },
+    { 0x1f88, 0x1f88, 0x0022, CanonicalizeSet },
+    { 0x1f89, 0x1f89, 0x0023, CanonicalizeSet },
+    { 0x1f8a, 0x1f8a, 0x0024, CanonicalizeSet },
+    { 0x1f8b, 0x1f8b, 0x0025, CanonicalizeSet },
+    { 0x1f8c, 0x1f8c, 0x0026, CanonicalizeSet },
+    { 0x1f8d, 0x1f8d, 0x0027, CanonicalizeSet },
+    { 0x1f8e, 0x1f8e, 0x0028, CanonicalizeSet },
+    { 0x1f8f, 0x1f8f, 0x0029, CanonicalizeSet },
+    { 0x1f90, 0x1f90, 0x002a, CanonicalizeSet },
+    { 0x1f91, 0x1f91, 0x002b, CanonicalizeSet },
+    { 0x1f92, 0x1f92, 0x002c, CanonicalizeSet },
+    { 0x1f93, 0x1f93, 0x002d, CanonicalizeSet },
+    { 0x1f94, 0x1f94, 0x002e, CanonicalizeSet },
+    { 0x1f95, 0x1f95, 0x002f, CanonicalizeSet },
+    { 0x1f96, 0x1f96, 0x0030, CanonicalizeSet },
+    { 0x1f97, 0x1f97, 0x0031, CanonicalizeSet },
+    { 0x1f98, 0x1f98, 0x002a, CanonicalizeSet },
+    { 0x1f99, 0x1f99, 0x002b, CanonicalizeSet },
+    { 0x1f9a, 0x1f9a, 0x002c, CanonicalizeSet },
+    { 0x1f9b, 0x1f9b, 0x002d, CanonicalizeSet },
+    { 0x1f9c, 0x1f9c, 0x002e, CanonicalizeSet },
+    { 0x1f9d, 0x1f9d, 0x002f, CanonicalizeSet },
+    { 0x1f9e, 0x1f9e, 0x0030, CanonicalizeSet },
+    { 0x1f9f, 0x1f9f, 0x0031, CanonicalizeSet },
+    { 0x1fa0, 0x1fa0, 0x0032, CanonicalizeSet },
+    { 0x1fa1, 0x1fa1, 0x0033, CanonicalizeSet },
+    { 0x1fa2, 0x1fa2, 0x0034, CanonicalizeSet },
+    { 0x1fa3, 0x1fa3, 0x0035, CanonicalizeSet },
+    { 0x1fa4, 0x1fa4, 0x0036, CanonicalizeSet },
+    { 0x1fa5, 0x1fa5, 0x0037, CanonicalizeSet },
+    { 0x1fa6, 0x1fa6, 0x0038, CanonicalizeSet },
+    { 0x1fa7, 0x1fa7, 0x0039, CanonicalizeSet },
+    { 0x1fa8, 0x1fa8, 0x0032, CanonicalizeSet },
+    { 0x1fa9, 0x1fa9, 0x0033, CanonicalizeSet },
+    { 0x1faa, 0x1faa, 0x0034, CanonicalizeSet },
+    { 0x1fab, 0x1fab, 0x0035, CanonicalizeSet },
+    { 0x1fac, 0x1fac, 0x0036, CanonicalizeSet },
+    { 0x1fad, 0x1fad, 0x0037, CanonicalizeSet },
+    { 0x1fae, 0x1fae, 0x0038, CanonicalizeSet },
+    { 0x1faf, 0x1faf, 0x0039, CanonicalizeSet },
+    { 0x1fb0, 0x1fb1, 0x0008, CanonicalizeRangeLo },
+    { 0x1fb2, 0x1fb2, 0x003a, CanonicalizeSet },
+    { 0x1fb3, 0x1fb3, 0x0010, CanonicalizeSet },
+    { 0x1fb4, 0x1fb4, 0x000d, CanonicalizeSet },
+    { 0x1fb5, 0x1fb5, 0x0000, CanonicalizeUnique },
+    { 0x1fb6, 0x1fb7, 0x0010, CanonicalizeSet },
+    { 0x1fb8, 0x1fb9, 0x0008, CanonicalizeRangeHi },
+    { 0x1fba, 0x1fba, 0x003a, CanonicalizeSet },
+    { 0x1fbb, 0x1fbb, 0x004a, CanonicalizeRangeHi },
+    { 0x1fbc, 0x1fbc, 0x0010, CanonicalizeSet },
+    { 0x1fbd, 0x1fbd, 0x0000, CanonicalizeUnique },
+    { 0x1fbe, 0x1fbe, 0x0015, CanonicalizeSet },
+    { 0x1fbf, 0x1fc1, 0x0000, CanonicalizeUnique },
+    { 0x1fc2, 0x1fc2, 0x003b, CanonicalizeSet },
+    { 0x1fc3, 0x1fc3, 0x0013, CanonicalizeSet },
+    { 0x1fc4, 0x1fc4, 0x000e, CanonicalizeSet },
+    { 0x1fc5, 0x1fc5, 0x0000, CanonicalizeUnique },
+    { 0x1fc6, 0x1fc7, 0x0013, CanonicalizeSet },
+    { 0x1fc8, 0x1fc9, 0x0056, CanonicalizeRangeHi },
+    { 0x1fca, 0x1fca, 0x003b, CanonicalizeSet },
+    { 0x1fcb, 0x1fcb, 0x0056, CanonicalizeRangeHi },
+    { 0x1fcc, 0x1fcc, 0x0013, CanonicalizeSet },
+    { 0x1fcd, 0x1fcf, 0x0000, CanonicalizeUnique },
+    { 0x1fd0, 0x1fd1, 0x0008, CanonicalizeRangeLo },
+    { 0x1fd2, 0x1fd3, 0x0015, CanonicalizeSet },
+    { 0x1fd4, 0x1fd5, 0x0000, CanonicalizeUnique },
+    { 0x1fd6, 0x1fd7, 0x0015, CanonicalizeSet },
+    { 0x1fd8, 0x1fd9, 0x0008, CanonicalizeRangeHi },
+    { 0x1fda, 0x1fdb, 0x0064, CanonicalizeRangeHi },
+    { 0x1fdc, 0x1fdf, 0x0000, CanonicalizeUnique },
+    { 0x1fe0, 0x1fe1, 0x0008, CanonicalizeRangeLo },
+    { 0x1fe2, 0x1fe3, 0x001b, CanonicalizeSet },
+    { 0x1fe4, 0x1fe4, 0x0019, CanonicalizeSet },
+    { 0x1fe5, 0x1fe5, 0x0007, CanonicalizeRangeLo },
+    { 0x1fe6, 0x1fe7, 0x001b, CanonicalizeSet },
+    { 0x1fe8, 0x1fe9, 0x0008, CanonicalizeRangeHi },
+    { 0x1fea, 0x1feb, 0x0070, CanonicalizeRangeHi },
+    { 0x1fec, 0x1fec, 0x0007, CanonicalizeRangeHi },
+    { 0x1fed, 0x1ff1, 0x0000, CanonicalizeUnique },
+    { 0x1ff2, 0x1ff2, 0x003c, CanonicalizeSet },
+    { 0x1ff3, 0x1ff3, 0x001d, CanonicalizeSet },
+    { 0x1ff4, 0x1ff4, 0x000f, CanonicalizeSet },
+    { 0x1ff5, 0x1ff5, 0x0000, CanonicalizeUnique },
+    { 0x1ff6, 0x1ff7, 0x001d, CanonicalizeSet },
+    { 0x1ff8, 0x1ff9, 0x0080, CanonicalizeRangeHi },
+    { 0x1ffa, 0x1ffa, 0x003c, CanonicalizeSet },
+    { 0x1ffb, 0x1ffb, 0x007e, CanonicalizeRangeHi },
+    { 0x1ffc, 0x1ffc, 0x001d, CanonicalizeSet },
+    { 0x1ffd, 0x2131, 0x0000, CanonicalizeUnique },
+    { 0x2132, 0x2132, 0x001c, CanonicalizeRangeLo },
+    { 0x2133, 0x214d, 0x0000, CanonicalizeUnique },
+    { 0x214e, 0x214e, 0x001c, CanonicalizeRangeHi },
+    { 0x214f, 0x215f, 0x0000, CanonicalizeUnique },
+    { 0x2160, 0x216f, 0x0010, CanonicalizeRangeLo },
+    { 0x2170, 0x217f, 0x0010, CanonicalizeRangeHi },
+    { 0x2180, 0x2182, 0x0000, CanonicalizeUnique },
+    { 0x2183, 0x2184, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2185, 0x24b5, 0x0000, CanonicalizeUnique },
+    { 0x24b6, 0x24cf, 0x001a, CanonicalizeRangeLo },
+    { 0x24d0, 0x24e9, 0x001a, CanonicalizeRangeHi },
+    { 0x24ea, 0x2bff, 0x0000, CanonicalizeUnique },
+    { 0x2c00, 0x2c2e, 0x0030, CanonicalizeRangeLo },
+    { 0x2c2f, 0x2c2f, 0x0000, CanonicalizeUnique },
+    { 0x2c30, 0x2c5e, 0x0030, CanonicalizeRangeHi },
+    { 0x2c5f, 0x2c5f, 0x0000, CanonicalizeUnique },
+    { 0x2c60, 0x2c61, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2c62, 0x2c62, 0x29f7, CanonicalizeRangeHi },
+    { 0x2c63, 0x2c63, 0x0ee6, CanonicalizeRangeHi },
+    { 0x2c64, 0x2c64, 0x29e7, CanonicalizeRangeHi },
+    { 0x2c65, 0x2c65, 0x2a2b, CanonicalizeRangeHi },
+    { 0x2c66, 0x2c66, 0x2a28, CanonicalizeRangeHi },
+    { 0x2c67, 0x2c6c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2c6d, 0x2c6d, 0x2a1c, CanonicalizeRangeHi },
+    { 0x2c6e, 0x2c6e, 0x29fd, CanonicalizeRangeHi },
+    { 0x2c6f, 0x2c6f, 0x2a1f, CanonicalizeRangeHi },
+    { 0x2c70, 0x2c70, 0x2a1e, CanonicalizeRangeHi },
+    { 0x2c71, 0x2c71, 0x0000, CanonicalizeUnique },
+    { 0x2c72, 0x2c73, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2c74, 0x2c74, 0x0000, CanonicalizeUnique },
+    { 0x2c75, 0x2c76, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2c77, 0x2c7d, 0x0000, CanonicalizeUnique },
+    { 0x2c7e, 0x2c7f, 0x2a3f, CanonicalizeRangeHi },
+    { 0x2c80, 0x2ce3, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2ce4, 0x2cea, 0x0000, CanonicalizeUnique },
+    { 0x2ceb, 0x2cee, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0x2cef, 0x2cf1, 0x0000, CanonicalizeUnique },
+    { 0x2cf2, 0x2cf3, 0x0000, CanonicalizeAlternatingAligned },
+    { 0x2cf4, 0x2cff, 0x0000, CanonicalizeUnique },
+    { 0x2d00, 0x2d25, 0x1c60, CanonicalizeRangeHi },
+    { 0x2d26, 0x2d26, 0x0000, CanonicalizeUnique },
+    { 0x2d27, 0x2d27, 0x1c60, CanonicalizeRangeHi },
+    { 0x2d28, 0x2d2c, 0x0000, CanonicalizeUnique },
+    { 0x2d2d, 0x2d2d, 0x1c60, CanonicalizeRangeHi },
+    { 0x2d2e, 0xa63f, 0x0000, CanonicalizeUnique },
+    { 0xa640, 0xa66d, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa66e, 0xa67f, 0x0000, CanonicalizeUnique },
+    { 0xa680, 0xa69b, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa69c, 0xa721, 0x0000, CanonicalizeUnique },
+    { 0xa722, 0xa72f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa730, 0xa731, 0x0000, CanonicalizeUnique },
+    { 0xa732, 0xa76f, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa770, 0xa778, 0x0000, CanonicalizeUnique },
+    { 0xa779, 0xa77c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0xa77d, 0xa77d, 0x8a04, CanonicalizeRangeHi },
+    { 0xa77e, 0xa787, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa788, 0xa78a, 0x0000, CanonicalizeUnique },
+    { 0xa78b, 0xa78c, 0x0000, CanonicalizeAlternatingUnaligned },
+    { 0xa78d, 0xa78d, 0xa528, CanonicalizeRangeHi },
+    { 0xa78e, 0xa78f, 0x0000, CanonicalizeUnique },
+    { 0xa790, 0xa793, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa794, 0xa795, 0x0000, CanonicalizeUnique },
+    { 0xa796, 0xa7a9, 0x0000, CanonicalizeAlternatingAligned },
+    { 0xa7aa, 0xa7aa, 0xa544, CanonicalizeRangeHi },
+    { 0xa7ab, 0xa7ab, 0xa54f, CanonicalizeRangeHi },
+    { 0xa7ac, 0xa7ac, 0xa54b, CanonicalizeRangeHi },
+    { 0xa7ad, 0xa7ad, 0xa541, CanonicalizeRangeHi },
+    { 0xa7ae, 0xa7af, 0x0000, CanonicalizeUnique },
+    { 0xa7b0, 0xa7b0, 0xa512, CanonicalizeRangeHi },
+    { 0xa7b1, 0xa7b1, 0xa52a, CanonicalizeRangeHi },
+    { 0xa7b2, 0xfaff, 0x0000, CanonicalizeUnique },
+    { 0xfb00, 0xfb04, 0x0001, CanonicalizeSet },
+    { 0xfb05, 0xfb06, 0x0005, CanonicalizeSet },
+    { 0xfb07, 0xfb12, 0x0000, CanonicalizeUnique },
+    { 0xfb13, 0xfb15, 0x001f, CanonicalizeSet },
+    { 0xfb16, 0xfb16, 0x0020, CanonicalizeSet },
+    { 0xfb17, 0xfb17, 0x001f, CanonicalizeSet },
+    { 0xfb18, 0xff20, 0x0000, CanonicalizeUnique },
+    { 0xff21, 0xff3a, 0x0020, CanonicalizeRangeLo },
+    { 0xff3b, 0xff40, 0x0000, CanonicalizeUnique },
+    { 0xff41, 0xff5a, 0x0020, CanonicalizeRangeHi },
+    { 0xff5b, 0x103ff, 0x0000, CanonicalizeUnique },
+    { 0x10400, 0x10427, 0x0028, CanonicalizeRangeLo },
+    { 0x10428, 0x1044f, 0x0028, CanonicalizeRangeHi },
+    { 0x10450, 0x1189f, 0x0000, CanonicalizeUnique },
+    { 0x118a0, 0x118bf, 0x0020, CanonicalizeRangeLo },
+    { 0x118c0, 0x118df, 0x0020, CanonicalizeRangeHi },
+    { 0x118e0, 0x10ffff, 0x0000, CanonicalizeUnique },
+};
+
+} } // JSC::Yarr
+
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2012 Apple Inc. All rights reserved.
+ * Copyright (C) 2012-2016 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
  */
 
-#ifndef YarrCanonicalizeUCS2_H
-#define YarrCanonicalizeUCS2_H
+#ifndef YarrCanonicalizeUnicode_h
+#define YarrCanonicalizeUnicode_h
 
 #include <stdint.h>
 #include <unicode/utypes.h>
 
 namespace JSC { namespace Yarr {
 
-// This set of data (autogenerated using YarrCanonicalizeUCS2.js into YarrCanonicalizeUCS2.cpp)
+// This set of data (autogenerated using YarrCanonicalizeUnicode.js into YarrCanonicalizeUnicode.cpp)
 // provides information for each UCS2 code point as to the set of code points that it should
 // match under the ES5.1 case insensitive RegExp matching rules, specified in 15.10.2.8.
 enum UCS2CanonicalizationType {
@@ -42,32 +42,38 @@ enum UCS2CanonicalizationType {
     CanonicalizeAlternatingAligned,   // Aligned consequtive pair, e.g. 0x1f4,0x1f5.
     CanonicalizeAlternatingUnaligned, // Unaligned consequtive pair, e.g. 0x241,0x242.
 };
-struct UCS2CanonicalizationRange { uint16_t begin, end, value, type; };
-extern const size_t UCS2_CANONICALIZATION_RANGES;
-extern const uint16_t* const characterSetInfo[];
-extern const UCS2CanonicalizationRange rangeInfo[];
-
-// This table is similar to the full rangeInfo table, however this maps from UCS2 codepoints to
-// the set of Latin1 codepoints that could match.
-enum LatinCanonicalizationType {
-    CanonicalizeLatinSelf,     // This character is in the Latin1 range, but has no canonical equivalent in the range.
-    CanonicalizeLatinMask0x20, // One of a pair of characters, under the mask 0x20.
-    CanonicalizeLatinOther,    // This character is not in the Latin1 range, but canonicalizes to another that is.
-    CanonicalizeLatinInvalid,  // Cannot match against Latin1 input.
+struct CanonicalizationRange {
+    UChar32 begin;
+    UChar32 end;
+    UChar32 value;
+    UCS2CanonicalizationType type;
 };
-struct LatinCanonicalizationRange { uint16_t begin, end, value, type; };
-extern const size_t LATIN_CANONICALIZATION_RANGES;
-extern LatinCanonicalizationRange latinRangeInfo[];
 
-// This searches in log2 time over ~364 entries, so should typically result in 8 compares.
-inline const UCS2CanonicalizationRange* rangeInfoFor(UChar ch)
+extern const size_t UCS2_CANONICALIZATION_RANGES;
+extern const UChar32* const ucs2CharacterSetInfo[];
+extern const CanonicalizationRange ucs2RangeInfo[];
+
+extern const size_t UNICODE_CANONICALIZATION_RANGES;
+extern const UChar32* const unicodeCharacterSetInfo[];
+extern const CanonicalizationRange unicodeRangeInfo[];
+
+enum class CanonicalMode { UCS2, Unicode };
+
+inline const UChar32* canonicalCharacterSetInfo(unsigned index, CanonicalMode canonicalMode)
+{
+    const UChar32* const* rangeInfo = canonicalMode == CanonicalMode::UCS2 ? ucs2CharacterSetInfo : unicodeCharacterSetInfo;
+    return rangeInfo[index];
+}
+
+// This searches in log2 time over ~400-600 entries, so should typically result in 9 compares.
+inline const CanonicalizationRange* canonicalRangeInfoFor(UChar32 ch, CanonicalMode canonicalMode = CanonicalMode::UCS2)
 {
-    const UCS2CanonicalizationRange* info = rangeInfo;
-    size_t entries = UCS2_CANONICALIZATION_RANGES;
+    const CanonicalizationRange* info = canonicalMode == CanonicalMode::UCS2 ? ucs2RangeInfo : unicodeRangeInfo;
+    size_t entries = canonicalMode == CanonicalMode::UCS2 ? UCS2_CANONICALIZATION_RANGES : UNICODE_CANONICALIZATION_RANGES;
 
     while (true) {
         size_t candidate = entries >> 1;
-        const UCS2CanonicalizationRange* candidateInfo = info + candidate;
+        const CanonicalizationRange* candidateInfo = info + candidate;
         if (ch < candidateInfo->begin)
             entries = candidate;
         else if (ch <= candidateInfo->end)
@@ -80,7 +86,7 @@ inline const UCS2CanonicalizationRange* rangeInfoFor(UChar ch)
 }
 
 // Should only be called for characters that have one canonically matching value.
-inline UChar getCanonicalPair(const UCS2CanonicalizationRange* info, UChar ch)
+inline UChar32 getCanonicalPair(const CanonicalizationRange* info, UChar32 ch)
 {
     ASSERT(ch >= info->begin && ch <= info->end);
     switch (info->type) {
@@ -100,20 +106,20 @@ inline UChar getCanonicalPair(const UCS2CanonicalizationRange* info, UChar ch)
 }
 
 // Returns true if no other UCS2 codepoint can match this value.
-inline bool isCanonicallyUnique(UChar ch)
+inline bool isCanonicallyUnique(UChar32 ch, CanonicalMode canonicalMode = CanonicalMode::UCS2)
 {
-    return rangeInfoFor(ch)->type == CanonicalizeUnique;
+    return canonicalRangeInfoFor(ch, canonicalMode)->type == CanonicalizeUnique;
 }
 
 // Returns true if values are equal, under the canonicalization rules.
-inline bool areCanonicallyEquivalent(UChar a, UChar b)
+inline bool areCanonicallyEquivalent(UChar32 a, UChar32 b, CanonicalMode canonicalMode = CanonicalMode::UCS2)
 {
-    const UCS2CanonicalizationRange* info = rangeInfoFor(a);
+    const CanonicalizationRange* info = canonicalRangeInfoFor(a, canonicalMode);
     switch (info->type) {
     case CanonicalizeUnique:
         return a == b;
     case CanonicalizeSet: {
-        for (const uint16_t* set = characterSetInfo[info->value]; (a = *set); ++set) {
+        for (const UChar32* set = canonicalCharacterSetInfo(info->value, canonicalMode); (a = *set); ++set) {
             if (a == b)
                 return true;
         }
diff --git a/Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.js b/Source/JavaScriptCore/yarr/YarrCanonicalizeUnicode.js
new file mode 100644 (file)
index 0000000..22ad9fc
--- /dev/null
@@ -0,0 +1,221 @@
+/*
+ * Copyright (C) 2012, 2016 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
+ */
+
+function printHeader()
+{
+    var copyright = (
+                     "/*"                                                                            + "\n" +
+                     " * Copyright (C) 2012-2013, 2015-2016 Apple Inc. All rights reserved."         + "\n" +
+                     " *"                                                                            + "\n" +
+                     " * Redistribution and use in source and binary forms, with or without"         + "\n" +
+                     " * modification, are permitted provided that the following conditions"         + "\n" +
+                     " * are met:"                                                                   + "\n" +
+                     " * 1. Redistributions of source code must retain the above copyright"          + "\n" +
+                     " *    notice, this list of conditions and the following disclaimer."           + "\n" +
+                     " * 2. Redistributions in binary form must reproduce the above copyright"       + "\n" +
+                     " *    notice, this list of conditions and the following disclaimer in the"     + "\n" +
+                     " *    documentation and/or other materials provided with the distribution."    + "\n" +
+                     " *"                                                                            + "\n" +
+                     " * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY"                  + "\n" +
+                     " * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE"          + "\n" +
+                     " * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR"         + "\n" +
+                     " * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR"                   + "\n" +
+                     " * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,"      + "\n" +
+                     " * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,"        + "\n" +
+                     " * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR"         + "\n" +
+                     " * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY"        + "\n" +
+                     " * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT"               + "\n" +
+                     " * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE"      + "\n" +
+                     " * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. "      + "\n" +
+                     " */");
+    
+    print(copyright);
+    print();
+    print("// DO NOT EDIT! - this file autogenerated by YarrCanonicalizeUnicode.js");
+    print();
+    print('#include "config.h"');
+    print('#include "YarrCanonicalizeUnicode.h"');
+    print();
+    print("namespace JSC { namespace Yarr {");
+    print();
+    print("#include <stdint.h>");
+    print();
+}
+
+function printFooter()
+{
+    print("} } // JSC::Yarr");
+    print();
+}
+
+// Helper function to convert a number to a fixed width hex representation of a UChar32.
+function hex(x)
+{
+    var s = Number(x).toString(16);
+    while (s.length < 4)
+        s = 0 + s;
+    return "0x" + s;
+}
+
+// See ES 6.0, 21.2.2.8.2 Steps 3
+function canonicalize(ch)
+{
+    var u = String.fromCharCode(ch).toUpperCase();
+    if (u.length > 1)
+        return ch;
+    var cu = u.charCodeAt(0);
+    if (ch >= 128 && cu < 128)
+        return ch;
+    return cu;
+}
+
+// See ES 6.0, 21.2.2.8.2 Step 2
+function canonicalizeUnicode(ch)
+{
+    if (ch < 128)
+        return canonicalize(ch);
+
+    return String.fromCodePoint(ch).toUpperCase().codePointAt(0);
+}
+
+var MAX_UCS2 = 0xFFFF;
+var MAX_UNICODE = 0x10FFFF;
+
+function createUCS2CanonicalGroups()
+{
+    var groupedCanonically = [];
+    // Pass 1: populate groupedCanonically - this is mapping from canonicalized
+    // values back to the set of character code that canonicalize to them.
+    for (var i = 0; i <= MAX_UCS2; ++i) {
+        var ch = canonicalize(i);
+        if (!groupedCanonically[ch])
+            groupedCanonically[ch] = [];
+        groupedCanonically[ch].push(i);
+    }
+
+    return groupedCanonically;
+}
+
+function createUnicodeCanonicalGroups()
+{
+    var groupedCanonically = [];
+    // Pass 1: populate groupedCanonically - this is mapping from canonicalized
+    // values back to the set of character code that canonicalize to them.
+    for (var i = 0; i <= MAX_UNICODE; ++i) {
+        var ch = canonicalizeUnicode(i);
+        if (!groupedCanonically[ch])
+            groupedCanonically[ch] = [];
+        groupedCanonically[ch].push(i);
+    }
+
+    return groupedCanonically;
+}
+
+function createTables(prefix, maxValue, canonicalGroups)
+{
+    var prefixLower = prefix.toLowerCase();
+    var prefixUpper = prefix.toUpperCase();
+    var typeInfo = [];
+    var characterSetInfo = [];
+    // Pass 2: populate typeInfo & characterSetInfo. For every character calculate
+    // a typeInfo value, described by the types above, and a value payload.
+    for (cu in canonicalGroups) {
+        // The set of characters that canonicalize to cu
+        var characters = canonicalGroups[cu];
+
+        // If there is only one, it is unique.
+        if (characters.length == 1) {
+            typeInfo[characters[0]] = "CanonicalizeUnique:0";
+            continue;
+        }
+
+        // Sort the array.
+        characters.sort(function(x,y){return x-y;});
+
+        // If there are more than two characters, create an entry in characterSetInfo.
+        if (characters.length > 2) {
+            for (i in characters)
+                typeInfo[characters[i]] = "CanonicalizeSet:" + characterSetInfo.length;
+            characterSetInfo.push(characters);
+
+            continue;
+        }
+
+        // We have a pair, mark alternating ranges, otherwise track whether this is the low or high partner.
+        var lo = characters[0];
+        var hi = characters[1];
+        var delta = hi - lo;
+        if (delta == 1) {
+            var type = lo & 1 ? "CanonicalizeAlternatingUnaligned:0" : "CanonicalizeAlternatingAligned:0";
+            typeInfo[lo] = type;
+            typeInfo[hi] = type;
+        } else {
+            typeInfo[lo] = "CanonicalizeRangeLo:" + delta;
+            typeInfo[hi] = "CanonicalizeRangeHi:" + delta;
+        }
+    }
+
+    var rangeInfo = [];
+    // Pass 3: coallesce types into ranges.
+    for (var end = 0; end <= maxValue; ++end) {
+        var begin = end;
+        var type = typeInfo[end];
+        while (end < maxValue && typeInfo[end + 1] == type)
+            ++end;
+        rangeInfo.push({begin:begin, end:end, type:type});
+    }
+
+    for (i in characterSetInfo) {
+        var characters = ""
+        var set = characterSetInfo[i];
+        for (var j in set)
+            characters += hex(set[j]) + ", ";
+        print("const UChar32 " + prefixLower + "CharacterSet" + i + "[] = { " + characters + "0 };");
+    }
+    print();
+    print("static const size_t " + prefixUpper + "_CANONICALIZATION_SETS = " + characterSetInfo.length + ";");
+    print("const UChar32* const " + prefixLower + "CharacterSetInfo[" + prefixUpper + "_CANONICALIZATION_SETS] = {");
+    for (i in characterSetInfo)
+    print("    " + prefixLower + "CharacterSet" + i + ",");
+    print("};");
+    print();
+    print("const size_t " + prefixUpper + "_CANONICALIZATION_RANGES = " + rangeInfo.length + ";");
+    print("const CanonicalizationRange " + prefixLower + "RangeInfo[" + prefixUpper + "_CANONICALIZATION_RANGES] = {");
+    for (i in rangeInfo) {
+        var info = rangeInfo[i];
+        var typeAndValue = info.type.split(':');
+        print("    { " + hex(info.begin) + ", " + hex(info.end) + ", " + hex(typeAndValue[1]) + ", " + typeAndValue[0] + " },");
+    }
+    print("};");
+    print();
+}
+
+printHeader();
+
+createTables("UCS2", MAX_UCS2, createUCS2CanonicalGroups());
+createTables("Unicode", MAX_UNICODE, createUnicodeCanonicalGroups());
+
+printFooter();
+
index 99b7315..1eaed95 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013, 2016 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga (pvarga@inf.u-szeged.hu), University of Szeged
  *
  * Redistribution and use in source and binary forms, with or without
@@ -28,7 +28,7 @@
 #include "YarrInterpreter.h"
 
 #include "Yarr.h"
-#include "YarrCanonicalizeUCS2.h"
+#include "YarrCanonicalizeUnicode.h"
 #include <wtf/BumpPointerAllocator.h>
 #include <wtf/DataLog.h>
 #include <wtf/text/CString.h>
@@ -44,9 +44,11 @@ public:
     struct ParenthesesDisjunctionContext;
 
     struct BackTrackInfoPatternCharacter {
+        uintptr_t begin; // Only needed for unicode patterns
         uintptr_t matchAmount;
     };
     struct BackTrackInfoCharacterClass {
+        uintptr_t begin; // Only needed for unicode patterns
         uintptr_t matchAmount;
     };
     struct BackTrackInfoBackReference {
@@ -167,10 +169,11 @@ public:
 
     class InputStream {
     public:
-        InputStream(const CharType* input, unsigned start, unsigned length)
+        InputStream(const CharType* input, unsigned start, unsigned length, bool decodeSurrogatePairs)
             : input(input)
             , pos(start)
             , length(length)
+            , decodeSurrogatePairs(decodeSurrogatePairs)
         {
         }
 
@@ -204,13 +207,43 @@ public:
             RELEASE_ASSERT(pos >= negativePositionOffest);
             unsigned p = pos - negativePositionOffest;
             ASSERT(p < length);
-            return input[p];
+            int result = input[p];
+            if (U16_IS_LEAD(result) && decodeSurrogatePairs && p + 1 < length
+                && U16_IS_TRAIL(input[p + 1])) {
+                if (atEnd())
+                    return -1;
+                
+                result = U16_GET_SUPPLEMENTARY(result, input[p + 1]);
+                next();
+            }
+            return result;
+        }
+        
+        int readSurrogatePairChecked(unsigned negativePositionOffest)
+        {
+            RELEASE_ASSERT(pos >= negativePositionOffest);
+            unsigned p = pos - negativePositionOffest;
+            ASSERT(p < length);
+            if (p + 1 >= length)
+                return -1;
+
+            int first = input[p];
+            if (U16_IS_LEAD(first) && U16_IS_TRAIL(input[p + 1]))
+                return U16_GET_SUPPLEMENTARY(first, input[p + 1]);
+
+            return -1;
         }
 
         int reread(unsigned from)
         {
             ASSERT(from < length);
-            return input[from];
+            int result = input[from];
+            if (U16_IS_LEAD(result) && decodeSurrogatePairs && from + 1 < length
+                && U16_IS_TRAIL(input[from + 1])) {
+                
+                result = U16_GET_SUPPLEMENTARY(result, input[from + 1]);
+            }
+            return result;
         }
 
         int prev()
@@ -281,11 +314,12 @@ public:
         const CharType* input;
         unsigned pos;
         unsigned length;
+        bool decodeSurrogatePairs;
     };
 
     bool testCharacterClass(CharacterClass* characterClass, int ch)
     {
-        if (ch & 0xFF80) {
+        if (ch & 0x1FFF80) {
             for (unsigned i = 0; i < characterClass->m_matchesUnicode.size(); ++i)
                 if (ch == characterClass->m_matchesUnicode[i])
                     return true;
@@ -309,6 +343,11 @@ public:
         return testChar == input.readChecked(negativeInputOffset);
     }
 
+    bool checkSurrogatePair(int testUnicodeChar, unsigned negativeInputOffset)
+    {
+        return testUnicodeChar == input.readSurrogatePairChecked(negativeInputOffset);
+    }
+
     bool checkCasedCharacter(int loChar, int hiChar, unsigned negativeInputOffset)
     {
         int ch = input.readChecked(negativeInputOffset);
@@ -328,32 +367,30 @@ public:
         if (!input.checkInput(matchSize))
             return false;
 
-        if (pattern->m_ignoreCase) {
-            for (unsigned i = 0; i < matchSize; ++i) {
-                int oldCh = input.reread(matchBegin + i);
-                int ch = input.readChecked(negativeInputOffset + matchSize - i);
+        for (unsigned i = 0; i < matchSize; ++i) {
+            int oldCh = input.reread(matchBegin + i);
+            int ch;
+            if (!U_IS_BMP(oldCh)) {
+                ch = input.readSurrogatePairChecked(negativeInputOffset + matchSize - i);
+                ++i;
+            } else
+                ch = input.readChecked(negativeInputOffset + matchSize - i);
 
-                if (oldCh == ch)
-                    continue;
+            if (oldCh == ch)
+                continue;
 
-                // The definition for canonicalize (see ES 5.1, 15.10.2.8) means that
+            if (pattern->m_ignoreCase) {
+                // The definition for canonicalize (see ES 6.0, 15.10.2.8) means that
                 // unicode values are never allowed to match against ascii ones.
                 if (isASCII(oldCh) || isASCII(ch)) {
                     if (toASCIIUpper(oldCh) == toASCIIUpper(ch))
                         continue;
-                } else if (areCanonicallyEquivalent(oldCh, ch))
+                } else if (areCanonicallyEquivalent(oldCh, ch, unicode ? CanonicalMode::Unicode : CanonicalMode::UCS2))
                     continue;
-
-                input.uncheckInput(matchSize);
-                return false;
-            }
-        } else {
-            for (unsigned i = 0; i < matchSize; ++i) {
-                if (!checkCharacter(input.reread(matchBegin + i), negativeInputOffset + matchSize - i)) {
-                    input.uncheckInput(matchSize);
-                    return false;
-                }
             }
+
+            input.uncheckInput(matchSize);
+            return false;
         }
 
         return true;
@@ -396,7 +433,10 @@ public:
         case QuantifierGreedy:
             if (backTrack->matchAmount) {
                 --backTrack->matchAmount;
-                input.uncheckInput(1);
+                if (unicode && !U_IS_BMP(term.atom.patternCharacter))
+                    input.uncheckInput(2);
+                else
+                    input.uncheckInput(1);
                 return true;
             }
             break;
@@ -407,7 +447,7 @@ public:
                 if (checkCharacter(term.atom.patternCharacter, term.inputPosition + 1))
                     return true;
             }
-            input.uncheckInput(backTrack->matchAmount);
+            input.setPos(backTrack->begin);
             break;
         }
 
@@ -446,10 +486,23 @@ public:
     bool matchCharacterClass(ByteTerm& term, DisjunctionContext* context)
     {
         ASSERT(term.type == ByteTerm::TypeCharacterClass);
-        BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + term.frameLocation);
+        BackTrackInfoCharacterClass* backTrack = reinterpret_cast<BackTrackInfoCharacterClass*>(context->frame + term.frameLocation);
 
         switch (term.atom.quantityType) {
         case QuantifierFixedCount: {
+            if (unicode) {
+                backTrack->begin = input.getPos();
+                unsigned matchAmount = 0;
+                for (matchAmount = 0; matchAmount < term.atom.quantityCount; ++matchAmount) {
+                    if (!checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition - matchAmount)) {
+                        input.setPos(backTrack->begin);
+                        return false;
+                    }
+                }
+
+                return true;
+            }
+
             for (unsigned matchAmount = 0; matchAmount < term.atom.quantityCount; ++matchAmount) {
                 if (!checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition - matchAmount))
                     return false;
@@ -458,6 +511,7 @@ public:
         }
 
         case QuantifierGreedy: {
+            backTrack->begin = input.getPos();
             unsigned matchAmount = 0;
             while ((matchAmount < term.atom.quantityCount) && input.checkInput(1)) {
                 if (!checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition + 1)) {
@@ -472,6 +526,7 @@ public:
         }
 
         case QuantifierNonGreedy:
+            backTrack->begin = input.getPos();
             backTrack->matchAmount = 0;
             return true;
         }
@@ -483,14 +538,28 @@ public:
     bool backtrackCharacterClass(ByteTerm& term, DisjunctionContext* context)
     {
         ASSERT(term.type == ByteTerm::TypeCharacterClass);
-        BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + term.frameLocation);
+        BackTrackInfoCharacterClass* backTrack = reinterpret_cast<BackTrackInfoCharacterClass*>(context->frame + term.frameLocation);
 
         switch (term.atom.quantityType) {
         case QuantifierFixedCount:
+            if (unicode)
+                input.setPos(backTrack->begin);
             break;
 
         case QuantifierGreedy:
             if (backTrack->matchAmount) {
+                if (unicode) {
+                    // Rematch one less match
+                    input.setPos(backTrack->begin);
+                    --backTrack->matchAmount;
+                    for (unsigned matchAmount = 0; (matchAmount < backTrack->matchAmount) && input.checkInput(1); ++matchAmount) {
+                        if (!checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition + 1)) {
+                            input.uncheckInput(1);
+                            break;
+                        }
+                    }
+                    return true;
+                }
                 --backTrack->matchAmount;
                 input.uncheckInput(1);
                 return true;
@@ -503,7 +572,7 @@ public:
                 if (checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition + 1))
                     return true;
             }
-            input.uncheckInput(backTrack->matchAmount);
+            input.setPos(backTrack->begin);
             break;
         }
 
@@ -773,7 +842,7 @@ public:
         if (backTrack->begin == input.getPos())
             return false;
 
-        // Successful match! Okay, what's next? - loop around and try to match moar!
+        // Successful match! Okay, what's next? - loop around and try to match more!
         context->term -= (term.atom.parenthesesWidth + 1);
         return true;
     }
@@ -1154,9 +1223,23 @@ public:
 
         case ByteTerm::TypePatternCharacterOnce:
         case ByteTerm::TypePatternCharacterFixed: {
+            if (unicode) {
+                if (!U_IS_BMP(currentTerm().atom.patternCharacter)) {
+                    for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
+                        if (!checkSurrogatePair(currentTerm().atom.patternCharacter, currentTerm().inputPosition - matchAmount)) {
+                            BACKTRACK();
+                        }
+                    }
+                    MATCH_NEXT();
+                }
+            }
+            unsigned position = input.getPos(); // May need to back out reading a surrogate pair.
+
             for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
-                if (!checkCharacter(currentTerm().atom.patternCharacter, currentTerm().inputPosition - matchAmount))
+                if (!checkCharacter(currentTerm().atom.patternCharacter, currentTerm().inputPosition - matchAmount)) {
+                    input.setPos(position);
                     BACKTRACK();
+                }
             }
             MATCH_NEXT();
         }
@@ -1176,12 +1259,28 @@ public:
         }
         case ByteTerm::TypePatternCharacterNonGreedy: {
             BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + currentTerm().frameLocation);
+            backTrack->begin = input.getPos();
             backTrack->matchAmount = 0;
             MATCH_NEXT();
         }
 
         case ByteTerm::TypePatternCasedCharacterOnce:
         case ByteTerm::TypePatternCasedCharacterFixed: {
+            if (unicode) {
+                // Case insensitive matching of unicode charaters are handled as TypeCharacterClass
+                ASSERT(U_IS_BMP(currentTerm().atom.patternCharacter));
+
+                unsigned position = input.getPos(); // May need to back out reading a surrogate pair.
+                
+                for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
+                    if (!checkCasedCharacter(currentTerm().atom.casedCharacter.lo, currentTerm().atom.casedCharacter.hi, currentTerm().inputPosition - matchAmount)) {
+                        input.setPos(position);
+                        BACKTRACK();
+                    }
+                }
+                MATCH_NEXT();
+            }
+
             for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
                 if (!checkCasedCharacter(currentTerm().atom.casedCharacter.lo, currentTerm().atom.casedCharacter.hi, currentTerm().inputPosition - matchAmount))
                     BACKTRACK();
@@ -1190,6 +1289,10 @@ public:
         }
         case ByteTerm::TypePatternCasedCharacterGreedy: {
             BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + currentTerm().frameLocation);
+
+            // Case insensitive matching of unicode charaters are handled as TypeCharacterClass
+            ASSERT(!unicode || U_IS_BMP(currentTerm().atom.patternCharacter));
+
             unsigned matchAmount = 0;
             while ((matchAmount < currentTerm().atom.quantityCount) && input.checkInput(1)) {
                 if (!checkCasedCharacter(currentTerm().atom.casedCharacter.lo, currentTerm().atom.casedCharacter.hi, currentTerm().inputPosition + 1)) {
@@ -1204,6 +1307,10 @@ public:
         }
         case ByteTerm::TypePatternCasedCharacterNonGreedy: {
             BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + currentTerm().frameLocation);
+
+            // Case insensitive matching of unicode charaters are handled as TypeCharacterClass
+            ASSERT(!unicode || U_IS_BMP(currentTerm().atom.patternCharacter));
+            
             backTrack->matchAmount = 0;
             MATCH_NEXT();
         }
@@ -1439,8 +1546,9 @@ public:
 
     Interpreter(BytecodePattern* pattern, unsigned* output, const CharType* input, unsigned length, unsigned start)
         : pattern(pattern)
+        , unicode(pattern->m_unicode)
         , output(output)
-        , input(input, start, length)
+        , input(input, start, length, pattern->m_unicode)
         , allocatorPool(0)
         , remainingMatchCount(matchLimit)
     {
@@ -1448,6 +1556,7 @@ public:
 
 private:
     BytecodePattern* pattern;
+    bool unicode;
     unsigned* output;
     InputStream input;
     BumpPointerPool* allocatorPool;
@@ -1506,14 +1615,14 @@ public:
         m_bodyDisjunction->terms.append(ByteTerm::WordBoundary(invert, inputPosition));
     }
 
-    void atomPatternCharacter(UChar ch, unsigned inputPosition, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
+    void atomPatternCharacter(UChar32 ch, unsigned inputPosition, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
     {
         if (m_pattern.m_ignoreCase) {
-            ASSERT(u_tolower(ch) <= 0xFFFF);
-            ASSERT(u_toupper(ch) <= 0xFFFF);
+            ASSERT(u_tolower(ch) <= UCHAR_MAX_VALUE);
+            ASSERT(u_toupper(ch) <= UCHAR_MAX_VALUE);
 
-            UChar lo = u_tolower(ch);
-            UChar hi = u_toupper(ch);
+            UChar32 lo = u_tolower(ch);
+            UChar32 hi = u_toupper(ch);
 
             if (lo != hi) {
                 m_bodyDisjunction->terms.append(ByteTerm(lo, hi, inputPosition, frameLocation, quantityCount, quantityType));
index dc2f3f7..3a5bc28 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2010 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2010-2012, 2014, 2016 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -74,10 +74,10 @@ struct ByteTerm {
     union {
         struct {
             union {
-                UChar patternCharacter;
+                UChar32 patternCharacter;
                 struct {
-                    UChar lo;
-                    UChar hi;
+                    UChar32 lo;
+                    UChar32 hi;
                 } casedCharacter;
                 CharacterClass* characterClass;
                 unsigned subpatternId;
@@ -105,7 +105,7 @@ struct ByteTerm {
     bool m_invert : 1;
     unsigned inputPosition;
 
-    ByteTerm(UChar ch, int inputPos, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
+    ByteTerm(UChar32 ch, int inputPos, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
         : frameLocation(frameLocation)
         , m_capture(false)
         , m_invert(false)
@@ -128,7 +128,7 @@ struct ByteTerm {
         inputPosition = inputPos;
     }
 
-    ByteTerm(UChar lo, UChar hi, int inputPos, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
+    ByteTerm(UChar32 lo, UChar32 hi, int inputPos, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
         : frameLocation(frameLocation)
         , m_capture(false)
         , m_invert(false)
@@ -341,6 +341,7 @@ public:
         : m_body(WTFMove(body))
         , m_ignoreCase(pattern.m_ignoreCase)
         , m_multiline(pattern.m_multiline)
+        , m_unicode(pattern.m_unicode)
         , m_allocator(allocator)
     {
         m_body->terms.shrinkToFit();
@@ -360,6 +361,7 @@ public:
     std::unique_ptr<ByteDisjunction> m_body;
     bool m_ignoreCase;
     bool m_multiline;
+    bool m_unicode;
     // Each BytecodePattern is associated with a RegExp, each RegExp is associated
     // with a VM.  Cache a pointer to out VM's m_regExpAllocator.
     BumpPointerAllocator* m_allocator;
index d600781..92f6d7c 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013, 2015-2016 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -30,7 +30,7 @@
 #include "LinkBuffer.h"
 #include "Options.h"
 #include "Yarr.h"
-#include "YarrCanonicalizeUCS2.h"
+#include "YarrCanonicalizeUnicode.h"
 
 #if ENABLE(YARR_JIT)
 
@@ -140,7 +140,7 @@ class YarrGenerator : private MacroAssembler {
         }
     }
 
-    void matchCharacterClassRange(RegisterID character, JumpList& failures, JumpList& matchDest, const CharacterRange* ranges, unsigned count, unsigned* matchIndex, const UChar* matches, unsigned matchCount)
+    void matchCharacterClassRange(RegisterID character, JumpList& failures, JumpList& matchDest, const CharacterRange* ranges, unsigned count, unsigned* matchIndex, const UChar32* matches, unsigned matchCount)
     {
         do {
             // pick which range we're going to generate
@@ -200,15 +200,15 @@ class YarrGenerator : private MacroAssembler {
 
             if (charClass->m_matchesUnicode.size()) {
                 for (unsigned i = 0; i < charClass->m_matchesUnicode.size(); ++i) {
-                    UChar ch = charClass->m_matchesUnicode[i];
+                    UChar32 ch = charClass->m_matchesUnicode[i];
                     matchDest.append(branch32(Equal, character, Imm32(ch)));
                 }
             }
 
             if (charClass->m_rangesUnicode.size()) {
                 for (unsigned i = 0; i < charClass->m_rangesUnicode.size(); ++i) {
-                    UChar lo = charClass->m_rangesUnicode[i].begin;
-                    UChar hi = charClass->m_rangesUnicode[i].end;
+                    UChar32 lo = charClass->m_rangesUnicode[i].begin;
+                    UChar32 hi = charClass->m_rangesUnicode[i].end;
 
                     Jump below = branch32(LessThan, character, Imm32(lo));
                     matchDest.append(branch32(LessThanOrEqual, character, Imm32(hi)));
@@ -285,7 +285,7 @@ class YarrGenerator : private MacroAssembler {
         return branch32(NotEqual, index, length);
     }
 
-    Jump jumpIfCharNotEquals(UChar ch, int inputPosition, RegisterID character)
+    Jump jumpIfCharNotEquals(UChar32 ch, int inputPosition, RegisterID character)
     {
         readCharacter(inputPosition, character);
 
@@ -766,7 +766,7 @@ class YarrGenerator : private MacroAssembler {
         YarrOp* nextOp = &m_ops[opIndex + 1];
 
         PatternTerm* term = op.m_term;
-        UChar ch = term->patternCharacter;
+        UChar32 ch = term->patternCharacter;
 
         if ((ch > 0xff) && (m_charSize == Char8)) {
             // Have a 16 bit pattern character and an 8 bit string - short circuit
@@ -813,7 +813,7 @@ class YarrGenerator : private MacroAssembler {
             int shiftAmount = (m_charSize == Char8 ? 8 : 16) * numberCharacters;
 #endif
 
-            UChar currentCharacter = nextTerm->patternCharacter;
+            UChar32 currentCharacter = nextTerm->patternCharacter;
 
             if ((currentCharacter > 0xff) && (m_charSize == Char8)) {
                 // Have a 16 bit pattern character and an 8 bit string - short circuit
@@ -882,7 +882,7 @@ class YarrGenerator : private MacroAssembler {
     {
         YarrOp& op = m_ops[opIndex];
         PatternTerm* term = op.m_term;
-        UChar ch = term->patternCharacter;
+        UChar32 ch = term->patternCharacter;
 
         const RegisterID character = regT0;
         const RegisterID countRegister = regT1;
@@ -919,7 +919,7 @@ class YarrGenerator : private MacroAssembler {
     {
         YarrOp& op = m_ops[opIndex];
         PatternTerm* term = op.m_term;
-        UChar ch = term->patternCharacter;
+        UChar32 ch = term->patternCharacter;
 
         const RegisterID character = regT0;
         const RegisterID countRegister = regT1;
@@ -977,7 +977,7 @@ class YarrGenerator : private MacroAssembler {
     {
         YarrOp& op = m_ops[opIndex];
         PatternTerm* term = op.m_term;
-        UChar ch = term->patternCharacter;
+        UChar32 ch = term->patternCharacter;
 
         const RegisterID character = regT0;
         const RegisterID countRegister = regT1;
index 761acb5..51d5ef3 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2014-2016 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -46,7 +46,7 @@ template<class Delegate, typename CharType>
 class Parser {
 private:
     template<class FriendDelegate>
-    friend const char* parse(FriendDelegate&, const String& pattern, unsigned backReferenceLimit);
+    friend const char* parse(FriendDelegate&, const String& pattern, bool isUnicode, unsigned backReferenceLimit);
 
     enum ErrorCode {
         NoError,
@@ -60,6 +60,7 @@ private:
         CharacterClassUnmatched,
         CharacterClassOutOfOrder,
         EscapeUnterminated,
+        InvalidUnicodeEscape,
         NumberOfErrorCodes
     };
 
@@ -101,7 +102,7 @@ private:
          * mode we will allow a hypen to be treated as indicating a range (i.e. /[a-z]/
          * is different to /[a\-z]/).
          */
-        void atomPatternCharacter(UChar ch, bool hyphenIsRange = false)
+        void atomPatternCharacter(UChar32 ch, bool hyphenIsRange = false)
         {
             switch (m_state) {
             case AfterCharacterClass:
@@ -225,16 +226,17 @@ private:
             AfterCharacterClass,
             AfterCharacterClassHyphen,
         } m_state;
-        UChar m_character;
+        UChar32 m_character;
     };
 
-    Parser(Delegate& delegate, const String& pattern, unsigned backReferenceLimit)
+    Parser(Delegate& delegate, const String& pattern, bool isUnicode, unsigned backReferenceLimit)
         : m_delegate(delegate)
         , m_backReferenceLimit(backReferenceLimit)
         , m_err(NoError)
         , m_data(pattern.characters<CharType>())
         , m_size(pattern.length())
         , m_index(0)
+        , m_isUnicode(isUnicode)
         , m_parenthesesNestingDepth(0)
     {
     }
@@ -411,11 +413,55 @@ private:
         // UnicodeEscape
         case 'u': {
             consume();
+            if (atEndOfPattern()) {
+                delegate.atomPatternCharacter('u');
+                break;
+            }
+
+            if (peek() == '{') {
+                consume();
+                UChar32 codePoint = 0;
+                do {
+                    if (atEndOfPattern())
+                        m_err = InvalidUnicodeEscape;
+                    if (!WTF::isASCIIHexDigit(peek()))
+                        m_err = InvalidUnicodeEscape;
+
+                    codePoint = (codePoint << 4) | WTF::toASCIIHexValue(consume());
+
+                    if (codePoint > UCHAR_MAX_VALUE)
+                        m_err = InvalidUnicodeEscape;
+                } while (!atEndOfPattern() && peek() != '}');
+                if (!atEndOfPattern())
+                    consume();
+                if (m_err)
+                    return false;
+
+                delegate.atomPatternCharacter(codePoint);
+                break;
+            }
             int u = tryConsumeHex(4);
             if (u == -1)
                 delegate.atomPatternCharacter('u');
-            else
+            else {
+                // If we have the first of a surrogate pair, look for the second.
+                if (U16_IS_LEAD(u) && m_isUnicode && (patternRemaining() >= 6) && peek() == '\\') {
+                    ParseState state = saveState();
+                    consume();
+                    
+                    if (tryConsume('u')) {
+                        int surrogate2 = tryConsumeHex(4);
+                        if (U16_IS_TRAIL(surrogate2)) {
+                            u = U16_GET_SUPPLEMENTARY(u, surrogate2);
+                            delegate.atomPatternCharacter(u);
+                            break;
+                        }
+                    }
+
+                    restoreState(state);
+                }
                 delegate.atomPatternCharacter(u);
+            }
             break;
         }
 
@@ -427,6 +473,22 @@ private:
         return true;
     }
 
+    UChar32 consumePossibleSurrogatePair()
+    {
+        UChar32 ch = consume();
+        if (U16_IS_LEAD(ch) && m_isUnicode && (patternRemaining() > 0)) {
+            ParseState state = saveState();
+
+            UChar32 surrogate2 = consume();
+            if (U16_IS_TRAIL(surrogate2))
+                ch = U16_GET_SUPPLEMENTARY(ch, surrogate2);
+            else
+                restoreState(state);
+        }
+
+        return ch;
+    }
+
     /*
      * parseAtomEscape(), parseCharacterClassEscape():
      *
@@ -470,7 +532,7 @@ private:
                 break;
 
             default:
-                characterClassConstructor.atomPatternCharacter(consume(), true);
+                characterClassConstructor.atomPatternCharacter(consumePossibleSurrogatePair(), true);
             }
 
             if (m_err)
@@ -662,7 +724,7 @@ private:
             FALLTHROUGH;
 
             default:
-                m_delegate.atomPatternCharacter(consume());
+                m_delegate.atomPatternCharacter(consumePossibleSurrogatePair());
                 lastTokenWasAnAtom = true;
             }
 
@@ -701,6 +763,7 @@ private:
             REGEXP_ERROR_PREFIX "missing terminating ] for character class",
             REGEXP_ERROR_PREFIX "range out of order in character class",
             REGEXP_ERROR_PREFIX "\\ at end of pattern"
+            REGEXP_ERROR_PREFIX "invalid unicode {} escape"
         };
 
         return errorMessages[m_err];
@@ -726,6 +789,12 @@ private:
         return m_index == m_size;
     }
 
+    unsigned patternRemaining()
+    {
+        ASSERT(m_index <= m_size);
+        return m_size - m_index;
+    }
+
     int peek()
     {
         ASSERT(m_index < m_size);
@@ -805,6 +874,7 @@ private:
     const CharType* m_data;
     unsigned m_size;
     unsigned m_index;
+    bool m_isUnicode;
     unsigned m_parenthesesNestingDepth;
 
     // Derived by empirical testing of compile time in PCRE and WREC.
@@ -825,11 +895,11 @@ private:
  *    void assertionEOL();
  *    void assertionWordBoundary(bool invert);
  *
- *    void atomPatternCharacter(UChar ch);
+ *    void atomPatternCharacter(UChar32 ch);
  *    void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert);
  *    void atomCharacterClassBegin(bool invert)
- *    void atomCharacterClassAtom(UChar ch)
- *    void atomCharacterClassRange(UChar begin, UChar end)
+ *    void atomCharacterClassAtom(UChar32 ch)
+ *    void atomCharacterClassRange(UChar32 begin, UChar32 end)
  *    void atomCharacterClassBuiltIn(BuiltInCharacterClassID classID, bool invert)
  *    void atomCharacterClassEnd()
  *    void atomParenthesesSubpatternBegin(bool capture = true);
@@ -871,11 +941,11 @@ private:
  */
 
 template<class Delegate>
-const char* parse(Delegate& delegate, const String& pattern, unsigned backReferenceLimit = quantifyInfinite)
+const char* parse(Delegate& delegate, const String& pattern, bool isUnicode, unsigned backReferenceLimit = quantifyInfinite)
 {
     if (pattern.is8Bit())
-        return Parser<Delegate, LChar>(delegate, pattern, backReferenceLimit).parse();
-    return Parser<Delegate, UChar>(delegate, pattern, backReferenceLimit).parse();
+        return Parser<Delegate, LChar>(delegate, pattern, isUnicode, backReferenceLimit).parse();
+    return Parser<Delegate, UChar>(delegate, pattern, isUnicode, backReferenceLimit).parse();
 }
 
 } } // namespace JSC::Yarr
index 00339b7..68b4f8f 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013-2016 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga (pvarga@inf.u-szeged.hu), University of Szeged
  *
  * Redistribution and use in source and binary forms, with or without
@@ -28,7 +28,7 @@
 #include "YarrPattern.h"
 
 #include "Yarr.h"
-#include "YarrCanonicalizeUCS2.h"
+#include "YarrCanonicalizeUnicode.h"
 #include "YarrParser.h"
 #include <wtf/Vector.h>
 
@@ -40,8 +40,9 @@ namespace JSC { namespace Yarr {
 
 class CharacterClassConstructor {
 public:
-    CharacterClassConstructor(bool isCaseInsensitive = false)
+    CharacterClassConstructor(bool isCaseInsensitive, CanonicalMode canonicalMode)
         : m_isCaseInsensitive(isCaseInsensitive)
+        , m_canonicalMode(canonicalMode)
     {
     }
     
@@ -65,7 +66,7 @@ public:
             addSortedRange(m_rangesUnicode, other->m_rangesUnicode[i].begin, other->m_rangesUnicode[i].end);
     }
 
-    void putChar(UChar ch)
+    void putChar(UChar32 ch)
     {
         // Handle ascii cases.
         if (ch <= 0x7f) {
@@ -84,33 +85,32 @@ public:
         }
 
         // Add multiple matches, if necessary.
-        const UCS2CanonicalizationRange* info = rangeInfoFor(ch);
+        const CanonicalizationRange* info = canonicalRangeInfoFor(ch, m_canonicalMode);
         if (info->type == CanonicalizeUnique)
             addSorted(m_matchesUnicode, ch);
         else
             putUnicodeIgnoreCase(ch, info);
     }
 
-    void putUnicodeIgnoreCase(UChar ch, const UCS2CanonicalizationRange* info)
+    void putUnicodeIgnoreCase(UChar32 ch, const CanonicalizationRange* info)
     {
         ASSERT(m_isCaseInsensitive);
-        ASSERT(ch > 0x7f);
         ASSERT(ch >= info->begin && ch <= info->end);
         ASSERT(info->type != CanonicalizeUnique);
         if (info->type == CanonicalizeSet) {
-            for (const uint16_t* set = characterSetInfo[info->value]; (ch = *set); ++set)
-                addSorted(m_matchesUnicode, ch);
+            for (const UChar32* set = canonicalCharacterSetInfo(info->value, m_canonicalMode); (ch = *set); ++set)
+                addSorted(ch);
         } else {
-            addSorted(m_matchesUnicode, ch);
-            addSorted(m_matchesUnicode, getCanonicalPair(info, ch));
+            addSorted(ch);
+            addSorted(getCanonicalPair(info, ch));
         }
     }
 
-    void putRange(UChar lo, UChar hi)
+    void putRange(UChar32 lo, UChar32 hi)
     {
         if (lo <= 0x7f) {
             char asciiLo = lo;
-            char asciiHi = std::min(hi, (UChar)0x7f);
+            char asciiHi = std::min(hi, (UChar32)0x7f);
             addSortedRange(m_ranges, lo, asciiHi);
             
             if (m_isCaseInsensitive) {
@@ -123,16 +123,16 @@ public:
         if (hi <= 0x7f)
             return;
 
-        lo = std::max(lo, (UChar)0x80);
+        lo = std::max(lo, (UChar32)0x80);
         addSortedRange(m_rangesUnicode, lo, hi);
         
         if (!m_isCaseInsensitive)
             return;
 
-        const UCS2CanonicalizationRange* info = rangeInfoFor(lo);
+        const CanonicalizationRange* info = canonicalRangeInfoFor(lo, m_canonicalMode);
         while (true) {
             // Handle the range [lo .. end]
-            UChar end = std::min<UChar>(info->end, hi);
+            UChar32 end = std::min<UChar32>(info->end, hi);
 
             switch (info->type) {
             case CanonicalizeUnique:
@@ -140,7 +140,7 @@ public:
                 break;
             case CanonicalizeSet: {
                 UChar ch;
-                for (const uint16_t* set = characterSetInfo[info->value]; (ch = *set); ++set)
+                for (const UChar32* set = canonicalCharacterSetInfo(info->value, m_canonicalMode); (ch = *set); ++set)
                     addSorted(m_matchesUnicode, ch);
                 break;
             }
@@ -188,7 +188,12 @@ public:
     }
 
 private:
-    void addSorted(Vector<UChar>& matches, UChar ch)
+    void addSorted(UChar32 ch)
+    {
+        addSorted(ch <= 0x7f ? m_matches : m_matchesUnicode, ch);
+    }
+
+    void addSorted(Vector<UChar32>& matches, UChar32 ch)
     {
         unsigned pos = 0;
         unsigned range = matches.size();
@@ -214,7 +219,7 @@ private:
             matches.insert(pos, ch);
     }
 
-    void addSortedRange(Vector<CharacterRange>& ranges, UChar lo, UChar hi)
+    void addSortedRange(Vector<CharacterRange>& ranges, UChar32 lo, UChar32 hi)
     {
         unsigned end = ranges.size();
         
@@ -260,10 +265,11 @@ private:
     }
 
     bool m_isCaseInsensitive;
+    CanonicalMode m_canonicalMode;
 
-    Vector<UChar> m_matches;
+    Vector<UChar32> m_matches;
     Vector<CharacterRange> m_ranges;
-    Vector<UChar> m_matchesUnicode;
+    Vector<UChar32> m_matchesUnicode;
     Vector<CharacterRange> m_rangesUnicode;
 };
 
@@ -271,7 +277,7 @@ class YarrPatternConstructor {
 public:
     YarrPatternConstructor(YarrPattern& pattern)
         : m_pattern(pattern)
-        , m_characterClassConstructor(pattern.m_ignoreCase)
+        , m_characterClassConstructor(pattern.m_ignoreCase, pattern.m_unicode ? CanonicalMode::Unicode : CanonicalMode::UCS2)
         , m_invertParentheticalAssertion(false)
     {
         auto body = std::make_unique<PatternDisjunction>();
@@ -313,16 +319,16 @@ public:
         m_alternative->m_terms.append(PatternTerm::WordBoundary(invert));
     }
 
-    void atomPatternCharacter(UChar ch)
+    void atomPatternCharacter(UChar32 ch)
     {
         // We handle case-insensitive checking of unicode characters which do have both
         // cases by handling them as if they were defined using a CharacterClass.
-        if (!m_pattern.m_ignoreCase || isASCII(ch)) {
+        if (!m_pattern.m_ignoreCase || (isASCII(ch) && !m_pattern.m_unicode)) {
             m_alternative->m_terms.append(PatternTerm(ch));
             return;
         }
 
-        const UCS2CanonicalizationRange* info = rangeInfoFor(ch);
+        const CanonicalizationRange* info = canonicalRangeInfoFor(ch, m_pattern.m_unicode ? CanonicalMode::Unicode : CanonicalMode::UCS2);
         if (info->type == CanonicalizeUnique) {
             m_alternative->m_terms.append(PatternTerm(ch));
             return;
@@ -357,12 +363,12 @@ public:
         m_invertCharacterClass = invert;
     }
 
-    void atomCharacterClassAtom(UChar ch)
+    void atomCharacterClassAtom(UChar32 ch)
     {
         m_characterClassConstructor.putChar(ch);
     }
 
-    void atomCharacterClassRange(UChar begin, UChar end)
+    void atomCharacterClassRange(UChar32 begin, UChar32 end)
     {
         m_characterClassConstructor.putRange(begin, end);
     }
@@ -596,6 +602,8 @@ public:
                     term.frameLocation = currentCallFrameSize;
                     currentCallFrameSize += YarrStackSpaceForBackTrackInfoPatternCharacter;
                     alternative->m_hasFixedSize = false;
+                } else if (m_pattern.m_unicode) {
+                    currentInputPosition += (!U_IS_BMP(term.patternCharacter) ? 2 : 1) * term.quantityCount;
                 } else
                     currentInputPosition += term.quantityCount;
                 break;
@@ -606,6 +614,11 @@ public:
                     term.frameLocation = currentCallFrameSize;
                     currentCallFrameSize += YarrStackSpaceForBackTrackInfoCharacterClass;
                     alternative->m_hasFixedSize = false;
+                } else if (m_pattern.m_unicode) {
+                    term.frameLocation = currentCallFrameSize;
+                    currentCallFrameSize += YarrStackSpaceForBackTrackInfoCharacterClass;
+                    currentInputPosition += term.quantityCount;
+                    alternative->m_hasFixedSize = false;
                 } else
                     currentInputPosition += term.quantityCount;
                 break;
@@ -832,7 +845,7 @@ const char* YarrPattern::compile(const String& patternString)
 {
     YarrPatternConstructor constructor(*this);
 
-    if (const char* error = parse(constructor, patternString))
+    if (const char* error = parse(constructor, patternString, m_unicode))
         return error;
     
     // If the pattern contains illegal backreferences reset & reparse.
@@ -846,7 +859,7 @@ const char* YarrPattern::compile(const String& patternString)
 #if !ASSERT_DISABLED
         const char* error =
 #endif
-            parse(constructor, patternString, numSubpatterns);
+            parse(constructor, patternString, m_unicode, numSubpatterns);
 
         ASSERT(!error);
         ASSERT(numSubpatterns == m_numSubpatterns);
@@ -861,9 +874,10 @@ const char* YarrPattern::compile(const String& patternString)
     return 0;
 }
 
-YarrPattern::YarrPattern(const String& pattern, bool ignoreCase, bool multiline, const char** error)
+YarrPattern::YarrPattern(const String& pattern, bool ignoreCase, bool multiline, bool unicode, const char** error)
     : m_ignoreCase(ignoreCase)
     , m_multiline(multiline)
+    , m_unicode(unicode)
     , m_containsBackreferences(false)
     , m_containsBOL(false)
     , m_containsUnsignedLengthPattern(false)
index 5482de5..e7fefc8 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013-2014, 2016 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga (pvarga@inf.u-szeged.hu), University of Szeged
  *
  * Redistribution and use in source and binary forms, with or without
@@ -37,10 +37,10 @@ namespace JSC { namespace Yarr {
 struct PatternDisjunction;
 
 struct CharacterRange {
-    UChar begin;
-    UChar end;
+    UChar32 begin;
+    UChar32 end;
 
-    CharacterRange(UChar begin, UChar end)
+    CharacterRange(UChar32 begin, UChar32 end)
         : begin(begin)
         , end(end)
     {
@@ -62,9 +62,9 @@ public:
         , m_tableInverted(inverted)
     {
     }
-    Vector<UChar> m_matches;
+    Vector<UChar32> m_matches;
     Vector<CharacterRange> m_ranges;
-    Vector<UChar> m_matchesUnicode;
+    Vector<UChar32> m_matchesUnicode;
     Vector<CharacterRange> m_rangesUnicode;
 
     const char* m_table;
@@ -93,7 +93,7 @@ struct PatternTerm {
     bool m_capture :1;
     bool m_invert :1;
     union {
-        UChar patternCharacter;
+        UChar32 patternCharacter;
         CharacterClass* characterClass;
         unsigned backReferenceSubpatternId;
         struct {
@@ -113,7 +113,7 @@ struct PatternTerm {
     int inputPosition;
     unsigned frameLocation;
 
-    PatternTerm(UChar ch)
+    PatternTerm(UChar32 ch)
         : type(PatternTerm::TypePatternCharacter)
         , m_capture(false)
         , m_invert(false)
@@ -300,7 +300,7 @@ struct TermChain {
 };
 
 struct YarrPattern {
-    JS_EXPORT_PRIVATE YarrPattern(const String& pattern, bool ignoreCase, bool multiline, const char** error);
+    JS_EXPORT_PRIVATE YarrPattern(const String& pattern, bool ignoreCase, bool multiline, bool unicode, const char** error);
 
     void reset()
     {
@@ -392,6 +392,7 @@ struct YarrPattern {
 
     bool m_ignoreCase : 1;
     bool m_multiline : 1;
+    bool m_unicode : 1;
     bool m_containsBackreferences : 1;
     bool m_containsBOL : 1;
     bool m_containsUnsignedLengthPattern : 1; 
index aa98c4a..535611d 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2011 Apple Inc. All rights reserved.
+ * Copyright (C) 2011, 2016 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -35,7 +35,7 @@ public:
     void assertionBOL() {}
     void assertionEOL() {}
     void assertionWordBoundary(bool) {}
-    void atomPatternCharacter(UChar) {}
+    void atomPatternCharacter(UChar32) {}
     void atomBuiltInCharacterClass(BuiltInCharacterClassID, bool) {}
     void atomCharacterClassBegin(bool = false) {}
     void atomCharacterClassAtom(UChar) {}
@@ -53,7 +53,7 @@ public:
 const char* checkSyntax(const String& pattern)
 {
     SyntaxChecker syntaxChecker;
-    return parse(syntaxChecker, pattern);
+    return parse(syntaxChecker, pattern, false);
 }
 
 }} // JSC::YARR