Implement Unicode RegExp support in the YARR JIT
authormsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Tue, 22 Aug 2017 22:43:08 +0000 (22:43 +0000)
committermsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Tue, 22 Aug 2017 22:43:08 +0000 (22:43 +0000)
https://bugs.webkit.org/show_bug.cgi?id=174646

Reviewed by Filip Pizlo.

Source/JavaScriptCore:

This support is only implemented for 64 bit platforms.  It wouldn't be too hard to add support
for 32 bit platforms with a reasonable number of spare registers.  This code slightly refactors
register usage to reduce the number of callee save registers used for non-Unicode expressions.
For Unicode expressions, there are several more registers used to store constants values for
processing surrogate pairs as well as discerning whether a character belongs to the Basic
Multilingual Plane (BMP) or one of the Supplemental Planes.

This implements JIT support for Unicode expressions very similar to how the interpreter works.
Just like in the interpreter, backtracking code uses more space on the stack to save positions.
Moved the BackTrackInfo* structs to YarrPattern as separate functions.  Added xxxIndex()
functions to each of these to simplify how the JIT code reads and writes the structure fields.

Given that reading surrogate pairs and transforming them into a single code point takes a
little processing, the code that implements reading a Unicode character is implemented as a
leaf function added to the end of the JIT'ed code.  The calling convention for
"tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
that argument values stay in argument registers for most of the generated code.
That helper takes the starting character address in one register, regUnicodeInputAndTrail,
and uses another dedicated temporary register, regUnicodeTemp.  The result is typically
returned in regT0.  If another return register is requested, we'll create an inline copy of
that function.

Added a new flag to CharacterClass to signify if a class has non-BMP characters.  This flag
is used in optimizeAlternative() where we swap the order of a fixed character class term with
a fixed character term that immediately follows it.  Since the non-BMP character class may
increment "index" when matching, that must be done first before trying to match a fixed
character term later in the string.

Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
function, getEffectiveAddress64(), with an ARM64 implementation.  It just calls x86Lea64()
on X86-64.  Also added an ImplicitAddress version of load16Unaligned().

(JSC::MacroAssemblerARM64::load16Unaligned):
(JSC::MacroAssemblerARM64::getEffectiveAddress64):
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::load16Unaligned):
(JSC::MacroAssemblerX86Common::load16):
* assembler/MacroAssemblerX86_64.h:
(JSC::MacroAssemblerX86_64::getEffectiveAddress64):
* create_regex_tables:
* runtime/RegExp.cpp:
(JSC::RegExp::compile):
* yarr/YarrInterpreter.cpp:
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
(JSC::Yarr::YarrGenerator::readCharacter):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::matchAssertionWordchar):
(JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generate):
(JSC::Yarr::YarrGenerator::backtrack):
(JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::generateReturn):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):
* yarr/YarrJIT.h:
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
(JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
(JSC::Yarr::BackTrackInfoBackReference::beginIndex):
(JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
(JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
(JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):

LayoutTests:

Updated tests.

* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@221052 268f45cc-cd09-0410-ab3c-d52691b4dbfc

14 files changed:
LayoutTests/ChangeLog
LayoutTests/js/regexp-unicode-expected.txt
LayoutTests/js/script-tests/regexp-unicode.js
Source/JavaScriptCore/ChangeLog
Source/JavaScriptCore/assembler/MacroAssemblerARM64.h
Source/JavaScriptCore/assembler/MacroAssemblerX86Common.h
Source/JavaScriptCore/assembler/MacroAssemblerX86_64.h
Source/JavaScriptCore/create_regex_tables
Source/JavaScriptCore/runtime/RegExp.cpp
Source/JavaScriptCore/yarr/YarrInterpreter.cpp
Source/JavaScriptCore/yarr/YarrJIT.cpp
Source/JavaScriptCore/yarr/YarrJIT.h
Source/JavaScriptCore/yarr/YarrPattern.cpp
Source/JavaScriptCore/yarr/YarrPattern.h

index 541b0a6..dfa0d54 100644 (file)
@@ -1,3 +1,15 @@
+2017-08-22  Michael Saboff  <msaboff@apple.com>
+
+        Implement Unicode RegExp support in the YARR JIT
+        https://bugs.webkit.org/show_bug.cgi?id=174646
+
+        Reviewed by Filip Pizlo.
+
+        Updated tests.
+
+        * js/regexp-unicode-expected.txt:
+        * js/script-tests/regexp-unicode.js:
+
 2017-08-22  Brent Fulgham  <bfulgham@apple.com>
 
         Unreviewed test fix after r221017.
index bfd9120..00e9866 100644 (file)
@@ -88,6 +88,8 @@ PASS re2.test("
 PASS re2.test("𒍅") is true
 PASS /πŒ†{2}/u.test("πŒ†πŒ†") is true
 PASS /πŒ†{2}/u.test("πŒ†πŒ†") is true
+PASS "𐐅𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
+PASS "𐐂𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
 PASS "𐐁𐐁𐐀".match(/𐐁{1,3}/u)[0] is "𐐁𐐁"
 PASS "𐐁𐐩".match(/𐐁{1,3}/iu)[0] is "𐐁𐐩"
 PASS "𐐁𐐩πͺ𐐩".match(/𐐁{1,}/iu)[0] is "𐐁𐐩"
@@ -116,6 +118,7 @@ PASS "𐐀".match(/a*/ui)[0].length is 0
 PASS "𐐀".match(/\d*/u)[0].length is 0
 PASS "123𐐀".match(/\d*/u)[0] is "123"
 PASS "12X3𐐀4".match(/\d{0,1}/ug) is ["1", "2", "", "3", "", "4", ""]
+PASS "𐐂𐐅𐐅𐐂𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
 PASS match3[0] is "a𐐐𐐐b"
 PASS match3[1] is undefined.
 PASS match3[2] is "a𐐐𐐐b"
@@ -136,9 +139,28 @@ PASS /\u{1}/.test("u") is true
 PASS /\u{4}/.test("u") is false
 PASS /\u{4}/.test("uuuu") is true
 PASS "800-555-1212".match(/[0-9\-]*/u)[0].length is 12
+PASS "πŸ‚‘πŸƒ‘πŸ‚ΈπŸƒ‰πŸƒš".match(re7)[0] is "πŸ‚‘πŸƒ‘"
+PASS "πŸ‚‘πŸƒ‘πŸ‚±πŸƒ‰πŸƒš".match(re7)[0] is "πŸ‚‘πŸƒ‘πŸ‚±"
+PASS "πŸ‚‘πŸƒ‘πŸ‚±πŸƒπŸƒš".match(re7)[0] is "πŸ‚‘πŸƒ‘πŸ‚±πŸƒ"
+PASS "πŸ‚£πŸƒ‘πŸ‚±πŸƒπŸƒš".match(re7)[0] is "πŸƒ‘πŸ‚±πŸƒ"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]*a|[πŒπŒ‘]*./iu)[0] is "πŒ‘πŒπŒ‘"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]*?a|[πŒπŒ‘]*?./iu)[0] is "πŒ‘"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]+a|[πŒπŒ‘]+./iu)[0] is "πŒ‘πŒπŒ‘"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]+?a|[πŒπŒ‘]+?./iu)[0] is "πŒ‘πŒ"
+PASS "C83|НАЧАВЬ".match(re8)[0] is "C83|НАЧАВЬ"
+PASS "This.Is.16.Chars|НАЧАВЬ".match(re8)[0] is "This.Is.16.Chars|НАЧАВЬ"
+PASS "Testing\nሴ 1 2 3".match(/^[α€€-𐃿] 1 2 3/um)[0] is "ሴ 1 2 3"
+PASS "Testing\n𐃰 1 2 3".match(/^[α€€-𐃿] 1 2 3/um)[0] is "𐃰 1 2 3"
+PASS "g\nሴ 1 2 3".match(/g\n^[α€€-𐃿] 1 2 3/um)[0] is "g\nሴ 1 2 3"
+PASS "g\n𐃰 1 2 3".match(/g\n^[α€€-𐃿] 1 2 3/um)[0] is "g\n𐃰 1 2 3"
+PASS "Testing αˆ΄\n1 2 3".match(/Testing [α€€-𐃿]$/um)[0] is "Testing αˆ΄"
+PASS "Testing πƒ°\n1 2 3".match(/Testing [α€€-𐃿]$/um)[0] is "Testing πƒ°"
+PASS "Testing αˆ΄\n1 2 3".match(/g [α€€-𐃿]$\n1/um)[0] is "g αˆ΄\n1"
+PASS "Testing πƒ°\n1 2 3".match(/g [α€€-𐃿]$\n1/um)[0] is "g πƒ°\n1"
 PASS "this is b\ba test".match(/is b\cha test/u)[0].length is 11
 PASS new RegExp("\\/", "u").source is "\\/"
 PASS r = new RegExp("\\u{110000}", "u") threw exception SyntaxError: Invalid regular expression: invalid unicode {} escape.
+PASS r = new RegExp("𐐅{2147483648}", "u") threw exception SyntaxError: Invalid regular expression: pattern exceeds string length limits.
 PASS r = new RegExp("\\-", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for unicode pattern.
 PASS r = new RegExp("\\a", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for unicode pattern.
 PASS r = new RegExp("[\\a]", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for unicode pattern.
index d5b1e73..0f2597b 100644 (file)
@@ -124,6 +124,8 @@ shouldBeTrue('re2.test("\u{12345}")');
 // Check quantified matches
 shouldBeTrue('/\u{1d306}{2}/u.test("\u{1d306}\u{1d306}")');
 shouldBeTrue('/\uD834\uDF06{2}/u.test("\uD834\uDF06\uD834\uDF06")');
+shouldBe('"\u{10405}\u{10405}\u{10405}\u{10405}".match(/\u{10405}{3}/u)[0]', '"\u{10405}\u{10405}\u{10405}"');
+shouldBe('"\u{10402}\u{10405}\u{10405}\u{10405}".match(/\u{10405}{3}/u)[0]', '"\u{10405}\u{10405}\u{10405}"');
 shouldBe('"\u{10401}\u{10401}\u{10400}".match(/\u{10401}{1,3}/u)[0]', '"\u{10401}\u{10401}"');
 shouldBe('"\u{10401}\u{10429}".match(/\u{10401}{1,3}/iu)[0]', '"\u{10401}\u{10429}"');
 shouldBe('"\u{10401}\u{10429}\u{1042a}\u{10429}".match(/\u{10401}{1,}/iu)[0]', '"\u{10401}\u{10429}"');
@@ -154,6 +156,7 @@ shouldBe('"\u{10400}".match(/a*/ui)[0].length', '0');
 shouldBe('"\u{10400}".match(/\\d*/u)[0].length', '0');
 shouldBe('"123\u{10400}".match(/\\d*/u)[0]', '"123"');
 shouldBe('"12X3\u{10400}4".match(/\\d{0,1}/ug)', '["1", "2", "", "3", "", "4", ""]');
+shouldBe('"\u{10402}\u{10405}\u{10405}\u{10402}\u{10405}\u{10405}\u{10405}".match(/\u{10405}{3}/u)[0]', '"\u{10405}\u{10405}\u{10405}"');
 
 var re3 = new RegExp("(a\u{10410}*bc)|(a\u{10410}*b)", "u");
 var match3 = "a\u{10410}\u{10410}b".match(re3);
@@ -191,12 +194,38 @@ shouldBeTrue('/\\u{4}/.test("uuuu")');
 // Check that \- escape works in a character class for a unicode pattern
 shouldBe('"800-555-1212".match(/[0-9\\-]*/u)[0].length', '12');
 
+// Check that counted Unicode character classes work.
+var re7 = new RegExp("(?:[\u{1f0a1}\u{1f0b1}\u{1f0d1}\u{1f0c1}]{2,4})", "u");
+shouldBe('"\u{1f0a1}\u{1f0d1}\u{1f0b8}\u{1f0c9}\u{1f0da}".match(re7)[0]', '"\u{1f0a1}\u{1f0d1}"');
+shouldBe('"\u{1f0a1}\u{1f0d1}\u{1f0b1}\u{1f0c9}\u{1f0da}".match(re7)[0]', '"\u{1f0a1}\u{1f0d1}\u{1f0b1}"');
+shouldBe('"\u{1f0a1}\u{1f0d1}\u{1f0b1}\u{1f0c1}\u{1f0da}".match(re7)[0]', '"\u{1f0a1}\u{1f0d1}\u{1f0b1}\u{1f0c1}"');
+shouldBe('"\u{1f0a3}\u{1f0d1}\u{1f0b1}\u{1f0c1}\u{1f0da}".match(re7)[0]', '"\u{1f0d1}\u{1f0b1}\u{1f0c1}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]*a|[\u{10310}\u{10311}]*./iu)[0]', '"\u{10311}\u{10310}\u{10311}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]*?a|[\u{10310}\u{10311}]*?./iu)[0]', '"\u{10311}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]+a|[\u{10310}\u{10311}]+./iu)[0]', '"\u{10311}\u{10310}\u{10311}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]+?a|[\u{10310}\u{10311}]+?./iu)[0]', '"\u{10311}\u{10310}"');
+
+var re8 = new  RegExp("^([0-9a-z\.]{3,16})\\|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}", "ui");
+shouldBe('"C83|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}".match(re8)[0]', '"C83|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}"');
+shouldBe('"This.Is.16.Chars|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}".match(re8)[0]', '"This.Is.16.Chars|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}"');
+
+// Check that unicode characters work with ^ and $ for multiline patterns
+shouldBe('"Testing\\n\u{1234} 1 2 3".match(/^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"\u{1234} 1 2 3"');
+shouldBe('"Testing\\n\u{100f0} 1 2 3".match(/^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"\u{100f0} 1 2 3"');
+shouldBe('"g\\n\u{1234} 1 2 3".match(/g\\n^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"g\\n\u{1234} 1 2 3"');
+shouldBe('"g\\n\u{100f0} 1 2 3".match(/g\\n^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"g\\n\u{100f0} 1 2 3"');
+shouldBe('"Testing \u{1234}\\n1 2 3".match(/Testing [\u{1000}-\u{100ff}]$/um)[0]', '"Testing \u{1234}"');
+shouldBe('"Testing \u{100f0}\\n1 2 3".match(/Testing [\u{1000}-\u{100ff}]$/um)[0]', '"Testing \u{100f0}"');
+shouldBe('"Testing \u{1234}\\n1 2 3".match(/g [\u{1000}-\u{100ff}]$\\n1/um)[0]', '"g \u{1234}\\n1"');
+shouldBe('"Testing \u{100f0}\\n1 2 3".match(/g [\u{1000}-\u{100ff}]$\\n1/um)[0]', '"g \u{100f0}\\n1"');
+
 // Check that control letter escapes work with unicode flag
 shouldBe('"this is b\ba test".match(/is b\\cha test/u)[0].length', '11');
 
 // Check that invalid unicode patterns throw exceptions
 shouldBe('new RegExp("\\\\/", "u").source', '"\\\\/"');
 shouldThrow('r = new RegExp("\\\\u{110000}", "u")', '"SyntaxError: Invalid regular expression: invalid unicode {} escape"');
+shouldThrow('r = new RegExp("\u{10405}{2147483648}", "u")', '"SyntaxError: Invalid regular expression: pattern exceeds string length limits"');
 
 var invalidEscapeException = "SyntaxError: Invalid regular expression: invalid escaped character for unicode pattern";
 var newRegExp;
index f0c83fa..9fbe521 100644 (file)
@@ -1,3 +1,105 @@
+2017-08-22  Michael Saboff  <msaboff@apple.com>
+
+        Implement Unicode RegExp support in the YARR JIT
+        https://bugs.webkit.org/show_bug.cgi?id=174646
+
+        Reviewed by Filip Pizlo.
+
+        This support is only implemented for 64 bit platforms.  It wouldn't be too hard to add support
+        for 32 bit platforms with a reasonable number of spare registers.  This code slightly refactors
+        register usage to reduce the number of callee save registers used for non-Unicode expressions.
+        For Unicode expressions, there are several more registers used to store constants values for
+        processing surrogate pairs as well as discerning whether a character belongs to the Basic
+        Multilingual Plane (BMP) or one of the Supplemental Planes.
+
+        This implements JIT support for Unicode expressions very similar to how the interpreter works.
+        Just like in the interpreter, backtracking code uses more space on the stack to save positions.
+        Moved the BackTrackInfo* structs to YarrPattern as separate functions.  Added xxxIndex()
+        functions to each of these to simplify how the JIT code reads and writes the structure fields.
+
+        Given that reading surrogate pairs and transforming them into a single code point takes a
+        little processing, the code that implements reading a Unicode character is implemented as a
+        leaf function added to the end of the JIT'ed code.  The calling convention for
+        "tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
+        that argument values stay in argument registers for most of the generated code.
+        That helper takes the starting character address in one register, regUnicodeInputAndTrail,
+        and uses another dedicated temporary register, regUnicodeTemp.  The result is typically
+        returned in regT0.  If another return register is requested, we'll create an inline copy of
+        that function.
+
+        Added a new flag to CharacterClass to signify if a class has non-BMP characters.  This flag
+        is used in optimizeAlternative() where we swap the order of a fixed character class term with
+        a fixed character term that immediately follows it.  Since the non-BMP character class may
+        increment "index" when matching, that must be done first before trying to match a fixed
+        character term later in the string.
+
+        Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
+        base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
+        function, getEffectiveAddress64(), with an ARM64 implementation.  It just calls x86Lea64()
+        on X86-64.  Also added an ImplicitAddress version of load16Unaligned().
+
+        (JSC::MacroAssemblerARM64::load16Unaligned):
+        (JSC::MacroAssemblerARM64::getEffectiveAddress64):
+        * assembler/MacroAssemblerX86Common.h:
+        (JSC::MacroAssemblerX86Common::load16Unaligned):
+        (JSC::MacroAssemblerX86Common::load16):
+        * assembler/MacroAssemblerX86_64.h:
+        (JSC::MacroAssemblerX86_64::getEffectiveAddress64):
+        * create_regex_tables:
+        * runtime/RegExp.cpp:
+        (JSC::RegExp::compile):
+        * yarr/YarrInterpreter.cpp:
+        * yarr/YarrJIT.cpp:
+        (JSC::Yarr::YarrGenerator::optimizeAlternative):
+        (JSC::Yarr::YarrGenerator::matchCharacterClass):
+        (JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
+        (JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
+        (JSC::Yarr::YarrGenerator::readCharacter):
+        (JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
+        (JSC::Yarr::YarrGenerator::matchAssertionWordchar):
+        (JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
+        (JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
+        (JSC::Yarr::YarrGenerator::generate):
+        (JSC::Yarr::YarrGenerator::backtrack):
+        (JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
+        (JSC::Yarr::YarrGenerator::generateEnter):
+        (JSC::Yarr::YarrGenerator::generateReturn):
+        (JSC::Yarr::YarrGenerator::YarrGenerator):
+        (JSC::Yarr::YarrGenerator::compile):
+        * yarr/YarrJIT.h:
+        * yarr/YarrPattern.cpp:
+        (JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
+        (JSC::Yarr::CharacterClassConstructor::reset):
+        (JSC::Yarr::CharacterClassConstructor::charClass):
+        (JSC::Yarr::CharacterClassConstructor::addSorted):
+        (JSC::Yarr::CharacterClassConstructor::addSortedRange):
+        (JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
+        (JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
+        * yarr/YarrPattern.h:
+        (JSC::Yarr::CharacterClass::CharacterClass):
+        (JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
+        (JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
+        (JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
+        (JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
+        (JSC::Yarr::BackTrackInfoBackReference::beginIndex):
+        (JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
+        (JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
+        (JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
+        (JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
+        (JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):
+
 2017-08-22  Per Arne Vollan  <pvollan@apple.com>
 
         Implement 64-bit MacroAssembler::probe support for Windows.
index 10f2bea..d24e0f5 100644 (file)
@@ -1176,6 +1176,11 @@ public:
         m_assembler.ldrh(dest, address.base, memoryTempRegister);
     }
     
+    void load16Unaligned(ImplicitAddress address, RegisterID dest)
+    {
+        load16(address, dest);
+    }
+
     void load16Unaligned(BaseIndex address, RegisterID dest)
     {
         load16(address, dest);
@@ -1535,6 +1540,13 @@ public:
         m_assembler.strb(src, dest, simm);
     }
 
+    void getEffectiveAddress64(BaseIndex address, RegisterID dest)
+    {
+        m_assembler.add<64>(dest, address.base, address.index, ARM64Assembler::LSL, address.scale);
+        if (address.offset)
+            add64(TrustedImm32(address.offset), dest);
+    }
+
     // Floating-point operations:
 
     static bool supportsFloatingPoint() { return true; }
index cb1a748..6877529 100644 (file)
@@ -1163,6 +1163,11 @@ public:
         load32(address, dest);
     }
 
+    void load16Unaligned(ImplicitAddress address, RegisterID dest)
+    {
+        load16(address, dest);
+    }
+
     void load16Unaligned(BaseIndex address, RegisterID dest)
     {
         load16(address, dest);
@@ -1225,6 +1230,11 @@ public:
         m_assembler.movsbl_rr(src, dest);
     }
     
+    void load16(ImplicitAddress address, RegisterID dest)
+    {
+        m_assembler.movzwl_mr(address.offset, address.base, dest);
+    }
+
     void load16(BaseIndex address, RegisterID dest)
     {
         m_assembler.movzwl_mr(address.offset, address.base, address.index, address.scale, dest);
index a715c73..e8564bf 100644 (file)
@@ -364,6 +364,11 @@ public:
         m_assembler.leaq_mr(index.offset, index.base, index.index, index.scale, dest);
     }
 
+    void getEffectiveAddress64(BaseIndex address, RegisterID dest)
+    {
+        return x86Lea64(address, dest);
+    }
+
     void addPtrNoFlags(TrustedImm32 imm, RegisterID srcDest)
     {
         m_assembler.leaq_mr(imm.m_value, srcDest, srcDest);
index 11a683a..ac9378d 100644 (file)
@@ -1,4 +1,4 @@
-# Copyright (C) 2010, 2013 Apple Inc. All rights reserved.
+# Copyright (C) 2010, 2013-2017 Apple Inc. All rights reserved.
 # 
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -97,6 +97,7 @@ for name, classes in types.items():
             function += ("    auto characterClass = std::make_unique<CharacterClass>(_%sData, false);\n" % (name))
     else:
         function += ("    auto characterClass = std::make_unique<CharacterClass>();\n")
+    hasNonBMPCharacters = False
     for (min, max) in ranges:
         if (min == max):
             if (min > 127):
@@ -108,6 +109,9 @@ for name, classes in types.items():
             function += ("    characterClass->m_rangesUnicode.append(CharacterRange(0x%04x, 0x%04x));\n" % (min, max))
         else:
             function += ("    characterClass->m_ranges.append(CharacterRange(0x%02x, 0x%02x));\n" % (min, max))
+        if max >= 0x10000:
+            hasNonBMPCharacters = True
+    function += ("    characterClass->m_hasNonBMPCharacters = %s;\n" % ("true" if hasNonBMPCharacters else "false"))
     function += ("    return characterClass;\n")
     function += ("}\n\n")
     functions += function
index 60232b7..436b36e 100644 (file)
@@ -1,6 +1,6 @@
 /*
  *  Copyright (C) 1999-2001, 2004 Harri Porten (porten@kde.org)
- *  Copyright (c) 2007, 2008, 2016 Apple Inc. All rights reserved.
+ *  Copyright (c) 2007, 2008, 2016-2017 Apple Inc. All rights reserved.
  *  Copyright (C) 2009 Torch Mobile, Inc.
  *  Copyright (C) 2010 Peter Varga (pvarga@inf.u-szeged.hu), University of Szeged
  *
@@ -281,7 +281,7 @@ void RegExp::compile(VM* vm, Yarr::YarrCharSize charSize)
     }
 
 #if ENABLE(YARR_JIT)
-    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && !unicode() && vm->canUseRegExpJIT()) {
+    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && vm->canUseRegExpJIT()) {
         Yarr::jitCompile(pattern, charSize, vm, m_regExpJITCode);
         if (!m_regExpJITCode.isFallBack()) {
             m_state = JITCode;
index c9f4302..2eeba8a 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013, 2016 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013-2017 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga (pvarga@inf.u-szeged.hu), University of Szeged
  *
  * Redistribution and use in source and binary forms, with or without
@@ -44,30 +44,6 @@ class Interpreter {
 public:
     struct ParenthesesDisjunctionContext;
 
-    struct BackTrackInfoPatternCharacter {
-        uintptr_t begin; // Only needed for unicode patterns
-        uintptr_t matchAmount;
-    };
-    struct BackTrackInfoCharacterClass {
-        uintptr_t begin; // Only needed for unicode patterns
-        uintptr_t matchAmount;
-    };
-    struct BackTrackInfoBackReference {
-        uintptr_t begin; // Not really needed for greedy quantifiers.
-        uintptr_t matchAmount; // Not really needed for fixed quantifiers.
-    };
-    struct BackTrackInfoAlternative {
-        uintptr_t offset;
-    };
-    struct BackTrackInfoParentheticalAssertion {
-        uintptr_t begin;
-    };
-    struct BackTrackInfoParenthesesOnce {
-        uintptr_t begin;
-    };
-    struct BackTrackInfoParenthesesTerminal {
-        uintptr_t begin;
-    };
     struct BackTrackInfoParentheses {
         uintptr_t matchAmount;
         ParenthesesDisjunctionContext* lastContext;
@@ -2082,12 +2058,12 @@ unsigned interpret(BytecodePattern* bytecode, const UChar* input, unsigned lengt
 }
 
 // These should be the same for both UChar & LChar.
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoPatternCharacter) == (YarrStackSpaceForBackTrackInfoPatternCharacter * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoPatternCharacter);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoCharacterClass) == (YarrStackSpaceForBackTrackInfoCharacterClass * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoCharacterClass);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoBackReference) == (YarrStackSpaceForBackTrackInfoBackReference * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoBackReference);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoAlternative) == (YarrStackSpaceForBackTrackInfoAlternative * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoAlternative);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoParentheticalAssertion) == (YarrStackSpaceForBackTrackInfoParentheticalAssertion * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParentheticalAssertion);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoParenthesesOnce) == (YarrStackSpaceForBackTrackInfoParenthesesOnce * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParenthesesOnce);
+COMPILE_ASSERT(sizeof(BackTrackInfoPatternCharacter) == (YarrStackSpaceForBackTrackInfoPatternCharacter * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoPatternCharacter);
+COMPILE_ASSERT(sizeof(BackTrackInfoCharacterClass) == (YarrStackSpaceForBackTrackInfoCharacterClass * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoCharacterClass);
+COMPILE_ASSERT(sizeof(BackTrackInfoBackReference) == (YarrStackSpaceForBackTrackInfoBackReference * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoBackReference);
+COMPILE_ASSERT(sizeof(BackTrackInfoAlternative) == (YarrStackSpaceForBackTrackInfoAlternative * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoAlternative);
+COMPILE_ASSERT(sizeof(BackTrackInfoParentheticalAssertion) == (YarrStackSpaceForBackTrackInfoParentheticalAssertion * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParentheticalAssertion);
+COMPILE_ASSERT(sizeof(BackTrackInfoParenthesesOnce) == (YarrStackSpaceForBackTrackInfoParenthesesOnce * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParenthesesOnce);
 COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoParentheses) == (YarrStackSpaceForBackTrackInfoParentheses * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParentheses);
 
 
index 3fae2ce..d7829aa 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013, 2015-2016 Apple Inc. All rights reserved.
+ * Copyright (C) 2009-2017 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -51,12 +51,12 @@ class YarrGenerator : private MacroAssembler {
 
     static const RegisterID regT0 = ARMRegisters::r4;
     static const RegisterID regT1 = ARMRegisters::r5;
-
     static const RegisterID initialStart = ARMRegisters::r6;
-#define HAVE_INITIAL_START_REG
 
     static const RegisterID returnRegister = ARMRegisters::r0;
     static const RegisterID returnRegister2 = ARMRegisters::r1;
+
+#define HAVE_INITIAL_START_REG
 #elif CPU(ARM64)
     static const RegisterID input = ARM64Registers::x0;
     static const RegisterID index = ARM64Registers::x1;
@@ -65,12 +65,19 @@ class YarrGenerator : private MacroAssembler {
 
     static const RegisterID regT0 = ARM64Registers::x4;
     static const RegisterID regT1 = ARM64Registers::x5;
-
-    static const RegisterID initialStart = ARM64Registers::x6;
-#define HAVE_INITIAL_START_REG
+    static const RegisterID regUnicodeInputAndTrail = ARM64Registers::x6;
+    static const RegisterID regUnicodeTemp = ARM64Registers::x7;
+    static const RegisterID initialStart = ARM64Registers::x8;
+    static const RegisterID supplementaryPlanesBase = ARM64Registers::x9;
+    static const RegisterID surrogateTagMask = ARM64Registers::x10;
+    static const RegisterID leadingSurrogateTag = ARM64Registers::x11;
+    static const RegisterID trailingSurrogateTag = ARM64Registers::x12;
 
     static const RegisterID returnRegister = ARM64Registers::x0;
     static const RegisterID returnRegister2 = ARM64Registers::x1;
+
+#define HAVE_INITIAL_START_REG
+#define JIT_UNICODE_EXPRESSIONS
 #elif CPU(MIPS)
     static const RegisterID input = MIPSRegisters::a0;
     static const RegisterID index = MIPSRegisters::a1;
@@ -79,12 +86,12 @@ class YarrGenerator : private MacroAssembler {
 
     static const RegisterID regT0 = MIPSRegisters::t4;
     static const RegisterID regT1 = MIPSRegisters::t5;
-
     static const RegisterID initialStart = MIPSRegisters::t6;
-#define HAVE_INITIAL_START_REG
 
     static const RegisterID returnRegister = MIPSRegisters::v0;
     static const RegisterID returnRegister2 = MIPSRegisters::v1;
+
+#define HAVE_INITIAL_START_REG
 #elif CPU(X86)
     static const RegisterID input = X86Registers::eax;
     static const RegisterID index = X86Registers::edx;
@@ -113,30 +120,45 @@ class YarrGenerator : private MacroAssembler {
 #endif
 
     static const RegisterID regT0 = X86Registers::eax;
-    static const RegisterID regT1 = X86Registers::ebx;
+#if !OS(WINDOWS)
+    static const RegisterID regT1 = X86Registers::r8;
+#else
+    static const RegisterID regT1 = X86Registers::ecx;
+#endif
 
+    static const RegisterID initialStart = X86Registers::ebx;
 #if !OS(WINDOWS)
-    static const RegisterID initialStart = X86Registers::r8;
+    static const RegisterID regUnicodeInputAndTrail = X86Registers::r9;
+    static const RegisterID regUnicodeTemp = X86Registers::r10;
 #else
-    static const RegisterID initialStart = X86Registers::ecx;
+    static const RegisterID regUnicodeInputAndTrail = X86Registers::esi;
+    static const RegisterID regUnicodeTemp = X86Registers::edi;
 #endif
-#define HAVE_INITIAL_START_REG
+    static const RegisterID supplementaryPlanesBase = X86Registers::r12;
+    static const RegisterID surrogateTagMask = X86Registers::r13;
+    static const RegisterID leadingSurrogateTag = X86Registers::r14;
+    static const RegisterID trailingSurrogateTag = X86Registers::r15;
 
     static const RegisterID returnRegister = X86Registers::eax;
     static const RegisterID returnRegister2 = X86Registers::edx;
+
+#define HAVE_INITIAL_START_REG
+#define JIT_UNICODE_EXPRESSIONS
 #endif
 
     void optimizeAlternative(PatternAlternative* alternative)
     {
-        if (!alternative->m_terms.size())
+        if (!alternative->m_terms.size() || m_decodeSurrogatePairs)
             return;
 
         for (unsigned i = 0; i < alternative->m_terms.size() - 1; ++i) {
             PatternTerm& term = alternative->m_terms[i];
             PatternTerm& nextTerm = alternative->m_terms[i + 1];
 
+            // We can move BMP only character classes after fixed character terms.
             if ((term.type == PatternTerm::TypeCharacterClass)
                 && (term.quantityType == QuantifierFixedCount)
+                && (!term.characterClass->m_hasNonBMPCharacters)
                 && (nextTerm.type == PatternTerm::TypePatternCharacter)
                 && (nextTerm.quantityType == QuantifierFixedCount)) {
                 PatternTerm termCopy = term;
@@ -195,14 +217,16 @@ class YarrGenerator : private MacroAssembler {
 
     void matchCharacterClass(RegisterID character, JumpList& matchDest, const CharacterClass* charClass)
     {
-        if (charClass->m_table) {
+        if (charClass->m_table && !m_decodeSurrogatePairs) {
             ExtendedAddress tableEntry(character, reinterpret_cast<intptr_t>(charClass->m_table));
             matchDest.append(branchTest8(charClass->m_tableInverted ? Zero : NonZero, tableEntry));
             return;
         }
-        Jump unicodeFail;
+        JumpList unicodeFail;
         if (charClass->m_matchesUnicode.size() || charClass->m_rangesUnicode.size()) {
-            Jump isAscii = branch32(LessThanOrEqual, character, TrustedImm32(0x7f));
+            JumpList isAscii;
+            if (charClass->m_matches.size() || charClass->m_ranges.size())
+                isAscii.append(branch32(LessThanOrEqual, character, TrustedImm32(0x7f)));
 
             if (charClass->m_matchesUnicode.size()) {
                 for (unsigned i = 0; i < charClass->m_matchesUnicode.size(); ++i) {
@@ -222,7 +246,8 @@ class YarrGenerator : private MacroAssembler {
                 }
             }
 
-            unicodeFail = jump();
+            if (charClass->m_matches.size() || charClass->m_ranges.size())
+                unicodeFail = jump();
             isAscii.link(this);
         }
 
@@ -322,12 +347,52 @@ class YarrGenerator : private MacroAssembler {
         return BaseIndex(input, indexReg, TimesTwo, (characterOffset * static_cast<int32_t>(sizeof(UChar))).unsafeGet());
     }
 
+#ifdef JIT_UNICODE_EXPRESSIONS
+    void tryReadUnicodeCharImpl(RegisterID resultReg)
+    {
+        ASSERT(m_charSize == Char16);
+
+        JumpList notUnicode;
+        load16Unaligned(regUnicodeInputAndTrail, resultReg);
+        and32(surrogateTagMask, resultReg, regUnicodeTemp);
+        notUnicode.append(branch32(NotEqual, regUnicodeTemp, leadingSurrogateTag));
+        addPtr(TrustedImm32(2), regUnicodeInputAndTrail);
+        getEffectiveAddress64(BaseIndex(input, length, TimesTwo), regUnicodeTemp);
+        notUnicode.append(branch32(AboveOrEqual, regUnicodeInputAndTrail, regUnicodeTemp));
+        load16Unaligned(Address(regUnicodeInputAndTrail), regUnicodeInputAndTrail);
+        and32(surrogateTagMask, regUnicodeInputAndTrail, regUnicodeTemp);
+        notUnicode.append(branch32(NotEqual, regUnicodeTemp, trailingSurrogateTag));
+        sub32(leadingSurrogateTag, resultReg);
+        sub32(trailingSurrogateTag, regUnicodeInputAndTrail);
+        lshift32(TrustedImm32(10), resultReg);
+        or32(regUnicodeInputAndTrail, resultReg);
+        add32(supplementaryPlanesBase, resultReg);
+        notUnicode.link(this);
+    }
+
+    void tryReadUnicodeChar(BaseIndex address, RegisterID resultReg)
+    {
+        ASSERT(m_charSize == Char16);
+
+        getEffectiveAddress64(address, regUnicodeInputAndTrail);
+
+        if (resultReg == regT0)
+            m_tryReadUnicodeCharacterCalls.append(nearCall());
+        else
+            tryReadUnicodeCharImpl(resultReg);
+    }
+#endif
+
     void readCharacter(Checked<unsigned> negativeCharacterOffset, RegisterID resultReg, RegisterID indexReg = index)
     {
         BaseIndex address = negativeOffsetIndexedAddress(negativeCharacterOffset, resultReg, indexReg);
 
         if (m_charSize == Char8)
             load8(address, resultReg);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        else if (m_decodeSurrogatePairs)
+            tryReadUnicodeChar(address, resultReg);
+#endif
         else
             load16Unaligned(address, resultReg);
     }
@@ -338,7 +403,7 @@ class YarrGenerator : private MacroAssembler {
 
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
-        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
+        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch, m_canonicalMode));
         if (m_pattern.ignoreCase() && isASCIIAlpha(ch)) {
             or32(TrustedImm32(0x20), character);
             ch |= 0x20;
@@ -746,7 +811,15 @@ class YarrGenerator : private MacroAssembler {
             nextIsNotWordChar.append(atEndOfInput());
 
         readCharacter(m_checkedOffset - term->inputPosition, character);
-        matchCharacterClass(character, nextIsWordChar, m_pattern.wordcharCharacterClass());
+
+        CharacterClass* wordcharCharacterClass;
+
+        if (m_unicodeIgnoreCase)
+            wordcharCharacterClass = m_pattern.wordUnicodeIgnoreCaseCharCharacterClass();
+        else
+            wordcharCharacterClass = m_pattern.wordcharCharacterClass();
+
+        matchCharacterClass(character, nextIsWordChar, wordcharCharacterClass);
     }
 
     void generateAssertionWordBoundary(size_t opIndex)
@@ -761,7 +834,15 @@ class YarrGenerator : private MacroAssembler {
         if (!term->inputPosition)
             atBegin = branch32(Equal, index, Imm32(m_checkedOffset.unsafeGet()));
         readCharacter(m_checkedOffset - term->inputPosition + 1, character);
-        matchCharacterClass(character, matchDest, m_pattern.wordcharCharacterClass());
+
+        CharacterClass* wordcharCharacterClass;
+
+        if (m_unicodeIgnoreCase)
+            wordcharCharacterClass = m_pattern.wordUnicodeIgnoreCaseCharCharacterClass();
+        else
+            wordcharCharacterClass = m_pattern.wordcharCharacterClass();
+
+        matchCharacterClass(character, matchDest, wordcharCharacterClass);
         if (!term->inputPosition)
             atBegin.link(this);
 
@@ -833,7 +914,7 @@ class YarrGenerator : private MacroAssembler {
 
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
-        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
+        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch, m_canonicalMode));
 
         if (m_pattern.ignoreCase() && isASCIIAlpha(ch))
 #if CPU(BIG_ENDIAN)
@@ -848,7 +929,8 @@ class YarrGenerator : private MacroAssembler {
             if (nextTerm->type != PatternTerm::TypePatternCharacter
                 || nextTerm->quantityType != QuantifierFixedCount
                 || nextTerm->quantityMaxCount != 1
-                || nextTerm->inputPosition != (startTermPosition + numberCharacters))
+                || nextTerm->inputPosition != (startTermPosition + numberCharacters)
+                || (U16_LENGTH(nextTerm->patternCharacter) != 1 && m_decodeSurrogatePairs))
                 break;
 
             nextOp->m_isDeadCode = true;
@@ -869,7 +951,7 @@ class YarrGenerator : private MacroAssembler {
 
             // For case-insesitive compares, non-ascii characters that have different
             // upper & lower case representations are converted to a character class.
-            ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(currentCharacter) || isCanonicallyUnique(currentCharacter));
+            ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(currentCharacter) || isCanonicallyUnique(currentCharacter, m_canonicalMode));
 
             allCharacters |= (currentCharacter << shiftAmount);
 
@@ -930,20 +1012,27 @@ class YarrGenerator : private MacroAssembler {
         const RegisterID countRegister = regT1;
 
         move(index, countRegister);
-        sub32(Imm32(term->quantityMaxCount.unsafeGet()), countRegister);
+        Checked<unsigned> scaledMaxCount = term->quantityMaxCount;
+        scaledMaxCount *= U_IS_BMP(ch) ? 1 : 2;
+        sub32(Imm32(scaledMaxCount.unsafeGet()), countRegister);
 
         Label loop(this);
-        readCharacter(m_checkedOffset - term->inputPosition - term->quantityMaxCount, character, countRegister);
+        readCharacter(m_checkedOffset - term->inputPosition - scaledMaxCount, character, countRegister);
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
-        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
+        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch, m_canonicalMode));
         if (m_pattern.ignoreCase() && isASCIIAlpha(ch)) {
             or32(TrustedImm32(0x20), character);
             ch |= 0x20;
         }
 
         op.m_jumps.append(branch32(NotEqual, character, Imm32(ch)));
-        add32(TrustedImm32(1), countRegister);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs && !U_IS_BMP(ch))
+            add32(TrustedImm32(2), countRegister);
+        else
+#endif
+            add32(TrustedImm32(1), countRegister);
         branch32(NotEqual, countRegister, index).linkTo(loop, this);
     }
     void backtrackPatternCharacterFixed(size_t opIndex)
@@ -969,8 +1058,18 @@ class YarrGenerator : private MacroAssembler {
             failures.append(atEndOfInput());
             failures.append(jumpIfCharNotEquals(ch, m_checkedOffset - term->inputPosition, character));
 
-            add32(TrustedImm32(1), countRegister);
             add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+            if (m_decodeSurrogatePairs && !U_IS_BMP(ch)) {
+                Jump surrogatePairOk = notAtEndOfInput();
+                sub32(TrustedImm32(1), index);
+                failures.append(jump());
+                surrogatePairOk.link(this);
+                add32(TrustedImm32(1), index);
+            }
+#endif
+            add32(TrustedImm32(1), countRegister);
+
             if (term->quantityMaxCount == quantifyInfinite)
                 jump(loop);
             else
@@ -980,7 +1079,7 @@ class YarrGenerator : private MacroAssembler {
         }
         op.m_reentry = label();
 
-        storeToFrame(countRegister, term->frameLocation);
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex());
     }
     void backtrackPatternCharacterGreedy(size_t opIndex)
     {
@@ -991,10 +1090,13 @@ class YarrGenerator : private MacroAssembler {
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        loadFromFrame(term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex(), countRegister);
         m_backtrackingState.append(branchTest32(Zero, countRegister));
         sub32(TrustedImm32(1), countRegister);
-        sub32(TrustedImm32(1), index);
+        if (!m_decodeSurrogatePairs || U_IS_BMP(term->patternCharacter))
+            sub32(TrustedImm32(1), index);
+        else
+            sub32(TrustedImm32(2), index);
         jump(op.m_reentry);
     }
 
@@ -1007,7 +1109,7 @@ class YarrGenerator : private MacroAssembler {
 
         move(TrustedImm32(0), countRegister);
         op.m_reentry = label();
-        storeToFrame(countRegister, term->frameLocation);
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex());
     }
     void backtrackPatternCharacterNonGreedy(size_t opIndex)
     {
@@ -1020,7 +1122,7 @@ class YarrGenerator : private MacroAssembler {
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        loadFromFrame(term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex(), countRegister);
 
         // Unless have a 16 bit pattern character and an 8 bit string - short circuit
         if (!((ch > 0xff) && (m_charSize == Char8))) {
@@ -1030,13 +1132,27 @@ class YarrGenerator : private MacroAssembler {
                 nonGreedyFailures.append(branch32(Equal, countRegister, Imm32(term->quantityMaxCount.unsafeGet())));
             nonGreedyFailures.append(jumpIfCharNotEquals(ch, m_checkedOffset - term->inputPosition, character));
 
-            add32(TrustedImm32(1), countRegister);
             add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+            if (m_decodeSurrogatePairs && !U_IS_BMP(ch)) {
+                Jump surrogatePairOk = notAtEndOfInput();
+                sub32(TrustedImm32(1), index);
+                nonGreedyFailures.append(jump());
+                surrogatePairOk.link(this);
+                add32(TrustedImm32(1), index);
+            }
+#endif
+            add32(TrustedImm32(1), countRegister);
 
             jump(op.m_reentry);
             nonGreedyFailures.link(this);
         }
 
+        if (m_decodeSurrogatePairs && !U_IS_BMP(ch)) {
+            // subtract countRegister*2 for non-BMP characters
+            lshift32(TrustedImm32(1), countRegister);
+        }
+
         sub32(countRegister, index);
         m_backtrackingState.fallthrough();
     }
@@ -1048,6 +1164,9 @@ class YarrGenerator : private MacroAssembler {
 
         const RegisterID character = regT0;
 
+        if (m_decodeSurrogatePairs)
+            storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
+
         JumpList matchDest;
         readCharacter(m_checkedOffset - term->inputPosition, character);
         matchCharacterClass(character, matchDest, term->characterClass);
@@ -1058,9 +1177,27 @@ class YarrGenerator : private MacroAssembler {
             op.m_jumps.append(jump());
             matchDest.link(this);
         }
+
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
     }
     void backtrackCharacterClassOnce(size_t opIndex)
     {
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            YarrOp& op = m_ops[opIndex];
+            PatternTerm* term = op.m_term;
+
+            m_backtrackingState.link(this);
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
+            m_backtrackingState.fallthrough();
+        }
+#endif
         backtrackTermDefault(opIndex);
     }
 
@@ -1088,6 +1225,15 @@ class YarrGenerator : private MacroAssembler {
         }
 
         add32(TrustedImm32(1), countRegister);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            op.m_jumps.append(atEndOfInput());
+            add32(TrustedImm32(1), countRegister);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
         branch32(NotEqual, countRegister, index).linkTo(loop, this);
     }
     void backtrackCharacterClassFixed(size_t opIndex)
@@ -1103,6 +1249,8 @@ class YarrGenerator : private MacroAssembler {
         const RegisterID character = regT0;
         const RegisterID countRegister = regT1;
 
+        if (m_decodeSurrogatePairs)
+            storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
         move(TrustedImm32(0), countRegister);
 
         JumpList failures;
@@ -1122,6 +1270,15 @@ class YarrGenerator : private MacroAssembler {
 
         add32(TrustedImm32(1), countRegister);
         add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            failures.append(atEndOfInput());
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
+
         if (term->quantityMaxCount != quantifyInfinite) {
             branch32(NotEqual, countRegister, Imm32(term->quantityMaxCount.unsafeGet())).linkTo(loop, this);
             failures.append(jump());
@@ -1131,7 +1288,7 @@ class YarrGenerator : private MacroAssembler {
         failures.link(this);
         op.m_reentry = label();
 
-        storeToFrame(countRegister, term->frameLocation);
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
     }
     void backtrackCharacterClassGreedy(size_t opIndex)
     {
@@ -1142,10 +1299,34 @@ class YarrGenerator : private MacroAssembler {
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
         m_backtrackingState.append(branchTest32(Zero, countRegister));
         sub32(TrustedImm32(1), countRegister);
-        sub32(TrustedImm32(1), index);
+        if (!m_decodeSurrogatePairs)
+            sub32(TrustedImm32(1), index);
+        else {
+            const RegisterID character = regT0;
+
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
+            // Rematch one less
+            storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
+
+            Label rematchLoop(this);
+            readCharacter(m_checkedOffset - term->inputPosition, character);
+
+            sub32(TrustedImm32(1), countRegister);
+            add32(TrustedImm32(1), index);
+
+#ifdef JIT_UNICODE_EXPRESSIONS
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+#endif
+
+            branchTest32(Zero, countRegister).linkTo(rematchLoop, this);
+
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
+        }
         jump(op.m_reentry);
     }
 
@@ -1158,8 +1339,11 @@ class YarrGenerator : private MacroAssembler {
 
         move(TrustedImm32(0), countRegister);
         op.m_reentry = label();
-        storeToFrame(countRegister, term->frameLocation);
+        if (m_decodeSurrogatePairs)
+            storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
     }
+
     void backtrackCharacterClassNonGreedy(size_t opIndex)
     {
         YarrOp& op = m_ops[opIndex];
@@ -1172,7 +1356,9 @@ class YarrGenerator : private MacroAssembler {
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        if (m_decodeSurrogatePairs)
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
+        loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
 
         nonGreedyFailures.append(atEndOfInput());
         nonGreedyFailures.append(branch32(Equal, countRegister, Imm32(term->quantityMaxCount.unsafeGet())));
@@ -1190,6 +1376,13 @@ class YarrGenerator : private MacroAssembler {
 
         add32(TrustedImm32(1), countRegister);
         add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
 
         jump(op.m_reentry);
 
@@ -1650,12 +1843,12 @@ class YarrGenerator : private MacroAssembler {
                 //
                 // FIXME: for capturing parens, could use the index in the capture array?
                 if (term->quantityType == QuantifierGreedy)
-                    storeToFrame(index, parenthesesFrameLocation);
+                    storeToFrame(index, parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                 else if (term->quantityType == QuantifierNonGreedy) {
-                    storeToFrame(TrustedImm32(-1), parenthesesFrameLocation);
+                    storeToFrame(TrustedImm32(-1), parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                     op.m_jumps.append(jump());
                     op.m_reentry = label();
-                    storeToFrame(index, parenthesesFrameLocation);
+                    storeToFrame(index, parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                 }
 
                 // If the parenthese are capturing, store the starting index value to the
@@ -1731,7 +1924,7 @@ class YarrGenerator : private MacroAssembler {
 
                 // Store the start index of the current match; we need to reject zero
                 // length matches.
-                storeToFrame(index, term->frameLocation);
+                storeToFrame(index, term->frameLocation + BackTrackInfoParenthesesTerminal::beginIndex());
                 break;
             }
             case OpParenthesesSubpatternTerminalEnd: {
@@ -1764,7 +1957,7 @@ class YarrGenerator : private MacroAssembler {
                 // Store the current index - assertions should not update index, so
                 // we will need to restore it upon a successful match.
                 unsigned parenthesesFrameLocation = term->frameLocation;
-                storeToFrame(index, parenthesesFrameLocation);
+                storeToFrame(index, parenthesesFrameLocation + BackTrackInfoParentheticalAssertion::beginIndex());
 
                 // Check 
                 op.m_checkAdjust = m_checkedOffset - term->inputPosition;
@@ -1779,7 +1972,7 @@ class YarrGenerator : private MacroAssembler {
 
                 // Restore the input index value.
                 unsigned parenthesesFrameLocation = term->frameLocation;
-                loadFromFrame(parenthesesFrameLocation, index);
+                loadFromFrame(parenthesesFrameLocation + BackTrackInfoParentheticalAssertion::beginIndex(), index);
 
                 // If inverted, a successful match of the assertion must be treated
                 // as a failure, so jump to backtracking.
@@ -2221,7 +2414,7 @@ class YarrGenerator : private MacroAssembler {
                     if (term->quantityType == QuantifierGreedy) {
                         // Clear the flag in the stackframe indicating we ran through the subpattern.
                         unsigned parenthesesFrameLocation = term->frameLocation;
-                        storeToFrame(TrustedImm32(-1), parenthesesFrameLocation);
+                        storeToFrame(TrustedImm32(-1), parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                         // Jump to after the parentheses, skipping the subpattern.
                         jump(m_ops[op.m_nextOp].m_reentry);
                         // A backtrack from after the parentheses, when skipping the subpattern,
@@ -2590,12 +2783,46 @@ class YarrGenerator : private MacroAssembler {
         lastOp.m_nextOp = repeatLoop;
     }
 
+    void generateTryReadUnicodeCharacterHelper()
+    {
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_tryReadUnicodeCharacterCalls.isEmpty())
+            return;
+
+        ASSERT(m_decodeSurrogatePairs);
+
+        m_tryReadUnicodeCharacterEntry = label();
+
+        tryReadUnicodeCharImpl(regT0);
+
+        ret();
+#endif
+    }
+
     void generateEnter()
     {
 #if CPU(X86_64)
         push(X86Registers::ebp);
         move(stackPointerRegister, X86Registers::ebp);
-        push(X86Registers::ebx);
+
+        if (m_pattern.m_saveInitialStartValue)
+            push(X86Registers::ebx);
+
+        if (m_decodeSurrogatePairs) {
+#if OS(WINDOWS)
+            push(X86Registers::edi);
+            push(X86Registers::esi);
+#endif
+            push(X86Registers::r12);
+            push(X86Registers::r13);
+            push(X86Registers::r14);
+            push(X86Registers::r15);
+
+            move(TrustedImm32(0x10000), supplementaryPlanesBase);
+            move(TrustedImm32(0xfffffc00), surrogateTagMask);
+            move(TrustedImm32(0xd800), leadingSurrogateTag);
+            move(TrustedImm32(0xdc00), trailingSurrogateTag);
+        }
         // The ABI doesn't guarantee the upper bits are zero on unsigned arguments, so clear them ourselves.
         zeroExtend32ToPtr(index, index);
         zeroExtend32ToPtr(length, length);
@@ -2622,6 +2849,14 @@ class YarrGenerator : private MacroAssembler {
             loadPtr(Address(X86Registers::ebp, 2 * sizeof(void*)), output);
     #endif
 #elif CPU(ARM64)
+        if (m_decodeSurrogatePairs) {
+            pushPair(framePointerRegister, linkRegister);
+            move(TrustedImm32(0x10000), supplementaryPlanesBase);
+            move(TrustedImm32(0xfffffc00), surrogateTagMask);
+            move(TrustedImm32(0xd800), leadingSurrogateTag);
+            move(TrustedImm32(0xdc00), trailingSurrogateTag);
+        }
+
         // The ABI doesn't guarantee the upper bits are zero on unsigned arguments, so clear them ourselves.
         zeroExtend32ToPtr(index, index);
         zeroExtend32ToPtr(length, length);
@@ -2647,13 +2882,28 @@ class YarrGenerator : private MacroAssembler {
         store64(returnRegister2, Address(X86Registers::ecx, sizeof(void*)));
         move(X86Registers::ecx, returnRegister);
 #endif
-        pop(X86Registers::ebx);
+        if (m_decodeSurrogatePairs) {
+            pop(X86Registers::r15);
+            pop(X86Registers::r14);
+            pop(X86Registers::r13);
+            pop(X86Registers::r12);
+#if OS(WINDOWS)
+            pop(X86Registers::esi);
+            pop(X86Registers::edi);
+#endif
+        }
+
+        if (m_pattern.m_saveInitialStartValue)
+            pop(X86Registers::ebx);
         pop(X86Registers::ebp);
 #elif CPU(X86)
         pop(X86Registers::esi);
         pop(X86Registers::edi);
         pop(X86Registers::ebx);
         pop(X86Registers::ebp);
+#elif CPU(ARM64)
+        if (m_decodeSurrogatePairs)
+            popPair(framePointerRegister, linkRegister);
 #elif CPU(ARM)
         pop(ARMRegisters::r6);
         pop(ARMRegisters::r5);
@@ -2670,11 +2920,21 @@ public:
         , m_pattern(pattern)
         , m_charSize(charSize)
         , m_shouldFallBack(false)
+        , m_decodeSurrogatePairs(m_charSize == Char16 && m_pattern.unicode())
+        , m_unicodeIgnoreCase(m_pattern.unicode() && m_pattern.ignoreCase())
+        , m_canonicalMode(m_pattern.unicode() ? CanonicalMode::Unicode : CanonicalMode::UCS2)
     {
     }
 
     void compile(YarrCodeBlock& jitObject)
     {
+#ifndef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            jitObject.setFallBack(true);
+            return;
+        }
+#endif
+
         generateEnter();
 
         Jump hasInput = checkInput();
@@ -2709,12 +2969,21 @@ public:
         generate();
         backtrack();
 
+        generateTryReadUnicodeCharacterHelper();
+
         LinkBuffer linkBuffer(*this, REGEXP_CODE_ID, JITCompilationCanFail);
         if (linkBuffer.didFailToAllocate()) {
             jitObject.setFallBack(true);
             return;
         }
 
+        if (!m_tryReadUnicodeCharacterCalls.isEmpty()) {
+            CodeLocationLabel tryReadUnicodeCharacterHelper = linkBuffer.locationOf(m_tryReadUnicodeCharacterEntry);
+
+            for (auto call : m_tryReadUnicodeCharacterCalls)
+                linkBuffer.link(call, tryReadUnicodeCharacterHelper);
+        }
+
         m_backtrackingState.linkDataLabels(linkBuffer);
 
         if (compileMode == MatchOnly) {
@@ -2742,6 +3011,12 @@ private:
     // supported in the JIT; fall back to the interpreter when this is detected.
     bool m_shouldFallBack;
 
+    bool m_decodeSurrogatePairs;
+    bool m_unicodeIgnoreCase;
+    CanonicalMode m_canonicalMode;
+    Vector<Call> m_tryReadUnicodeCharacterCalls;
+    Label m_tryReadUnicodeCharacterEntry;
+
     // The regular expression expressed as a linear sequence of operations.
     Vector<YarrOp, 128> m_ops;
 
index 9e17574..b72ba9b 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009 Apple Inc. All rights reserved.
+ * Copyright (C) 2009-2017 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
index 7222cc0..ac9d7bf 100644 (file)
@@ -45,6 +45,7 @@ class CharacterClassConstructor {
 public:
     CharacterClassConstructor(bool isCaseInsensitive, CanonicalMode canonicalMode)
         : m_isCaseInsensitive(isCaseInsensitive)
+        , m_hasNonBMPCharacters(false)
         , m_canonicalMode(canonicalMode)
     {
     }
@@ -55,6 +56,7 @@ public:
         m_ranges.clear();
         m_matchesUnicode.clear();
         m_rangesUnicode.clear();
+        m_hasNonBMPCharacters = false;
     }
 
     void append(const CharacterClass* other)
@@ -185,6 +187,7 @@ public:
         characterClass->m_ranges.swap(m_ranges);
         characterClass->m_matchesUnicode.swap(m_matchesUnicode);
         characterClass->m_rangesUnicode.swap(m_rangesUnicode);
+        characterClass->m_hasNonBMPCharacters = hasNonBMPCharacters();
 
         return characterClass;
     }
@@ -200,6 +203,9 @@ private:
         unsigned pos = 0;
         unsigned range = matches.size();
 
+        if (!U_IS_BMP(ch))
+            m_hasNonBMPCharacters = true;
+
         // binary chop, find position to insert char.
         while (range) {
             unsigned index = range >> 1;
@@ -224,7 +230,10 @@ private:
     void addSortedRange(Vector<CharacterRange>& ranges, UChar32 lo, UChar32 hi)
     {
         unsigned end = ranges.size();
-        
+
+        if (!U_IS_BMP(hi))
+            m_hasNonBMPCharacters = true;
+
         // Simple linear scan - I doubt there are that many ranges anyway...
         // feel free to fix this with something faster (eg binary chop).
         for (unsigned i = 0; i < end; ++i) {
@@ -266,7 +275,13 @@ private:
         ranges.append(CharacterRange(lo, hi));
     }
 
+    bool hasNonBMPCharacters()
+    {
+        return m_hasNonBMPCharacters;
+    }
+
     bool m_isCaseInsensitive;
+    bool m_hasNonBMPCharacters;
     CanonicalMode m_canonicalMode;
 
     Vector<UChar32> m_matches;
@@ -617,7 +632,11 @@ public:
                     currentCallFrameSize += YarrStackSpaceForBackTrackInfoPatternCharacter;
                     alternative->m_hasFixedSize = false;
                 } else if (m_pattern.unicode()) {
-                    currentInputPosition += U16_LENGTH(term.patternCharacter) * term.quantityMaxCount;
+                    Checked<unsigned, RecordOverflow> tempCount = term.quantityMaxCount;
+                    tempCount *= U16_LENGTH(term.patternCharacter);
+                    if (tempCount.hasOverflowed())
+                        return YarrPattern::OffsetTooLarge;
+                    currentInputPosition += tempCount;
                 } else
                     currentInputPosition += term.quantityMaxCount;
                 break;
index cad61ac..2cebbce 100644 (file)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013-2014, 2016 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013-2017 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga (pvarga@inf.u-szeged.hu), University of Szeged
  *
  * Redistribution and use in source and binary forms, with or without
@@ -56,11 +56,13 @@ public:
     // specified matches and ranges)
     CharacterClass()
         : m_table(0)
+        , m_hasNonBMPCharacters(false)
     {
     }
     CharacterClass(const char* table, bool inverted)
         : m_table(table)
         , m_tableInverted(inverted)
+        , m_hasNonBMPCharacters(false)
     {
     }
     Vector<UChar32> m_matches;
@@ -70,6 +72,7 @@ public:
 
     const char* m_table;
     bool m_tableInverted;
+    bool m_hasNonBMPCharacters;
 };
 
 enum QuantifierType {
@@ -493,4 +496,52 @@ private:
     CharacterClass* nonwordUnicodeIgnoreCasecharCached;
 };
 
+    struct BackTrackInfoPatternCharacter {
+        uintptr_t begin; // Only needed for unicode patterns
+        uintptr_t matchAmount;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoPatternCharacter, begin) / sizeof(uintptr_t); }
+        static unsigned matchAmountIndex() { return offsetof(BackTrackInfoPatternCharacter, matchAmount) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoCharacterClass {
+        uintptr_t begin; // Only needed for unicode patterns
+        uintptr_t matchAmount;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoCharacterClass, begin) / sizeof(uintptr_t); }
+        static unsigned matchAmountIndex() { return offsetof(BackTrackInfoCharacterClass, matchAmount) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoBackReference {
+        uintptr_t begin; // Not really needed for greedy quantifiers.
+        uintptr_t matchAmount; // Not really needed for fixed quantifiers.
+
+        unsigned beginIndex() { return offsetof(BackTrackInfoBackReference, begin) / sizeof(uintptr_t); }
+        unsigned matchAmountIndex() { return offsetof(BackTrackInfoBackReference, matchAmount) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoAlternative {
+        uintptr_t offset;
+
+        static unsigned offsetIndex() { return offsetof(BackTrackInfoAlternative, offset) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoParentheticalAssertion {
+        uintptr_t begin;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoParentheticalAssertion, begin) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoParenthesesOnce {
+        uintptr_t begin;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoParenthesesOnce, begin) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoParenthesesTerminal {
+        uintptr_t begin;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoParenthesesTerminal, begin) / sizeof(uintptr_t); }
+    };
+
 } } // namespace JSC::Yarr