YARR: . doesn't match non-BMP Unicode characters in some cases
authormsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Tue, 10 Jul 2018 17:34:34 +0000 (17:34 +0000)
committermsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Tue, 10 Jul 2018 17:34:34 +0000 (17:34 +0000)
https://bugs.webkit.org/show_bug.cgi?id=187248

Reviewed by Geoffrey Garen.

JSTests:

New regression test.

* stress/regexp-with-nonBMP-any.js: Added.

Source/JavaScriptCore:

The safety check in optimizeAlternative() for moving character classes that only consist of BMP
characters did not take into account that the character class is inverted.  In this case, we
represent '.' as "not a newline" using the newline character class with an inverted check.
Clearly that includes non-BMP characters.

The fix is to check that the character class doesn't have non-BMP characters AND it isn't an
inverted use of that character class.

* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@233690 268f45cc-cd09-0410-ab3c-d52691b4dbfc

JSTests/ChangeLog
JSTests/stress/regexp-with-nonBMP-any.js [new file with mode: 0644]
Source/JavaScriptCore/ChangeLog
Source/JavaScriptCore/yarr/YarrJIT.cpp

index 99b484c..427bb6d 100644 (file)
@@ -1,3 +1,14 @@
+2018-07-10  Michael Saboff  <msaboff@apple.com>
+
+        YARR: . doesn't match non-BMP Unicode characters in some cases
+        https://bugs.webkit.org/show_bug.cgi?id=187248
+
+        Reviewed by Geoffrey Garen.
+
+        New regression test.
+
+        * stress/regexp-with-nonBMP-any.js: Added.
+
 2018-07-09  Michael Saboff  <msaboff@apple.com>
 
         REGRESSION (ICU-62100.0.1): JSC test mozilla-tests.yaml/ecma/String/15.5.4.12-3.js is failing
diff --git a/JSTests/stress/regexp-with-nonBMP-any.js b/JSTests/stress/regexp-with-nonBMP-any.js
new file mode 100644 (file)
index 0000000..979936b
--- /dev/null
@@ -0,0 +1,10 @@
+// This test that . followed by fixed character terms works with non-BMP characters
+
+if (!/^.-clef/u.test("\u{1D123}-clef"))
+    throw "Should have matched string with leading non-BMP with BOL anchored . in RE";
+
+if (!/c.lef/u.test("c\u{1C345}lef"))
+    throw "Should have matched string with non-BMP with . in RE";
+
+
+
index 8f9e1fa..0a78a40 100644 (file)
@@ -1,3 +1,21 @@
+2018-07-10  Michael Saboff  <msaboff@apple.com>
+
+        YARR: . doesn't match non-BMP Unicode characters in some cases
+        https://bugs.webkit.org/show_bug.cgi?id=187248
+
+        Reviewed by Geoffrey Garen.
+
+        The safety check in optimizeAlternative() for moving character classes that only consist of BMP
+        characters did not take into account that the character class is inverted.  In this case, we
+        represent '.' as "not a newline" using the newline character class with an inverted check.
+        Clearly that includes non-BMP characters.
+
+        The fix is to check that the character class doesn't have non-BMP characters AND it isn't an
+        inverted use of that character class.
+
+        * yarr/YarrJIT.cpp:
+        (JSC::Yarr::YarrGenerator::optimizeAlternative):
+
 2018-07-09  Mark Lam  <mark.lam@apple.com>
 
         Add --traceLLIntExecution and --traceLLIntSlowPath options.
index 26fcd69..38e9d7a 100644 (file)
@@ -321,7 +321,7 @@ class YarrGenerator : private MacroAssembler {
             // We can move BMP only character classes after fixed character terms.
             if ((term.type == PatternTerm::TypeCharacterClass)
                 && (term.quantityType == QuantifierFixedCount)
-                && (!m_decodeSurrogatePairs || !term.characterClass->m_hasNonBMPCharacters)
+                && (!m_decodeSurrogatePairs || (!term.characterClass->m_hasNonBMPCharacters && !term.m_invert))
                 && (nextTerm.type == PatternTerm::TypePatternCharacter)
                 && (nextTerm.quantityType == QuantifierFixedCount)) {
                 PatternTerm termCopy = term;