Add basic pattern matching support to the url filters
authorbenjamin@webkit.org <benjamin@webkit.org@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Tue, 13 Jan 2015 01:37:25 +0000 (01:37 +0000)
committerbenjamin@webkit.org <benjamin@webkit.org@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Tue, 13 Jan 2015 01:37:25 +0000 (01:37 +0000)
https://bugs.webkit.org/show_bug.cgi?id=140283

Reviewed by Andreas Kling.

Source/JavaScriptCore:

* JavaScriptCore.xcodeproj/project.pbxproj:
Make YarrParser.h private in order to use it from WebCore.

Source/WebCore:

This patch adds some basic generic pattern support for the url filters
of ContentExtensions.

Instead of writting a new parser, I re-used Gavin's parser for JavaScript
RegExp.

This patch only implements the very basic stuffs: transition on any character
and repetition.

* WebCore.xcodeproj/project.pbxproj:
* contentextensions/ContentExtensionsBackend.cpp:
(WebCore::ContentExtensions::ContentExtensionsBackend::setRuleList):
Use the new parser.

* contentextensions/DFA.cpp:
(WebCore::ContentExtensions::DFA::DFA):
(WebCore::ContentExtensions::printRange):
(WebCore::ContentExtensions::printTransition):
(WebCore::ContentExtensions::DFA::debugPrintDot):
* contentextensions/NFA.cpp:
(WebCore::ContentExtensions::printRange):
(WebCore::ContentExtensions::printTransition):
(WebCore::ContentExtensions::NFA::debugPrintDot):
The graphs generated with the extended patterns are vastly more complicated
than the old prefix matcher.
I changed the debug output to have a single link between any two nodes
instead of one per transition. This makes the graph a little more manageable.

* contentextensions/NFA.cpp:
(WebCore::ContentExtensions::NFA::addTransition):
(WebCore::ContentExtensions::NFA::addEpsilonTransition):
(WebCore::ContentExtensions::NFA::graphSize):
(WebCore::ContentExtensions::NFA::restoreToGraphSize):
* contentextensions/NFA.h:
* contentextensions/NFANode.h:
(WebCore::ContentExtensions::epsilonClosure):
The new parser can generate transitions back to the root node of index zero.
All the hash structures had to be updated to support this kind of key.

* contentextensions/NFAToDFA.cpp:
(WebCore::ContentExtensions::HashableNodeIdSetHash::hash):
Two tiny improvements:
-Don't hash zero to zero, it causes more conflicts that needed.
-The hash operation must use a commutative operation, otherwise the order
 of elements can affect the hash, which is undesired for a set.
I'll improve this further later.

(WebCore::ContentExtensions::NFAToDFA::convert):

* contentextensions/URLFilterParser.cpp: Added.
(WebCore::ContentExtensions::GraphBuilder::GraphBuilder):
(WebCore::ContentExtensions::GraphBuilder::m_lastAtom):
(WebCore::ContentExtensions::GraphBuilder::finalize):
(WebCore::ContentExtensions::GraphBuilder::errorMessage):
(WebCore::ContentExtensions::GraphBuilder::atomPatternCharacter):
(WebCore::ContentExtensions::GraphBuilder::atomBuiltInCharacterClass):
(WebCore::ContentExtensions::GraphBuilder::quantifyAtom):
(WebCore::ContentExtensions::GraphBuilder::atomBackReference):
(WebCore::ContentExtensions::GraphBuilder::atomCharacterClassAtom):
(WebCore::ContentExtensions::GraphBuilder::assertionBOL):
(WebCore::ContentExtensions::GraphBuilder::assertionEOL):
(WebCore::ContentExtensions::GraphBuilder::assertionWordBoundary):
(WebCore::ContentExtensions::GraphBuilder::atomCharacterClassBegin):
(WebCore::ContentExtensions::GraphBuilder::atomCharacterClassRange):
(WebCore::ContentExtensions::GraphBuilder::atomCharacterClassBuiltIn):
(WebCore::ContentExtensions::GraphBuilder::atomCharacterClassEnd):
(WebCore::ContentExtensions::GraphBuilder::atomParenthesesSubpatternBegin):
(WebCore::ContentExtensions::GraphBuilder::atomParentheticalAssertionBegin):
(WebCore::ContentExtensions::GraphBuilder::atomParenthesesEnd):
(WebCore::ContentExtensions::GraphBuilder::disjunction):
(WebCore::ContentExtensions::GraphBuilder::hasError):
(WebCore::ContentExtensions::GraphBuilder::fail):
(WebCore::ContentExtensions::URLFilterParser::parse):
* contentextensions/URLFilterParser.h:
(WebCore::ContentExtensions::URLFilterParser::hasError):
(WebCore::ContentExtensions::URLFilterParser::errorMessage):

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@178313 268f45cc-cd09-0410-ab3c-d52691b4dbfc

12 files changed:
Source/JavaScriptCore/ChangeLog
Source/JavaScriptCore/JavaScriptCore.xcodeproj/project.pbxproj
Source/WebCore/ChangeLog
Source/WebCore/WebCore.xcodeproj/project.pbxproj
Source/WebCore/contentextensions/ContentExtensionsBackend.cpp
Source/WebCore/contentextensions/DFA.cpp
Source/WebCore/contentextensions/NFA.cpp
Source/WebCore/contentextensions/NFA.h
Source/WebCore/contentextensions/NFANode.h
Source/WebCore/contentextensions/NFAToDFA.cpp
Source/WebCore/contentextensions/URLFilterParser.cpp [new file with mode: 0644]
Source/WebCore/contentextensions/URLFilterParser.h [new file with mode: 0644]

index 1995331..9a27da0 100644 (file)
@@ -1,3 +1,13 @@
+2015-01-12  Benjamin Poulain  <benjamin@webkit.org>
+
+        Add basic pattern matching support to the url filters
+        https://bugs.webkit.org/show_bug.cgi?id=140283
+
+        Reviewed by Andreas Kling.
+
+        * JavaScriptCore.xcodeproj/project.pbxproj:
+        Make YarrParser.h private in order to use it from WebCore.
+
 2015-01-12  Geoffrey Garen  <ggaren@apple.com>
 
         Out of bounds read in IdentifierArena::makeIdentifier
index 62958bb..e7d4231 100644 (file)
                86704B8512DBA33700A9FE7B /* YarrInterpreter.h in Headers */ = {isa = PBXBuildFile; fileRef = 86704B7E12DBA33700A9FE7B /* YarrInterpreter.h */; settings = {ATTRIBUTES = (Private, ); }; };
                86704B8612DBA33700A9FE7B /* YarrJIT.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 86704B7F12DBA33700A9FE7B /* YarrJIT.cpp */; };
                86704B8712DBA33700A9FE7B /* YarrJIT.h in Headers */ = {isa = PBXBuildFile; fileRef = 86704B8012DBA33700A9FE7B /* YarrJIT.h */; settings = {ATTRIBUTES = (Private, ); }; };
-               86704B8812DBA33700A9FE7B /* YarrParser.h in Headers */ = {isa = PBXBuildFile; fileRef = 86704B8112DBA33700A9FE7B /* YarrParser.h */; settings = {ATTRIBUTES = (); }; };
+               86704B8812DBA33700A9FE7B /* YarrParser.h in Headers */ = {isa = PBXBuildFile; fileRef = 86704B8112DBA33700A9FE7B /* YarrParser.h */; settings = {ATTRIBUTES = (Private, ); }; };
                86704B8912DBA33700A9FE7B /* YarrPattern.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 86704B8212DBA33700A9FE7B /* YarrPattern.cpp */; };
                86704B8A12DBA33700A9FE7B /* YarrPattern.h in Headers */ = {isa = PBXBuildFile; fileRef = 86704B8312DBA33700A9FE7B /* YarrPattern.h */; settings = {ATTRIBUTES = (Private, ); }; };
                86880F1F14328BB900B08D42 /* DFGSpeculativeJIT32_64.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 86880F1B14328BB900B08D42 /* DFGSpeculativeJIT32_64.cpp */; };
index f417294..1b85ad1 100644 (file)
@@ -1,3 +1,87 @@
+2015-01-12  Benjamin Poulain  <benjamin@webkit.org>
+
+        Add basic pattern matching support to the url filters
+        https://bugs.webkit.org/show_bug.cgi?id=140283
+
+        Reviewed by Andreas Kling.
+
+        This patch adds some basic generic pattern support for the url filters
+        of ContentExtensions.
+
+        Instead of writting a new parser, I re-used Gavin's parser for JavaScript
+        RegExp.
+
+        This patch only implements the very basic stuffs: transition on any character
+        and repetition.
+
+        * WebCore.xcodeproj/project.pbxproj:
+        * contentextensions/ContentExtensionsBackend.cpp:
+        (WebCore::ContentExtensions::ContentExtensionsBackend::setRuleList):
+        Use the new parser.
+
+        * contentextensions/DFA.cpp:
+        (WebCore::ContentExtensions::DFA::DFA):
+        (WebCore::ContentExtensions::printRange):
+        (WebCore::ContentExtensions::printTransition):
+        (WebCore::ContentExtensions::DFA::debugPrintDot):
+        * contentextensions/NFA.cpp:
+        (WebCore::ContentExtensions::printRange):
+        (WebCore::ContentExtensions::printTransition):
+        (WebCore::ContentExtensions::NFA::debugPrintDot):
+        The graphs generated with the extended patterns are vastly more complicated
+        than the old prefix matcher.
+        I changed the debug output to have a single link between any two nodes
+        instead of one per transition. This makes the graph a little more manageable.
+
+        * contentextensions/NFA.cpp:
+        (WebCore::ContentExtensions::NFA::addTransition):
+        (WebCore::ContentExtensions::NFA::addEpsilonTransition):
+        (WebCore::ContentExtensions::NFA::graphSize):
+        (WebCore::ContentExtensions::NFA::restoreToGraphSize):
+        * contentextensions/NFA.h:
+        * contentextensions/NFANode.h:
+        (WebCore::ContentExtensions::epsilonClosure):
+        The new parser can generate transitions back to the root node of index zero.
+        All the hash structures had to be updated to support this kind of key.
+
+        * contentextensions/NFAToDFA.cpp:
+        (WebCore::ContentExtensions::HashableNodeIdSetHash::hash):
+        Two tiny improvements:
+        -Don't hash zero to zero, it causes more conflicts that needed.
+        -The hash operation must use a commutative operation, otherwise the order
+         of elements can affect the hash, which is undesired for a set.
+        I'll improve this further later.
+
+        (WebCore::ContentExtensions::NFAToDFA::convert):
+
+        * contentextensions/URLFilterParser.cpp: Added.
+        (WebCore::ContentExtensions::GraphBuilder::GraphBuilder):
+        (WebCore::ContentExtensions::GraphBuilder::m_lastAtom):
+        (WebCore::ContentExtensions::GraphBuilder::finalize):
+        (WebCore::ContentExtensions::GraphBuilder::errorMessage):
+        (WebCore::ContentExtensions::GraphBuilder::atomPatternCharacter):
+        (WebCore::ContentExtensions::GraphBuilder::atomBuiltInCharacterClass):
+        (WebCore::ContentExtensions::GraphBuilder::quantifyAtom):
+        (WebCore::ContentExtensions::GraphBuilder::atomBackReference):
+        (WebCore::ContentExtensions::GraphBuilder::atomCharacterClassAtom):
+        (WebCore::ContentExtensions::GraphBuilder::assertionBOL):
+        (WebCore::ContentExtensions::GraphBuilder::assertionEOL):
+        (WebCore::ContentExtensions::GraphBuilder::assertionWordBoundary):
+        (WebCore::ContentExtensions::GraphBuilder::atomCharacterClassBegin):
+        (WebCore::ContentExtensions::GraphBuilder::atomCharacterClassRange):
+        (WebCore::ContentExtensions::GraphBuilder::atomCharacterClassBuiltIn):
+        (WebCore::ContentExtensions::GraphBuilder::atomCharacterClassEnd):
+        (WebCore::ContentExtensions::GraphBuilder::atomParenthesesSubpatternBegin):
+        (WebCore::ContentExtensions::GraphBuilder::atomParentheticalAssertionBegin):
+        (WebCore::ContentExtensions::GraphBuilder::atomParenthesesEnd):
+        (WebCore::ContentExtensions::GraphBuilder::disjunction):
+        (WebCore::ContentExtensions::GraphBuilder::hasError):
+        (WebCore::ContentExtensions::GraphBuilder::fail):
+        (WebCore::ContentExtensions::URLFilterParser::parse):
+        * contentextensions/URLFilterParser.h:
+        (WebCore::ContentExtensions::URLFilterParser::hasError):
+        (WebCore::ContentExtensions::URLFilterParser::errorMessage):
+
 2015-01-11  Sam Weinig  <sam@webkit.org>
 
         Remove support for SharedWorkers
index ffade7b..2215c7a 100644 (file)
                267725FF1A5B3AD9003C24DD /* DFANode.h in Headers */ = {isa = PBXBuildFile; fileRef = 267725F91A5B3AD9003C24DD /* DFANode.h */; };
                267726001A5B3AD9003C24DD /* NFAToDFA.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 267725FA1A5B3AD9003C24DD /* NFAToDFA.cpp */; };
                267726011A5B3AD9003C24DD /* NFAToDFA.h in Headers */ = {isa = PBXBuildFile; fileRef = 267725FB1A5B3AD9003C24DD /* NFAToDFA.h */; };
+               267726041A5DF6F2003C24DD /* URLFilterParser.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 267726021A5DF6F2003C24DD /* URLFilterParser.cpp */; };
+               267726051A5DF6F2003C24DD /* URLFilterParser.h in Headers */ = {isa = PBXBuildFile; fileRef = 267726031A5DF6F2003C24DD /* URLFilterParser.h */; };
                269239961505E1AA009E57FC /* JSIDBVersionChangeEvent.h in Headers */ = {isa = PBXBuildFile; fileRef = 269239921505E1AA009E57FC /* JSIDBVersionChangeEvent.h */; };
                269397211A4A412F00E8349D /* NFANode.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 2693971F1A4A412F00E8349D /* NFANode.cpp */; };
                269397221A4A412F00E8349D /* NFANode.h in Headers */ = {isa = PBXBuildFile; fileRef = 269397201A4A412F00E8349D /* NFANode.h */; };
                267725F91A5B3AD9003C24DD /* DFANode.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = DFANode.h; sourceTree = "<group>"; };
                267725FA1A5B3AD9003C24DD /* NFAToDFA.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = NFAToDFA.cpp; sourceTree = "<group>"; };
                267725FB1A5B3AD9003C24DD /* NFAToDFA.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = NFAToDFA.h; sourceTree = "<group>"; };
+               267726021A5DF6F2003C24DD /* URLFilterParser.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = URLFilterParser.cpp; sourceTree = "<group>"; };
+               267726031A5DF6F2003C24DD /* URLFilterParser.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = URLFilterParser.h; sourceTree = "<group>"; };
                269239911505E1AA009E57FC /* JSIDBVersionChangeEvent.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = JSIDBVersionChangeEvent.cpp; sourceTree = "<group>"; };
                269239921505E1AA009E57FC /* JSIDBVersionChangeEvent.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = JSIDBVersionChangeEvent.h; sourceTree = "<group>"; };
                2693971F1A4A412F00E8349D /* NFANode.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = NFANode.cpp; sourceTree = "<group>"; };
                                269397201A4A412F00E8349D /* NFANode.h */,
                                267725FA1A5B3AD9003C24DD /* NFAToDFA.cpp */,
                                267725FB1A5B3AD9003C24DD /* NFAToDFA.h */,
+                               267726021A5DF6F2003C24DD /* URLFilterParser.cpp */,
+                               267726031A5DF6F2003C24DD /* URLFilterParser.h */,
                        );
                        path = contentextensions;
                        sourceTree = "<group>";
                                85992EBE0AA5069500AC0785 /* DOMHTMLLinkElement.h in Headers */,
                                85E711B80AC5D5350053270F /* DOMHTMLLinkElementInternal.h in Headers */,
                                85ECBEF30AA7626900544F0B /* DOMHTMLMapElement.h in Headers */,
+                               267726051A5DF6F2003C24DD /* URLFilterParser.h in Headers */,
                                85E711B90AC5D5350053270F /* DOMHTMLMapElementInternal.h in Headers */,
                                BC51579F0C03BBD3008BB0EE /* DOMHTMLMarqueeElement.h in Headers */,
                                BC5156EA0C03B741008BB0EE /* DOMHTMLMarqueeElementInternal.h in Headers */,
                                07B5A2DB1464320A00A81ECE /* JSTextTrackList.cpp in Sources */,
                                07B5A30D14687D7100A81ECE /* JSTextTrackListCustom.cpp in Sources */,
                                E446141A0CD6826900FADA75 /* JSTimeRanges.cpp in Sources */,
+                               267726041A5DF6F2003C24DD /* URLFilterParser.cpp in Sources */,
                                0FDA7C16188322EB00C954B5 /* JSTouch.cpp in Sources */,
                                0FDA7C18188322EB00C954B5 /* JSTouchEvent.cpp in Sources */,
                                0FDA7C1A188322EB00C954B5 /* JSTouchList.cpp in Sources */,
index a6d1aa3..023739d 100644 (file)
 #include "NFA.h"
 #include "NFAToDFA.h"
 #include "URL.h"
+#include "URLFilterParser.h"
+#include <wtf/DataLog.h>
 #include <wtf/NeverDestroyed.h>
+#include <wtf/text/CString.h>
 
 namespace WebCore {
 
@@ -55,21 +58,18 @@ void ContentExtensionsBackend::setRuleList(const String& identifier, const Vecto
     }
 
     NFA nfa;
-    unsigned rootNode = nfa.root();
     for (unsigned ruleIndex = 0; ruleIndex < ruleList.size(); ++ruleIndex) {
         const ContentExtensionRule& contentExtensionRule = ruleList[ruleIndex];
         const ContentExtensionRule::Trigger& trigger = contentExtensionRule.trigger();
         ASSERT(trigger.urlFilter.length());
 
-        unsigned lastNode = rootNode;
+        URLFilterParser urlFilterParser;
+        urlFilterParser.parse(trigger.urlFilter, ruleIndex, nfa);
 
-        for (unsigned i = 0; i < trigger.urlFilter.length(); ++i) {
-            unsigned newNode = nfa.createNode(ruleIndex);
-            nfa.addTransition(lastNode, newNode, trigger.urlFilter[i]);
-            lastNode = newNode;
+        if (urlFilterParser.hasError()) {
+            dataLogF("Error while parsing %s: %s", trigger.urlFilter.utf8().data(), urlFilterParser.errorMessage().utf8().data());
+            continue;
         }
-
-        nfa.setFinal(lastNode);
     }
 
     // FIXME: never add a DFA that only matches the empty set.
index e4abbfb..2900c69 100644 (file)
@@ -43,7 +43,7 @@ DFA::DFA(Vector<DFANode>&& nodes, unsigned rootIndex)
     : m_nodes(WTF::move(nodes))
     , m_root(rootIndex)
 {
-    ASSERT(rootIndex < nodes.size());
+    ASSERT(rootIndex < m_nodes.size());
 }
 
 DFA::DFA(const DFA& dfa)
@@ -80,6 +80,62 @@ const Vector<uint64_t>& DFA::actions(unsigned currentState) const
 }
 
 #ifndef NDEBUG
+static void printRange(bool firstRange, char rangeStart, char rangeEnd)
+{
+    if (!firstRange)
+        dataLogF(", ");
+    if (rangeStart == rangeEnd) {
+        if (rangeStart == '"' || rangeStart == '\\')
+            dataLogF("\\%c", rangeStart);
+        else if (rangeStart >= '!' && rangeStart <= '~')
+            dataLogF("%c", rangeStart);
+        else
+            dataLogF("\\\\%d", rangeStart);
+    } else
+        dataLogF("\\\\%d-\\\\%d", rangeStart, rangeEnd);
+}
+
+static void printTransition(unsigned sourceNode, const HashMap<uint16_t, unsigned>& transitions)
+{
+    if (transitions.isEmpty())
+        return;
+
+    HashMap<unsigned, Vector<uint16_t>, DefaultHash<unsigned>::Hash, WTF::UnsignedWithZeroKeyHashTraits<unsigned>> transitionsPerTarget;
+
+    // First, we build the list of transitions coming to each target node.
+    for (const auto& transition : transitions) {
+        unsigned target = transition.value;
+        transitionsPerTarget.add(target, Vector<uint16_t>());
+
+        transitionsPerTarget.find(target)->value.append(transition.key);
+    }
+
+    // Then we go over each one an display the ranges one by one.
+    for (const auto& transitionPerTarget : transitionsPerTarget) {
+        dataLogF("        %d -> %d [label=\"", sourceNode, transitionPerTarget.key);
+
+        Vector<uint16_t> incommingCharacters = transitionPerTarget.value;
+        std::sort(incommingCharacters.begin(), incommingCharacters.end());
+
+        char rangeStart = incommingCharacters.first();
+        char rangeEnd = rangeStart;
+        bool first = true;
+        for (unsigned sortedTransitionIndex = 1; sortedTransitionIndex < incommingCharacters.size(); ++sortedTransitionIndex) {
+            char nextChar = incommingCharacters[sortedTransitionIndex];
+            if (nextChar == rangeEnd+1) {
+                rangeEnd = nextChar;
+                continue;
+            }
+            printRange(first, rangeStart, rangeEnd);
+            rangeStart = rangeEnd = nextChar;
+            first = false;
+        }
+        printRange(first, rangeStart, rangeEnd);
+
+        dataLogF("\"];\n");
+    }
+}
+
 void DFA::debugPrintDot() const
 {
     dataLogF("digraph DFA_Transitions {\n");
@@ -117,10 +173,9 @@ void DFA::debugPrintDot() const
     dataLogF("    }\n");
 
     dataLogF("    {\n");
-    for (unsigned i = 0; i < m_nodes.size(); ++i) {
-        for (const auto& slot : m_nodes[i].transitions)
-            dataLogF("        %d -> %d [label=\"%c\"];\n", i, slot.value, slot.key);
-    }
+    for (unsigned i = 0; i < m_nodes.size(); ++i)
+        printTransition(i, m_nodes[i].transitions);
+
     dataLogF("    }\n");
     dataLogF("}\n");
 }
index 51e9a7e..728879d 100644 (file)
@@ -53,7 +53,7 @@ void NFA::addTransition(unsigned from, unsigned to, char character)
     ASSERT(to < m_nodes.size());
     ASSERT(character);
 
-    auto addResult = m_nodes[from].transitions.add(character, HashSet<unsigned>());
+    auto addResult = m_nodes[from].transitions.add(character, HashSet<unsigned, DefaultHash<unsigned>::Hash, WTF::UnsignedWithZeroKeyHashTraits<unsigned>>());
     addResult.iterator->value.add(to);
 }
 
@@ -62,7 +62,7 @@ void NFA::addEpsilonTransition(unsigned from, unsigned to)
     ASSERT(from < m_nodes.size());
     ASSERT(to < m_nodes.size());
 
-    auto addResult = m_nodes[from].transitions.add(epsilonTransitionCharacter, HashSet<unsigned>());
+    auto addResult = m_nodes[from].transitions.add(epsilonTransitionCharacter, HashSet<unsigned, DefaultHash<unsigned>::Hash, WTF::UnsignedWithZeroKeyHashTraits<unsigned>>());
     addResult.iterator->value.add(to);
 }
 
@@ -72,7 +72,79 @@ void NFA::setFinal(unsigned node)
     m_nodes[node].isFinal = true;
 }
 
+unsigned NFA::graphSize() const
+{
+    return m_nodes.size();
+}
+
+void NFA::restoreToGraphSize(unsigned size)
+{
+    ASSERT(size > 1);
+    ASSERT(size <= graphSize());
+
+    m_nodes.shrink(size);
+}
+
 #ifndef NDEBUG
+
+static void printRange(bool firstRange, uint16_t rangeStart, uint16_t rangeEnd, uint16_t epsilonTransitionCharacter)
+{
+    if (!firstRange)
+        dataLogF(", ");
+    if (rangeStart == rangeEnd) {
+        if (rangeStart == epsilonTransitionCharacter)
+            dataLogF("ɛ");
+        else if (rangeStart == '"' || rangeStart == '\\')
+            dataLogF("\\%c", rangeStart);
+        else if (rangeStart >= '!' && rangeStart <= '~')
+            dataLogF("%c", rangeStart);
+        else
+            dataLogF("\\\\%d", rangeStart);
+    } else {
+        if (rangeStart == 1 && rangeEnd == 127)
+            dataLogF("[any input]");
+        else
+            dataLogF("\\\\%d-\\\\%d", rangeStart, rangeEnd);
+    }
+}
+
+static void printTransition(unsigned sourceNode, const HashMap<uint16_t, HashSet<unsigned, DefaultHash<unsigned>::Hash, WTF::UnsignedWithZeroKeyHashTraits<unsigned>>>& transitions, uint16_t epsilonTransitionCharacter)
+{
+    HashMap<unsigned, HashSet<uint16_t>, DefaultHash<unsigned>::Hash, WTF::UnsignedWithZeroKeyHashTraits<unsigned>> transitionsPerTarget;
+
+    for (const auto& transition : transitions) {
+        for (unsigned targetNode : transition.value) {
+            transitionsPerTarget.add(targetNode, HashSet<uint16_t>());
+            transitionsPerTarget.find(targetNode)->value.add(transition.key);
+        }
+    }
+
+    for (const auto& transitionPerTarget : transitionsPerTarget) {
+        dataLogF("        %d -> %d [label=\"", sourceNode, transitionPerTarget.key);
+
+        Vector<uint16_t> incommingCharacters;
+        copyToVector(transitionPerTarget.value, incommingCharacters);
+        std::sort(incommingCharacters.begin(), incommingCharacters.end());
+
+        uint16_t rangeStart = incommingCharacters.first();
+        uint16_t rangeEnd = rangeStart;
+        bool first = true;
+        for (unsigned sortedTransitionIndex = 1; sortedTransitionIndex < incommingCharacters.size(); ++sortedTransitionIndex) {
+            uint16_t nextChar = incommingCharacters[sortedTransitionIndex];
+            if (nextChar == rangeEnd+1) {
+                rangeEnd = nextChar;
+                continue;
+            }
+            printRange(first, rangeStart, rangeEnd, epsilonTransitionCharacter);
+            rangeStart = rangeEnd = nextChar;
+            first = false;
+        }
+        printRange(first, rangeStart, rangeEnd, epsilonTransitionCharacter);
+
+        dataLogF("\"];\n");
+    }
+}
+
 void NFA::debugPrintDot() const
 {
     dataLogF("digraph NFA_Transitions {\n");
@@ -93,16 +165,8 @@ void NFA::debugPrintDot() const
     dataLogF("    }\n");
 
     dataLogF("    {\n");
-    for (unsigned i = 0; i < m_nodes.size(); ++i) {
-        for (const auto& slot : m_nodes[i].transitions) {
-            for (unsigned nextState : slot.value) {
-                if (slot.key == epsilonTransitionCharacter)
-                    dataLogF("        %d -> %d [label=\"ɛ\"];\n", i, nextState);
-                else
-                    dataLogF("        %d -> %d [label=\"%c\"];\n", i, nextState, slot.key);
-            }
-        }
-    }
+    for (unsigned i = 0; i < m_nodes.size(); ++i)
+        printTransition(i, m_nodes[i].transitions, epsilonTransitionCharacter);
     dataLogF("    }\n");
     dataLogF("}\n");
 }
index 7cc1445..109a699 100644 (file)
@@ -50,6 +50,9 @@ public:
     void addEpsilonTransition(unsigned from, unsigned to);
     void setFinal(unsigned node);
 
+    unsigned graphSize() const;
+    void restoreToGraphSize(unsigned);
+
 #ifndef NDEBUG
     void debugPrintDot() const;
 #endif
index 3239d66..ea24377 100644 (file)
@@ -40,7 +40,7 @@ class NFANode {
 public:
     NFANode(uint64_t ruleId);
 
-    HashMap<uint16_t, HashSet<unsigned>> transitions;
+    HashMap<uint16_t, HashSet<unsigned, DefaultHash<unsigned>::Hash, WTF::UnsignedWithZeroKeyHashTraits<unsigned>>> transitions;
     bool isFinal;
     const uint64_t ruleId;
 };
index 8666fdc..b3f3158 100644 (file)
@@ -58,7 +58,7 @@ static NodeIdSet epsilonClosure(const NodeIdSet& nodeSet, const Vector<NFANode>&
             const NFANode& node = graph[nodeId];
             auto epsilonTransitionSlot = node.transitions.find(epsilonTransitionCharacter);
             if (epsilonTransitionSlot != node.transitions.end()) {
-                const HashSet<unsigned>& targets = epsilonTransitionSlot->value;
+                const HashSet<unsigned, DefaultHash<unsigned>::Hash, WTF::UnsignedWithZeroKeyHashTraits<unsigned>>& targets = epsilonTransitionSlot->value;
                 for (unsigned targetNodeId : targets) {
                     if (!outputNodeSet.contains(targetNodeId))
                         nextGenerationDiscoveredNodes.add(targetNodeId);
@@ -148,9 +148,9 @@ private:
 struct HashableNodeIdSetHash {
     static unsigned hash(const HashableNodeIdSet& p)
     {
-        unsigned hash = 0;
+        unsigned hash = 4207445155;
         for (unsigned nodeId : p.nodeIdSet())
-            hash ^= DefaultHash<unsigned>::Hash::hash(nodeId);
+            hash += DefaultHash<unsigned>::Hash::hash(nodeId);
         return hash;
     }
 
@@ -212,7 +212,7 @@ DFA NFAToDFA::convert(const NFA& nfa)
     do {
         HashableNodeIdSet stateSet = unprocessedStateSets.takeAny();
 
-        ASSERT(!processedStateSets.contains(stateSet));
+        ASSERT(!processedStateSets.contains(stateSet.nodeIdSet()));
         processedStateSets.add(stateSet.nodeIdSet());
 
         unsigned dfaNodeId = nfaToDFANodeMap.get(stateSet);
diff --git a/Source/WebCore/contentextensions/URLFilterParser.cpp b/Source/WebCore/contentextensions/URLFilterParser.cpp
new file mode 100644 (file)
index 0000000..a3ad606
--- /dev/null
@@ -0,0 +1,245 @@
+/*
+ * Copyright (C) 2014, 2015 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. AND ITS CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL APPLE INC. OR ITS CONTRIBUTORS
+ * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
+ * THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "config.h"
+#include "URLFilterParser.h"
+
+#if ENABLE(CONTENT_EXTENSIONS)
+
+#include "NFA.h"
+#include <JavaScriptCore/YarrParser.h>
+
+namespace WebCore {
+
+namespace ContentExtensions {
+
+class GraphBuilder {
+private:
+    struct BoundedSubGraph {
+        unsigned start;
+        unsigned end;
+    };
+public:
+    GraphBuilder(NFA& nfa, uint64_t patternId)
+        : m_nfa(nfa)
+        , m_patternId(patternId)
+        , m_activeGroup({ nfa.root(), nfa.root() })
+        , m_lastAtom(m_activeGroup)
+    {
+    }
+
+    void finalize()
+    {
+        if (hasError())
+            return;
+        if (m_activeGroup.start != m_activeGroup.end)
+            m_nfa.setFinal(m_activeGroup.end);
+        else
+            fail(ASCIILiteral("The pattern cannot match anything."));
+    }
+
+    const String& errorMessage() const
+    {
+        return m_errorMessage;
+    }
+
+    void atomPatternCharacter(UChar character)
+    {
+        if (isASCII(character)) {
+            fail(ASCIILiteral("Only ASCII characters are supported in pattern."));
+            return;
+        }
+
+        if (hasError())
+            return;
+
+        m_hasValidAtom = true;
+        unsigned newEnd = m_nfa.createNode(m_patternId);
+        m_nfa.addTransition(m_lastAtom.end, newEnd, static_cast<char>(character));
+        m_lastAtom.start = m_lastAtom.end;
+        m_lastAtom.end = newEnd;
+        m_activeGroup.end = m_lastAtom.end;
+    }
+
+    void atomBuiltInCharacterClass(JSC::Yarr::BuiltInCharacterClassID builtInCharacterClassID, bool inverted)
+    {
+        if (hasError())
+            return;
+
+        if (builtInCharacterClassID == JSC::Yarr::NewlineClassID && inverted) {
+            // FIXME: handle new line properly.
+            m_hasValidAtom = true;
+            unsigned newEnd = m_nfa.createNode(m_patternId);
+            for (unsigned i = 1; i < 128; ++i)
+                m_nfa.addTransition(m_lastAtom.end, newEnd, i);
+            m_lastAtom.start = m_lastAtom.end;
+            m_lastAtom.end = newEnd;
+            m_activeGroup.end = m_lastAtom.end;
+        } else
+            fail(ASCIILiteral("Character class is not supported."));
+    }
+
+    void quantifyAtom(unsigned minimum, unsigned maximum, bool)
+    {
+        if (hasError())
+            return;
+
+        ASSERT(m_hasValidAtom);
+        if (!m_hasValidAtom) {
+            fail(ASCIILiteral("Quantifier without corresponding atom to quantify."));
+            return;
+        }
+
+        if (!minimum && maximum == 1)
+            m_nfa.addEpsilonTransition(m_lastAtom.start, m_lastAtom.end);
+        else if (!minimum && maximum == JSC::Yarr::quantifyInfinite) {
+            m_nfa.addEpsilonTransition(m_lastAtom.start, m_lastAtom.end);
+            m_nfa.addEpsilonTransition(m_lastAtom.end, m_lastAtom.start);
+        } else if (minimum == 1 && maximum == JSC::Yarr::quantifyInfinite)
+            m_nfa.addEpsilonTransition(m_lastAtom.end, m_lastAtom.start);
+        else
+            fail(ASCIILiteral("Arbitrary atom repetitions are not supported."));
+    }
+
+    NO_RETURN_DUE_TO_ASSERT void atomBackReference(unsigned)
+    {
+        fail(ASCIILiteral("Patterns cannot contain backreferences."));
+        ASSERT_NOT_REACHED();
+    }
+
+    void atomCharacterClassAtom(UChar)
+    {
+        fail(ASCIILiteral("Character class atoms are not supported yet."));
+    }
+
+    void assertionBOL()
+    {
+        fail(ASCIILiteral("Line boundary assertions are not supported yet."));
+    }
+
+    void assertionEOL()
+    {
+        fail(ASCIILiteral("Line boundary assertions are not supported yet."));
+    }
+
+    void assertionWordBoundary(bool)
+    {
+        fail(ASCIILiteral("Word boundaries assertions are not supported yet."));
+    }
+
+    void atomCharacterClassBegin(bool = false)
+    {
+        fail(ASCIILiteral("Character class atoms are not supported yet."));
+    }
+
+    void atomCharacterClassRange(UChar, UChar)
+    {
+        fail(ASCIILiteral("Character class ranges are not supported yet."));
+    }
+
+    void atomCharacterClassBuiltIn(JSC::Yarr::BuiltInCharacterClassID, bool)
+    {
+        fail(ASCIILiteral("Buildins character class atoms are not supported yet."));
+    }
+
+    void atomCharacterClassEnd()
+    {
+        fail(ASCIILiteral("Character class are not supported yet."));
+    }
+
+    void atomParenthesesSubpatternBegin(bool = true)
+    {
+        fail(ASCIILiteral("Groups are not supported yet."));
+    }
+
+    void atomParentheticalAssertionBegin(bool = false)
+    {
+        fail(ASCIILiteral("Groups are not supported yet."));
+    }
+
+    void atomParenthesesEnd()
+    {
+        fail(ASCIILiteral("Groups are not supported yet."));
+    }
+
+    void disjunction()
+    {
+        fail(ASCIILiteral("Disjunctions are not supported yet."));
+    }
+
+
+private:
+    bool hasError() const
+    {
+        return !m_errorMessage.isNull();
+    }
+
+    void fail(const String& errorMessage)
+    {
+        if (hasError())
+            return;
+        m_errorMessage = errorMessage;
+    }
+
+    NFA& m_nfa;
+    const uint64_t m_patternId;
+
+    BoundedSubGraph m_activeGroup;
+
+    bool m_hasValidAtom = false;
+    BoundedSubGraph m_lastAtom;
+
+    String m_errorMessage;
+};
+
+void URLFilterParser::parse(const String& pattern, uint64_t patternId, NFA& nfa)
+{
+    if (!pattern.containsOnlyASCII())
+        m_errorMessage = ASCIILiteral("URLFilterParser only supports ASCII patterns.");
+    ASSERT(!pattern.isEmpty());
+
+    if (pattern.isEmpty())
+        return;
+
+    unsigned oldSize = nfa.graphSize();
+
+    GraphBuilder graphBuilder(nfa, patternId);
+    const char* error = JSC::Yarr::parse(graphBuilder, pattern, 0);
+    if (error)
+        m_errorMessage = String(error);
+    else
+        graphBuilder.finalize();
+
+    if (!error)
+        m_errorMessage = graphBuilder.errorMessage();
+
+    if (hasError())
+        nfa.restoreToGraphSize(oldSize);
+}
+
+} // namespace ContentExtensions
+} // namespace WebCore
+
+#endif // ENABLE(CONTENT_EXTENSIONS)
diff --git a/Source/WebCore/contentextensions/URLFilterParser.h b/Source/WebCore/contentextensions/URLFilterParser.h
new file mode 100644 (file)
index 0000000..3ec8024
--- /dev/null
@@ -0,0 +1,61 @@
+/*
+ * Copyright (C) 2014, 2015 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. AND ITS CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL APPLE INC. OR ITS CONTRIBUTORS
+ * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
+ * THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef URLFilterParser_h
+#define URLFilterParser_h
+
+#if ENABLE(CONTENT_EXTENSIONS)
+
+#include <wtf/text/WTFString.h>
+
+namespace WebCore {
+
+namespace ContentExtensions {
+
+class NFA;
+
+class URLFilterParser {
+public:
+    void parse(const String& pattern, uint64_t patternId, NFA&);
+
+    bool hasError() const { return !m_errorMessage.isNull(); }
+    String errorMessage() const { return m_errorMessage; }
+
+private:
+    struct BoundedSubGraph {
+        unsigned start;
+        unsigned end;
+    };
+
+    String m_errorMessage;
+};
+
+} // namespace ContentExtensions
+
+} // namespace WebCore
+
+#endif // ENABLE(CONTENT_EXTENSIONS)
+
+#endif // URLFilterParser_h