2010-08-12 Adam Barth <abarth@webkit.org>
authoreric@webkit.org <eric@webkit.org@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sat, 14 Aug 2010 03:18:16 +0000 (03:18 +0000)
committereric@webkit.org <eric@webkit.org@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sat, 14 Aug 2010 03:18:16 +0000 (03:18 +0000)
        Reviewed by Eric Seidel.

        Add support for MathML entities
        https://bugs.webkit.org/show_bug.cgi?id=43949

        Test progression for proper entity support.

        * html5lib/runner-expected-html5.txt:
        * html5lib/runner-expected.txt:
2010-08-09  Adam Barth  <abarth@webkit.org>

        Reviewed by Eric Seidel.

        Add support for MathML entities
        https://bugs.webkit.org/show_bug.cgi?id=43949

        Implementing the HTML5 entity parsing algorithm require refactoring how
        we search for entity names.  Instead of using a perfect hash, we now
        use a sorted list.  As we advance through the input, we walk down a
        binary search of the table looking for an entity.

        Using this data structure lets us keep track of whether the current
        string is a prefix of an existing entity, which we need for the
        algorithm.  In a future patch, I plan to add some indices to the
        table, which should let us narrow down the range of interesting entries
        more quickly.

        The one nasty piece of the algorithm is if we walk too far down the
        input and we need to back up to a previous match.  In this patch, we
        accomplish this by rewinding the input and consuming a known number of
        characters to resync the source.

        * WebCore.xcodeproj/project.pbxproj:
        * html/HTMLEntityParser.cpp:
        (WebCore::consumeHTMLEntity):
        * html/HTMLEntitySearch.cpp: Added.
        (WebCore::):
        (WebCore::HTMLEntitySearch::HTMLEntitySearch):
        (WebCore::HTMLEntitySearch::compare):
        (WebCore::HTMLEntitySearch::findStart):
        (WebCore::HTMLEntitySearch::findEnd):
        (WebCore::HTMLEntitySearch::advance):
        * html/HTMLEntitySearch.h: Added.
        (WebCore::HTMLEntitySearch::isEntityPrefix):
        (WebCore::HTMLEntitySearch::currentValue):
        (WebCore::HTMLEntitySearch::lastMatch):
        (WebCore::HTMLEntitySearch::):
        (WebCore::HTMLEntitySearch::fail):
        * html/HTMLEntityTable.h: Added.
        (WebCore::HTMLEntityTableEntry::lastCharacter):
2010-08-12  Adam Barth  <abarth@webkit.org>

        Reviewed by Eric Seidel.

        Add support for MathML entities
        https://bugs.webkit.org/show_bug.cgi?id=43949

        A script for generating the C++ state data structure describing all the
        entities from a JSON description.

        * Scripts/create-html-entity-table: Added.

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@65351 268f45cc-cd09-0410-ab3c-d52691b4dbfc

21 files changed:
LayoutTests/ChangeLog
LayoutTests/html5lib/runner-expected-html5.txt
LayoutTests/html5lib/runner-expected.txt
WebCore/CMakeLists.txt
WebCore/ChangeLog
WebCore/DerivedSources.make
WebCore/GNUmakefile.am
WebCore/WebCore.gyp/WebCore.gyp
WebCore/WebCore.gypi
WebCore/WebCore.pri
WebCore/WebCore.pro
WebCore/WebCore.vcproj/WebCore.vcproj
WebCore/WebCore.xcodeproj/project.pbxproj
WebCore/html/HTMLEntityNames.gperf [deleted file]
WebCore/html/HTMLEntityParser.cpp
WebCore/html/HTMLEntitySearch.cpp [new file with mode: 0644]
WebCore/html/HTMLEntitySearch.h [new file with mode: 0644]
WebCore/html/HTMLEntityTable.h [new file with mode: 0644]
WebCore/make-hash-tools.pl
WebKitTools/ChangeLog
WebKitTools/Scripts/create-html-entity-table [new file with mode: 0755]

index a8a1f3e..5892cf0 100644 (file)
@@ -1,3 +1,15 @@
+2010-08-12  Adam Barth  <abarth@webkit.org>
+
+        Reviewed by Eric Seidel.
+
+        Add support for MathML entities
+        https://bugs.webkit.org/show_bug.cgi?id=43949
+
+        Test progression for proper entity support.
+
+        * html5lib/runner-expected-html5.txt:
+        * html5lib/runner-expected.txt:
+
 2010-08-13  Mihai Parparita  <mihaip@chromium.org>
 
         Reviewed by Dimitri Glazkov.
index 2eb01b4..84c7217 100644 (file)
@@ -118,92 +118,10 @@ resources/doctype01.dat: PASS
 
 resources/scriptdata01.dat: PASS
 
-resources/html5test-com.dat:
-7
-9
-10
-11
-
-Test 7 of 24 in resources/html5test-com.dat failed. Input:
-&lang;&rang;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "〈〉"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "⟨⟩"
-
-Test 9 of 24 in resources/html5test-com.dat failed. Input:
-&ImaginaryI;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "&ImaginaryI;"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "ⅈ"
-
-Test 10 of 24 in resources/html5test-com.dat failed. Input:
-&Kopf;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "&Kopf;"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "𝕂"
+resources/html5test-com.dat: PASS
 
-Test 11 of 24 in resources/html5test-com.dat failed. Input:
-&notinva;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "&notinva;"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "∉"
-resources/entities01.dat:
-2
-5
-
-Test 2 of 68 in resources/entities01.dat failed. Input:
-FOO&gtBAR
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "FOO&gtBAR"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "FOO>BAR"
+resources/entities01.dat: PASS
 
-Test 5 of 68 in resources/entities01.dat failed. Input:
-I'm &notit; I tell you
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "I'm &notit; I tell you"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "I'm ¬it; I tell you"
 resources/entities02.dat: PASS
 
 resources/comments01.dat: PASS
index 8a8f7be..c9ae245 100644 (file)
@@ -191,92 +191,10 @@ resources/doctype01.dat: PASS
 
 resources/scriptdata01.dat: PASS
 
-resources/html5test-com.dat:
-7
-9
-10
-11
-
-Test 7 of 24 in resources/html5test-com.dat failed. Input:
-&lang;&rang;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "〈〉"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "⟨⟩"
-
-Test 9 of 24 in resources/html5test-com.dat failed. Input:
-&ImaginaryI;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "&ImaginaryI;"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "ⅈ"
-
-Test 10 of 24 in resources/html5test-com.dat failed. Input:
-&Kopf;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "&Kopf;"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "𝕂"
+resources/html5test-com.dat: PASS
 
-Test 11 of 24 in resources/html5test-com.dat failed. Input:
-&notinva;
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "&notinva;"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "∉"
-resources/entities01.dat:
-2
-5
+resources/entities01.dat: PASS
 
-Test 2 of 68 in resources/entities01.dat failed. Input:
-FOO&gtBAR
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "FOO&gtBAR"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "FOO>BAR"
-
-Test 5 of 68 in resources/entities01.dat failed. Input:
-I'm &notit; I tell you
-Got:
-| <html>
-|   <head>
-|   <body>
-|     "I'm &notit; I tell you"
-Expected:
-| <html>
-|   <head>
-|   <body>
-|     "I'm ¬it; I tell you"
 resources/entities02.dat: PASS
 
 resources/comments01.dat: PASS
index 95897ae..7d8e70c 100644 (file)
@@ -971,6 +971,7 @@ SET(WebCore_SOURCES
     html/HTMLDocument.cpp
     html/HTMLElement.cpp
     html/HTMLElementStack.cpp
+    html/HTMLEntitySearch.cpp
     html/HTMLEmbedElement.cpp
     html/HTMLFieldSetElement.cpp
     html/HTMLFormattingElementList.cpp
index af43244..e26427c 100644 (file)
@@ -1,3 +1,45 @@
+2010-08-09  Adam Barth  <abarth@webkit.org>
+
+        Reviewed by Eric Seidel.
+
+        Add support for MathML entities
+        https://bugs.webkit.org/show_bug.cgi?id=43949
+
+        Implementing the HTML5 entity parsing algorithm require refactoring how
+        we search for entity names.  Instead of using a perfect hash, we now
+        use a sorted list.  As we advance through the input, we walk down a
+        binary search of the table looking for an entity.
+
+        Using this data structure lets us keep track of whether the current
+        string is a prefix of an existing entity, which we need for the
+        algorithm.  In a future patch, I plan to add some indices to the
+        table, which should let us narrow down the range of interesting entries
+        more quickly.
+
+        The one nasty piece of the algorithm is if we walk too far down the
+        input and we need to back up to a previous match.  In this patch, we
+        accomplish this by rewinding the input and consuming a known number of
+        characters to resync the source.
+
+        * WebCore.xcodeproj/project.pbxproj:
+        * html/HTMLEntityParser.cpp:
+        (WebCore::consumeHTMLEntity):
+        * html/HTMLEntitySearch.cpp: Added.
+        (WebCore::):
+        (WebCore::HTMLEntitySearch::HTMLEntitySearch):
+        (WebCore::HTMLEntitySearch::compare):
+        (WebCore::HTMLEntitySearch::findStart):
+        (WebCore::HTMLEntitySearch::findEnd):
+        (WebCore::HTMLEntitySearch::advance):
+        * html/HTMLEntitySearch.h: Added.
+        (WebCore::HTMLEntitySearch::isEntityPrefix):
+        (WebCore::HTMLEntitySearch::currentValue):
+        (WebCore::HTMLEntitySearch::lastMatch):
+        (WebCore::HTMLEntitySearch::):
+        (WebCore::HTMLEntitySearch::fail):
+        * html/HTMLEntityTable.h: Added.
+        (WebCore::HTMLEntityTableEntry::lastCharacter):
+
 2010-08-13  Tony Gentilcore  <tonyg@chromium.org>
 
         Reviewed by Eric Seidel.
index 37c2f10..bda4a7d 100644 (file)
@@ -505,7 +505,7 @@ all : \
     ColorData.cpp \
     DocTypeStrings.cpp \
     HTMLElementFactory.cpp \
-    HTMLEntityNames.cpp \
+    HTMLEntityTable.cpp \
     HTMLNames.cpp \
     WMLElementFactory.cpp \
     WMLNames.cpp \
@@ -600,8 +600,8 @@ DocTypeStrings.cpp : html/DocTypeStrings.gperf $(WebCore)/make-hash-tools.pl
 
 # HTML entity names
 
-HTMLEntityNames.cpp : html/HTMLEntityNames.gperf $(WebCore)/make-hash-tools.pl
-       perl $(WebCore)/make-hash-tools.pl . $(WebCore)/html/HTMLEntityNames.gperf
+HTMLEntityTable.cpp : html/HTMLEntityNames.json $(WebCore)/../WebKitTools/Scripts/create-html-entity-table
+       python $(WebCore)/../WebKitTools/Scripts/create-html-entity-table -o HTMLEntityTable.cpp $(WebCore)/html/HTMLEntityNames.json
 
 # --------
 
index adab026..6237fc7 100644 (file)
@@ -92,7 +92,7 @@ webcore_built_sources += \
        DerivedSources/WebCore/CSSValueKeywords.h \
        DerivedSources/WebCore/HTMLElementFactory.cpp \
        DerivedSources/WebCore/HTMLElementFactory.h \
-       DerivedSources/WebCore/HTMLEntityNames.cpp \
+       DerivedSources/WebCore/HTMLEntityTable.cpp \
        DerivedSources/WebCore/HTMLNames.cpp \
        DerivedSources/WebCore/HTMLNames.h \
        DerivedSources/WebCore/InspectorBackendDispatcher.cpp \
@@ -1427,6 +1427,8 @@ webcore_sources += \
        WebCore/html/HTMLElement.h \
        WebCore/html/HTMLElementStack.cpp \
        WebCore/html/HTMLElementStack.h \
+       WebCore/html/HTMLEntitySearch.cpp \
+       WebCore/html/HTMLEntitySearch.h \
        WebCore/html/HTMLEmbedElement.cpp \
        WebCore/html/HTMLEmbedElement.h \
        WebCore/html/HTMLFieldSetElement.cpp \
@@ -4395,8 +4397,8 @@ DerivedSources/WebCore/DocTypeStrings.cpp : $(WebCore)/html/DocTypeStrings.gperf
        $(PERL) $(WebCore)/make-hash-tools.pl $(GENSOURCES_WEBCORE) $(WebCore)/html/DocTypeStrings.gperf
 
 # HTML entity names
-DerivedSources/WebCore/HTMLEntityNames.cpp : $(WebCore)/html/HTMLEntityNames.gperf $(WebCore)/make-hash-tools.pl
-       $(PERL) $(WebCore)/make-hash-tools.pl $(GENSOURCES_WEBCORE) $(WebCore)/html/HTMLEntityNames.gperf
+DerivedSources/WebCore/HTMLEntityTable.cpp : $(WebCore)/html/HTMLEntityNames.json $(WebCore)/../WebKitTools/Scripts/create-html-entity-table
+       $(PYTHON) $(WebCore)/../WebKitTools/Scripts/create-html-entity-table -o $(GENSOURCES_WEBCORE)/HTMLEntityTable.cpp $(WebCore)/html/HTMLEntityNames.json
 
 # color names
 DerivedSources/WebCore/ColorData.cpp: $(WebCore)/platform/ColorData.gperf $(WebCore)/make-hash-tools.pl
index a28ee5d..30b9633 100644 (file)
 
         # gperf rule
         '../html/DocTypeStrings.gperf',
-        '../html/HTMLEntityNames.gperf',
         '../platform/ColorData.gperf',
 
+        # json rule
+        '../html/HTMLEntityNames.json',
+
         # idl rules
         '<@(bindings_idl_files)',
       ],
           'outputs': [
             '<(SHARED_INTERMEDIATE_DIR)/webkit/<(RULE_INPUT_ROOT).cpp',
           ],
-          'dependencies': [
+          'inputs': [
             '../make-hash-tools.pl',
           ],
           'action': [
           ],
           'process_outputs_as_sources': 0,
         },
+        {
+          'rule_name': 'json',
+          'extension': 'json',
+          #
+          # json outputs are generated by WebKitTools/Scripts/create-html-entity-table
+          #
+          'outputs': [
+            '<(SHARED_INTERMEDIATE_DIR)/webkit/HTMLEntityTable.cpp',
+          ],
+          'inputs': [
+            '../../WebKitTools/Scripts/create-html-entity-table',
+          ],
+          'action': [
+            'python',
+            '../../WebKitTools/Scripts/create-html-entity-table',
+            '-o',
+            '<(SHARED_INTERMEDIATE_DIR)/webkit/HTMLEntityTable.cpp',
+            '<(RULE_INPUT_PATH)',
+          ],
+        },
         # Rule to build generated JavaScript (V8) bindings from .idl source.
         {
           'rule_name': 'binding',
index 69f3f99..a9b9704 100644 (file)
             'html/HTMLElement.h',
             'html/HTMLElementStack.cpp',
             'html/HTMLElementStack.h',
+            'html/HTMLEntitySearch.cpp',
+            'html/HTMLEntitySearch.h',
             'html/HTMLEmbedElement.cpp',
             'html/HTMLEmbedElement.h',
             'html/HTMLFieldSetElement.cpp',
index b0effee..71818c2 100644 (file)
@@ -29,7 +29,7 @@ XML_NAMES = $$PWD/xml/xmlattrs.in
 
 XMLNS_NAMES = $$PWD/xml/xmlnsattrs.in
 
-ENTITIES_GPERF = $$PWD/html/HTMLEntityNames.gperf
+HTML_ENTITIES = $$PWD/html/HTMLEntityNames.json
 
 COLORDATA_GPERF = $$PWD/platform/ColorData.gperf
 
@@ -590,12 +590,12 @@ xmlnames.commands = perl -I$$PWD/bindings/scripts $$xmlnames.wkScript --attrs $$
 addExtraCompiler(xmlnames)
 
 # GENERATOR 8-A:
-entities.output = $${WC_GENERATED_SOURCES_DIR}/HTMLEntityNames.cpp
-entities.input = ENTITIES_GPERF
-entities.wkScript = $$PWD/make-hash-tools.pl
-entities.commands = perl $$entities.wkScript $${WC_GENERATED_SOURCES_DIR} $$ENTITIES_GPERF
+entities.output = $${WC_GENERATED_SOURCES_DIR}/HTMLEntityTable.cpp
+entities.input = HTML_ENTITIES
+entities.wkScript = $$PWD/../WebKitTools/Scripts/create-html-entity-table
+entities.commands = python $$entities.wkScript -o $${WC_GENERATED_SOURCES_DIR}/HTMLEntityTable.cpp $$HTML_ENTITIES
 entities.clean = ${QMAKE_FILE_OUT}
-entities.depends = $$PWD/make-hash-tools.pl
+entities.depends = $$PWD/../WebKitTools/Scripts/create-html-entity-table
 addExtraCompiler(entities)
 
 # GENERATOR 8-B:
index 1ff749d..bb8e978 100644 (file)
@@ -671,6 +671,7 @@ SOURCES += \
     html/HTMLDocument.cpp \
     html/HTMLElement.cpp \
     html/HTMLElementStack.cpp \
+    html/HTMLEntitySearch.cpp \
     html/HTMLEmbedElement.cpp \
     html/HTMLFieldSetElement.cpp \
     html/HTMLFontElement.cpp \
index d47bb23..0cf3344 100644 (file)
                                >\r
                        </File>\r
                        <File\r
+                               RelativePath="..\html\HTMLEntitySearch.cpp"\r
+                               >\r
+                       </File>\r
+                       <File\r
+                               RelativePath="..\html\HTMLEntitySearch.h"\r
+                               >\r
+                       </File>\r
+                       <File\r
                                RelativePath="..\html\HTMLEmbedElement.cpp"\r
                                >\r
                                <FileConfiguration\r
index f12bcc2..c399e6c 100644 (file)
                A8A564A611DC0E59003AC2F0 /* HTMLFormattingElementList.cpp in Sources */ = {isa = PBXBuildFile; fileRef = A8A564A411DC0E59003AC2F0 /* HTMLFormattingElementList.cpp */; };
                A8A909AC0CBCD6B50029B807 /* RenderSVGTransformableContainer.h in Headers */ = {isa = PBXBuildFile; fileRef = A8A909AA0CBCD6B50029B807 /* RenderSVGTransformableContainer.h */; };
                A8A909AD0CBCD6B50029B807 /* RenderSVGTransformableContainer.cpp in Sources */ = {isa = PBXBuildFile; fileRef = A8A909AB0CBCD6B50029B807 /* RenderSVGTransformableContainer.cpp */; };
+               A8BC044E1214EB2A00B5F122 /* HTMLEntitySearch.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 970C4FDF1211266200C3D393 /* HTMLEntitySearch.cpp */; };
+               A8BC044F1214EB2B00B5F122 /* HTMLEntitySearch.h in Headers */ = {isa = PBXBuildFile; fileRef = 970C4FE01211266200C3D393 /* HTMLEntitySearch.h */; };
+               A8BC04921214F69600B5F122 /* HTMLEntityTable.cpp in Sources */ = {isa = PBXBuildFile; fileRef = A8BC04911214F69600B5F122 /* HTMLEntityTable.cpp */; };
                A8BCFD05120A046100B5F122 /* SVGPathSeg.cpp in Sources */ = {isa = PBXBuildFile; fileRef = A8BCFD04120A046100B5F122 /* SVGPathSeg.cpp */; };
                A8C2280E11D4A59700D5A7D3 /* DocumentParser.cpp in Sources */ = {isa = PBXBuildFile; fileRef = A8C2280D11D4A59700D5A7D3 /* DocumentParser.cpp */; };
                A8C228A111D5722E00D5A7D3 /* DecodedDataDocumentParser.h in Headers */ = {isa = PBXBuildFile; fileRef = A8C2289F11D5722E00D5A7D3 /* DecodedDataDocumentParser.h */; };
                97059974107D975200A50A7C /* PolicyCallback.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = PolicyCallback.h; sourceTree = "<group>"; };
                97059975107D975200A50A7C /* PolicyChecker.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = PolicyChecker.cpp; sourceTree = "<group>"; };
                97059976107D975200A50A7C /* PolicyChecker.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = PolicyChecker.h; sourceTree = "<group>"; };
+               970C4FDF1211266200C3D393 /* HTMLEntitySearch.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = HTMLEntitySearch.cpp; sourceTree = "<group>"; };
+               970C4FE01211266200C3D393 /* HTMLEntitySearch.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = HTMLEntitySearch.h; sourceTree = "<group>"; };
+               970C4FE11211266200C3D393 /* HTMLEntityTable.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = HTMLEntityTable.cpp; sourceTree = "<group>"; };
+               970C4FE21211266200C3D393 /* HTMLEntityTable.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = HTMLEntityTable.h; sourceTree = "<group>"; };
                9719AEFF11D09F2C00D45831 /* HTMLInputStream.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = HTMLInputStream.h; sourceTree = "<group>"; };
                9738899E116EA9DC00ADF313 /* DocumentWriter.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = DocumentWriter.cpp; sourceTree = "<group>"; };
                9738899F116EA9DC00ADF313 /* DocumentWriter.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = DocumentWriter.h; sourceTree = "<group>"; };
                A8A564A411DC0E59003AC2F0 /* HTMLFormattingElementList.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = HTMLFormattingElementList.cpp; sourceTree = "<group>"; };
                A8A909AA0CBCD6B50029B807 /* RenderSVGTransformableContainer.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = RenderSVGTransformableContainer.h; sourceTree = "<group>"; };
                A8A909AB0CBCD6B50029B807 /* RenderSVGTransformableContainer.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = RenderSVGTransformableContainer.cpp; sourceTree = "<group>"; };
+               A8BC04911214F69600B5F122 /* HTMLEntityTable.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = HTMLEntityTable.cpp; sourceTree = "<group>"; };
                A8BCFD04120A046100B5F122 /* SVGPathSeg.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = SVGPathSeg.cpp; sourceTree = "<group>"; };
                A8C2280D11D4A59700D5A7D3 /* DocumentParser.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = DocumentParser.cpp; sourceTree = "<group>"; };
                A8C2289F11D5722E00D5A7D3 /* DecodedDataDocumentParser.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = DecodedDataDocumentParser.h; sourceTree = "<group>"; };
                E1FF57A50F01256B00891EBB /* ThreadGlobalData.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ThreadGlobalData.cpp; sourceTree = "<group>"; };
                E406F3FA1198304D009D59D6 /* DocTypeStrings.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = DocTypeStrings.cpp; sourceTree = "<group>"; };
                E406F3FB1198307D009D59D6 /* ColorData.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ColorData.cpp; sourceTree = "<group>"; };
-               E406F4021198329A009D59D6 /* HTMLEntityNames.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = HTMLEntityNames.cpp; sourceTree = "<group>"; };
                E415F10C0D9A05870033CE97 /* ElementTimeControl.idl */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = text; path = ElementTimeControl.idl; sourceTree = "<group>"; };
                E415F1680D9A165D0033CE97 /* DOMElementTimeControl.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = DOMElementTimeControl.h; sourceTree = "<group>"; };
                E415F1830D9A1A830033CE97 /* ElementTimeControl.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ElementTimeControl.h; sourceTree = "<group>"; };
                                E406F3FA1198304D009D59D6 /* DocTypeStrings.cpp */,
                                A17C81200F2A5CF7005DAAEB /* HTMLElementFactory.cpp */,
                                A17C81210F2A5CF7005DAAEB /* HTMLElementFactory.h */,
-                               E406F4021198329A009D59D6 /* HTMLEntityNames.cpp */,
+                               A8BC04911214F69600B5F122 /* HTMLEntityTable.cpp */,
                                A8D06B380A265DCD005E7203 /* HTMLNames.cpp */,
                                A8D06B370A265DCD005E7203 /* HTMLNames.h */,
                                938E65F609F0985D008A48EC /* JSHTMLElementWrapperFactory.cpp */,
                                859128790AB222EC00202265 /* HTMLEmbedElement.idl */,
                                976E895E11C0CA3A00EA9CA9 /* HTMLEntityParser.cpp */,
                                976E895F11C0CA3A00EA9CA9 /* HTMLEntityParser.h */,
+                               970C4FDF1211266200C3D393 /* HTMLEntitySearch.cpp */,
+                               970C4FE01211266200C3D393 /* HTMLEntitySearch.h */,
+                               970C4FE11211266200C3D393 /* HTMLEntityTable.cpp */,
+                               970C4FE21211266200C3D393 /* HTMLEntityTable.h */,
                                A81369B9097374F500D74463 /* HTMLFieldSetElement.cpp */,
                                A81369B8097374F500D74463 /* HTMLFieldSetElement.h */,
                                1AE2A9F40A1CDA5700B42B25 /* HTMLFieldSetElement.idl */,
                                97DD4D870FDF4D6E00ECF9A4 /* XSSAuditor.h in Headers */,
                                CE172E011136E8CE0062A533 /* ZoomMode.h in Headers */,
                                2EED57FE1214A9C2007656BB /* ThreadableBlobRegistry.h in Headers */,
+                               A8BC044F1214EB2B00B5F122 /* HTMLEntitySearch.h in Headers */,
                        );
                        runOnlyForDeploymentPostprocessing = 0;
                };
                                E1BE512D0CF6C512002EA959 /* XSLTUnicodeSort.cpp in Sources */,
                                97DD4D860FDF4D6E00ECF9A4 /* XSSAuditor.cpp in Sources */,
                                2EED57FD1214A9C2007656BB /* ThreadableBlobRegistry.cpp in Sources */,
+                               A8BC044E1214EB2A00B5F122 /* HTMLEntitySearch.cpp in Sources */,
+                               A8BC04921214F69600B5F122 /* HTMLEntityTable.cpp in Sources */,
                        );
                        runOnlyForDeploymentPostprocessing = 0;
                };
diff --git a/WebCore/html/HTMLEntityNames.gperf b/WebCore/html/HTMLEntityNames.gperf
deleted file mode 100644 (file)
index c665efe..0000000
+++ /dev/null
@@ -1,303 +0,0 @@
-%{
-/*
-     Copyright (C) 1999 Lars Knoll (knoll@mpi-hd.mpg.de)
-     Copyright (C) 2002, 2003, 2004, 2005 Apple Inc. All rights reserved.
-  
-     This library is free software; you can redistribute it and/or
-     modify it under the terms of the GNU Library General Public
-     License as published by the Free Software Foundation; either
-     version 2 of the License, or (at your option) any later version.
-  
-     This library is distributed in the hope that it will be useful,
-     but WITHOUT ANY WARRANTY; without even the implied warranty of
-     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-     Library General Public License for more details.
-  
-     You should have received a copy of the GNU Library General Public License
-     along with this library; see the file COPYING.LIB.  If not, write to
-     the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
-     Boston, MA 02110-1301, USA.
-  
-  ----------------------------------------------------------------------------
-  
-    HTMLEntityNames.gperf: input file to generate a hash table for entities
-    HTMLEntityNames.cpp: DO NOT EDIT! generated by WebCore/make-hash-tools.pl
-*/
-%}
-%struct-type
-struct Entity {
-    const char *name;
-    int code;
-};
-%language=ANSI-C
-%readonly-tables
-%global-table
-%compare-strncmp
-%define lookup-function-name findEntity
-%define hash-function-name entity_hash_function
-%includes
-%enum
-%%
-AElig, 0x00c6
-AMP, 38
-Aacute, 0x00c1
-Acirc, 0x00c2
-Agrave, 0x00c0
-Alpha, 0x0391
-Aring, 0x00c5
-Atilde, 0x00c3
-Auml, 0x00c4
-Beta, 0x0392
-COPY, 0x00a9
-Ccedil, 0x00c7
-Chi, 0x03a7
-Dagger, 0x2021
-Delta, 0x0394
-ETH, 0x00d0
-Eacute, 0x00c9
-Ecirc, 0x00ca
-Egrave, 0x00c8
-Epsilon, 0x0395
-Eta, 0x0397
-Euml, 0x00cb
-GT, 62
-Gamma, 0x0393
-Iacute, 0x00cd
-Icirc, 0x00ce
-Igrave, 0x00cc
-Iota, 0x0399
-Iuml, 0x00cf
-Kappa, 0x039a
-LT, 60
-Lambda, 0x039b
-Mu, 0x039c
-Ntilde, 0x00d1
-Nu, 0x039d
-OElig, 0x0152
-Oacute, 0x00d3
-Ocirc, 0x00d4
-Ograve, 0x00d2
-Omega, 0x03a9
-Omicron, 0x039f
-Oslash, 0x00d8
-Otilde, 0x00d5
-Ouml, 0x00d6
-Phi, 0x03a6
-Pi, 0x03a0
-Prime, 0x2033
-Psi, 0x03a8
-QUOT, 34
-REG, 0x00ae
-Rho, 0x03a1
-Scaron, 0x0160
-Sigma, 0x03a3
-THORN, 0x00de
-Tau, 0x03a4
-Theta, 0x0398
-Uacute, 0x00da
-Ucirc, 0x00db
-Ugrave, 0x00d9
-Upsilon, 0x03a5
-Uuml, 0x00dc
-Xi, 0x039e
-Yacute, 0x00dd
-Yuml, 0x0178
-Zeta, 0x0396
-aacute, 0x00e1
-acirc, 0x00e2
-acute, 0x00b4
-aelig, 0x00e6
-agrave, 0x00e0
-alefsym, 0x2135
-alpha, 0x03b1
-amp, 38
-and, 0x2227
-ang, 0x2220
-apos, 0x0027
-aring, 0x00e5
-asymp, 0x2248
-atilde, 0x00e3
-auml, 0x00e4
-bdquo, 0x201e
-beta, 0x03b2
-brvbar, 0x00a6
-bull, 0x2022
-cap, 0x2229
-ccedil, 0x00e7
-cedil, 0x00b8
-cent, 0x00a2
-chi, 0x03c7
-circ, 0x02c6
-clubs, 0x2663
-cong, 0x2245
-copy, 0x00a9
-crarr, 0x21b5
-cup, 0x222a
-curren, 0x00a4
-dArr, 0x21d3
-dagger, 0x2020
-darr, 0x2193
-deg, 0x00b0
-delta, 0x03b4
-diams, 0x2666
-divide, 0x00f7
-eacute, 0x00e9
-ecirc, 0x00ea
-egrave, 0x00e8
-empty, 0x2205
-emsp, 0x2003
-ensp, 0x2002
-epsilon, 0x03b5
-equiv, 0x2261
-eta, 0x03b7
-eth, 0x00f0
-euml, 0x00eb
-euro, 0x20ac
-exist, 0x2203
-fnof, 0x0192
-forall, 0x2200
-frac12, 0x00bd
-frac14, 0x00bc
-frac34, 0x00be
-frasl, 0x2044
-gamma, 0x03b3
-ge, 0x2265
-gt, 62
-hArr, 0x21d4
-harr, 0x2194
-hearts, 0x2665
-hellip, 0x2026
-iacute, 0x00ed
-icirc, 0x00ee
-iexcl, 0x00a1
-igrave, 0x00ec
-image, 0x2111
-infin, 0x221e
-int, 0x222b
-iota, 0x03b9
-iquest, 0x00bf
-isin, 0x2208
-iuml, 0x00ef
-kappa, 0x03ba
-lArr, 0x21d0
-lambda, 0x03bb
-lang, 0x3008
-laquo, 0x00ab
-larr, 0x2190
-lceil, 0x2308
-ldquo, 0x201c
-le, 0x2264
-lfloor, 0x230a
-lowast, 0x2217
-loz, 0x25ca
-lrm, 0x200e
-lsaquo, 0x2039
-lsquo, 0x2018
-lt, 60
-macr, 0x00af
-mdash, 0x2014
-micro, 0x00b5
-middot, 0x00b7
-minus, 0x2212
-mu, 0x03bc
-nabla, 0x2207
-nbsp, 0x00a0
-ndash, 0x2013
-ne, 0x2260
-ni, 0x220b
-not, 0x00ac
-notin, 0x2209
-nsub, 0x2284
-nsup, 0x2285
-ntilde, 0x00f1
-nu, 0x03bd
-oacute, 0x00f3
-ocirc, 0x00f4
-oelig, 0x0153
-ograve, 0x00f2
-oline, 0x203e
-omega, 0x03c9
-omicron, 0x03bf
-oplus, 0x2295
-or, 0x2228
-ordf, 0x00aa
-ordm, 0x00ba
-oslash, 0x00f8
-otilde, 0x00f5
-otimes, 0x2297
-ouml, 0x00f6
-para, 0x00b6
-part, 0x2202
-percnt, 0x0025
-permil, 0x2030
-perp, 0x22a5
-phi, 0x03c6
-pi, 0x03c0
-piv, 0x03d6
-plusmn, 0x00b1
-pound, 0x00a3
-prime, 0x2032
-prod, 0x220f
-prop, 0x221d
-psi, 0x03c8
-quot, 34
-rArr, 0x21d2
-radic, 0x221a
-rang, 0x3009
-raquo, 0x00bb
-rarr, 0x2192
-rceil, 0x2309
-rdquo, 0x201d
-real, 0x211c
-reg, 0x00ae
-rfloor, 0x230b
-rho, 0x03c1
-rlm, 0x200f
-rsaquo, 0x203a
-rsquo, 0x2019
-sbquo, 0x201a
-scaron, 0x0161
-sdot, 0x22c5
-sect, 0x00a7
-shy, 0x00ad
-sigma, 0x03c3
-sigmaf, 0x03c2
-sim, 0x223c
-spades, 0x2660
-sub, 0x2282
-sube, 0x2286
-sum, 0x2211
-sup, 0x2283
-sup1, 0x00b9
-sup2, 0x00b2
-sup3, 0x00b3
-supe, 0x2287
-supl, 0x00b9
-szlig, 0x00df
-tau, 0x03c4
-there4, 0x2234
-theta, 0x03b8
-thetasym, 0x03d1
-thinsp, 0x2009
-thorn, 0x00fe
-tilde, 0x02dc
-times, 0x00d7
-trade, 0x2122
-uArr, 0x21d1
-uacute, 0x00fa
-uarr, 0x2191
-ucirc, 0x00fb
-ugrave, 0x00f9
-uml, 0x00a8
-upsih, 0x03d2
-upsilon, 0x03c5
-uuml, 0x00fc
-weierp, 0x2118
-xi, 0x03be
-yacute, 0x00fd
-yen, 0x00a5
-yuml, 0x00ff
-zeta, 0x03b6
-zwj, 0x200d
-zwnj, 0x200c
-%%
index 6bec819..af3b9f3 100644 (file)
 #include "config.h"
 #include "HTMLEntityParser.h"
 
+#include "HTMLEntitySearch.h"
+#include "HTMLEntityTable.h"
 #include <wtf/Vector.h>
 
-#include "HTMLEntityNames.cpp"
-
 using namespace WTF;
 
 namespace WebCore {
@@ -102,7 +102,6 @@ unsigned consumeHTMLEntity(SegmentedString& source, bool& notEnoughCharacters, U
     EntityState entityState = Initial;
     unsigned result = 0;
     Vector<UChar, 10> consumedCharacters;
-    Vector<char, 10> entityName;
 
     while (!source.isEmpty()) {
         UChar cc = *source;
@@ -166,7 +165,7 @@ unsigned consumeHTMLEntity(SegmentedString& source, bool& notEnoughCharacters, U
             else if (cc == ';') {
                 source.advancePastNonNewline();
                 return legalEntityFor(result);
-            } else 
+            } else
                 return legalEntityFor(result);
             break;
         }
@@ -181,48 +180,48 @@ unsigned consumeHTMLEntity(SegmentedString& source, bool& notEnoughCharacters, U
             break;
         }
         case Named: {
-            // FIXME: This code is wrong. We need to find the longest matching entity.
-            //        The examples from the spec are:
-            //            I'm &notit; I tell you
-            //            I'm &notin; I tell you
-            //        In the first case, "&not" is the entity.  In the second
-            //        case, "&notin;" is the entity.
-            // FIXME: Our list of HTML entities is incomplete.
-            // FIXME: The number 8 below is bogus.
-            while (!source.isEmpty() && entityName.size() <= 8) {
+            HTMLEntitySearch entitySearch;
+            while (!source.isEmpty()) {
                 cc = *source;
-                if (cc == ';') {
-                    const Entity* entity = findEntity(entityName.data(), entityName.size());
-                    if (entity) {
-                        source.advanceAndASSERT(';');
-                        return entity->code;
-                    }
-                    break;
-                }
-                if (!isAlphaNumeric(cc)) {
-                    const Entity* entity = findEntity(entityName.data(), entityName.size());
-                    if (entity) {
-                        // HTML5 tells us to ignore this entity, for historical reasons,
-                        // if the lookhead character is '='.
-                        if (additionalAllowedCharacter && cc == '=')
-                            break;
-                        // Some entities require a terminating semicolon, whereas other
-                        // entities do not.  The HTML5 spec has a giant list:
-                        //
-                        // http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references
-                        //
-                        // However, the list seems to boil down to this branch:
-                        if (entity->code > 255)
-                            break;
-                        return entity->code;
-                    }
+                entitySearch.advance(cc);
+                if (!entitySearch.isEntityPrefix())
                     break;
-                }
-                entityName.append(cc);
                 consumedCharacters.append(cc);
                 source.advanceAndASSERT(cc);
             }
             notEnoughCharacters = source.isEmpty();
+            if (notEnoughCharacters) {
+                // We can't an entity because there might be a longer entity
+                // that we could match if we had more data.
+                unconsumeCharacters(source, consumedCharacters);
+                return 0;
+            }
+            if (!entitySearch.lastMatch()) {
+                ASSERT(!entitySearch.currentValue());
+                unconsumeCharacters(source, consumedCharacters);
+                return 0;
+            }
+            if (entitySearch.lastMatch()->length != entitySearch.currentLength()) {
+                // We've consumed too many characters.  We need to walk the
+                // source back to the point at which we had consumed an
+                // actual entity.
+                unconsumeCharacters(source, consumedCharacters);
+                consumedCharacters.clear();
+                const int length = entitySearch.lastMatch()->length;
+                const UChar* reference = entitySearch.lastMatch()->entity;
+                for (int i = 0; i < length; ++i) {
+                    cc = *source;
+                    ASSERT_UNUSED(reference, cc == *reference++);
+                    consumedCharacters.append(cc);
+                    source.advanceAndASSERT(cc);
+                    ASSERT(!source.isEmpty());
+                }
+                cc = *source;
+            }
+            if (entitySearch.lastMatch()->lastCharacter() == ';')
+                return entitySearch.lastMatch()->value;
+            if (!additionalAllowedCharacter || !(isAlphaNumeric(cc) || cc == '='))
+                return entitySearch.lastMatch()->value;
             unconsumeCharacters(source, consumedCharacters);
             return 0;
         }
@@ -238,8 +237,18 @@ unsigned consumeHTMLEntity(SegmentedString& source, bool& notEnoughCharacters, U
 
 UChar decodeNamedEntity(const char* name)
 {
-    const Entity* e = findEntity(name, strlen(name));
-    return e ? e->code : 0;
+    HTMLEntitySearch search;
+    while (name && search.isEntityPrefix())
+        search.advance(*name++);
+    search.advance(';');
+    UChar32 entityValue = search.currentValue();
+    if (U16_LENGTH(entityValue) != 1) {
+        // Callers need to move off this API if the entity table has values
+        // which do no fit in a 16 bit UChar!
+        ASSERT_NOT_REACHED();
+        return 0;
+    }
+    return static_cast<UChar>(entityValue);
 }
 
 } // namespace WebCore
diff --git a/WebCore/html/HTMLEntitySearch.cpp b/WebCore/html/HTMLEntitySearch.cpp
new file mode 100644 (file)
index 0000000..c0526a3
--- /dev/null
@@ -0,0 +1,132 @@
+/*
+ * Copyright (C) 2010 Google, Inc. All Rights Reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
+ */
+
+#include "config.h"
+#include "HTMLEntitySearch.h"
+
+#include "HTMLEntityTable.h"
+
+namespace WebCore {
+
+namespace {
+    
+const HTMLEntityTableEntry* halfway(const HTMLEntityTableEntry* left, const HTMLEntityTableEntry* right)
+{
+    return &left[(right - left) / 2];
+}
+
+}
+    
+HTMLEntitySearch::HTMLEntitySearch()
+    : m_currentLength(0)
+    , m_currentValue(0)
+    , m_lastMatch(0)
+    , m_start(HTMLEntityTable::start())
+    , m_end(HTMLEntityTable::end())
+{
+}
+
+HTMLEntitySearch::CompareResult HTMLEntitySearch::compare(const HTMLEntityTableEntry* entry, UChar nextCharacter) const
+{
+    if (entry->length < m_currentLength + 1)
+        return Before;
+    UChar entryNextCharacter = entry->entity[m_currentLength];
+    if (entryNextCharacter == nextCharacter)
+        return Prefix;
+    return entryNextCharacter < nextCharacter ? Before : After;
+}
+
+const HTMLEntityTableEntry* HTMLEntitySearch::findStart(UChar nextCharacter) const
+{
+    const HTMLEntityTableEntry* left = m_start;
+    const HTMLEntityTableEntry* right = m_end;
+    if (left == right)
+        return left;
+    CompareResult result = compare(left, nextCharacter);
+    if (result == Prefix)
+        return left;
+    if (result == After)
+        return right;
+    while (left + 1 < right) {
+        const HTMLEntityTableEntry* probe = halfway(left, right);
+        result = compare(probe, nextCharacter);
+        if (result == Before)
+            left = probe;
+        else {
+            ASSERT(result == After || result == Prefix);
+            right = probe;
+        }
+    }
+    ASSERT(left + 1 == right);
+    return right;
+}
+
+const HTMLEntityTableEntry* HTMLEntitySearch::findEnd(UChar nextCharacter) const
+{
+    const HTMLEntityTableEntry* left = m_start;
+    const HTMLEntityTableEntry* right = m_end;
+    if (left == right)
+        return right;
+    CompareResult result = compare(right, nextCharacter);
+    if (result == Prefix)
+        return right;
+    if (result == Before)
+        return left;
+    while (left + 1 < right) {
+        const HTMLEntityTableEntry* probe = halfway(left, right);
+        result = compare(probe, nextCharacter);
+        if (result == After)
+            right = probe;
+        else {
+            ASSERT(result == Before || result == Prefix);
+            left = probe;
+        }
+    }
+    ASSERT(left + 1 == right);
+    return left;
+}
+
+void HTMLEntitySearch::advance(UChar nextCharacter)
+{
+    ASSERT(isEntityPrefix());
+    if (!m_currentLength) {
+        m_start = HTMLEntityTable::start(nextCharacter);
+        m_end = HTMLEntityTable::end(nextCharacter);
+    } else {
+        m_start = findStart(nextCharacter);
+        m_end = findEnd(nextCharacter);
+        if (m_start == m_end && compare(m_start, nextCharacter) != Prefix)
+            return fail();
+    }
+    ++m_currentLength;
+    if (m_start->length != m_currentLength) {
+        m_currentValue = 0;
+        return;
+    }
+    m_lastMatch = m_start;
+    m_currentValue = m_lastMatch->value;
+}
+
+}
diff --git a/WebCore/html/HTMLEntitySearch.h b/WebCore/html/HTMLEntitySearch.h
new file mode 100644 (file)
index 0000000..e57859d
--- /dev/null
@@ -0,0 +1,75 @@
+/*
+ * Copyright (C) 2010 Google, Inc. All Rights Reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
+ */
+
+#ifndef HTMLEntitySearch_h
+#define HTMLEntitySearch_h
+
+#include "PlatformString.h"
+
+namespace WebCore {
+
+struct HTMLEntityTableEntry;
+
+class HTMLEntitySearch {
+public:
+    HTMLEntitySearch();
+
+    void advance(UChar);
+
+    bool isEntityPrefix() const { return !!m_start; }
+    int currentValue() const { return m_currentValue; }
+    int currentLength() const { return m_currentLength; }
+
+    const HTMLEntityTableEntry* lastMatch() const { return m_lastMatch; }
+
+private:
+    enum CompareResult {
+        Before,
+        Prefix,
+        After,
+    };
+
+    CompareResult compare(const HTMLEntityTableEntry*, UChar) const;
+    const HTMLEntityTableEntry* findStart(UChar) const;
+    const HTMLEntityTableEntry* findEnd(UChar) const;
+
+    void fail()
+    {
+        m_currentValue = 0;
+        m_start = 0;
+        m_end = 0;
+    }
+
+    int m_currentLength;
+    int m_currentValue;
+
+    const HTMLEntityTableEntry* m_lastMatch;
+    const HTMLEntityTableEntry* m_start;
+    const HTMLEntityTableEntry* m_end;
+};
+
+}
+
+#endif
diff --git a/WebCore/html/HTMLEntityTable.h b/WebCore/html/HTMLEntityTable.h
new file mode 100644 (file)
index 0000000..35a1afd
--- /dev/null
@@ -0,0 +1,52 @@
+/*
+ * Copyright (C) 2010 Google, Inc. All Rights Reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
+ */
+
+#ifndef HTMLEntityTable_h
+#define HTMLEntityTable_h
+
+#include "PlatformString.h"
+
+namespace WebCore {
+
+struct HTMLEntityTableEntry {
+    UChar lastCharacter() const { return entity[length - 1]; }
+
+    const UChar* entity;
+    int length;
+    int value;
+};
+
+class HTMLEntityTable {
+public:
+    static const HTMLEntityTableEntry* start();
+    static const HTMLEntityTableEntry* end();
+
+    static const HTMLEntityTableEntry* start(UChar);
+    static const HTMLEntityTableEntry* end(UChar);
+};
+
+}
+
+#endif
index 42cb6fd..8cc9952 100644 (file)
@@ -29,16 +29,6 @@ my $option = basename($ARGV[0],".gperf");
 
 switch ($option) {
 
-case "HTMLEntityNames" {
-
-    my $htmlEntityNamesGenerated   = "$outdir/HTMLEntityNames.cpp";
-    my $htmlEntityNamesGperf       = $ARGV[0];
-    shift;
-
-    system("gperf --key-positions=\"*\" -D -s 2 $htmlEntityNamesGperf > $htmlEntityNamesGenerated") == 0 || die "calling gperf failed: $?";
-
-} # case "HTMLEntityNames"
-
 case "DocTypeStrings" {
 
     my $docTypeStringsGenerated    = "$outdir/DocTypeStrings.cpp";
index 5f67e8c..fa278e3 100644 (file)
@@ -1,3 +1,15 @@
+2010-08-12  Adam Barth  <abarth@webkit.org>
+
+        Reviewed by Eric Seidel.
+
+        Add support for MathML entities
+        https://bugs.webkit.org/show_bug.cgi?id=43949
+
+        A script for generating the C++ state data structure describing all the
+        entities from a JSON description.
+
+        * Scripts/create-html-entity-table: Added.
+
 2010-08-13  Dirk Pranke  <dpranke@chromium.org>
 
         Reviewed by Eric Seidel.
diff --git a/WebKitTools/Scripts/create-html-entity-table b/WebKitTools/Scripts/create-html-entity-table
new file mode 100755 (executable)
index 0000000..14d55f7
--- /dev/null
@@ -0,0 +1,177 @@
+#!/usr/bin/env python
+# Copyright (c) 2010 Google Inc. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+# 
+#     * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above
+# copyright notice, this list of conditions and the following disclaimer
+# in the documentation and/or other materials provided with the
+# distribution.
+#     * Neither the name of Google Inc. nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+# 
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os.path
+import string
+import sys
+
+import webkitpy.thirdparty.simplejson as simplejson
+
+
+def convert_entity_to_cpp_name(entity):
+    postfix = "EntityName"
+    if entity[-1] == ";":
+        return "%sSemicolon%s" % (entity[:-1], postfix)
+    return "%s%s" % (entity, postfix)
+
+
+def convert_entity_to_uchar_array(entity):
+    return "{'%s'}" % "', '".join(entity)
+
+
+def convert_value_to_int(value):
+    assert(value[0] == "U")
+    assert(value[1] == "+")
+    return "0x" + value[2:]
+
+
+def offset_table_entry(offset):
+    return "    &staticEntityTable[%s]," % offset
+
+
+program_name = os.path.basename(__file__)
+if len(sys.argv) < 4 or sys.argv[1] != "-o":
+    print >> sys.stderr, "Usage: %s -o OUTPUT_FILE INPUT_FILE" % program_name
+    exit(1)
+
+output_path = sys.argv[2]
+input_path = sys.argv[3]
+
+html_entity_names_file = open(input_path)
+entries = simplejson.load(html_entity_names_file)
+html_entity_names_file.close()
+
+entries = sorted(entries, key=lambda entry: entry['entity'])
+entity_count = len(entries)
+
+output_file = open(output_path, "w")
+
+print >> output_file, """/*
+ * Copyright (C) 2010 Google, Inc. All Rights Reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
+ */
+
+// THIS FILE IS GENERATED BY WebKitTools/Scripts/create-html-entity-table
+// DO NOT EDIT (unless you are a ninja)!
+
+#include "config.h"
+#include "HTMLEntityTable.h"
+
+namespace WebCore {
+
+namespace {
+"""
+
+for entry in entries:
+    print >> output_file, "const UChar %sEntityName[] = %s;" % (
+        convert_entity_to_cpp_name(entry["entity"]),
+        convert_entity_to_uchar_array(entry["entity"]))
+
+print >> output_file, """
+HTMLEntityTableEntry staticEntityTable[%s] = {""" % entity_count
+
+index = {}
+offset = 0
+for entry in entries:
+    letter = entry["entity"][0]
+    if not index.get(letter):
+        index[letter] = offset
+    print >> output_file, '    { %sEntityName, %s, %s },' % (
+        convert_entity_to_cpp_name(entry["entity"]),
+        len(entry["entity"]),
+        convert_value_to_int(entry["value"]))
+    offset += 1
+
+print >> output_file, """};
+"""
+
+print >> output_file, "const HTMLEntityTableEntry* uppercaseOffset[] = {"
+for letter in string.uppercase:
+    print >> output_file, offset_table_entry(index[letter])
+print >> output_file, offset_table_entry(index['a'])
+print >> output_file, """};
+
+const HTMLEntityTableEntry* lowercaseOffset[] = {"""
+for letter in string.lowercase:
+    print >> output_file, offset_table_entry(index[letter])
+print >> output_file, offset_table_entry(entity_count)
+print >> output_file, """};
+
+}
+
+const HTMLEntityTableEntry* HTMLEntityTable::start(UChar c)
+{
+    if (c >= 'A' && c <= 'Z')
+        return uppercaseOffset[c - 'A'];
+    if (c >= 'a' && c <= 'z')
+        return lowercaseOffset[c - 'a'];
+    return 0;
+}
+
+const HTMLEntityTableEntry* HTMLEntityTable::end(UChar c)
+{
+    if (c >= 'A' && c <= 'Z')
+        return uppercaseOffset[c - 'A' + 1] - 1;
+    if (c >= 'a' && c <= 'z')
+        return lowercaseOffset[c - 'a' + 1] - 1;
+    return 0;
+}
+
+const HTMLEntityTableEntry* HTMLEntityTable::start()
+{
+    return &staticEntityTable[0];
+}
+
+const HTMLEntityTableEntry* HTMLEntityTable::end()
+{
+    return &staticEntityTable[%s - 1];
+}
+
+}
+""" % entity_count