Implement RegExp Unicode property escapes
authormsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Mon, 9 Oct 2017 23:14:46 +0000 (23:14 +0000)
committermsaboff@apple.com <msaboff@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Mon, 9 Oct 2017 23:14:46 +0000 (23:14 +0000)
commitdf56f59d9ed34d976cad413e73d15710f0c341b3
tree749b2c73f37aecad2cc5874379d2a4a6bbc1874b
parent823a3c6a51688a00a3fab3d5d1040acf413078a6
Implement RegExp Unicode property escapes
https://bugs.webkit.org/show_bug.cgi?id=172069

Reviewed by JF Bastien.

JSTests:

Enabled Unicode Property tests.

* test262.yaml:

Source/JavaScriptCore:

Added Unicode Properties by extending the existing CharacterClass processing.

Introduced a new Python script, generateYarrUnicodePropertyTables.py, that parses
Unicode Database files to create character class data.  The result is a set of functions
that return character classes, one for each of the required Unicode properties.
There are many cases where many properties are handled by one function, primarily due to
property aliases, but also due to Script_Extension properties that are the same as the
Script property for the same script value.

Extended the BuiltInCharacterClassID enum so it can be used also for Unicode property
character classes.  Unicode properties are the enum value BaseUnicodePropertyID plus a
zero based value, that value being the index to the corrensponding character class
function.  The generation script also creates static hashing tables similar to what we
use for the generated .lut.h lookup table files.  These hashing tables map property
names to the function index.  Using these hashing tables, we can lookup a property
name and if present convert it to a function index.  We add that index to
BaseUnicodePropertyID to create a BuiltInCharacterClassID.

When we do syntax parsing, we convert the property to its corresponding BuiltInCharacterClassID.
When doing real parsing we takes the returned BuiltInCharacterClassID and use it to get
the actual character class by calling the corresponding generated function.

Added a new CharacterClass constructor that can take literal arrays for ranges and matches
to make the creation of large static character classes more efficent.

Since the Unicode character classes typically have more matches and ranges, the character
class matching in the interpreter has been updated to use binary searching for matches and
ranges with more than 6 entries.

* CMakeLists.txt:
* DerivedSources.make:
* JavaScriptCore.xcodeproj/project.pbxproj:
* Scripts/generateYarrUnicodePropertyTables.py: Added.
(openOrExit):
(openUCDFileOrExit):
(verifyUCDFilesExist):
(ceilingToPowerOf2):
(Aliases):
(Aliases.__init__):
(Aliases.parsePropertyAliasesFile):
(Aliases.parsePropertyValueAliasesFile):
(Aliases.globalAliasesFor):
(Aliases.generalCategoryAliasesFor):
(Aliases.generalCategoryForAlias):
(Aliases.scriptAliasesFor):
(Aliases.scriptNameForAlias):
(PropertyData):
(PropertyData.__init__):
(PropertyData.setAliases):
(PropertyData.makeCopy):
(PropertyData.getIndex):
(PropertyData.getCreateFuncName):
(PropertyData.addMatch):
(PropertyData.addRange):
(PropertyData.addMatchUnorderedForMatchesAndRanges):
(PropertyData.addRangeUnorderedForMatchesAndRanges):
(PropertyData.addMatchUnordered):
(PropertyData.addRangeUnordered):
(PropertyData.removeMatchFromRanges):
(PropertyData.removeMatch):
(PropertyData.dumpMatchData):
(PropertyData.dump):
(PropertyData.dumpAll):
(PropertyData.dumpAll.std):
(PropertyData.createAndDumpHashTable):
(Scripts):
(Scripts.__init__):
(Scripts.parseScriptsFile):
(Scripts.parseScriptExtensionsFile):
(Scripts.dump):
(GeneralCategory):
(GeneralCategory.__init__):
(GeneralCategory.createSpecialPropertyData):
(GeneralCategory.findPropertyGroupFor):
(GeneralCategory.addNextCodePoints):
(GeneralCategory.parse):
(GeneralCategory.dump):
(BinaryProperty):
(BinaryProperty.__init__):
(BinaryProperty.parsePropertyFile):
(BinaryProperty.dump):
* Scripts/hasher.py: Added.
(stringHash):
* Sources.txt:
* ucd/DerivedBinaryProperties.txt: Added.
* ucd/DerivedCoreProperties.txt: Added.
* ucd/DerivedNormalizationProps.txt: Added.
* ucd/PropList.txt: Added.
* ucd/PropertyAliases.txt: Added.
* ucd/PropertyValueAliases.txt: Added.
* ucd/ScriptExtensions.txt: Added.
* ucd/Scripts.txt: Added.
* ucd/UnicodeData.txt: Added.
* ucd/emoji-data.txt: Added.
* yarr/Yarr.h:
* yarr/YarrInterpreter.cpp:
(JSC::Yarr::Interpreter::testCharacterClass):
* yarr/YarrParser.h:
(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::isUnicodePropertyValueExpressionChar):
(JSC::Yarr::Parser::tryConsumeUnicodePropertyExpression):
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::appendInverted):
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassBuiltIn):
(JSC::Yarr::YarrPattern::errorMessage):
(JSC::Yarr::PatternTerm::dump):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterRange::CharacterRange):
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::YarrPattern::reset):
(JSC::Yarr::YarrPattern::unicodeCharacterClassFor):
* yarr/YarrUnicodeProperties.cpp: Added.
(JSC::Yarr::HashTable::entry const):
(JSC::Yarr::unicodeMatchPropertyValue):
(JSC::Yarr::unicodeMatchProperty):
(JSC::Yarr::createUnicodeCharacterClassFor):
* yarr/YarrUnicodeProperties.h: Added.

Source/WebCore:

Refactoring change - Added BuiltInCharacterClassID:: prefix to uses of the enum.

* contentextensions/URLFilterParser.cpp:
(WebCore::ContentExtensions::PatternParser::atomBuiltInCharacterClass):

LayoutTests:

New test.

* js/regexp-unicode-properties-expected.txt: Added.
* js/regexp-unicode-properties.html: Added.
* js/script-tests/regexp-unicode-properties.js: Added.

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@223081 268f45cc-cd09-0410-ab3c-d52691b4dbfc
32 files changed:
JSTests/ChangeLog
JSTests/test262.yaml
LayoutTests/ChangeLog
LayoutTests/js/regexp-unicode-properties-expected.txt [new file with mode: 0644]
LayoutTests/js/regexp-unicode-properties.html [new file with mode: 0644]
LayoutTests/js/script-tests/regexp-unicode-properties.js [new file with mode: 0644]
Source/JavaScriptCore/CMakeLists.txt
Source/JavaScriptCore/ChangeLog
Source/JavaScriptCore/DerivedSources.make
Source/JavaScriptCore/JavaScriptCore.xcodeproj/project.pbxproj
Source/JavaScriptCore/Scripts/generateYarrUnicodePropertyTables.py [new file with mode: 0644]
Source/JavaScriptCore/Scripts/hasher.py [new file with mode: 0644]
Source/JavaScriptCore/Sources.txt
Source/JavaScriptCore/ucd/DerivedBinaryProperties.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/DerivedCoreProperties.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/DerivedNormalizationProps.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/PropList.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/PropertyAliases.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/PropertyValueAliases.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/ScriptExtensions.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/Scripts.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/UnicodeData.txt [new file with mode: 0644]
Source/JavaScriptCore/ucd/emoji-data.txt [new file with mode: 0644]
Source/JavaScriptCore/yarr/Yarr.h
Source/JavaScriptCore/yarr/YarrInterpreter.cpp
Source/JavaScriptCore/yarr/YarrParser.h
Source/JavaScriptCore/yarr/YarrPattern.cpp
Source/JavaScriptCore/yarr/YarrPattern.h
Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp [new file with mode: 0644]
Source/JavaScriptCore/yarr/YarrUnicodeProperties.h [new file with mode: 0644]
Source/WebCore/ChangeLog
Source/WebCore/contentextensions/URLFilterParser.cpp