Improve normalization code, including moving from unorm.h to unorm2.h
authordarin@apple.com <darin@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sun, 17 Mar 2019 02:20:52 +0000 (02:20 +0000)
committerdarin@apple.com <darin@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sun, 17 Mar 2019 02:20:52 +0000 (02:20 +0000)
commit93fd227d2368a9965cd5746219a024f73c16e61b
tree349e79cc4b34dfdbadeaad7f5f26a1201d2b0312
parent9cded5fe56716d4ad0f14821b09cc172ddc0eef2
Improve normalization code, including moving from unorm.h to unorm2.h
https://bugs.webkit.org/show_bug.cgi?id=195330

Reviewed by Michael Catanzaro.

Source/JavaScriptCore:

* runtime/JSString.h: Move StringViewWithUnderlyingString to StringView.h.

* runtime/StringPrototype.cpp: Include unorm2.h instead of unorm.h.
(JSC::normalizer): Added. Function to create normalizer object given
enumeration value indicating which is selected. Simplified because we
know the function will not fail and so we don't need error handling code.
(JSC::normalize): Changed this function to take a JSString* so we can
optimize the case where no normalization is needed. Added an early exit
if the string is stored as 8-bit and another if the string is already
normalized, using unorm2_isNormalized. Changed error handling to only
check cases that can actually fail in practice. Also did other small
optimizations like passing VM rather than ExecState.
(JSC::stringProtoFuncNormalize): Used smaller enumeration names that are
identical to the names used in the API and normalization parlance rather
than longer ones that expand the acronyms. Updated to pass JSString* to
the normalize function, so we can optimize 8-bit and already-normalized
cases, rather than callling the expensive String::upconvertedCharacters
function. Use throwVMRangeError.

Source/WebCore:

* editing/TextIterator.cpp: Include unorm2.h.
(WebCore::normalizeCharacters): Rewrote to use unorm2_normalize rather than
unorm_normalize, but left the logic otherwise the same.

* platform/graphics/SurrogatePairAwareTextIterator.cpp: Include unorm2.h.
(WebCore::SurrogatePairAwareTextIterator::normalizeVoicingMarks):
Use unorm2_composePair instead of unorm_normalize.

* platform/graphics/cairo/FontCairoHarfbuzzNG.cpp:
(characterSequenceIsEmoji): Changed to use existing SurrogatePairAwareTextIterator.
(FontCascade::fontForCombiningCharacterSequence): Use normalizedNFC instead of
calling unorm2_normalize directly.

* WebCore/platform/graphics/freetype/SimpleFontDataFreeType.cpp:
Removed unneeded include of <unicode/normlzr.h>.

* platform/text/TextEncoding.cpp:
(WebCore::TextEncoding::encode const): Use normalizedNFC instead of the
code that was here. The normalizedNFC function is better in multiple ways,
but primarily it handles 8-bit strings and other already-normalized
strings much more efficiently.

Source/WTF:

* wtf/URLHelpers.cpp: Removed unneeded include of unorm.h since the
normalization code is now in StringView.cpp.
(WTF::URLHelpers::escapeUnsafeCharacters): Renamed from
createStringWithEscapedUnsafeCharacters since it now only creates
a new string if one is needed. Use unsigned for string lengths, since
that's what WTF::String uses, not size_t. Added a first loop so that
we can return the string unmodified if no lookalike characters are
found. Removed unnecessary round trip from UTF-16 and then back in
the case where the character is not a lookalike.
(WTF::URLHelpers::toNormalizationFormC): Deleted. Moved this logic
into the WTF::normalizedNFC function in StringView.cpp.
(WTF::URLHelpers::userVisibleURL): Call escapeUnsafeCharacters and
normalizedNFC. The normalizedNFC function is better in multiple ways,
but primarily it handles 8-bit strings and other already-normalized
strings much more efficiently.

* wtf/text/StringView.cpp:
(WTF::normalizedNFC): Added. This has two overloads. One is for when
we already have a String, and want to re-use it if no normalization
is needed, and another is when we only have a StringView, and may need
to allocate a String to hold the result. Includes a fast special case
for 8-bit and already-normalized strings, and uses the same strategy
that JSC::normalize was already using: calls unorm2_normalize twice,
first just to determine the length.

* wtf/text/StringView.h: Added normalizedNFC, which can be called with
either a StringView or a String. Also moved StringViewWithUnderlyingString
here from JSString.h, here for use as the return value of normalizedNFC;
it is used for a similar purpose in the JavaScriptCore rope implementation.
Also removed an inaccurate comment.

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@243049 268f45cc-cd09-0410-ab3c-d52691b4dbfc
13 files changed:
Source/JavaScriptCore/ChangeLog
Source/JavaScriptCore/runtime/JSString.h
Source/JavaScriptCore/runtime/StringPrototype.cpp
Source/WTF/ChangeLog
Source/WTF/wtf/URLHelpers.cpp
Source/WTF/wtf/text/StringView.cpp
Source/WTF/wtf/text/StringView.h
Source/WebCore/ChangeLog
Source/WebCore/editing/TextIterator.cpp
Source/WebCore/platform/graphics/SurrogatePairAwareTextIterator.cpp
Source/WebCore/platform/graphics/cairo/FontCairoHarfbuzzNG.cpp
Source/WebCore/platform/graphics/freetype/SimpleFontDataFreeType.cpp
Source/WebCore/platform/text/TextEncoding.cpp