URLs are encoded in UTF-8, then decoded as if they are Latin1
authorbenjamin@webkit.org <benjamin@webkit.org@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Thu, 1 Dec 2011 22:51:13 +0000 (22:51 +0000)
committerbenjamin@webkit.org <benjamin@webkit.org@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Thu, 1 Dec 2011 22:51:13 +0000 (22:51 +0000)
commite117f5139e1a484c444a327360f8dd1e74498ffb
treed542278f8b653f8acac1abbc7c581ad994806b7a
parent28b6ecca243260a8a861ad7e8e769deeee270f30
URLs are encoded in UTF-8, then decoded as if they are Latin1
https://bugs.webkit.org/show_bug.cgi?id=71758

Reviewed by Darin Adler.

Source/JavaScriptCore:

Add the operator == between a String and a Vector of char. The implementation
is the same as the comparison of String and char* but adds the length as a
parameter for comparing the strings.

* JavaScriptCore.exp:
* wtf/text/StringImpl.h:
(WTF::equal):
* wtf/text/WTFString.h:
(WTF::operator==):
(WTF::operator!=):

Source/WebCore:

Previously, invalid URLs could have a string emanating from a
partial parsing of the input. The creation of the string was done
through the Latin1 codec regardless of the encoding of the char* url.

This caused two types of issues, URLs were evaluated as half-parsed,
and the coding and decoding of the string was not consistent.

This patch changes KURL::parse() to fallback on the original string
whenever the parsing of the URL fails.

Test: fast/url/invalid-urls-utf8.html

* platform/KURL.cpp:
(WebCore::KURL::KURL):
(WebCore::KURL::init):
(WebCore::KURL::parse):
Previously, originalString was only used as an optimization to avoid
the allocation of a string. Since this optimization depends on the
comparison of the incoming string and the encoded buffer.
This patches generalizes originalString to always be the original string
being parsed by KURL. The optimization is kept by comparing that string
and the final parsed result.
* platform/KURL.h:
(WebCore::KURL::parse):
* platform/cf/KURLCFNet.cpp:
(WebCore::KURL::KURL):
* platform/mac/KURLMac.mm:
(WebCore::KURL::KURL):

LayoutTests:

* fast/url/invalid-urls-utf8-expected.txt: Added.
* fast/url/invalid-urls-utf8.html: Added.
New test for invalid URL where the Unicode characters were mangled
by the parsing.

* fast/url/file-expected.txt:
* fast/url/file-http-base-expected.txt:
Two urls where expended by their base URL before found invalid. The partial
parsed result was saved as the new URL.

* fast/url/host-expected.txt:
The host of two urls were invalid, and partially modified by the parsing.

* fast/url/idna2003-expected.txt:
The first 'http://www.lookout.netâ©´80/' encoded for parsing is http://www.lookout.net::=80/
and fails as invalid. The new result does not modify the original string.

The whitespace in 'http://www .lookout.net/' causes the parsing to fail when parsing
the username because a space is not a UserInfoChar.

* fast/url/port-expected.txt:
The unicode characters used as the port number were transformed due to
the encoding UTF-8 -> Unicode through the Latin1 codec.

* platform/chromium/test_expectations.txt: Skip the test on Chromium for now since Google URL
does not implement the extended version of parse().

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@101713 268f45cc-cd09-0410-ab3c-d52691b4dbfc
18 files changed:
LayoutTests/ChangeLog
LayoutTests/fast/url/file-expected.txt
LayoutTests/fast/url/file-http-base-expected.txt
LayoutTests/fast/url/host-expected.txt
LayoutTests/fast/url/idna2003-expected.txt
LayoutTests/fast/url/invalid-urls-utf8-expected.txt [new file with mode: 0644]
LayoutTests/fast/url/invalid-urls-utf8.html [new file with mode: 0644]
LayoutTests/fast/url/port-expected.txt
LayoutTests/platform/chromium/test_expectations.txt
Source/JavaScriptCore/ChangeLog
Source/JavaScriptCore/JavaScriptCore.exp
Source/JavaScriptCore/wtf/text/StringImpl.h
Source/JavaScriptCore/wtf/text/WTFString.h
Source/WebCore/ChangeLog
Source/WebCore/platform/KURL.cpp
Source/WebCore/platform/KURL.h
Source/WebCore/platform/cf/KURLCFNet.cpp
Source/WebCore/platform/mac/KURLMac.mm