<rdar://problem/13666412> Clean up some edge cases of URL parsing.
authorap@apple.com <ap@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sat, 11 May 2013 05:51:04 +0000 (05:51 +0000)
committerap@apple.com <ap@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sat, 11 May 2013 05:51:04 +0000 (05:51 +0000)
        https://bugs.webkit.org/show_bug.cgi?id=104919

        Reviewed by Darin Adler.

WebCore:
        * page/SecurityOrigin.cpp:
        (WebCore::schemeRequiresHost):
        (WebCore::shouldTreatAsUniqueOrigin):
        Updated function name and comments (host is not the same as authority). We still
        need this check - KURL can still produce http URLs with an empty host (even as this
        patch reduces the number of such cases). So can Gecko and current draft of URL
        Standard.
        It would be good to have a guarantee that such useless URLs can not come out of
        URL parser, as relying on downstream code re-parsing the URL correctly would be fragile.

        * platform/KURL.cpp:
        (WebCore::hostPortIsEmptyButCredentialsArePresent): Updated an argument name for
        correctness.
        (WebCore::KURL::parse):
        1. Reverted behavior changes from <http://trac.webkit.org/changeset/82181> - I could
        find no reason to allow "@" in hostnames, and having a URL like this re-parsed by
        a different parser would likely produce different results. It's better to just treat
        these edge case URLs as invalid.
        2. When hostname component is a lone colon, preserve it in parsed URL string,
        as otherwise path would get pushed in its place when re-parsing.
        3. When authority component is a lone colon, don't forget to "//" after scheme, too.
        4. Added some assertions about contents of authority component, to catch potential
        mis-parsing earlier.

LayoutTests:
        * fast/dom/HTMLAnchorElement/script-tests/set-href-attribute-pathname.js:
        * fast/dom/HTMLAnchorElement/set-href-attribute-pathname-expected.txt:
        Updated expectations of one sub-test. We previously tried to keep the test passing
        as is (see bug 57291), but I couldn't find any reason to prefer the old behavior.

        * fast/url/host-expected.txt:
        * fast/url/host.html:
        Updated one subtest to new results, which match at least Gecko (original of the
        test actually claims that all browsers including Safari already do what we'll do now).

        * fast/url/segments-userinfo-vs-host-expected.txt: Added.
        * fast/url/segments-userinfo-vs-host.html: Added.
        Added a number of tests, with detailed explanations of the differences with Firefox,
        and with rationales.

        * http/tests/uri/username-with-no-hostname-expected.txt: Removed.
        * http/tests/uri/username-with-no-hostname.html-disabled: Removed.
        * platform/win/http/tests/uri/username-with-no-hostname-expected.txt: Removed.
        This test has been disabled for a long time, and being an end-to-end test for
        invalid URL handling, it would be difficult to make work again. We have multiple
        parsing tests for URLs like this.

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@149925 268f45cc-cd09-0410-ab3c-d52691b4dbfc

13 files changed:
LayoutTests/ChangeLog
LayoutTests/fast/dom/HTMLAnchorElement/script-tests/set-href-attribute-pathname.js
LayoutTests/fast/dom/HTMLAnchorElement/set-href-attribute-pathname-expected.txt
LayoutTests/fast/url/host-expected.txt
LayoutTests/fast/url/host.html
LayoutTests/fast/url/segments-userinfo-vs-host-expected.txt [new file with mode: 0644]
LayoutTests/fast/url/segments-userinfo-vs-host.html [new file with mode: 0644]
LayoutTests/http/tests/uri/username-with-no-hostname-expected.txt [deleted file]
LayoutTests/http/tests/uri/username-with-no-hostname.html-disabled [deleted file]
LayoutTests/platform/win/http/tests/uri/username-with-no-hostname-expected.txt [deleted file]
Source/WebCore/ChangeLog
Source/WebCore/page/SecurityOrigin.cpp
Source/WebCore/platform/KURL.cpp

index b77e3e8..01f485d 100644 (file)
@@ -1,3 +1,32 @@
+2013-05-10  Alexey Proskuryakov  <ap@apple.com>
+
+        <rdar://problem/13666412> Clean up some edge cases of URL parsing.
+        https://bugs.webkit.org/show_bug.cgi?id=104919
+
+        Reviewed by Darin Adler.
+
+        * fast/dom/HTMLAnchorElement/script-tests/set-href-attribute-pathname.js:
+        * fast/dom/HTMLAnchorElement/set-href-attribute-pathname-expected.txt:
+        Updated expectations of one sub-test. We previously tried to keep the test passing
+        as is (see bug 57291), but I couldn't find any reason to prefer the old behavior.
+
+        * fast/url/host-expected.txt:
+        * fast/url/host.html:
+        Updated one subtest to new results, which match at least Gecko (original of the
+        test actually claims that all browsers including Safari already do what we'll do now).
+
+        * fast/url/segments-userinfo-vs-host-expected.txt: Added.
+        * fast/url/segments-userinfo-vs-host.html: Added.
+        Added a number of tests, with detailed explanations of the differences with Firefox,
+        and with rationales.
+
+        * http/tests/uri/username-with-no-hostname-expected.txt: Removed.
+        * http/tests/uri/username-with-no-hostname.html-disabled: Removed.
+        * platform/win/http/tests/uri/username-with-no-hostname-expected.txt: Removed.
+        This test has been disabled for a long time, and being an end-to-end test for
+        invalid URL handling, it would be difficult to make work again. We have multiple
+        parsing tests for URLs like this.
+
 2013-05-10  Christophe Dumez  <ch.dumez@sisa.samsung.com>
 
         Remove [NoInterfaceObject] from several WebAudio IDL interfaces
index 7ff0b3f..b7a89b9 100644 (file)
@@ -50,12 +50,12 @@ a.href = "https://www.my|d[]()omain.com/path/testurl.html?key=value";
 a.pathname = "p$a|th";
 shouldBe("a.href", "'https://www.my|d[]()omain.com/path/testurl.html?key=value'");
 
-// IE8 throws a security exception.
+// IE8 throws a security exception. Gecko parses this as a URL with an empty hostname.
 try {
 debug("Set pathname to URL that contains '@' in host");
 a.href = "http://w@#ww";
 a.pathname = "path";
-shouldBe("a.href", "'http://w@/path#ww'");
+shouldBe("a.href", "'http://w@#ww'");
 } catch(e) {
 debug("Exception: " + e.description);
 }
index 5b2d8ae..e5e0b7b 100644 (file)
@@ -18,7 +18,7 @@ PASS a.href is 'https://www.mydomain.com/?key=value'
 Set pathname that includes illegal characters to URL that contains illegal characters.
 PASS a.href is 'https://www.my|d[]()omain.com/path/testurl.html?key=value'
 Set pathname to URL that contains '@' in host
-PASS a.href is 'http://w@/path#ww'
+PASS a.href is 'http://w@#ww'
 Set pathname to a URL with non-hierarchical protocol
 PASS a.href is 'tel:+1800-555-1212'
 PASS successfullyParsed is true
index 96150ab..90bc481 100644 (file)
@@ -38,7 +38,7 @@ PASS canonicalize('http://\google.com/foo/') is 'http://google.com/foo/'
 FAIL canonicalize('http://\\google.com/foo/') should be http://google.com/foo/. Was http:/google.com/foo/.
 PASS canonicalize('http:////asdf@//') is 'http://asdf@//'
 PASS canonicalize('http:////:81/') is 'http://:81/'
-FAIL canonicalize('http://:///') should be http://///. Was http:///.
+PASS canonicalize('http://:///') is 'http://:///'
 PASS canonicalize('http://c:/') is 'http://c/'
 PASS canonicalize('http://xxxx:/') is 'http://xxxx/'
 PASS canonicalize('http://.:./') is 'http://.:./'
index e88b474..ce3c6d6 100644 (file)
@@ -79,7 +79,7 @@ cases = [
   ["\\\\google.com/foo","google.com/foo"],
   ["//asdf@/","asdf@/"],
   ["//:81",":81"],
-  ["://","//"],
+  ["://","://"],
   ["c:","c"],
   ["xxxx:","xxxx"],
   [".:.",".:."],
diff --git a/LayoutTests/fast/url/segments-userinfo-vs-host-expected.txt b/LayoutTests/fast/url/segments-userinfo-vs-host-expected.txt
new file mode 100644 (file)
index 0000000..87b8d61
--- /dev/null
@@ -0,0 +1,135 @@
+Canonicalization of URLs that start with something that may or may not be userinfo
+
+On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE".
+
+
+PASS canonicalize('http:@www.apple.com') is 'http://www.apple.com/'
+PASS segments('http:@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http:/@www.apple.com') is 'http://www.apple.com/'
+PASS segments('http:/@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http://@www.apple.com') is 'http://www.apple.com/'
+PASS segments('http://@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http:a:b@www.apple.com') is 'http://a:b@www.apple.com/'
+PASS segments('http:a:b@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http:/a:b@www.apple.com') is 'http://a:b@www.apple.com/'
+PASS segments('http:/a:b@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http://a:b@www.apple.com') is 'http://a:b@www.apple.com/'
+PASS segments('http://a:b@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http://@pple.com') is 'http://pple.com/'
+PASS segments('http://@pple.com') is '["http:","pple.com","","/","",""]'
+
+PASS canonicalize('http::b@www.apple.com') is 'http://:b@www.apple.com/'
+PASS segments('http::b@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http:/:b@www.apple.com') is 'http://:b@www.apple.com/'
+PASS segments('http:/:b@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http://:b@www.apple.com') is 'http://:b@www.apple.com/'
+PASS segments('http://:b@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http:/:@/www.apple.com') is 'http:/:@/www.apple.com'
+PASS segments('http:/:@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://user@/www.apple.com') is 'http://user@/www.apple.com'
+PASS segments('http://user@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:@/www.apple.com') is 'http:@/www.apple.com'
+PASS segments('http:@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:/@/www.apple.com') is 'http:/@/www.apple.com'
+PASS segments('http:/@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://@/www.apple.com') is 'http://@/www.apple.com'
+PASS segments('http://@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('https:@/www.apple.com') is 'https:@/www.apple.com'
+PASS segments('https:@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:a:b@/www.apple.com') is 'http:a:b@/www.apple.com'
+PASS segments('http:a:b@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:/a:b@/www.apple.com') is 'http:/a:b@/www.apple.com'
+PASS segments('http:/a:b@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://a:b@/www.apple.com') is 'http://a:b@/www.apple.com'
+PASS segments('http://a:b@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http::@/www.apple.com') is 'http::@/www.apple.com'
+PASS segments('http::@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:a:@www.apple.com') is 'http://a@www.apple.com/'
+PASS segments('http:a:@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http:/a:@www.apple.com') is 'http://a@www.apple.com/'
+PASS segments('http:/a:@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http://a:@www.apple.com') is 'http://a@www.apple.com/'
+PASS segments('http://a:@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http://a:b@www.@pple.com/p@th') is 'http://a:b@www.@pple.com/p@th'
+PASS segments('http://a:b@www.@pple.com/p@th') is '[":","","","","",""]'
+
+PASS canonicalize('http://www.@@pple.com') is 'http://www.@@pple.com'
+PASS segments('http://www.@@pple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://@@pple.com') is 'http://@@pple.com'
+PASS segments('http://@@pple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://@@@pple.com') is 'http://@@@pple.com'
+PASS segments('http://@@@pple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:@@/www.apple.com') is 'http:@@/www.apple.com'
+PASS segments('http:@@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:/@@/www.apple.com') is 'http:/@@/www.apple.com'
+PASS segments('http:/@@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://@@/www.apple.com') is 'http://@@/www.apple.com'
+PASS segments('http://@@/www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:@:b@www.apple.com') is 'http:@:b@www.apple.com'
+PASS segments('http:@:b@www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:/@:b@www.apple.com') is 'http:/@:b@www.apple.com'
+PASS segments('http:/@:b@www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://@:b@www.apple.com') is 'http://@:b@www.apple.com'
+PASS segments('http://@:b@www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://www.@pple.com') is 'http://www.@pple.com/'
+PASS segments('http://www.@pple.com') is '["http:","pple.com","","/","",""]'
+
+PASS canonicalize('http:@:www.apple.com') is 'http:@:www.apple.com'
+PASS segments('http:@:www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http:/@:www.apple.com') is 'http:/@:www.apple.com'
+PASS segments('http:/@:www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://@:www.apple.com') is 'http://@:www.apple.com'
+PASS segments('http://@:www.apple.com') is '[":","","","","",""]'
+
+PASS canonicalize('http://:@www.apple.com') is 'http://www.apple.com/'
+PASS segments('http://:@www.apple.com') is '["http:","www.apple.com","","/","",""]'
+
+PASS canonicalize('http:@:/www.apple.com') is 'http://:/www.apple.com'
+PASS segments('http:@:/www.apple.com') is '["http:","","0","/www.apple.com","",""]'
+
+PASS canonicalize('http:/@:/www.apple.com') is 'http://:/www.apple.com'
+PASS segments('http:/@:/www.apple.com') is '["http:","","0","/www.apple.com","",""]'
+
+PASS canonicalize('http://@:/www.apple.com') is 'http://:/www.apple.com'
+PASS segments('http://@:/www.apple.com') is '["http:","","0","/www.apple.com","",""]'
+
+PASS canonicalize('http:@:80/www.apple.com') is 'http://:80/www.apple.com'
+PASS segments('http:@:80/www.apple.com') is '["http:","","80","/www.apple.com","",""]'
+
+PASS successfullyParsed is true
+
+TEST COMPLETE
+
diff --git a/LayoutTests/fast/url/segments-userinfo-vs-host.html b/LayoutTests/fast/url/segments-userinfo-vs-host.html
new file mode 100644 (file)
index 0000000..a6c8b66
--- /dev/null
@@ -0,0 +1,88 @@
+<!DOCTYPE html>
+<html>
+<head>
+<script src="../js/resources/js-test-pre.js"></script>
+<script src="resources/utilities.js"></script>
+</head>
+<body>
+<script>
+description("Canonicalization of URLs that start with something that may or may not be userinfo");
+
+cases = [
+    // These cases currently match between WebKit and Gecko, so the results should probably never change.
+    ["http:@www.apple.com", "http://www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http:/@www.apple.com", "http://www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http://@www.apple.com", "http://www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http:a:b@www.apple.com", "http://a:b@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http:/a:b@www.apple.com", "http://a:b@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http://a:b@www.apple.com", "http://a:b@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http://@pple.com", "http://pple.com/", ["http:","pple.com","","/","",""]],
+    ["http::b@www.apple.com", "http://:b@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http:/:b@www.apple.com", "http://:b@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http://:b@www.apple.com", "http://:b@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+
+    // Gecko and WebKit both treat this URLs as invalid, but Gecko reports "http:" as protocol.
+    ["http:/:@/www.apple.com", "http:/:@/www.apple.com", [":","","","","",""]],
+
+    // HTTP URLs with an empty host are invalid, because otherwise, there is just too much risk that another parsing pass (possibly by a network layer with a subtly different algorithm) would treat path as hostname.
+    ["http://user@/www.apple.com", "http://user@/www.apple.com", [":","","","","",""]], // Regression test for <https://bugs.webkit.org/show_bug.cgi?id=57220>.
+    ["http:@/www.apple.com", "http:@/www.apple.com", [":","","","","",""]], // Regression test for <https://bugs.webkit.org/show_bug.cgi?id=104919>.    
+    ["http:/@/www.apple.com", "http:/@/www.apple.com", [":","","","","",""]],
+    ["http://@/www.apple.com", "http://@/www.apple.com", [":","","","","",""]],
+    ["https:@/www.apple.com", "https:@/www.apple.com", [":","","","","",""]], // Regression test for <https://bugs.webkit.org/show_bug.cgi?id=104919>.    
+    ["http:a:b@/www.apple.com", "http:a:b@/www.apple.com", [":","","","","",""]],
+    ["http:/a:b@/www.apple.com", "http:/a:b@/www.apple.com", [":","","","","",""]],
+    ["http://a:b@/www.apple.com", "http://a:b@/www.apple.com", [":","","","","",""]],
+
+    // Gecko treats this URL as relative. WebKit treats it as invalid, because of empty host.
+    ["http::@/www.apple.com", "http::@/www.apple.com", [":","","","","",""]],
+
+    // Empty password. Gecko still adds a ':', WebKit does not.
+    ["http:a:@www.apple.com", "http://a@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http:/a:@www.apple.com", "http://a@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+    ["http://a:@www.apple.com", "http://a@www.apple.com/", ["http:","www.apple.com","","/","",""]],
+
+    // Some tests for multiple @-signs. WebKit treats these URLs as invalid, while Gecko allows '@' in userinfo, searching for the last instance in authority as a delimiter.
+    ["http://a:b@www.@pple.com/p@th", "http://a:b@www.@pple.com/p@th", [":","","","","",""]],
+    ["http://www.@@pple.com", "http://www.@@pple.com", [":","","","","",""]],
+    ["http://@@pple.com", "http://@@pple.com", [":","","","","",""]],
+    ["http://@@@pple.com", "http://@@@pple.com", [":","","","","",""]],
+    ["http:@@/www.apple.com", "http:@@/www.apple.com", [":","","","","",""]],
+    ["http:/@@/www.apple.com", "http:/@@/www.apple.com", [":","","","","",""]],
+    ["http://@@/www.apple.com", "http://@@/www.apple.com", [":","","","","",""]],
+    ["http:@:b@www.apple.com", "http:@:b@www.apple.com", [":","","","","",""]],
+    ["http:/@:b@www.apple.com", "http:/@:b@www.apple.com", [":","","","","",""]],
+    ["http://@:b@www.apple.com", "http://@:b@www.apple.com", [":","","","","",""]],
+
+    // Gecko escapes '.' in username, WebKit does not.
+    ["http://www.@pple.com", "http://www.@pple.com/", ["http:","pple.com","","/","",""]],
+
+    // Invalid in WebKit. Gecko thinks that hostname is ":www.apple.com", which doesn't seem like a good idea.
+    ["http:@:www.apple.com", "http:@:www.apple.com", [":","","","","",""]],
+    ["http:/@:www.apple.com", "http:/@:www.apple.com", [":","","","","",""]],
+    ["http://@:www.apple.com", "http://@:www.apple.com", [":","","","","",""]],
+
+    // Invalid in Gecko. WebKit just treats this as a URL with empty username and password, which appears to make sense.
+    ["http://:@www.apple.com", "http://www.apple.com/", ["http:","www.apple.com","","/","",""]],
+
+    // Host is empty, but hostname is not, so the resulting parsed URL string is not particularly ambiguous.
+    // Gecko builds the same parsed URL string, although its components are different.
+    // FIXME: Unsure why we are getting a "0" as port here, but it probably doesn't matter for such an edge case.
+    ["http:@:/www.apple.com", "http://:/www.apple.com", ["http:","","0","/www.apple.com","",""]],
+    ["http:/@:/www.apple.com", "http://:/www.apple.com", ["http:","","0","/www.apple.com","",""]],
+    ["http://@:/www.apple.com", "http://:/www.apple.com", ["http:","","0","/www.apple.com","",""]],
+    ["http:@:80/www.apple.com", "http://:80/www.apple.com", ["http:","","80","/www.apple.com","",""]],
+];
+
+for (var i = 0; i < cases.length; ++i) {
+    var test_string = cases[i][0];
+    var expected_canonicalized_url = cases[i][1];
+    var expected_url_segments = cases[i][2];
+    shouldBe("canonicalize('" + test_string + "')", "'" + expected_canonicalized_url + "'");
+    shouldBe("segments('" + test_string + "')", "'" + JSON.stringify(expected_url_segments) + "'");
+    debug("");
+}
+</script>
+<script src="../js/resources/js-test-post.js"></script>
+</body>
+</html>
diff --git a/LayoutTests/http/tests/uri/username-with-no-hostname-expected.txt b/LayoutTests/http/tests/uri/username-with-no-hostname-expected.txt
deleted file mode 100644 (file)
index fa082de..0000000
+++ /dev/null
@@ -1,4 +0,0 @@
-Blocked access to external URL http://user@/localhost:8000/misc/resources/compass.jpg
-This test checks that URLs with a username but no hostname do not mistakenly get loaded. If it fails in the browser, you will also see an image.
-PASS document.getElementsByTagName("img")[0].width is 0
-
diff --git a/LayoutTests/http/tests/uri/username-with-no-hostname.html-disabled b/LayoutTests/http/tests/uri/username-with-no-hostname.html-disabled
deleted file mode 100644 (file)
index 93c11b6..0000000
+++ /dev/null
@@ -1,22 +0,0 @@
-<!DOCTYPE html>
-<html>
-<head>
-  <script src="/js-test-resources/js-test-pre.js"></script>
-</head>
-<body onload="load()">
-
-This test checks that URLs with a username but no hostname do not
-mistakenly get loaded. If it fails in the browser, you will also see
-an image.
-
-<p id="description"></p>
-<div id="console"></div>
-
-<img src="http://user@/localhost:8000/misc/resources/compass.jpg" onerror="this.width=0">
-<script>
-function load() {
-    shouldBe('document.getElementsByTagName("img")[0].width', "0");
-}
-</script>       
-
-</body>
diff --git a/LayoutTests/platform/win/http/tests/uri/username-with-no-hostname-expected.txt b/LayoutTests/platform/win/http/tests/uri/username-with-no-hostname-expected.txt
deleted file mode 100644 (file)
index 655bd25..0000000
+++ /dev/null
@@ -1,3 +0,0 @@
-This test checks that URLs with a username but no hostname do not mistakenly get loaded. If it fails in the browser, you will also see an image.
-PASS document.getElementsByTagName("img")[0].width is 0
-
index 58365fd..141f228 100644 (file)
@@ -1,5 +1,38 @@
 2013-05-10  Alexey Proskuryakov  <ap@apple.com>
 
+        <rdar://problem/13666412> Clean up some edge cases of URL parsing.
+        https://bugs.webkit.org/show_bug.cgi?id=104919
+
+        Reviewed by Darin Adler.
+
+        Test: fast/url/segments-userinfo-vs-host.html
+
+        * page/SecurityOrigin.cpp:
+        (WebCore::schemeRequiresHost):
+        (WebCore::shouldTreatAsUniqueOrigin):
+        Updated function name and comments (host is not the same as authority). We still
+        need this check - KURL can still produce http URLs with an empty host (even as this
+        patch reduces the number of such cases). So can Gecko and current draft of URL
+        Standard.
+        It would be good to have a guarantee that such useless URLs can not come out of
+        URL parser, as relying on downstream code re-parsing the URL correctly would be fragile.
+
+        * platform/KURL.cpp:
+        (WebCore::hostPortIsEmptyButCredentialsArePresent): Updated an argument name for
+        correctness.
+        (WebCore::KURL::parse):
+        1. Reverted behavior changes from <http://trac.webkit.org/changeset/82181> - I could
+        find no reason to allow "@" in hostnames, and having a URL like this re-parsed by
+        a different parser would likely produce different results. It's better to just treat
+        these edge case URLs as invalid.
+        2. When hostname component is a lone colon, preserve it in parsed URL string,
+        as otherwise path would get pushed in its place when re-parsing.
+        3. When authority component is a lone colon, don't forget to "//" after scheme, too.
+        4. Added some assertions about contents of authority component, to catch potential
+        mis-parsing earlier.
+
+2013-05-10  Alexey Proskuryakov  <ap@apple.com>
+
         Make TextCodecICU not depend on TextEncoding
         https://bugs.webkit.org/show_bug.cgi?id=115848
 
index 47ea1f5..f66dc0c 100644 (file)
@@ -44,7 +44,7 @@ namespace WebCore {
 const int InvalidPort = 0;
 const int MaxAllowedPort = 65535;
 
-static bool schemeRequiresAuthority(const KURL& url)
+static bool schemeRequiresHost(const KURL& url)
 {
     // We expect URLs with these schemes to have authority components. If the
     // URL lacks an authority component, we get concerned and mark the origin
@@ -101,9 +101,9 @@ static bool shouldTreatAsUniqueOrigin(const KURL& url)
     // FIXME: Check whether innerURL is valid.
 
     // For edge case URLs that were probably misparsed, make sure that the origin is unique.
-    // FIXME: Do we really need to do this? This looks to be a hack around a
-    // security bug in CFNetwork that might have been fixed.
-    if (schemeRequiresAuthority(innerURL) && innerURL.host().isEmpty())
+    // This is an additional safety net against bugs in KURL parsing, and for network back-ends that parse URLs differently,
+    // and could misinterpret another component for hostname.
+    if (schemeRequiresHost(innerURL) && innerURL.host().isEmpty())
         return true;
 
     // SchemeRegistry needs a lower case protocol because it uses HashMaps
index a747f39..396f3b5 100644 (file)
@@ -1028,9 +1028,9 @@ static inline bool isDefaultPortForScheme(const char* port, size_t portLength, c
     return false;
 }
 
-static inline bool hostPortIsEmptyButCredentialsArePresent(int hostStart, int portEnd, char userEndChar)
+static inline bool hostPortIsEmptyButCredentialsArePresent(int hostStart, int portEnd, char userinfoEndChar)
 {
-    return userEndChar == '@' && hostStart == portEnd;
+    return userinfoEndChar == '@' && hostStart == portEnd;
 }
 
 static bool isNonFileHierarchicalScheme(const char* scheme, size_t schemeLength)
@@ -1207,10 +1207,10 @@ void KURL::parse(const char* url, const String* originalString)
             return;
         }
 
-        if (hostPortIsEmptyButCredentialsArePresent(hostStart, portEnd, url[userEnd])) {
-            // in this circumstance, act as if there is an erroneous hostname containing an '@'
-            userEnd = userStart;
-            hostStart = userEnd;
+        if (hostPortIsEmptyButCredentialsArePresent(hostStart, portEnd, url[passwordEnd])) {
+            m_string = originalString ? *originalString : url;
+            invalidate();
+            return;
         }
 
         if (userStart == portEnd && !m_protocolIsInHTTPFamily && !isFile) {
@@ -1281,7 +1281,9 @@ void KURL::parse(const char* url, const String* originalString)
     // File URLs need a host part unless it is just file:// or file://localhost
     bool degenerateFilePath = pathStart == pathEnd && (hostStart == hostEnd || hostIsLocalHost);
 
-    bool haveNonHostAuthorityPart = userStart != userEnd || passwordStart != passwordEnd || portStart != portEnd;
+    // We drop empty credentials, but keep a colon in an empty host/port pair.
+    // Removing hostname completely would change the structure of the URL on re-parsing.
+    bool haveNonHostAuthorityPart = userStart != userEnd || passwordStart != passwordEnd || hostEnd != portEnd;
 
     // add ":" after scheme
     *p++ = ':';
@@ -1296,8 +1298,11 @@ void KURL::parse(const char* url, const String* originalString)
         // copy in the user
         strPtr = url + userStart;
         const char* userEndPtr = url + userEnd;
-        while (strPtr < userEndPtr)
-            *p++ = *strPtr++;
+        while (strPtr < userEndPtr) {
+            char c = *strPtr++;
+            ASSERT(isUserInfoChar(c));
+            *p++ = c;
+        }
         m_userEnd = p - buffer.data();
 
         // copy in the password
@@ -1305,8 +1310,11 @@ void KURL::parse(const char* url, const String* originalString)
             *p++ = ':';
             strPtr = url + passwordStart;
             const char* passwordEndPtr = url + passwordEnd;
-            while (strPtr < passwordEndPtr)
-                *p++ = *strPtr++;
+            while (strPtr < passwordEndPtr) {
+                char c = *strPtr++;
+                ASSERT(isUserInfoChar(c));
+                *p++ = c;
+            }
         }
         m_passwordEnd = p - buffer.data();
 
@@ -1319,20 +1327,27 @@ void KURL::parse(const char* url, const String* originalString)
             strPtr = url + hostStart;
             const char* hostEndPtr = url + hostEnd;
             if (isCanonicalHostnameLowercaseForScheme(buffer.data(), m_schemeEnd)) {
-                while (strPtr < hostEndPtr)
-                    *p++ = toASCIILower(*strPtr++);
+                while (strPtr < hostEndPtr) {
+                    char c = toASCIILower(*strPtr++);
+                    ASSERT(isHostnameChar(c) || c == '[' || c == ']' || c == ':');
+                    *p++ = c;
+                }
             } else {
-                while (strPtr < hostEndPtr)
-                    *p++ = *strPtr++;
+                while (strPtr < hostEndPtr) {
+                    char c = *strPtr++;
+                    ASSERT(isHostnameChar(c) || c == '[' || c == ']' || c == ':');
+                    *p++ = c;
+                }
             }
         }
         m_hostEnd = p - buffer.data();
 
-        // Copy in the port if the URL has one (and it's not default).
+        // Copy in the port if the URL has one (and it's not default). Also, copy it if there was no hostname, so that there is still something in authority component.
         if (hostEnd != portStart) {
             const char* portStr = url + portStart;
             size_t portLength = portEnd - portStart;
-            if (portLength && !isDefaultPortForScheme(portStr, portLength, buffer.data(), m_schemeEnd)) {
+            if ((portLength && !isDefaultPortForScheme(portStr, portLength, buffer.data(), m_schemeEnd))
+                || (hostStart == hostEnd && hostEnd != portStart)) {
                 *p++ = ':';
                 const char* portEndPtr = url + portEnd;
                 while (portStr < portEndPtr)