bmalloc: speed up the lock slow path
authorggaren@apple.com <ggaren@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Mon, 22 Aug 2016 23:18:09 +0000 (23:18 +0000)
committerggaren@apple.com <ggaren@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Mon, 22 Aug 2016 23:18:09 +0000 (23:18 +0000)
https://bugs.webkit.org/show_bug.cgi?id=161058

Reviewed by Filip Pizlo.

It is generally accepted practice that a lock should yield instead of
spinning when a lock acquisition fails, to avoid wasting CPU and power.

There are two problems with this generally accepted practice:

(1) It's a fallacy that yielding is free. In reality, yielding itself
consumes CPU and power -- by performing a syscall, running the OS
scheduler, and possibly performing a context switch. (Instruments
traces of MallocBench show the cost of yielding.) Therefore, spinning a
little to avoid yielding can actually *save* CPU and power.

(2) std::this_thread_yield() on Darwin is way too aggressive: It not only
yields but also depresses your priority to absolute zero for 10ms. A
recent PLT trace showed a few spots where the main thread just gave up
on loading and rendering a page for 10ms so an unimportant background
task could run.

To correct these problems, this patch adds a little bit of spinning to
the bmalloc lock slow path.

Below are performance results on various CPUs.

Mac Pro (12 hyperthreaded cores = 24 threads):

                                                    Baseline                       Patch                           Δ
    Execution Time:
        message_one                                    173ms                       173ms
        message_many                                   953ms                       927ms              ^ 1.03x faster
        churn --parallel                                60ms                        41ms              ^ 1.46x faster
        list_allocate --parallel                       224ms                       143ms              ^ 1.57x faster
        tree_allocate --parallel                     1,190ms                       758ms              ^ 1.57x faster
        tree_churn --parallel                        1,517ms                       906ms              ^ 1.67x faster
        facebook --parallel                          6,519ms                     4,580ms              ^ 1.42x faster
        reddit --parallel                            5,097ms                     3,411ms              ^ 1.49x faster
        flickr --parallel                            4,903ms                     3,501ms               ^ 1.4x faster
        theverge --parallel                          6,641ms                     4,505ms              ^ 1.47x faster

        <geometric mean>                             1,158ms                       832ms              ^ 1.39x faster
        <arithmetic mean>                            2,728ms                     1,895ms              ^ 1.44x faster
        <harmonic mean>                                332ms                       240ms              ^ 1.38x faster

MacBook Air (2 hyperthreaded cores = 4 threads):

                                                    Baseline                       Patch                           Δ
    Execution Time:
        message_one                                    911ms                       907ms               ^ 1.0x faster
        message_many                                   515ms                       513ms               ^ 1.0x faster
        churn --parallel                               132ms                       134ms              ! 1.02x slower
        list_allocate --parallel                       104ms                       102ms              ^ 1.02x faster
        tree_allocate --parallel                       117ms                       111ms              ^ 1.05x faster
        tree_churn --parallel                          154ms                       151ms              ^ 1.02x faster
        facebook --parallel                            719ms                       687ms              ^ 1.05x faster
        reddit --parallel                              382ms                       341ms              ^ 1.12x faster
        flickr --parallel                              372ms                       345ms              ^ 1.08x faster
        theverge --parallel                            489ms                       444ms               ^ 1.1x faster

        <geometric mean>                               299ms                       287ms              ^ 1.04x faster
        <arithmetic mean>                              390ms                       374ms              ^ 1.04x faster
        <harmonic mean>                                227ms                       220ms              ^ 1.03x faster

iPad (2 cores = 2 threads):

    [ Doesn't run Ruby, so no pretty subtest output. ]

                                                    Baseline                       Patch                           Δ
    Execution Time:                                 174.14ms                     171.5ms              ^ 1.02x faster

* bmalloc.xcodeproj/project.pbxproj:

* bmalloc/ScopeExit.h: Added. A barebones very wimpy version of
WTF::ScopeExit.
(bmalloc::ScopeExit::ScopeExit):
(bmalloc::ScopeExit::~ScopeExit):
(bmalloc::makeScopeExit):

* bmalloc/StaticMutex.cpp:
(bmalloc::StaticMutex::lockSlowCase): Spin before yielding -- that's the
speedup. Don't spin if another CPU is already spinning. In theory, more
than one spinner accomplishes nothing, and I found that there's a cutoff
around 8 or 16 spinners that becomes performance negative on Mac Pro.

(Note: Another way to accomplish a similar result, if you don't want to
use a bit of state in the lock, is to spin for a random duration between
0 and aLot. I tested a version of WTF::WeakRandom with unsynchronized
static state and it worked great. But I ultimately opted for the explicit
bit because I thought it was clearer.)

* bmalloc/StaticMutex.h:
(bmalloc::StaticMutex::init): Initialize our new bit.

* bmalloc/ThreadSwitch.h: Added.
(bmalloc::threadSwitch): Don't call yield() on Darwin because it's too
aggressive. swtch() does what we want: Go run something else, without
any other side-effects.

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@204744 268f45cc-cd09-0410-ab3c-d52691b4dbfc

Source/bmalloc/ChangeLog
Source/bmalloc/bmalloc.xcodeproj/project.pbxproj
Source/bmalloc/bmalloc/ScopeExit.h [new file with mode: 0644]
Source/bmalloc/bmalloc/StaticMutex.cpp
Source/bmalloc/bmalloc/StaticMutex.h
Source/bmalloc/bmalloc/ThreadSwitch.h [new file with mode: 0644]

index d24263f..24dd9d5 100644 (file)
@@ -1,3 +1,105 @@
+2016-08-22  Geoffrey Garen  <ggaren@apple.com>
+
+        bmalloc: speed up the lock slow path
+        https://bugs.webkit.org/show_bug.cgi?id=161058
+
+        Reviewed by Filip Pizlo.
+
+        It is generally accepted practice that a lock should yield instead of
+        spinning when a lock acquisition fails, to avoid wasting CPU and power.
+
+        There are two problems with this generally accepted practice:
+
+        (1) It's a fallacy that yielding is free. In reality, yielding itself
+        consumes CPU and power -- by performing a syscall, running the OS
+        scheduler, and possibly performing a context switch. (Instruments
+        traces of MallocBench show the cost of yielding.) Therefore, spinning a
+        little to avoid yielding can actually *save* CPU and power.
+
+        (2) std::this_thread_yield() on Darwin is way too aggressive: It not only
+        yields but also depresses your priority to absolute zero for 10ms. A
+        recent PLT trace showed a few spots where the main thread just gave up
+        on loading and rendering a page for 10ms so an unimportant background
+        task could run.
+
+        To correct these problems, this patch adds a little bit of spinning to
+        the bmalloc lock slow path.
+
+        Below are performance results on various CPUs.
+
+        Mac Pro (12 hyperthreaded cores = 24 threads):
+
+                                                            Baseline                       Patch                           Δ
+            Execution Time:
+                message_one                                    173ms                       173ms                            
+                message_many                                   953ms                       927ms              ^ 1.03x faster
+                churn --parallel                                60ms                        41ms              ^ 1.46x faster
+                list_allocate --parallel                       224ms                       143ms              ^ 1.57x faster
+                tree_allocate --parallel                     1,190ms                       758ms              ^ 1.57x faster
+                tree_churn --parallel                        1,517ms                       906ms              ^ 1.67x faster
+                facebook --parallel                          6,519ms                     4,580ms              ^ 1.42x faster
+                reddit --parallel                            5,097ms                     3,411ms              ^ 1.49x faster
+                flickr --parallel                            4,903ms                     3,501ms               ^ 1.4x faster
+                theverge --parallel                          6,641ms                     4,505ms              ^ 1.47x faster
+
+                <geometric mean>                             1,158ms                       832ms              ^ 1.39x faster
+                <arithmetic mean>                            2,728ms                     1,895ms              ^ 1.44x faster
+                <harmonic mean>                                332ms                       240ms              ^ 1.38x faster
+
+        MacBook Air (2 hyperthreaded cores = 4 threads):
+
+                                                            Baseline                       Patch                           Δ
+            Execution Time:
+                message_one                                    911ms                       907ms               ^ 1.0x faster
+                message_many                                   515ms                       513ms               ^ 1.0x faster
+                churn --parallel                               132ms                       134ms              ! 1.02x slower
+                list_allocate --parallel                       104ms                       102ms              ^ 1.02x faster
+                tree_allocate --parallel                       117ms                       111ms              ^ 1.05x faster
+                tree_churn --parallel                          154ms                       151ms              ^ 1.02x faster
+                facebook --parallel                            719ms                       687ms              ^ 1.05x faster
+                reddit --parallel                              382ms                       341ms              ^ 1.12x faster
+                flickr --parallel                              372ms                       345ms              ^ 1.08x faster
+                theverge --parallel                            489ms                       444ms               ^ 1.1x faster
+
+                <geometric mean>                               299ms                       287ms              ^ 1.04x faster
+                <arithmetic mean>                              390ms                       374ms              ^ 1.04x faster
+                <harmonic mean>                                227ms                       220ms              ^ 1.03x faster
+
+        iPad (2 cores = 2 threads):
+
+            [ Doesn't run Ruby, so no pretty subtest output. ]
+
+                                                            Baseline                       Patch                           Δ
+            Execution Time:                                 174.14ms                     171.5ms              ^ 1.02x faster
+
+        * bmalloc.xcodeproj/project.pbxproj:
+
+        * bmalloc/ScopeExit.h: Added. A barebones very wimpy version of
+        WTF::ScopeExit.
+        (bmalloc::ScopeExit::ScopeExit):
+        (bmalloc::ScopeExit::~ScopeExit):
+        (bmalloc::makeScopeExit):
+
+        * bmalloc/StaticMutex.cpp:
+        (bmalloc::StaticMutex::lockSlowCase): Spin before yielding -- that's the
+        speedup. Don't spin if another CPU is already spinning. In theory, more
+        than one spinner accomplishes nothing, and I found that there's a cutoff
+        around 8 or 16 spinners that becomes performance negative on Mac Pro.
+
+        (Note: Another way to accomplish a similar result, if you don't want to
+        use a bit of state in the lock, is to spin for a random duration between
+        0 and aLot. I tested a version of WTF::WeakRandom with unsynchronized
+        static state and it worked great. But I ultimately opted for the explicit
+        bit because I thought it was clearer.)
+
+        * bmalloc/StaticMutex.h:
+        (bmalloc::StaticMutex::init): Initialize our new bit.
+
+        * bmalloc/ThreadSwitch.h: Added.
+        (bmalloc::threadSwitch): Don't call yield() on Darwin because it's too
+        aggressive. swtch() does what we want: Go run something else, without
+        any other side-effects.
+
 2016-08-03  Geoffrey Garen  <ggaren@apple.com>
 
         [bmalloc] Merging of XLargeRanges can leak the upper range
index 7f7a2ec..ded5c1a 100644 (file)
@@ -24,6 +24,8 @@
                147DC6E31CA5B70B00724E8D /* Chunk.h in Headers */ = {isa = PBXBuildFile; fileRef = 147DC6E21CA5B70B00724E8D /* Chunk.h */; settings = {ATTRIBUTES = (Private, ); }; };
                14895D911A3A319C0006235D /* Environment.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 14895D8F1A3A319C0006235D /* Environment.cpp */; };
                14895D921A3A319C0006235D /* Environment.h in Headers */ = {isa = PBXBuildFile; fileRef = 14895D901A3A319C0006235D /* Environment.h */; settings = {ATTRIBUTES = (Private, ); }; };
+               148EFAE81D6B953B008E721E /* ScopeExit.h in Headers */ = {isa = PBXBuildFile; fileRef = 148EFAE61D6B953B008E721E /* ScopeExit.h */; };
+               148EFAE91D6B953B008E721E /* ThreadSwitch.h in Headers */ = {isa = PBXBuildFile; fileRef = 148EFAE71D6B953B008E721E /* ThreadSwitch.h */; };
                14C8992B1CC485E70027A057 /* Map.h in Headers */ = {isa = PBXBuildFile; fileRef = 14C8992A1CC485E70027A057 /* Map.h */; settings = {ATTRIBUTES = (Private, ); }; };
                14C8992D1CC578330027A057 /* XLargeRange.h in Headers */ = {isa = PBXBuildFile; fileRef = 14C8992C1CC578330027A057 /* XLargeRange.h */; settings = {ATTRIBUTES = (Private, ); }; };
                14C919C918FCC59F0028DB43 /* BPlatform.h in Headers */ = {isa = PBXBuildFile; fileRef = 14C919C818FCC59F0028DB43 /* BPlatform.h */; settings = {ATTRIBUTES = (Private, ); }; };
                1485656018A43DBA00ED6942 /* ObjectType.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = ObjectType.h; path = bmalloc/ObjectType.h; sourceTree = "<group>"; };
                14895D8F1A3A319C0006235D /* Environment.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; name = Environment.cpp; path = bmalloc/Environment.cpp; sourceTree = "<group>"; };
                14895D901A3A319C0006235D /* Environment.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = Environment.h; path = bmalloc/Environment.h; sourceTree = "<group>"; };
+               148EFAE61D6B953B008E721E /* ScopeExit.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = ScopeExit.h; path = bmalloc/ScopeExit.h; sourceTree = "<group>"; };
+               148EFAE71D6B953B008E721E /* ThreadSwitch.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = ThreadSwitch.h; path = bmalloc/ThreadSwitch.h; sourceTree = "<group>"; };
                14B650C518F39F4800751968 /* Base.xcconfig */ = {isa = PBXFileReference; lastKnownFileType = text.xcconfig; path = Base.xcconfig; sourceTree = "<group>"; };
                14B650C618F39F4800751968 /* bmalloc.xcconfig */ = {isa = PBXFileReference; lastKnownFileType = text.xcconfig; path = bmalloc.xcconfig; sourceTree = "<group>"; };
                14B650C718F39F4800751968 /* DebugRelease.xcconfig */ = {isa = PBXFileReference; lastKnownFileType = text.xcconfig; path = DebugRelease.xcconfig; sourceTree = "<group>"; };
                                14446A0717A61FA400F9EA1D /* PerProcess.h */,
                                144469FD17A61F1F00F9EA1D /* PerThread.h */,
                                145F6878179E3A4400D65598 /* Range.h */,
+                               148EFAE61D6B953B008E721E /* ScopeExit.h */,
                                143CB81A19022BC900B16A45 /* StaticMutex.cpp */,
                                143CB81B19022BC900B16A45 /* StaticMutex.h */,
                                1417F64F18B7280C0076FA3F /* Syscall.h */,
+                               148EFAE71D6B953B008E721E /* ThreadSwitch.h */,
                                1479E21217A1A255006D4E9D /* Vector.h */,
                                1479E21417A1A63E006D4E9D /* VMAllocate.h */,
                        );
                                4426E2831C839547008EB042 /* BSoftLinking.h in Headers */,
                                14DD789018F48CEB00950702 /* Sizes.h in Headers */,
                                14DD78C718F48D7500950702 /* BAssert.h in Headers */,
+                               148EFAE91D6B953B008E721E /* ThreadSwitch.h in Headers */,
                                14DD78D018F48D7500950702 /* VMAllocate.h in Headers */,
                                14DD78CE18F48D7500950702 /* Syscall.h in Headers */,
                                14DD78C618F48D7500950702 /* AsyncTask.h in Headers */,
                                14895D921A3A319C0006235D /* Environment.h in Headers */,
                                1400274A18F89C2300115C97 /* VMHeap.h in Headers */,
                                1400274918F89C1300115C97 /* Heap.h in Headers */,
+                               148EFAE81D6B953B008E721E /* ScopeExit.h in Headers */,
                                140FA00319CE429C00FFD3C8 /* BumpRange.h in Headers */,
                                4426E2811C838EE0008EB042 /* Logging.h in Headers */,
                                14DD78C518F48D7500950702 /* Algorithm.h in Headers */,
diff --git a/Source/bmalloc/bmalloc/ScopeExit.h b/Source/bmalloc/bmalloc/ScopeExit.h
new file mode 100644 (file)
index 0000000..3d99181
--- /dev/null
@@ -0,0 +1,53 @@
+/*
+ * Copyright (C) 2016 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
+ */
+
+#include <type_traits>
+
+namespace bmalloc {
+
+template<typename ExitFunction>
+class ScopeExit {
+public:
+    explicit ScopeExit(ExitFunction&& exitFunction)
+        : m_exitFunction(exitFunction)
+    {
+    }
+
+    ~ScopeExit()
+    {
+        m_exitFunction();
+    }
+
+private:
+    ExitFunction m_exitFunction;
+};
+
+template<typename ExitFunction>
+ScopeExit<ExitFunction> makeScopeExit(ExitFunction&& exitFunction)
+{
+    return ScopeExit<ExitFunction>(std::forward<ExitFunction>(exitFunction));
+}
+    
+} // namespace bmalloc
index a99614d..6649a57 100644 (file)
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
  */
 
+#include "ScopeExit.h"
 #include "StaticMutex.h"
-#include <thread>
+#include "ThreadSwitch.h"
 
 namespace bmalloc {
 
 void StaticMutex::lockSlowCase()
 {
+    // The longest critical section in bmalloc is much shorter than the
+    // time it takes to make a system call to yield to the OS scheduler.
+    // So, we try again a lot before we yield.
+    static const size_t aLot = 256;
+    
+    if (!m_isSpinning.test_and_set()) {
+        auto clear = makeScopeExit([&] { m_isSpinning.clear(); });
+
+        for (size_t i = 0; i < aLot; ++i) {
+            if (try_lock())
+                return;
+        }
+    }
+
+    // Avoid spinning pathologically.
     while (!try_lock())
-        std::this_thread::yield();
+        threadSwitch();
 }
 
 } // namespace bmalloc
index 4404e95..9e13617 100644 (file)
@@ -52,6 +52,7 @@ private:
     void lockSlowCase();
 
     std::atomic_flag m_flag;
+    std::atomic_flag m_isSpinning;
 };
 
 static inline void sleep(
@@ -78,6 +79,7 @@ static inline void waitUntilFalse(
 inline void StaticMutex::init()
 {
     m_flag.clear();
+    m_isSpinning.clear();
 }
 
 inline bool StaticMutex::try_lock()
diff --git a/Source/bmalloc/bmalloc/ThreadSwitch.h b/Source/bmalloc/bmalloc/ThreadSwitch.h
new file mode 100644 (file)
index 0000000..385c84f
--- /dev/null
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) 2016 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#if BOS(DARWIN)
+#include <mach/thread_switch.h>
+#endif
+#include <thread>
+
+namespace bmalloc {
+    
+inline void threadSwitch()
+{
+    // yield() on Darwin will depress your priority to absolute 0 for 10ms,
+    // and possibly clock down the CPU -- so we avoid it.
+#if BOS(DARWIN)
+    swtch();
+#else
+    std::this_thread::yield();
+#endif
+}
+
+} // namespace bmalloc