Network cache Bloom filter is too big
authorantti@apple.com <antti@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sun, 5 Apr 2015 15:18:47 +0000 (15:18 +0000)
committerantti@apple.com <antti@apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Sun, 5 Apr 2015 15:18:47 +0000 (15:18 +0000)
https://bugs.webkit.org/show_bug.cgi?id=143400

Reviewed by Chris Dumez.

It is currently 1MB.

This patch switches the cache from a counting filter (CountingBloomFilter) to a bit filter (BloomFilter).

It also reduces the filter size from 2^20 to 2^18 elements which is good for ~26000 cache entries while
still keeping false positive rate below 1%. The current cache capacity allows around 4000 entries
with typical web contents.

The new filter size is 32KB.

* NetworkProcess/cache/NetworkCacheStorage.cpp:
(WebKit::NetworkCache::Storage::Storage):
(WebKit::NetworkCache::Storage::synchronize):

    Turn initialization function into general purpose synchronization function.

(WebKit::NetworkCache::Storage::addToContentsFilter):

    Collect newly added hashes so we don't miss entries that were added during synchronization.

(WebKit::NetworkCache::Storage::mayContain):
(WebKit::NetworkCache::Storage::remove):
(WebKit::NetworkCache::Storage::retrieve):
(WebKit::NetworkCache::Storage::store):
(WebKit::NetworkCache::Storage::dispatchPendingWriteOperations):
(WebKit::NetworkCache::Storage::dispatchFullWriteOperation):
(WebKit::NetworkCache::Storage::dispatchHeaderWriteOperation):
(WebKit::NetworkCache::Storage::setMaximumSize):
(WebKit::NetworkCache::Storage::clear):
(WebKit::NetworkCache::Storage::shrinkIfNeeded):
(WebKit::NetworkCache::Storage::shrink):

    Non-counting Bloom filter does not support removals so this requires a new strategy.

    Shrink code now simply deletes entries. The filter is updated by calling synchronize() at the end.
    While we could synchronize the filter during traversal it is better to just have one function for that.

(WebKit::NetworkCache::Storage::initialize): Deleted.
* NetworkProcess/cache/NetworkCacheStorage.h:
(WebKit::NetworkCache::Storage::mayContain):
(WebKit::NetworkCache::Storage::cacheMayContain): Deleted.

git-svn-id: https://svn.webkit.org/repository/webkit/trunk@182357 268f45cc-cd09-0410-ab3c-d52691b4dbfc

Source/WebKit2/ChangeLog
Source/WebKit2/NetworkProcess/cache/NetworkCacheStorage.cpp
Source/WebKit2/NetworkProcess/cache/NetworkCacheStorage.h

index ff3160cd142e692316e1429830b9eea56f6179cb..bebfa002487019d40801dd376e74210f4070492a 100644 (file)
@@ -1,3 +1,52 @@
+2015-04-04  Antti Koivisto  <antti@apple.com>
+
+        Network cache Bloom filter is too big
+        https://bugs.webkit.org/show_bug.cgi?id=143400
+
+        Reviewed by Chris Dumez.
+
+        It is currently 1MB.
+
+        This patch switches the cache from a counting filter (CountingBloomFilter) to a bit filter (BloomFilter).
+
+        It also reduces the filter size from 2^20 to 2^18 elements which is good for ~26000 cache entries while
+        still keeping false positive rate below 1%. The current cache capacity allows around 4000 entries
+        with typical web contents.
+
+        The new filter size is 32KB.
+
+        * NetworkProcess/cache/NetworkCacheStorage.cpp:
+        (WebKit::NetworkCache::Storage::Storage):
+        (WebKit::NetworkCache::Storage::synchronize):
+
+            Turn initialization function into general purpose synchronization function.
+
+        (WebKit::NetworkCache::Storage::addToContentsFilter):
+
+            Collect newly added hashes so we don't miss entries that were added during synchronization.
+
+        (WebKit::NetworkCache::Storage::mayContain):
+        (WebKit::NetworkCache::Storage::remove):
+        (WebKit::NetworkCache::Storage::retrieve):
+        (WebKit::NetworkCache::Storage::store):
+        (WebKit::NetworkCache::Storage::dispatchPendingWriteOperations):
+        (WebKit::NetworkCache::Storage::dispatchFullWriteOperation):
+        (WebKit::NetworkCache::Storage::dispatchHeaderWriteOperation):
+        (WebKit::NetworkCache::Storage::setMaximumSize):
+        (WebKit::NetworkCache::Storage::clear):
+        (WebKit::NetworkCache::Storage::shrinkIfNeeded):
+        (WebKit::NetworkCache::Storage::shrink):
+
+            Non-counting Bloom filter does not support removals so this requires a new strategy.
+
+            Shrink code now simply deletes entries. The filter is updated by calling synchronize() at the end.
+            While we could synchronize the filter during traversal it is better to just have one function for that.
+
+        (WebKit::NetworkCache::Storage::initialize): Deleted.
+        * NetworkProcess/cache/NetworkCacheStorage.h:
+        (WebKit::NetworkCache::Storage::mayContain):
+        (WebKit::NetworkCache::Storage::cacheMayContain): Deleted.
+
 2015-04-04  Andy Estes  <aestes@apple.com>
 
         [Content Filtering] Blocked page is not always displayed when it should be
 2015-04-04  Andy Estes  <aestes@apple.com>
 
         [Content Filtering] Blocked page is not always displayed when it should be
index 6916d02872e829e6ef735e17bdf9a0df8626b5b0..9fb86f018684436f06a847606bba3648bb2e41ca 100644 (file)
@@ -70,34 +70,75 @@ Storage::Storage(const String& baseDirectoryPath)
     , m_serialBackgroundIOQueue(WorkQueue::create("com.apple.WebKit.Cache.Storage.serialBackground", WorkQueue::Type::Serial, WorkQueue::QOS::Background))
 {
     deleteOldVersions();
     , m_serialBackgroundIOQueue(WorkQueue::create("com.apple.WebKit.Cache.Storage.serialBackground", WorkQueue::Type::Serial, WorkQueue::QOS::Background))
 {
     deleteOldVersions();
-    initialize();
+    synchronize();
 }
 
 }
 
-void Storage::initialize()
+void Storage::synchronize()
 {
     ASSERT(RunLoop::isMain());
 
 {
     ASSERT(RunLoop::isMain());
 
-    StringCapture cachePathCapture(m_directoryPath);
+    if (m_synchronizationInProgress || m_shrinkInProgress)
+        return;
+    m_synchronizationInProgress = true;
+
+    LOG(NetworkCacheStorage, "(NetworkProcess) synchronizing cache");
 
 
+    StringCapture cachePathCapture(m_directoryPath);
     backgroundIOQueue().dispatch([this, cachePathCapture] {
         String cachePath = cachePathCapture.string();
     backgroundIOQueue().dispatch([this, cachePathCapture] {
         String cachePath = cachePathCapture.string();
-        traverseCacheFiles(cachePath, [this](const String& fileName, const String& partitionPath) {
+
+        auto filter = std::make_unique<ContentsFilter>();
+        size_t size = 0;
+        unsigned count = 0;
+        traverseCacheFiles(cachePath, [&filter, &size, &count](const String& fileName, const String& partitionPath) {
             Key::HashType hash;
             if (!Key::stringToHash(fileName, hash))
                 return;
             Key::HashType hash;
             if (!Key::stringToHash(fileName, hash))
                 return;
-            unsigned shortHash = Key::toShortHash(hash);
-            RunLoop::main().dispatch([this, shortHash] {
-                m_contentsFilter.add(shortHash);
-            });
             auto filePath = WebCore::pathByAppendingComponent(partitionPath, fileName);
             long long fileSize = 0;
             WebCore::getFileSize(filePath, fileSize);
             auto filePath = WebCore::pathByAppendingComponent(partitionPath, fileName);
             long long fileSize = 0;
             WebCore::getFileSize(filePath, fileSize);
-            m_approximateSize += fileSize;
+            if (!fileSize)
+                return;
+            filter->add(Key::toShortHash(hash));
+            size += fileSize;
+            ++count;
+        });
+
+        auto* filterPtr = filter.release();
+        RunLoop::main().dispatch([this, filterPtr, size] {
+            auto filter = std::unique_ptr<ContentsFilter>(filterPtr);
+
+            for (auto hash : m_contentsFilterHashesAddedDuringSynchronization)
+                filter->add(hash);
+            m_contentsFilterHashesAddedDuringSynchronization.clear();
+
+            m_contentsFilter = WTF::move(filter);
+            m_approximateSize = size;
+            m_synchronizationInProgress = false;
         });
         });
-        m_hasPopulatedContentsFilter = true;
+
+        LOG(NetworkCacheStorage, "(NetworkProcess) cache synchronization completed approximateSize=%zu count=%d", size, count);
     });
 }
 
     });
 }
 
+void Storage::addToContentsFilter(const Key& key)
+{
+    ASSERT(RunLoop::isMain());
+
+    if (m_contentsFilter)
+        m_contentsFilter->add(key.shortHash());
+
+    // If we get new entries during filter synchronization take care to add them to the new filter as well.
+    if (m_synchronizationInProgress)
+        m_contentsFilterHashesAddedDuringSynchronization.append(key.shortHash());
+}
+
+bool Storage::mayContain(const Key& key) const
+{
+    ASSERT(RunLoop::isMain());
+    return !m_contentsFilter || m_contentsFilter->mayContain(key.shortHash());
+}
+
 static String directoryPathForKey(const Key& key, const String& cachePath)
 {
     ASSERT(!key.partition().isEmpty());
 static String directoryPathForKey(const Key& key, const String& cachePath)
 {
     ASSERT(!key.partition().isEmpty());
@@ -288,11 +329,9 @@ void Storage::remove(const Key& key)
 {
     ASSERT(RunLoop::isMain());
 
 {
     ASSERT(RunLoop::isMain());
 
-    // For simplicity we don't reduce m_approximateSize on removals.
-    // The next cache shrink will update the size.
-
-    if (m_contentsFilter.mayContain(key.shortHash()))
-        m_contentsFilter.remove(key.shortHash());
+    // We can't remove the key from the Bloom filter (but some false positives are expected anyway).
+    // For simplicity we also don't reduce m_approximateSize on removals.
+    // The next synchronization will update everything.
 
     StringCapture filePathCapture(filePathForKey(key, m_directoryPath));
     serialBackgroundIOQueue().dispatch([this, filePathCapture] {
 
     StringCapture filePathCapture(filePathForKey(key, m_directoryPath));
     serialBackgroundIOQueue().dispatch([this, filePathCapture] {
@@ -385,7 +424,7 @@ void Storage::retrieve(const Key& key, unsigned priority, RetrieveCompletionHand
         return;
     }
 
         return;
     }
 
-    if (!cacheMayContain(key.shortHash())) {
+    if (!mayContain(key)) {
         completionHandler(nullptr);
         return;
     }
         completionHandler(nullptr);
         return;
     }
@@ -412,7 +451,7 @@ void Storage::store(const Record& record, StoreCompletionHandler&& completionHan
     m_pendingWriteOperations.append(new WriteOperation { record, { }, WTF::move(completionHandler) });
 
     // Add key to the filter already here as we do lookups from the pending operations too.
     m_pendingWriteOperations.append(new WriteOperation { record, { }, WTF::move(completionHandler) });
 
     // Add key to the filter already here as we do lookups from the pending operations too.
-    m_contentsFilter.add(record.key.shortHash());
+    addToContentsFilter(record.key);
 
     dispatchPendingWriteOperations();
 }
 
     dispatchPendingWriteOperations();
 }
@@ -479,7 +518,7 @@ void Storage::dispatchPendingWriteOperations()
         auto& write = *writeOperation;
         m_activeWriteOperations.add(WTF::move(writeOperation));
 
         auto& write = *writeOperation;
         m_activeWriteOperations.add(WTF::move(writeOperation));
 
-        if (write.existingRecord && cacheMayContain(write.record.key.shortHash())) {
+        if (write.existingRecord && mayContain(write.record.key)) {
             dispatchHeaderWriteOperation(write);
             continue;
         }
             dispatchHeaderWriteOperation(write);
             continue;
         }
@@ -492,8 +531,8 @@ void Storage::dispatchFullWriteOperation(const WriteOperation& write)
     ASSERT(RunLoop::isMain());
     ASSERT(m_activeWriteOperations.contains(&write));
 
     ASSERT(RunLoop::isMain());
     ASSERT(m_activeWriteOperations.contains(&write));
 
-    if (!m_contentsFilter.mayContain(write.record.key.shortHash()))
-        m_contentsFilter.add(write.record.key.shortHash());
+    // This was added already when starting the store but filter might have been wiped.
+    addToContentsFilter(write.record.key);
 
     StringCapture cachePathCapture(m_directoryPath);
     backgroundIOQueue().dispatch([this, &write, cachePathCapture] {
 
     StringCapture cachePathCapture(m_directoryPath);
     backgroundIOQueue().dispatch([this, &write, cachePathCapture] {
@@ -505,14 +544,10 @@ void Storage::dispatchFullWriteOperation(const WriteOperation& write)
         size_t bodyOffset = encodedHeader.size();
 
         channel->write(0, headerAndBodyData, [this, &write, bodyOffset, fd](int error) {
         size_t bodyOffset = encodedHeader.size();
 
         channel->write(0, headerAndBodyData, [this, &write, bodyOffset, fd](int error) {
-            LOG(NetworkCacheStorage, "(NetworkProcess) write complete error=%d", error);
-            if (error) {
-                if (m_contentsFilter.mayContain(write.record.key.shortHash()))
-                    m_contentsFilter.remove(write.record.key.shortHash());
-            }
             size_t bodySize = write.record.body.size();
             size_t totalSize = bodyOffset + bodySize;
 
             size_t bodySize = write.record.body.size();
             size_t totalSize = bodyOffset + bodySize;
 
+            // On error the entry still stays in the contents filter until next synchronization.
             m_approximateSize += totalSize;
 
             bool shouldMapBody = !error && bodySize >= pageSize();
             m_approximateSize += totalSize;
 
             bool shouldMapBody = !error && bodySize >= pageSize();
@@ -523,6 +558,8 @@ void Storage::dispatchFullWriteOperation(const WriteOperation& write)
             ASSERT(m_activeWriteOperations.contains(&write));
             m_activeWriteOperations.remove(&write);
             dispatchPendingWriteOperations();
             ASSERT(m_activeWriteOperations.contains(&write));
             m_activeWriteOperations.remove(&write);
             dispatchPendingWriteOperations();
+
+            LOG(NetworkCacheStorage, "(NetworkProcess) write complete error=%d", error);
         });
     });
 
         });
     });
 
@@ -534,7 +571,7 @@ void Storage::dispatchHeaderWriteOperation(const WriteOperation& write)
     ASSERT(RunLoop::isMain());
     ASSERT(write.existingRecord);
     ASSERT(m_activeWriteOperations.contains(&write));
     ASSERT(RunLoop::isMain());
     ASSERT(write.existingRecord);
     ASSERT(m_activeWriteOperations.contains(&write));
-    ASSERT(cacheMayContain(write.record.key.shortHash()));
+    ASSERT(mayContain(write.record.key));
 
     // Try to update the header of an existing entry.
     StringCapture cachePathCapture(m_directoryPath);
 
     // Try to update the header of an existing entry.
     StringCapture cachePathCapture(m_directoryPath);
@@ -570,6 +607,16 @@ void Storage::dispatchHeaderWriteOperation(const WriteOperation& write)
 void Storage::setMaximumSize(size_t size)
 {
     ASSERT(RunLoop::isMain());
 void Storage::setMaximumSize(size_t size)
 {
     ASSERT(RunLoop::isMain());
+
+#if !ASSERT_DISABLED
+    const size_t assumedAverageRecordSize = 50 << 20;
+    size_t maximumRecordCount = size / assumedAverageRecordSize;
+    // ~10 bits per element are required for <1% false positive rate.
+    size_t effectiveBloomFilterCapacity = ContentsFilter::tableSize / 10;
+    // If this gets hit it might be time to increase the filter size.
+    ASSERT(maximumRecordCount < effectiveBloomFilterCapacity);
+#endif
+
     m_maximumSize = size;
 
     shrinkIfNeeded();
     m_maximumSize = size;
 
     shrinkIfNeeded();
@@ -580,7 +627,8 @@ void Storage::clear()
     ASSERT(RunLoop::isMain());
     LOG(NetworkCacheStorage, "(NetworkProcess) clearing cache");
 
     ASSERT(RunLoop::isMain());
     LOG(NetworkCacheStorage, "(NetworkProcess) clearing cache");
 
-    m_contentsFilter.clear();
+    if (m_contentsFilter)
+        m_contentsFilter->clear();
     m_approximateSize = 0;
 
     StringCapture directoryPathCapture(m_directoryPath);
     m_approximateSize = 0;
 
     StringCapture directoryPathCapture(m_directoryPath);
@@ -629,20 +677,24 @@ void Storage::shrinkIfNeeded()
 {
     ASSERT(RunLoop::isMain());
 
 {
     ASSERT(RunLoop::isMain());
 
-    if (m_approximateSize <= m_maximumSize)
-        return;
-    if (m_shrinkInProgress)
+    if (m_approximateSize > m_maximumSize)
+        shrink();
+}
+
+void Storage::shrink()
+{
+    ASSERT(RunLoop::isMain());
+
+    if (m_shrinkInProgress || m_synchronizationInProgress)
         return;
     m_shrinkInProgress = true;
 
     LOG(NetworkCacheStorage, "(NetworkProcess) shrinking cache approximateSize=%zu, m_maximumSize=%zu", static_cast<size_t>(m_approximateSize), m_maximumSize);
 
         return;
     m_shrinkInProgress = true;
 
     LOG(NetworkCacheStorage, "(NetworkProcess) shrinking cache approximateSize=%zu, m_maximumSize=%zu", static_cast<size_t>(m_approximateSize), m_maximumSize);
 
-    m_approximateSize = 0;
-
     StringCapture cachePathCapture(m_directoryPath);
     backgroundIOQueue().dispatch([this, cachePathCapture] {
         String cachePath = cachePathCapture.string();
     StringCapture cachePathCapture(m_directoryPath);
     backgroundIOQueue().dispatch([this, cachePathCapture] {
         String cachePath = cachePathCapture.string();
-        traverseCacheFiles(cachePath, [this](const String& fileName, const String& partitionPath) {
+        traverseCacheFiles(cachePath, [](const String& fileName, const String& partitionPath) {
             auto filePath = WebCore::pathByAppendingComponent(partitionPath, fileName);
 
             auto times = fileTimes(filePath);
             auto filePath = WebCore::pathByAppendingComponent(partitionPath, fileName);
 
             auto times = fileTimes(filePath);
@@ -651,22 +703,8 @@ void Storage::shrinkIfNeeded()
 
             LOG(NetworkCacheStorage, "Deletion probability=%f shouldDelete=%d", probability, shouldDelete);
 
 
             LOG(NetworkCacheStorage, "Deletion probability=%f shouldDelete=%d", probability, shouldDelete);
 
-            if (!shouldDelete) {
-                long long fileSize = 0;
-                WebCore::getFileSize(filePath, fileSize);
-                m_approximateSize += fileSize;
-                return;
-            }
-
-            WebCore::deleteFile(filePath);
-            Key::HashType hash;
-            if (!Key::stringToHash(fileName, hash))
-                return;
-            unsigned shortHash = Key::toShortHash(hash);
-            RunLoop::main().dispatch([this, shortHash] {
-                if (m_contentsFilter.mayContain(shortHash))
-                    m_contentsFilter.remove(shortHash);
-            });
+            if (shouldDelete)
+                WebCore::deleteFile(filePath);
         });
 
         // Let system figure out if they are really empty.
         });
 
         // Let system figure out if they are really empty.
@@ -675,9 +713,13 @@ void Storage::shrinkIfNeeded()
             WebCore::deleteEmptyDirectory(partitionPath);
         });
 
             WebCore::deleteEmptyDirectory(partitionPath);
         });
 
-        m_shrinkInProgress = false;
+        RunLoop::main().dispatch([this] {
+            m_shrinkInProgress = false;
+            // We could synchronize during the shrink traversal. However this is fast and it is better to have just one code path.
+            synchronize();
+        });
 
 
-        LOG(NetworkCacheStorage, "(NetworkProcess) cache shrink completed approximateSize=%zu", static_cast<size_t>(m_approximateSize));
+        LOG(NetworkCacheStorage, "(NetworkProcess) cache shrink completed");
     });
 }
 
     });
 }
 
index 4cab39e775be5f8e755a30f576a6feef3a305931..7659b6b4d7bd00cd6543fd66f1654f2a68431c60 100644 (file)
@@ -85,9 +85,10 @@ public:
 private:
     Storage(const String& directoryPath);
 
 private:
     Storage(const String& directoryPath);
 
-    void initialize();
+    void synchronize();
     void deleteOldVersions();
     void shrinkIfNeeded();
     void deleteOldVersions();
     void shrinkIfNeeded();
+    void shrink();
 
     struct ReadOperation {
         Key key;
 
     struct ReadOperation {
         Key key;
@@ -111,18 +112,24 @@ private:
     WorkQueue& backgroundIOQueue() { return m_backgroundIOQueue.get(); }
     WorkQueue& serialBackgroundIOQueue() { return m_serialBackgroundIOQueue.get(); }
 
     WorkQueue& backgroundIOQueue() { return m_backgroundIOQueue.get(); }
     WorkQueue& serialBackgroundIOQueue() { return m_serialBackgroundIOQueue.get(); }
 
-    bool cacheMayContain(unsigned shortHash) { return !m_hasPopulatedContentsFilter || m_contentsFilter.mayContain(shortHash); }
+    bool mayContain(const Key&) const;
+
+    // 2^18 bit filter can support up to 26000 entries with false positive rate < 1%.
+    using ContentsFilter = BloomFilter<18>;
+    void addToContentsFilter(const Key&);
 
     const String m_baseDirectoryPath;
     const String m_directoryPath;
 
     size_t m_maximumSize { std::numeric_limits<size_t>::max() };
 
     const String m_baseDirectoryPath;
     const String m_directoryPath;
 
     size_t m_maximumSize { std::numeric_limits<size_t>::max() };
+    size_t m_approximateSize { 0 };
+
+    std::unique_ptr<ContentsFilter> m_contentsFilter;
 
 
-    CountingBloomFilter<20> m_contentsFilter;
-    std::atomic<bool> m_hasPopulatedContentsFilter { false };
+    bool m_synchronizationInProgress { false };
+    bool m_shrinkInProgress { false };
 
 
-    std::atomic<size_t> m_approximateSize { 0 };
-    std::atomic<bool> m_shrinkInProgress { false };
+    Vector<unsigned> m_contentsFilterHashesAddedDuringSynchronization;
 
     static const int maximumRetrievePriority = 4;
     Deque<std::unique_ptr<const ReadOperation>> m_pendingReadOperationsByPriority[maximumRetrievePriority + 1];
 
     static const int maximumRetrievePriority = 4;
     Deque<std::unique_ptr<const ReadOperation>> m_pendingReadOperationsByPriority[maximumRetrievePriority + 1];