Bloom Filters: The Approximate Memory that Revolutionizes Performance
- 👉 This article is available in english too.
- 👉 Este artículo también está disponible en español.
- 👉 Questo articolo è disponibile anche in italiano.
- 👉 Este artigo também está disponível em português do Brasil.
The Problem: When Checking Becomes Expensive
In recent years, DelphiMVCFramework has been adopted in mission-critical systems worldwide: from banking APIs processing transactions worth millions of euros, to hospital systems managing vital patient data, to industrial platforms coordinating automated production lines. These systems share a common characteristic: they must handle millions of API calls daily while maintaining excellent performance and predictable response times.
In such contexts, even seemingly simple operations can become critical bottlenecks when scaled to enterprise volumes. A perfect example is what you might encounter when developing a user registration API for a social network that has reached 100 million subscribers. Every time someone attempts to register, you need to verify if the email already exists in the database. An apparently innocent query like SELECT COUNT(*) FROM users WHERE email = ?
may seem trivial, but when executed thousands of times per minute on a table with 100 million rows, even with the best indexes, the database starts to suffer.
The paradox is that the vast majority of these queries return zero results - most emails being checked are actually new. You’re literally wasting computational resources to confirm something you could know in advance with a much more efficient method.
This scenario repeats in countless contexts: checking URLs already visited in a web crawler, verifying keys in distributed caches, anti-spam filters, deduplication systems. All situations where the cost of verifying if an element “might be present” is enormously lower than the cost of a definitive verification.
The Traditional Solution and Its Limits
The classic approach involves using data structures like HashSet or Dictionary to keep all previously seen elements in memory. For our email case, this would mean loading 100 million strings into memory - we’re talking about several gigabytes of RAM just for this functionality.
Even assuming you have the available memory, the cost of keeping this structure synchronized with the database becomes prohibitive. Every insertion, modification, or deletion requires updates to both the database and the in-memory structure.
Enter Bloom Filters: The Approximate Memory
A Bloom Filter is essentially an “approximate memory” that allows you to answer a fundamental question: “Have I ever seen this element before?” with two possible answers:
- “Definitely NO” (100% guaranteed answer)
- “Probably YES” (with a configurable probability of error)
Think of a Bloom Filter as that friend who has an exceptional memory for things they’ve NEVER seen, but sometimes confuses things they’ve already encountered. If they tell you “I’ve never seen it,” you can trust them blindly. If they tell you “I’ve seen it before,” they might be wrong, but most of the time they’re right.
How It Works: The Elegance of Simplicity
A Bloom Filter consists of a bit array (initially all zeros) and a set of hash functions. When you want to “remember” an element:
- You calculate various hash functions of the element
- You use these values as indices in the bit array
- You set all corresponding bits to 1
To verify if an element is already present:
- You calculate the same hash functions
- You check if all corresponding bits are 1
- If even one is 0, the element is definitely not there
- If they’re all 1, the element is probably there
The beauty lies in compactness: regardless of how many elements you insert, the size of the Bloom Filter remains constant. For our case of 100 million emails, instead of gigabytes of memory, we could use just a few megabytes while maintaining a false positive probability of 1%.
Solving the Original Problem
Returning to our user registration scenario, the flow becomes:
- User attempts registration with email X
- Quick check in the Bloom Filter (microseconds)
- If it says “Definitely NO,” proceed directly with insertion
- If it says “Probably YES,” make the database query for confirmation
Considering that 99% of emails are actually new, we eliminate 99% of database queries. The remaining 1% includes both true duplicates and false positives, but both require final verification anyway.
Simple but Efficient Implementation in Delphi
The following implementation is deliberately simple but uses advanced techniques to maximize performance. Instead of using a TArray<Boolean>
that wastes memory, we use a TArray<UInt32>
for compact bit storage.
unit BloomFilter;
interface
uses
System.SysUtils, System.Hash;
type
TBloomFilter = class
private
fBitArray: TArray<UInt32>; // Compact storage with UInt32 array
fBitCount: UInt32; // Total number of bits in the filter
fHashFunctions: Integer; // Number of hash functions to use
function GetBitArraySize: Integer;
procedure SetBit(aIndex: UInt32);
function GetBit(aIndex: UInt32): Boolean;
function ComputeHash(const aValue: string; aSeed: UInt32): UInt32;
public
constructor Create(aBitCount: UInt32; aHashFunctions: Integer = 3);
destructor Destroy; override;
procedure Add(const aValue: string);
function MightContain(const aValue: string): Boolean;
procedure Clear;
end;
implementation
constructor TBloomFilter.Create(aBitCount: UInt32; aHashFunctions: Integer);
begin
inherited Create;
if aBitCount = 0 then
raise EArgumentException.Create('Bit count must be greater than 0');
if aHashFunctions <= 0 then
raise EArgumentException.Create('Hash function count must be greater than 0');
fBitCount := aBitCount;
fHashFunctions := aHashFunctions;
// Calculate required UInt32 array size
SetLength(fBitArray, GetBitArraySize);
Clear;
end;
destructor TBloomFilter.Destroy;
begin
SetLength(fBitArray, 0);
inherited;
end;
function TBloomFilter.GetBitArraySize: Integer;
begin
// Ceiling division: (fBitCount + 31) div 32
Result := (fBitCount + 31) div 32;
end;
procedure TBloomFilter.SetBit(aIndex: UInt32);
var
lArrayIndex: Integer;
lBitIndex: Integer;
begin
lArrayIndex := aIndex div 32;
lBitIndex := aIndex mod 32;
fBitArray[lArrayIndex] := fBitArray[lArrayIndex] or (1 shl lBitIndex);
end;
function TBloomFilter.GetBit(aIndex: UInt32): Boolean;
var
lArrayIndex: Integer;
lBitIndex: Integer;
begin
lArrayIndex := aIndex div 32;
lBitIndex := aIndex mod 32;
Result := (fBitArray[lArrayIndex] and (1 shl lBitIndex)) <> 0;
end;
function TBloomFilter.ComputeHash(const aValue: string; aSeed: UInt32): UInt32;
var
lHash1Val, lHash2Val: Cardinal;
lCombined: UInt64; // UInt64 to prevent overflow
begin
// Double hashing to simulate independent hash functions
lHash1Val := Cardinal(THashBobJenkins.GetHashValue(aValue));
lHash2Val := Cardinal(THashFNV1a32.GetHashValue(aValue));
// Combine with seed to create different hash functions
lCombined := UInt64(lHash1Val) + UInt64(aSeed) * UInt64(lHash2Val);
// Modulo ensures result is always in valid range
Result := lCombined mod fBitCount;
end;
procedure TBloomFilter.Add(const aValue: string);
var
lHashValue: UInt32;
begin
for var lI := 0 to fHashFunctions - 1 do
begin
lHashValue := ComputeHash(aValue, UInt32(lI));
SetBit(lHashValue);
end;
end;
function TBloomFilter.MightContain(const aValue: string): Boolean;
var
lHashValue: UInt32;
begin
Result := True;
for var lI := 0 to fHashFunctions - 1 do
begin
lHashValue := ComputeHash(aValue, UInt32(lI));
if not GetBit(lHashValue) then
begin
Result := False;
Exit; // Optimization: early exit
end;
end;
end;
procedure TBloomFilter.Clear;
begin
for var lI := 0 to Length(fBitArray) - 1 do
fBitArray[lI] := 0;
end;
end.
The implementation uses several techniques to maximize performance:
Compact Storage: Using TArray<UInt32>
instead of TArray<Boolean>
reduces memory usage by 32 times. Each UInt32 can store 32 bits, while each Boolean would use an entire one.
Double Hashing: The ComputeHash
function combines two different hash algorithms (Bob Jenkins and FNV1a) with a seed to simulate independent hash functions, avoiding the need to implement multiple separate hash functions.
Overflow Prevention: Using UInt64
in intermediate calculations prevents overflow during arithmetic operations, ensuring correct results even with large values.
Early Exit: In MightContain
, the loop stops at the first unset bit, optimizing performance in the most common case (elements not present).
Modern Naming: The class follows modern Delphi conventions with private members starting with “f” and local variables with “l”, along with the use of inline syntax for loops.
The Secret of Double Hashing: Why Do We Need Multiple Hash Functions?
A fundamental question in Bloom Filter implementation is: why not use just one hash function? The answer lies in the mathematics of probabilistic distribution and the need to minimize false positives.
The Problem with a Single Hash Function: If we used only one hash function, different elements could easily “collide” on the same bit, drastically reducing the filter’s effectiveness. With multiple hash functions, the probability that two different elements have all the same hash positions becomes extremely low.
The Double Hashing Technique: Instead of implementing k completely independent hash functions (computationally expensive), our implementation uses an elegant technique: simulating k hash functions using only two base functions.
The mathematical formula is:
Gi(x) = (H1(x) + i * H2(x)) mod m
Where:
Gi(x)
is the i-th simulated hash function for element xH1(x)
andH2(x)
are the two base hash functions (Bob Jenkins and FNV1a)i
is the hash function index (0, 1, 2…)m
is the size of the bit array
How It Works in Our Implementation:
function TBloomFilter.ComputeHash(const aValue: string; aSeed: UInt32): UInt32;
var
lHash1Val, lHash2Val: Cardinal;
lCombined: UInt64;
begin
// Two base hash functions
lHash1Val := Cardinal(THashBobJenkins.GetHashValue(aValue)); // H1(x)
lHash2Val := Cardinal(THashFNV1a32.GetHashValue(aValue)); // H2(x)
// Formula: G_i(x) = (H1(x) + i * H2(x)) mod m
lCombined := UInt64(lHash1Val) + UInt64(aSeed) * UInt64(lHash2Val);
Result := lCombined mod fBitCount;
end;
The aSeed
parameter represents the index i
of the hash function. When we call the method with seeds 0, 1, 2… we get different hash functions that are mathematically correlated in an optimal way.
Advantages of This Approach:
- Efficiency: We compute only two base hashes instead of k separate hashes
- Independence: The resulting functions are sufficiently independent to minimize false positives
- Simplicity: One formula instead of k different implementations
- Proven: This technique has been mathematically proven almost equivalent to k truly independent functions
This is why our Bloom Filter can be both simple and extremely effective: we leverage mathematics to achieve maximum results with minimum computational effort.
Quick Demo of Functionality
Before seeing practical use in production, here’s a simple console demo showing the API in action:
program BloomFilterDemo;
{$APPTYPE CONSOLE}
uses
System.SysUtils,
BloomFilter in 'BloomFilter.pas';
var
lFilter: TBloomFilter;
begin
// Create a Bloom Filter with 10000 bits and 3 hash functions
lFilter := TBloomFilter.Create(10000, 3);
try
Writeln('=== BLOOM FILTER DEMO ===');
Writeln;
// Test emails not yet inserted
Write('Is email "mario@test.com" present? ');
if lFilter.MightContain('mario@test.com') then
Writeln('PROBABLY YES')
else
Writeln('DEFINITELY NO');
Write('Is email "luigi@test.com" present? ');
if lFilter.MightContain('luigi@test.com') then
Writeln('PROBABLY YES')
else
Writeln('DEFINITELY NO');
Writeln;
Writeln('Adding "mario@test.com" to the filter...');
lFilter.Add('mario@test.com');
Writeln('Adding "francesco@test.com" to the filter...');
lFilter.Add('francesco@test.com');
Writeln;
Writeln('=== TEST AFTER INSERTIONS ===');
// Test inserted elements
Write('Is email "mario@test.com" present? ');
if lFilter.MightContain('mario@test.com') then
Writeln('PROBABLY YES (correct!)')
else
Writeln('DEFINITELY NO');
Write('Is email "francesco@test.com" present? ');
if lFilter.MightContain('francesco@test.com') then
Writeln('PROBABLY YES (correct!)')
else
Writeln('DEFINITELY NO');
// Test element never inserted
Write('Is email "luigi@test.com" present? ');
if lFilter.MightContain('luigi@test.com') then
Writeln('PROBABLY YES (false positive)')
else
Writeln('DEFINITELY NO (correct!)');
Writeln;
Writeln('Note: Inserted elements are always detected (zero false negatives)');
Writeln('Non-inserted elements might occasionally give false positives');
Readln;
finally
lFilter.Free;
end;
end.
The typical output will show how the Bloom Filter responds “DEFINITELY NO” for elements never seen, “PROBABLY YES” for inserted elements, and occasionally “PROBABLY YES” even for non-inserted elements (false positives). The key point is that we’ll never have false negatives: if an element has been inserted, it will always be detected.
Practical Usage
// In your registration controller
var
lBloomFilter: TBloomFilter;
// Initialization (to be done only once)
lBloomFilter := TBloomFilter.Create(1000000, 3); // 1M bits, 3 hash functions
// During registration
procedure RegisterUser(const aEmail: string);
begin
if not lBloomFilter.MightContain(aEmail) then
begin
// Definitely new email, insert directly
InsertNewUser(aEmail);
lBloomFilter.Add(aEmail);
end
else
begin
// Might be duplicate, verify with database
if not EmailExistsInDB(aEmail) then
begin
InsertNewUser(aEmail);
lBloomFilter.Add(aEmail);
end
else
raise Exception.Create('Email already registered');
end;
end;
Final Considerations
Bloom Filters represent a perfect example of how sometimes the best solution isn’t the perfect one, but the “good enough” one that solves the real problem. ☝️ As Voltaire said, “the perfect is the enemy of the good.” By accepting a small percentage of uncertainty, we achieve enormous advantages in terms of memory and performance.
When to Use Bloom Filters
Bloom Filters are the ideal choice when:
- You need to perform many membership checks on a very large set
- Memory is a limited resource
- You can tolerate a small percentage of false positives
- You don’t need to remove elements from the set (once a bit is 1, it cannot return to 0)
Real-World Use Cases
Beyond our user registration API example, Bloom Filters are used in:
Web Browsers: Chrome uses Bloom Filters for Safe Browsing - quickly checks if a URL is in the list of malicious sites without having to query the server for every link.
Distributed Databases: Systems like Cassandra and HBase use them to avoid expensive disk reads. Before searching for specific data, they consult the Bloom Filter to know if it’s worth accessing the disk.
Caching Systems: To quickly check if an item is already in cache before retrieving it from a slower source.
Content Delivery Networks (CDN): To determine if content is already available on an edge server without having to make expensive queries.
Duplicate Detection: In applications where it’s important to know if an element has already been seen (avoiding spam, suggesting unseen content).
Trade-offs to Consider
Advantages:
- Extreme space efficiency (a few megabytes for millions of elements)
- Near-constant time operations
- No false negatives guaranteed
Disadvantages:
- Possible false positives (configurable)
- Inability to remove elements
- Fixed size of the bit array
In the world of distributed architectures and big data, Bloom Filters have become an indispensable tool. Their conceptual simplicity hides surprising power in optimizing real systems, especially when combined with frameworks like DelphiMVCFramework to create high-performance APIs.
Comments
comments powered by Disqus