Become a member!

Bloom Filters: The Approximate Memory that Revolutionizes Performance

  • 👉 This article is available in english too.
  • 👉 Este artículo también está disponible en español.
  • 👉 Questo articolo è disponibile anche in italiano.
  • 👉 Este artigo também está disponível em português do Brasil.

The Problem: When Checking Becomes Expensive

In recent years, DelphiMVCFramework has been adopted in mission-critical systems worldwide: from banking APIs processing transactions worth millions of euros, to hospital systems managing vital patient data, to industrial platforms coordinating automated production lines. These systems share a common characteristic: they must handle millions of API calls daily while maintaining excellent performance and predictable response times.

In such contexts, even seemingly simple operations can become critical bottlenecks when scaled to enterprise volumes. A perfect example is what you might encounter when developing a user registration API for a social network that has reached 100 million subscribers. Every time someone attempts to register, you need to verify if the email already exists in the database. An apparently innocent query like SELECT COUNT(*) FROM users WHERE email = ? may seem trivial, but when executed thousands of times per minute on a table with 100 million rows, even with the best indexes, the database starts to suffer.

The paradox is that the vast majority of these queries return zero results - most emails being checked are actually new. You’re literally wasting computational resources to confirm something you could know in advance with a much more efficient method.

This scenario repeats in countless contexts: checking URLs already visited in a web crawler, verifying keys in distributed caches, anti-spam filters, deduplication systems. All situations where the cost of verifying if an element “might be present” is enormously lower than the cost of a definitive verification.

The Traditional Solution and Its Limits

The classic approach involves using data structures like HashSet or Dictionary to keep all previously seen elements in memory. For our email case, this would mean loading 100 million strings into memory - we’re talking about several gigabytes of RAM just for this functionality.

Even assuming you have the available memory, the cost of keeping this structure synchronized with the database becomes prohibitive. Every insertion, modification, or deletion requires updates to both the database and the in-memory structure.

Enter Bloom Filters: The Approximate Memory

A Bloom Filter is essentially an “approximate memory” that allows you to answer a fundamental question: “Have I ever seen this element before?” with two possible answers:

  • “Definitely NO” (100% guaranteed answer)
  • “Probably YES” (with a configurable probability of error)

Think of a Bloom Filter as that friend who has an exceptional memory for things they’ve NEVER seen, but sometimes confuses things they’ve already encountered. If they tell you “I’ve never seen it,” you can trust them blindly. If they tell you “I’ve seen it before,” they might be wrong, but most of the time they’re right.

How It Works: The Elegance of Simplicity

A Bloom Filter consists of a bit array (initially all zeros) and a set of hash functions. When you want to “remember” an element:

  1. You calculate various hash functions of the element
  2. You use these values as indices in the bit array
  3. You set all corresponding bits to 1

To verify if an element is already present:

  1. You calculate the same hash functions
  2. You check if all corresponding bits are 1
  3. If even one is 0, the element is definitely not there
  4. If they’re all 1, the element is probably there

The beauty lies in compactness: regardless of how many elements you insert, the size of the Bloom Filter remains constant. For our case of 100 million emails, instead of gigabytes of memory, we could use just a few megabytes while maintaining a false positive probability of 1%.

Solving the Original Problem

Returning to our user registration scenario, the flow becomes:

  1. User attempts registration with email X
  2. Quick check in the Bloom Filter (microseconds)
  3. If it says “Definitely NO,” proceed directly with insertion
  4. If it says “Probably YES,” make the database query for confirmation

Considering that 99% of emails are actually new, we eliminate 99% of database queries. The remaining 1% includes both true duplicates and false positives, but both require final verification anyway.

Simple but Efficient Implementation in Delphi

The following implementation is deliberately simple but uses advanced techniques to maximize performance. Instead of using a TArray<Boolean> that wastes memory, we use a TArray<UInt32> for compact bit storage.

unit BloomFilter;

interface

uses
  System.SysUtils, System.Hash;

type
  TBloomFilter = class
  private
    fBitArray: TArray<UInt32>;    // Compact storage with UInt32 array
    fBitCount: UInt32;            // Total number of bits in the filter
    fHashFunctions: Integer;      // Number of hash functions to use
    
    function GetBitArraySize: Integer;
    procedure SetBit(aIndex: UInt32);
    function GetBit(aIndex: UInt32): Boolean;
    function ComputeHash(const aValue: string; aSeed: UInt32): UInt32;
    
  public
    constructor Create(aBitCount: UInt32; aHashFunctions: Integer = 3);
    destructor Destroy; override;
    
    procedure Add(const aValue: string);
    function MightContain(const aValue: string): Boolean;
    procedure Clear;
  end;

implementation

constructor TBloomFilter.Create(aBitCount: UInt32; aHashFunctions: Integer);
begin
  inherited Create;
  
  if aBitCount = 0 then
    raise EArgumentException.Create('Bit count must be greater than 0');
  if aHashFunctions <= 0 then
    raise EArgumentException.Create('Hash function count must be greater than 0');
    
  fBitCount := aBitCount;
  fHashFunctions := aHashFunctions;
  
  // Calculate required UInt32 array size
  SetLength(fBitArray, GetBitArraySize);
  Clear;
end;

destructor TBloomFilter.Destroy;
begin
  SetLength(fBitArray, 0);
  inherited;
end;

function TBloomFilter.GetBitArraySize: Integer;
begin
  // Ceiling division: (fBitCount + 31) div 32
  Result := (fBitCount + 31) div 32;
end;

procedure TBloomFilter.SetBit(aIndex: UInt32);
var
  lArrayIndex: Integer;
  lBitIndex: Integer;
begin
  lArrayIndex := aIndex div 32;
  lBitIndex := aIndex mod 32;
  fBitArray[lArrayIndex] := fBitArray[lArrayIndex] or (1 shl lBitIndex);
end;

function TBloomFilter.GetBit(aIndex: UInt32): Boolean;
var
  lArrayIndex: Integer;
  lBitIndex: Integer;
begin
  lArrayIndex := aIndex div 32;
  lBitIndex := aIndex mod 32;
  Result := (fBitArray[lArrayIndex] and (1 shl lBitIndex)) <> 0;
end;

function TBloomFilter.ComputeHash(const aValue: string; aSeed: UInt32): UInt32;
var
  lHash1Val, lHash2Val: Cardinal;
  lCombined: UInt64;  // UInt64 to prevent overflow
begin
  // Double hashing to simulate independent hash functions
  lHash1Val := Cardinal(THashBobJenkins.GetHashValue(aValue));
  lHash2Val := Cardinal(THashFNV1a32.GetHashValue(aValue));
  
  // Combine with seed to create different hash functions
  lCombined := UInt64(lHash1Val) + UInt64(aSeed) * UInt64(lHash2Val);
  
  // Modulo ensures result is always in valid range
  Result := lCombined mod fBitCount;
end;

procedure TBloomFilter.Add(const aValue: string);
var
  lHashValue: UInt32;
begin
  for var lI := 0 to fHashFunctions - 1 do
  begin
    lHashValue := ComputeHash(aValue, UInt32(lI));
    SetBit(lHashValue);
  end;
end;

function TBloomFilter.MightContain(const aValue: string): Boolean;
var
  lHashValue: UInt32;
begin
  Result := True;
  for var lI := 0 to fHashFunctions - 1 do
  begin
    lHashValue := ComputeHash(aValue, UInt32(lI));
    if not GetBit(lHashValue) then
    begin
      Result := False;
      Exit; // Optimization: early exit
    end;
  end;
end;

procedure TBloomFilter.Clear;
begin
  for var lI := 0 to Length(fBitArray) - 1 do
    fBitArray[lI] := 0;
end;

end.

The implementation uses several techniques to maximize performance:

Compact Storage: Using TArray<UInt32> instead of TArray<Boolean> reduces memory usage by 32 times. Each UInt32 can store 32 bits, while each Boolean would use an entire one.

Double Hashing: The ComputeHash function combines two different hash algorithms (Bob Jenkins and FNV1a) with a seed to simulate independent hash functions, avoiding the need to implement multiple separate hash functions.

Overflow Prevention: Using UInt64 in intermediate calculations prevents overflow during arithmetic operations, ensuring correct results even with large values.

Early Exit: In MightContain, the loop stops at the first unset bit, optimizing performance in the most common case (elements not present).

Modern Naming: The class follows modern Delphi conventions with private members starting with “f” and local variables with “l”, along with the use of inline syntax for loops.

The Secret of Double Hashing: Why Do We Need Multiple Hash Functions?

A fundamental question in Bloom Filter implementation is: why not use just one hash function? The answer lies in the mathematics of probabilistic distribution and the need to minimize false positives.

The Problem with a Single Hash Function: If we used only one hash function, different elements could easily “collide” on the same bit, drastically reducing the filter’s effectiveness. With multiple hash functions, the probability that two different elements have all the same hash positions becomes extremely low.

The Double Hashing Technique: Instead of implementing k completely independent hash functions (computationally expensive), our implementation uses an elegant technique: simulating k hash functions using only two base functions.

The mathematical formula is:

Gi(x) = (H1(x) + i * H2(x)) mod m

Where:

  • Gi(x) is the i-th simulated hash function for element x
  • H1(x) and H2(x) are the two base hash functions (Bob Jenkins and FNV1a)
  • i is the hash function index (0, 1, 2…)
  • m is the size of the bit array

How It Works in Our Implementation:

function TBloomFilter.ComputeHash(const aValue: string; aSeed: UInt32): UInt32;
var
  lHash1Val, lHash2Val: Cardinal;
  lCombined: UInt64;
begin
  // Two base hash functions
  lHash1Val := Cardinal(THashBobJenkins.GetHashValue(aValue));  // H1(x)
  lHash2Val := Cardinal(THashFNV1a32.GetHashValue(aValue));     // H2(x)
  
  // Formula: G_i(x) = (H1(x) + i * H2(x)) mod m
  lCombined := UInt64(lHash1Val) + UInt64(aSeed) * UInt64(lHash2Val);
  Result := lCombined mod fBitCount;
end;

The aSeed parameter represents the index i of the hash function. When we call the method with seeds 0, 1, 2… we get different hash functions that are mathematically correlated in an optimal way.

Advantages of This Approach:

  • Efficiency: We compute only two base hashes instead of k separate hashes
  • Independence: The resulting functions are sufficiently independent to minimize false positives
  • Simplicity: One formula instead of k different implementations
  • Proven: This technique has been mathematically proven almost equivalent to k truly independent functions

This is why our Bloom Filter can be both simple and extremely effective: we leverage mathematics to achieve maximum results with minimum computational effort.

Quick Demo of Functionality

Before seeing practical use in production, here’s a simple console demo showing the API in action:

program BloomFilterDemo;
{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  BloomFilter in 'BloomFilter.pas';

var
  lFilter: TBloomFilter;
begin
  // Create a Bloom Filter with 10000 bits and 3 hash functions
  lFilter := TBloomFilter.Create(10000, 3);
  try
    Writeln('=== BLOOM FILTER DEMO ===');
    Writeln;
    
    // Test emails not yet inserted
    Write('Is email "mario@test.com" present? ');
    if lFilter.MightContain('mario@test.com') then
      Writeln('PROBABLY YES')
    else  
      Writeln('DEFINITELY NO');
      
    Write('Is email "luigi@test.com" present? ');
    if lFilter.MightContain('luigi@test.com') then
      Writeln('PROBABLY YES')
    else  
      Writeln('DEFINITELY NO');
    
    Writeln;
    Writeln('Adding "mario@test.com" to the filter...');
    lFilter.Add('mario@test.com');
    
    Writeln('Adding "francesco@test.com" to the filter...');
    lFilter.Add('francesco@test.com');
    
    Writeln;
    Writeln('=== TEST AFTER INSERTIONS ===');
    
    // Test inserted elements
    Write('Is email "mario@test.com" present? ');
    if lFilter.MightContain('mario@test.com') then
      Writeln('PROBABLY YES (correct!)')
    else  
      Writeln('DEFINITELY NO');
      
    Write('Is email "francesco@test.com" present? ');
    if lFilter.MightContain('francesco@test.com') then
      Writeln('PROBABLY YES (correct!)')
    else  
      Writeln('DEFINITELY NO');
    
    // Test element never inserted
    Write('Is email "luigi@test.com" present? ');
    if lFilter.MightContain('luigi@test.com') then
      Writeln('PROBABLY YES (false positive)')
    else  
      Writeln('DEFINITELY NO (correct!)');
      
    Writeln;
    Writeln('Note: Inserted elements are always detected (zero false negatives)');
    Writeln('Non-inserted elements might occasionally give false positives');
    
    Readln;
  finally
    lFilter.Free;
  end;
end.

The typical output will show how the Bloom Filter responds “DEFINITELY NO” for elements never seen, “PROBABLY YES” for inserted elements, and occasionally “PROBABLY YES” even for non-inserted elements (false positives). The key point is that we’ll never have false negatives: if an element has been inserted, it will always be detected.

Practical Usage

// In your registration controller
var
  lBloomFilter: TBloomFilter;
  
// Initialization (to be done only once)
lBloomFilter := TBloomFilter.Create(1000000, 3); // 1M bits, 3 hash functions

// During registration
procedure RegisterUser(const aEmail: string);
begin
  if not lBloomFilter.MightContain(aEmail) then
  begin
    // Definitely new email, insert directly
    InsertNewUser(aEmail);
    lBloomFilter.Add(aEmail);
  end
  else
  begin
    // Might be duplicate, verify with database
    if not EmailExistsInDB(aEmail) then
    begin
      InsertNewUser(aEmail);
      lBloomFilter.Add(aEmail);
    end
    else
      raise Exception.Create('Email already registered');
  end;
end;

Final Considerations

Bloom Filters represent a perfect example of how sometimes the best solution isn’t the perfect one, but the “good enough” one that solves the real problem. ☝️ As Voltaire said, “the perfect is the enemy of the good.” By accepting a small percentage of uncertainty, we achieve enormous advantages in terms of memory and performance.

When to Use Bloom Filters

Bloom Filters are the ideal choice when:

  • You need to perform many membership checks on a very large set
  • Memory is a limited resource
  • You can tolerate a small percentage of false positives
  • You don’t need to remove elements from the set (once a bit is 1, it cannot return to 0)

Real-World Use Cases

Beyond our user registration API example, Bloom Filters are used in:

Web Browsers: Chrome uses Bloom Filters for Safe Browsing - quickly checks if a URL is in the list of malicious sites without having to query the server for every link.

Distributed Databases: Systems like Cassandra and HBase use them to avoid expensive disk reads. Before searching for specific data, they consult the Bloom Filter to know if it’s worth accessing the disk.

Caching Systems: To quickly check if an item is already in cache before retrieving it from a slower source.

Content Delivery Networks (CDN): To determine if content is already available on an edge server without having to make expensive queries.

Duplicate Detection: In applications where it’s important to know if an element has already been seen (avoiding spam, suggesting unseen content).

Trade-offs to Consider

Advantages:

  • Extreme space efficiency (a few megabytes for millions of elements)
  • Near-constant time operations
  • No false negatives guaranteed

Disadvantages:

  • Possible false positives (configurable)
  • Inability to remove elements
  • Fixed size of the bit array

In the world of distributed architectures and big data, Bloom Filters have become an indispensable tool. Their conceptual simplicity hides surprising power in optimizing real systems, especially when combined with frameworks like DelphiMVCFramework to create high-performance APIs.

Comments

comments powered by Disqus