The Billion-URL Diet: Cutting Python Loops From Our Pipeline

At Doppel, we classify enormous volumes of URLs to detect attacks against the world’s largest brands. Over the last few months, our system comfortably handled approximately 100 million URLs per day. But as the number of customers and threat actors grew, the next stage of the roadmap became inevitable: scale the platform to more than 1 billion URLs without exploding compute cost or latency.

That goal forced us to confront the limits of our original architecture. The early pipeline built around per-URL loops was simple, readable, and perfectly adequate at lower volume. But as we’ve continued to scale the symptoms were clear: CPU underutilization, ballooning latency, and an execution model that fundamentally couldn’t stretch another order of magnitude.

Reaching 1B+ URLs was going to take more than just a simple optimization; it required a different way of structuring work. This post is the story of how we rebuilt the system to get there.

Recognizing the Problem: Per-URL Python = Bottleneck

The original design did exactly what you’d expect a first version to do:

One URL at a time. Each incoming URL triggered normalization, parsing, fuzzy matching, and customer comparison independently.
Python-level nested loops. For every URL–customer pair, the system executed string comparisons and fuzzy matching inside Python.
Frequent per-string operations. Unicode processing, homoglyph detection, and tokenization all happened as repeated Python function calls.

We tried multi-threading, but the GIL prevented CPU saturation. We tried multi-processing, but each process still handled thousands of tiny Python operations. In both cases, the workload was granulated into pieces too small and too interpreter-bound to exploit modern hardware.

This mismatch of massive data volume vs. tiny GIL-bound tasks made the old pipeline fundamentally unscalable.

How We Fixed It

Microbatches: Reshape the Workload

Our phishing detection workload has two competing demands. Firstly, time to detection must be as short as possible. We want to take down credential harvesting threats as quickly as possible. On the other hand effective detection requires scanning massive portions of the internet.

These goals pull in opposite directions: low latency pushes you toward streaming systems, while global-scale coverage pushes you toward batch efficiency. Microbatching provides a Goldilocks-zone that aligns directly with the realities of phishing defense.

Mode	Time to detection	Cost efficiency
Streaming	Very fast	Low
Batch	Slow	Very high
Microbatch	Fast	High

The key optimization here is taking advantage of amortization. One expensive setup for many cheap operations.

In streaming:   (cost_setup + cost_compute) × N
In microbatch:  cost_setup + (cost_compute × N)

We drive down the cost_compute in micro_batch by spending more on cost_setup. Optimizations like columnar normalization, fused lazy expressions, SIMD-friendly kernels, and sparse-matrix filtering all rely on having enough data to work on at once. Microbatching did not speed up the pipeline by itself, but it enabled the architectural techniques that actually provide the speedups.

Optimize the Data Layout: Lazy Execution and Columnar Format

While batching solved the “too many tiny tasks” problem, we had another major opportunity for optimization. We needed to organize our data to take advantage of modern CPU architectures. To push toward this new scale, we had to restructure both how data moved and which comparisons we performed at all.

The first part of that was adopting a lazy, columnar execution model. Instead of running step-by-step Python loops, we expressed the entire normalization, tokenization, feature extraction pipeline as Polars dataframe operations.

def process_batch(self, url: pl.LazyFrame) -&gt; pl.LazyFrame:
    return url.with_columns(
        polars.col(self._input_column)
        .str.strip_chars()
        .str.to_lowercase()
        .alias(self._output_column)
    )

Most importantly, Polars is built on Apache Arrow, which stores each column in a contiguous, type-pure memory buffer. That single design choice unlocks a cascade of performance benefits:

Fewer passes over data - operations are fused into a handful of native passes
Predictable memory access - data flows linearly through CPU caches instead of hopping through Python object pointers
SIMD-ready layout - values are tightly aligned, enabling wide-vector instructions (AVX2/AVX-512)
No Python overhead - the hot path stays entirely in Rust until you explicitly ask for Python objects

By the time a microbatch reaches our similarity-scoring stage, the data is already laid out in an ideal format for SIMD execution, giving us near-hardware-limit throughput with minimal CPU overhead.

Rust Kernels: Accelerating the Hot Path

So far we’ve built up the foundation of our pipeline, but there’s another critical reason we chose this architecture: we get to write Rust! That was a bit tongue-in-cheek, but this setup lets us offload performance sensitive code to Rust while keeping orchestration in Python.

To illustrate we ran some benchmarks on a simple feature that we compute: vowel/consonant ratio. Computing this ratio is simple, but when you're processing millions of domains, every microsecond counts.

We benchmarked 4 implementations.

Approach 1: Pure Python

VOWELS = frozenset("aeiouAEIOU")
CONSONANTS = frozenset("bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ")

def compute_ratio(domain: str) -&gt; float:
    vowel_count = 0
    consonant_count = 0
    for char in domain:
        if char in VOWELS:
            vowel_count += 1
        elif char in CONSONANTS:
            consonant_count += 1
    total = vowel_count + consonant_count
    return vowel_count / total if total else 0.0

# Process batch
results = [compute_ratio(d) for d in domains]

Approach 2: Polars + Python UDF - Polars DataFrame with map_elements()

import polars as pl

def batch_with_polars(domains: list[str]) -&gt; list[float]:
    df = pl.DataFrame({"domain": domains})
    
    result = df.with_columns(
        pl.col("domain")
        .map_elements(compute_ratio, return_dtype=pl.Float64)
        .alias("ratio")
    )
    
    return result["ratio"].to_list()

Approach 3 & 4: Rust - Single-threaded (3) & Multi-threaded with Rayon (4)

fn compute_vowel_ratio(s: &str) -&gt; f64 {
    let mut vowel_count = 0u32;
    let mut consonant_count = 0u32;
    
    for c in s.chars() {
        if is_vowel(c) { vowel_count += 1; }
        else if is_consonant(c) { consonant_count += 1; }
    }
    
    let total = vowel_count + consonant_count;
    if total == 0 { 0.0 } else { vowel_count as f64 / total as f64 }
}

// Sequential version
#[pyfunction]
fn batch_vowel_ratio(domains: Vec&lt;String&gt;) -&gt; Vec&lt;f64&gt; {
    domains.iter()
        .map(|d| compute_vowel_ratio(d))
        .collect()
}

// Parallel version with Rayon
#[pyfunction]
fn batch_vowel_ratio_parallel(domains: Vec&lt;String&gt;) -&gt; Vec&lt;f64&gt; {
    domains.par_iter()
        .map(|d| compute_vowel_ratio(d))
        .collect()
}

We started with processing 100k domains through each implementation. Here both single and multithreaded Rust easily beat Python implementations. A surprising finding is that Polars + Python UDF is in fact slower than pure Python. The overhead of DataFrame creation and map_elements() calling back into Python outweighs any benefits.

Vowel/Consonant Ratio: Throughput Comparison

Next we benchmarked how throughput is impacted by dataset size. As we can see, parallel Rust actually gets faster with larger batches. This is likely because we're able to better amortize per thread overhead and improve core utilization.

Now you might wonder why can’t we parallelize Python?

Python's GIL only allows one thread to execute Python bytecode at a time. Multithreading helps for I/O-bound work, but not for CPU-bound computation like ours. Multiprocessing bypasses the GIL by spawning separate processes, but adds overhead from process creation and serializing data between processes.

To show this we added two more implementations to the benchmark, testing on another 100k domains.

Multithreading

from concurrent.futures import ThreadPoolExecutor

def batch_multithreading(domains: list[str]) -&gt; list[float]:
    with ThreadPoolExecutor(max_workers=14) as executor:
        return list(executor.map(compute_ratio, domains))

Multiprocessing

import multiprocessing as mp

def batch_multiprocessing(domains: list[str]) -&gt; list[float]:
    with mp.Pool(14) as pool:
        return pool.map(compute_ratio, domains, chunksize=1000)

Python Parallelization: Why It Doesn't Work

Both Python approaches are slower than single-threaded. Multithreading suffers from GIL contention and while multiprocessing avoids the GIL it must pickle/unpickle data between processes. Our computation takes microseconds per domain, but serialization takes milliseconds per batch. Only Rust + Rayon achieves real speedup with shared memory, no serialization, and lightweight threads.

Algorithmic Shift: From Cartesian Loops to Sparse Indexing

The second part of the redesign was eliminating Cartesian work entirely. To process a microbatch efficiently, we needed an indexing strategy that avoids this core scaling pitfall. As we add more customers, each with their own domains and brand metadata, combined with our new scaling target the number of possible URL-customer comparisons explodes.

To avoid this, we converted the problem into a matrix multiplication and built an in-memory index that lets us skip nearly all irrelevant comparisons.

Let’s illustrate this with an n-gram index for jaccard style similarity. Let’s walk through it using a realistic phishing example.

Phishing Example: “a-d0ppel.com”

Imagine we’re protecting these brands:

Doppel
NebulaPay
BrightCart

Step 1 - Tokenize everything into n-grams.

Customer metadata
doppel.com -&gt; [“dop”, “opp”, “ppe”, … ]
nebulapay.com -&gt; ["neb", "ebu", "bul", "ula", ...]
brightcart.com -&gt; …

Phishing url
a-d0ppel.com -&gt; [“a-d”, “-d0”, “d0p”, “0pp”, “ppe, …]

Step 2 - Look up each n-gram in our in-memory index

For each n-gram in our url we check our index to answer the question.

What brand metadata contains this n-gram?

url n-grams = [“a-d”, “-d0”,..., “ppe, "pel"]
“a-d” -&gt; {None}
"-d0" -&gt; {None}
...
"ppe" -&gt; {matches ppe n-gram in do[ppe]l.com}
"pel" -&gt; {matches pel n-gram in dop[pel].com}

Another way to frame this is: Does URL i contain n-gram g, and does metadata j also contain n-gram g? We can convert this lookup into a matrix multiplication problem.

To start, let's assign each n-gram to a column in the matrix. We’ll do this for both the customer metadata and the batch of urls.

          n-grams -&gt;
URLs ↓   [dop opp ppe pel ... neb ebu bul ...]
------------------------------------------------
url #1    0   1   0   0       1   1   0
url #2    1   0   1   0       0   0   0
url #3    0   1   0   1       0   0   0
...

Once we have a matrix for (urls, n-gram) and (n-gram, customer metadata) we can multiply these to get

(URL, n-gram) * (n-gram, metadata) = (URL, metadata)

A non-zero result in position (url_i, metadata_j) means the URL and metadata share at least one n-gram and they are real similarity candidates.

We benchmarked this approach against the naive implementation.

Naive n-gram Jaccard Similarity

def naive_cross_join_jaccard(queries: list[str], bases: list[str]) -&gt; list[tuple]:
    query_n_grams = [get_n_grams(q) for q in queries]
    base_n_grams = [get_n_grams(b) for b in bases]
    
    results = []
    for q_idx, q_n_grams in enumerate(query_n_grams):
        for b_idx, b_n-grams in enumerate(base_n_grams):
            sim = jaccard_similarity(q_n-grams, b_n_grams)
            if sim &gt; 0:
                results.append((q_idx, b_idx, sim))
    return results

We saw a massive difference between the naive implementation and this indexed operation. More importantly, the gap here keeps increasing as we scale the operation - perfect for our microbatch architecture.

Memory Optimization

As a side note - most of the entries here are 0s so we can further optimize by only storing the non-zero entries. Here we can use a layout compressed representation (CSC/CSR) for the matrix.

So instead of storing:
[0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...]  // thousands of columns
We store:
[(col 4, 1), (col 7, 1), (col 11, 1)]

The Future

The new architecture positions Doppel for long-term growth. A pipeline that scales naturally as customers grow, threat actors evolve, and the surface area of the internet continues to explode. Instead of stretching a system past its breaking point, we now operate on infrastructure engineered to absorb our next ambitions.

Note: Benchmarks run on a 14-core Apple M-series CPU.