Company

Battling Scams at Scale: Inside Doppel’s High-Throughput ML Platform

How our engineering team took our ML platform from zero to inference at internet scale in four months.
Justin D'Souza, William Gill, and Rohit Mukerji
August 28, 2025

The internet is both a powerful marketplace and a playground for threat actors. At Doppel, we see firsthand how a single undetected phishing site can cost a brand millions of dollars in fraud losses and erode customer trust overnight. With billions of URLs live at any moment—and new malicious sites spun up in seconds—our window to catch threats is razor‑thin.

Our mission at Doppel is to protect organizations from social engineering—fake login pages, scam ads, spoofed social accounts, and malicious mobile apps that trick real people into handing over their data. As a full‑stack digital risk protection solution, we handle everything from initial detections to full takedowns. A “takedown” involves working with hosting providers, domain registrars, and platform operators to remove or disable malicious sites, ads, or accounts the moment we identify them.

Figure 1. Overview of Doppel’s Digital Risk Protection Platform: live Internet signals flow into the system, where we automatically Detect malicious content, initiate Takedowns, and maintain Continuous Monitoring in a closed‑loop to catch any re‑emergence.

In our early days, we leaned on third‑party machine learning (ML) vendors to help us keep pace. But these tools left us:

  • Vulnerable to rapid attacker innovation, with retraining cycles measured in days.
  • In the dark about model decisions, because of black‑box predictions.
  • Burdened by tuning constraints, driving up false positives and missing real scams.

Faced with escalating volume and stakes, we knew end‑to‑end control was the only way forward.

Why build an ML Platform?

To overcome these limitations, we needed more than individual point solutions—we needed a unified ML platform. As our business scaled, we found that feature engineering, model training, and serving are all highly repetitive workflows: pulling data, transforming features, spinning up training jobs, packaging models, and deploying endpoints. Manually orchestrating each step not only slowed development and wasted engineering cycles, but also led to divergent design patterns across teams and inconsistent results.

By abstracting away this complexity and standardizing end‑to‑end patterns, we set out to build a platform that lets any engineer at Doppel:

  • Write and ship new features rapidly: Define and deploy feature transformations without rebuilding data pipelines.
  • Use a unified training & serving interface: Track experiments, version datasets, and run validation checks through a single workflow.
  • Push models live with confidence: Deploy models with built‑in performance monitoring and auto‑scaling for real‑time throughput.

With our ML platform now fully in place, we’ve transformed how quickly and effectively we defend against threats. In this post, we’ll walk through how we bootstrapped adversarial model development, built a real‑time serving stack at scale, and distilled key lessons from running ML in production.

TLDR; Impact

In just 4 months, we’ve achieved the following outcomes:

  • Expanded model portfolio: We’ve replaced ~3 opaque vendor classifiers with six in‑house models, each backed by versioned feature sets.
  • Unified feature platform: We’ve moved from brittle DIY abstractions—and the headaches of training‑serving skew—to a single source of truth that standardizes training and serving pipelines and supports hundreds of batch and real‑time features powering our detection systems.
  • Accelerated time from ideation to production: We’ve cut the train‑and‑deploy cycle from multiple days to mere hours.

1. Bootstrapping Model Development in an Adversarial Space

When you’re detecting malicious content at scale, the first challenge isn’t infrastructure, it’s signal. You need labeled data to train models, but in adversarial environments like ours, that data is sparse, messy, and constantly evolving. The ground truth is often unclear. Threat actors don’t announce themselves, and most URLs on the internet are benign noise.

Weak signals, strong patterns

Our earliest models were trained on a combination of:

  • Confirmed takedowns from real impersonation cases
  • Manual reviews by our security operations team
  • Heuristics encoded from subject-matter expertise — things like measuring keyword similarity to official brand names, computing domain string entropy, and flagging inconsistencies between page content and the claimed brand.

These signals were noisy, but directional. We built labeling pipelines that could aggregate weak supervision at scale, and prioritized models that could help rank content for human review, not make binary decisions.

Biases in the data, and how we handled them

Our initial labeled sets risked overfitting to high‑confidence edge cases. To combat this, we built a suite of data build tool (dbt) models in our data warehouse that:

  • Codify label sources: We formalize derived labels (e.g., heuristic flags) alongside human review annotations, and apply consistent transformation logic so every signal follows the same lineage.
  • Unify external and internal datasets: We ingest third‑party threat feeds and internal security‑ops tags, then merge and reconcile these sources into a single, versioned training table.

Whenever our security team flags a benign domain as suspicious, we capture that correction in dbt and send it through the exact same pipelines as our other labels. This way, every false positive becomes part of our training data, keeping our models grounded in real‑world feedback.

Optimizing for learning velocity

We didn’t optimize for model performance out of the gate. We optimized for learning velocity — how quickly we could train, evaluate, ship, and get feedback. That meant:

  • Detailed, versioned datasets with rich label metadata: we tag every training example with standardized metadata,like label source, timestamp, and labeller, so it’s easy to trace exactly how each label was generated and updated over time.
  • Lightweight model experimentation with reproducible notebooks and metrics
  • Tight integration with our takedown system to close the loop between predictions and outcomes

This early investment let us move fast without losing track of what was working, and gave us a foundation we could confidently scale on top of.

2. Serving Real-Time ML at Scale

Model training is one challenge—deploying inference in production at scale is another entirely. Our serving infrastructure must process a continuous stream of 100 million+ URL checks per day, maintaining sub‑100 ms P99 latency under bursty traffic. In an adversarial context, a single false negative lets a phishing site slip through undetected, while thousands of false positives per second would drown our SOC team in noise and degrade our signal‑to‑noise ratio.

We needed a serving stack that was:

  • Low-latency: supporting both bulk‑style and point‑based feature generation and inference in real time
  • High-throughput: tens of millions of predictions daily, with peak hour spikes
  • Auditable and debuggable: we had to be able to explain predictions to ourselves and to customers

Chalk as our feature store

At the core of our real-time serving is Chalk — a real-time feature platform we use to compute and serve features dynamically based on the latest web content. The features we define are “resolved” by functions or Python-native resolvers that compute everything from domain string features (e.g., entropy, brand overlap, token patterns) to page-level metadata extracted from crawled content.

This pattern enables us to write features which are:

  • Versioned and testable
  • Composable into higher-level features
  • Servable at request time or in batch depending on use case

This lets us reuse the same logic in both training and production, reducing drift and improving reproducibility.

Productionizing model inference

Model inference is orchestrated within Chalk: Each model consumes raw feature primitives—domain strings, extracted HTML, metadata, and upstream feature outputs—and emits its predictions as first‑class features. This lets us treat model scores just like any other resolver, composing them seamlessly into downstream workflows.

As an illustrative example, imagine we had a general phishing detection model which consumes four intuitive signals to gauge phishing risk:

  • num_login_forms: The count of login forms on the page—more forms can indicate an attempt to harvest credentials.
  • has_suspicious_language: A boolean flag for whether the HTML content language is in a suspicious language list
  • external_reputation_score: A third‑party trust metric, where lower scores signal riskier domains.


We could first spell out those features in code in a feature class like so:

Python
from chalk.features import features

@features
class Url:
    # Value of the URL
    value: str

    # The count of login forms on the page
    num_login_forms: int

    # Whether the HTML content language is in a suspicious language list
    has_suspicious_language: bool

    # A thirdparty trust metric (lower = riskier)
    external_reputation_score: float

We could then write a resolver which passes these  primitives to our phishing detection service, which returns a single floating‑point phishing_probability between 0.0 (safe) and 1.0 (definitely phishing). That value is exposed as Url.phishing_probability, making it seamlessly available for any downstream resolver or workflow.

Python
from chalk import online
from chalk.features import Features


MODEL_ENDPOINT_URL = "https://sample_url.com/v1/phish-detection"


@online
def phishing_detection_classifier(
    num_login_forms: Url.num_login_forms,
    has_suspicious_language: Url.has_suspicious_language,
    external_reputation_score: Url.external_reputation_score,
) -> Features[Url.phishing_probability]:
    params = {
        "forms": num_login_forms,
        "has_language_in_blocklist": has_suspicious_language,
        "reputation": external_reputation_score,
    }
    response = requests.get(MODEL_ENDPOINT_URL, params=params)
    result = response.json()
    return result["phishing_probability"]

Behind the scenes, we package each model and its dependencies into custom Docker containers and serve them via lightweight Cloud Run services. This serverless approach keeps inference modular and testable, and allows us to scale, version, and monitor each model independently, while keeping feature logic and model orchestration centralized in Chalk.

To meet real‑time latency targets, we optimize along three key dimensions:

  • Raw input caching: Store crawled HTML in the online feature store with a TTL to eliminate redundant fetches and parsing.
  • Entity‑level feature cache: Precompute and cache feature vectors for features related to customers to slash per‑request computation.
  • Prediction caching: Cache model outputs for frequently seen URLs to bypass repeat inference.

Built‑in observability spans schema enforcement, performance telemetry, and end‑to‑end traceability:

  • Input/output contracts: Enforce schemas with Pydantic to catch data mismatches before inference.
  • Metrics tracking: Surface latency, throughput, and error rates via Cloud Run dashboards and alerts.
  • Audit logs: Persist full feature snapshots and model version metadata per prediction for compliance and post‑hoc analysis.
  • Inference lineage: Leverage Chalk’s query planning DAG to reconstruct the exact computation graph for every score—vital for debugging, validation, and customer audits.

Figure 2: High‑level real‑time serving architecture where multiple detection workloads (A, B, C) funnel through Chalk’s feature platform for on‑the‑fly feature generation, and individual models (A, B, C) are deployed as containerized Cloud Run services to provide low‑latency inference back into the feature store.

3. Lessons from Running ML in Production

Owning our ML stack end‑to‑end revealed critical engineering insights that map directly back to the challenges we tackled:

  • Centralized label management with dbt: Relying on ad hoc CSVs and manual tagging led to brittle, non‑reproducible training sets. By codifying label logic in dbt—merging external threat feeds, security‑ops annotations, and false‑positive flags into a single versioned table—we keep our training data in lockstep with evolving attacker behaviors.
  • Treat features as first‑class, versioned artifacts: Divergent feature definitions across teams spawned training‑serving skew and hard‑to‑debug errors. Centralizing all feature resolvers in one platform, versioning them, and sharing the same code paths for batch and real‑time compute eliminated drift and made model behavior predictable.
  • Layered caching + containerized inference for sub‑100 ms P99: Real‑time detection at 100M+ URL checks/day demanded more than raw compute power. Our layers of caching combined with custom Docker containers on Cloud Run smashed redundant work, drove down tail latency, and kept costs under control.
  • Shift‑left validation & end‑to‑end observability: Late‑stage surprises—schema mismatches, silent feature drift, or performance regressions—are unacceptable in adversarial settings. We baked Pydantic schema checks, dataset‑drift alerts, and “shadow” inference tests into our CI/CD pipelines, and log every inference with full feature snapshots and model metadata to BigQuery for instant traceability.

These practices have transformed Doppel’s ML platform from a collection of one‑off scripts into a robust, scalable ecosystem—empowering engineers to safely ship and operate new models at internet scale.

We’re Hiring

Interested in pushing the boundaries of AI applications in cybersecurity? We’re hiring—let’s build the future of Social Engineering Defense.

Related Blogs

Company
Founder’s Note: Why We Built Doppel Simulation
Learn More
Company
Retire the Phishing Test: Doppel Simulation is Here
Learn More
Company
Press Release: Doppel and Filigran Announce Integration to Strengthen Threat Intelligence Workflows
Learn More

Learn how Doppel can protect your business.