The internet is both a powerful marketplace and a playground for threat actors. At Doppel, we see firsthand how a single undetected phishing site can cost a brand millions of dollars in fraud losses and erode customer trust overnight. With billions of URLs live at any moment—and new malicious sites spun up in seconds—our window to catch threats is razor‑thin.
Our mission at Doppel is to protect organizations from social engineering—fake login pages, scam ads, spoofed social accounts, and malicious mobile apps that trick real people into handing over their data. As a full‑stack digital risk protection solution, we handle everything from initial detections to full takedowns. A “takedown” involves working with hosting providers, domain registrars, and platform operators to remove or disable malicious sites, ads, or accounts the moment we identify them.
Figure 1. Overview of Doppel’s Digital Risk Protection Platform: live Internet signals flow into the system, where we automatically Detect malicious content, initiate Takedowns, and maintain Continuous Monitoring in a closed‑loop to catch any re‑emergence.
In our early days, we leaned on third‑party machine learning (ML) vendors to help us keep pace. But these tools left us:
Faced with escalating volume and stakes, we knew end‑to‑end control was the only way forward.
To overcome these limitations, we needed more than individual point solutions—we needed a unified ML platform. As our business scaled, we found that feature engineering, model training, and serving are all highly repetitive workflows: pulling data, transforming features, spinning up training jobs, packaging models, and deploying endpoints. Manually orchestrating each step not only slowed development and wasted engineering cycles, but also led to divergent design patterns across teams and inconsistent results.
By abstracting away this complexity and standardizing end‑to‑end patterns, we set out to build a platform that lets any engineer at Doppel:
With our ML platform now fully in place, we’ve transformed how quickly and effectively we defend against threats. In this post, we’ll walk through how we bootstrapped adversarial model development, built a real‑time serving stack at scale, and distilled key lessons from running ML in production.
In just 4 months, we’ve achieved the following outcomes:
When you’re detecting malicious content at scale, the first challenge isn’t infrastructure, it’s signal. You need labeled data to train models, but in adversarial environments like ours, that data is sparse, messy, and constantly evolving. The ground truth is often unclear. Threat actors don’t announce themselves, and most URLs on the internet are benign noise.
Our earliest models were trained on a combination of:
These signals were noisy, but directional. We built labeling pipelines that could aggregate weak supervision at scale, and prioritized models that could help rank content for human review, not make binary decisions.
Our initial labeled sets risked overfitting to high‑confidence edge cases. To combat this, we built a suite of data build tool (dbt) models in our data warehouse that:
Whenever our security team flags a benign domain as suspicious, we capture that correction in dbt and send it through the exact same pipelines as our other labels. This way, every false positive becomes part of our training data, keeping our models grounded in real‑world feedback.
We didn’t optimize for model performance out of the gate. We optimized for learning velocity — how quickly we could train, evaluate, ship, and get feedback. That meant:
This early investment let us move fast without losing track of what was working, and gave us a foundation we could confidently scale on top of.
Model training is one challenge—deploying inference in production at scale is another entirely. Our serving infrastructure must process a continuous stream of 100 million+ URL checks per day, maintaining sub‑100 ms P99 latency under bursty traffic. In an adversarial context, a single false negative lets a phishing site slip through undetected, while thousands of false positives per second would drown our SOC team in noise and degrade our signal‑to‑noise ratio.
We needed a serving stack that was:
At the core of our real-time serving is Chalk — a real-time feature platform we use to compute and serve features dynamically based on the latest web content. The features we define are “resolved” by functions or Python-native resolvers that compute everything from domain string features (e.g., entropy, brand overlap, token patterns) to page-level metadata extracted from crawled content.
This pattern enables us to write features which are:
This lets us reuse the same logic in both training and production, reducing drift and improving reproducibility.
Model inference is orchestrated within Chalk: Each model consumes raw feature primitives—domain strings, extracted HTML, metadata, and upstream feature outputs—and emits its predictions as first‑class features. This lets us treat model scores just like any other resolver, composing them seamlessly into downstream workflows.
As an illustrative example, imagine we had a general phishing detection model which consumes four intuitive signals to gauge phishing risk:
We could first spell out those features in code in a feature class like so:
Python
from chalk.features import features
@features
class Url:
# Value of the URL
value: str
# The count of login forms on the page
num_login_forms: int
# Whether the HTML content language is in a suspicious language list
has_suspicious_language: bool
# A third‑party trust metric (lower = riskier)
external_reputation_score: float
We could then write a resolver which passes these primitives to our phishing detection service, which returns a single floating‑point phishing_probability between 0.0 (safe) and 1.0 (definitely phishing). That value is exposed as Url.phishing_probability, making it seamlessly available for any downstream resolver or workflow.
Python
from chalk import online
from chalk.features import Features
MODEL_ENDPOINT_URL = "https://sample_url.com/v1/phish-detection"
@online
def phishing_detection_classifier(
num_login_forms: Url.num_login_forms,
has_suspicious_language: Url.has_suspicious_language,
external_reputation_score: Url.external_reputation_score,
) -> Features[Url.phishing_probability]:
params = {
"forms": num_login_forms,
"has_language_in_blocklist": has_suspicious_language,
"reputation": external_reputation_score,
}
response = requests.get(MODEL_ENDPOINT_URL, params=params)
result = response.json()
return result["phishing_probability"]
Behind the scenes, we package each model and its dependencies into custom Docker containers and serve them via lightweight Cloud Run services. This serverless approach keeps inference modular and testable, and allows us to scale, version, and monitor each model independently, while keeping feature logic and model orchestration centralized in Chalk.
To meet real‑time latency targets, we optimize along three key dimensions:
Built‑in observability spans schema enforcement, performance telemetry, and end‑to‑end traceability:
Figure 2: High‑level real‑time serving architecture where multiple detection workloads (A, B, C) funnel through Chalk’s feature platform for on‑the‑fly feature generation, and individual models (A, B, C) are deployed as containerized Cloud Run services to provide low‑latency inference back into the feature store.
Owning our ML stack end‑to‑end revealed critical engineering insights that map directly back to the challenges we tackled:
These practices have transformed Doppel’s ML platform from a collection of one‑off scripts into a robust, scalable ecosystem—empowering engineers to safely ship and operate new models at internet scale.
We’re Hiring
Interested in pushing the boundaries of AI applications in cybersecurity? We’re hiring
—let’s build the future of Social Engineering Defense.