DRAFT intro: this post explains how StyloBot turns a request (and a session, and a client over time) into a behavioural vector and why that matters when the bots get smart. Check back for the final cut in a day or so!
I started building StyloBot after solving a problem for a customer; how do you ensure only legitimate clients can access endpoints and use APIs WITHOUT the brittleness of current methods.
StyloBot is ENTIRELY FREE TO RUN...in future I'll sell realtime management and reporting (to try and you know...eat). But the engine in the exe IS StyloBot. Commerical just adds options, distributed topology support, realtime (no reload to apply config changes) updates & more DB options.
All the source is here https://github.com/scottgal/stylobot
To install it: macOS (Homebrew)
brew install scottgal/stylobot/stylobot
stylobot 5080 http://localhost:3000
Linux (apt - Debian/Ubuntu)
curl -1sLf 'https://dl.cloudsmith.io/public/mostlylucid/stylobot/setup.deb.sh' | sudo bash
sudo apt update && sudo apt install stylobot
stylobot 5080 http://localhost:3000
Linux (manual / ARM64)
# Download from GitHub Releases: stylobot-linux-x64.tar.gz or stylobot-linux-arm64.tar.gz
tar xzf stylobot-linux-x64.tar.gz && chmod +x stylobot && sudo mv stylobot /usr/local/bin/
stylobot 5080 http://localhost:3000
Docker
docker run --rm -p 8080:8080 -e DEFAULT_UPSTREAM=http://host.docker.internal:3000 \
scottgal/stylobot-gateway:latest
NuGet (embed as ASP.NET Core middleware)
dotnet add package mostlylucid.botdetection
dotnet add package mostlylucid.botdetection.ui
builder.Services.AddStyloBot(dashboard => {
dashboard.AllowUnauthenticatedAccess = true; // dev only
});
app.UseRouting();
app.UseStyloBot(); // broadcast, detection, dashboard: correct ordering guaranteed
app.MapControllers();
Dashboard at /_stylobot. Detection at ~150µs per request from first request.
Then just run it stylobot 5080 http://localhost:3000 and voila your upstream site is listening (use --mode block to actually block too).
Related reading: StyloBot Part 1 - Fighting Back Against Scrapers, StyloBot Part 2 - The New Frontier in Bot Detection, StyloBot Part 3 - As Simple As Possible And No Simpler.
So early in the article? Yup...really the current market shows the issues stylobot attempts to solve
Cheap, simple, but always after the fact
Fast and cheap-ish, but only for known patterns
Powerful but expensive, and can affect user experience
Cheap control, but blunt instrument
Essential infra layer, but not behavioural
Deep insight, but slow and expensive...used sparingly
Identity-heavy, comes with compliance and cost baggage
Visibility layer...expensive but necessary
The "we had to fix gaps" layer
SO, simple, right. This is a market which does cover MOST of the bases you need. However, it's expensive, can be slow, prone to false positives (read actual users having bad experiences).
Notice anything about all the market players in the previous example; lots of needing UA / IP to remain identifiable, or needing manual config per endpoint to avoid blocking 'legitimate' traffic. They're also SLOW...if every request goes through this pipeline that's likely a significant chunk of your time spent processing requests & responding.
So these traditional players work...up to a point. At some point no matter how much you really spend; you won't block them AND you'll be spending more than you save.
Just like our defensive systems on the market above offer different types of protection (at different cost, complexity), 'bots' have their own hierarchy.
(curl, scanners, brute force, invalid paths)
Failure point: none, everything catches these
Really these are the scripts which have been around since the start of the web (perl FTW!) they're the 'go to site, scrape content' types. EASY to identify (single endpoint, same ip generally, same UA...).
(rotating UA, valid endpoints, simple scraping)
Failure point: systems relying on obvious mistakes
Starting to get harder. Now you need to identify known patterns, process traffic later etc...
(Puppeteer/Playwright, JS execution, real flows)
Failure point: anything based on request correctness or signatures
Easy ONLY because they're being used legitimately most of the time (e.g. scraping, SEO, etc.) however this is false positive city...telling legit from illegitimate traffic is HARD.
(proxy rotation, residential IPs, fingerprint spoofing)
Failure point:
This is where false positives start rising if you push harder all of your normal identifiers start to fall off you need to be able to identify the same client from with deceptive identity.
(slow, distributed, learn site behaviour, adjust dynamically)
Failure point:
These bots behave "correctly" and evolve. LLMs can adapt to standard attempts to block them (CAPTCHA solvers, randomizers).
THIS is where StyloBot is aimed...right NOW these bots are expensive to operate at scale. THAT IS CHANGING.
As we move down the list we start to move through time...we moved from simple identity (block IP) through starting to need to understand huge quantities of traffic and log files.
To defend from INTELLIGENT scrapers like we see at level 5 you need INTELLIGENT detection AND protection.
In previous articles I've written about my Behavioural Inference systems. In essence these are a CHEAT that became a feature.
The problem; single 'sensors' are easy to bypass now.
In all the examples above the only constant is how they attempt to deceive. Changing factors about their identity (headers, IP, UA etc), changing timings / endpoints etc. So any ONE sensor can be bypassed, combining them gives more sensitivity (read catches more bots). HOWEVER in these static systems false positives start to grow as you increase sensors; if a single one is enough to trigger a false positive then you have a problem.
What behavioural inference does is profile -> characterize -> remember whether it's in lucidRAG / StyloBot etc...that's all it does really.
In StyloBot it's remembering these behavioural vectors, THAT is what a client behaviour becomes (note behaviour NOT identity...).
To IT you are a projection over a 130+ dimensional vector space.
In short, StyloBot is a behavioural inference engine applied to web traffic. It uses a large vector space to characterise and identify the class and type of web requests in order to identify automations vs humans.
It's closer to a sensor fusion system than a rules engine. Detectors don't make decisions; they emit signals. The signals are the system. Detectors are just producers.
The market leaders all share one of two shapes; either they rely on simple static rules (updated constantly, like OWASP feeds) or they analyse TONS of real traffic and are heavy beasts that need a SaaS to live in.
StyloBot aims for the distribution model of Fail2Ban (run an exe, point at upstream) with the power of the enterprise stacks.
It ALSO downloads lists of user agents, CVEs, exploits, and other indicators of compromise to enrich detection. HOWEVER these are just one factor in a decision; never the verdict on their own.
Under the hood StyloBot runs ~50 'contributors'; small focused bits of code that look like this:
using Microsoft.AspNetCore.Http;
using Mostlylucid.BotDetection.Models;
namespace Mostlylucid.BotDetection.Detectors;
/// <summary>
/// Execution stage for detectors. Detectors in the same stage run in parallel.
/// Higher stages wait for lower stages to complete.
/// </summary>
public enum DetectorStage
{
/// <summary>
/// Raw signal extraction (UA, headers, IP, client-side).
/// No dependencies on other detectors.
/// </summary>
RawSignals = 0,
/// <summary>
/// Behavioral analysis that may depend on raw signals.
/// Runs after Stage 0 completes.
/// </summary>
Behavioral = 1,
/// <summary>
/// Meta-analysis layers (inconsistency detection, risk assessment).
/// Reads signals from stages 0 and 1.
/// </summary>
MetaAnalysis = 2,
/// <summary>
/// AI/ML-based detection that can use all prior signals.
/// Runs last, can learn from all other signals.
/// </summary>
Intelligence = 3
}
/// <summary>
/// Interface for bot detection strategies
/// </summary>
public interface IDetector
{
/// <summary>
/// Name of the detector
/// </summary>
string Name { get; }
/// <summary>
/// Execution stage for this detector.
/// Detectors in the same stage run in parallel.
/// Higher stages wait for lower stages to complete.
/// </summary>
DetectorStage Stage => DetectorStage.RawSignals;
/// <summary>
/// Analyze an HTTP request for bot characteristics.
/// Legacy method - prefer DetectAsync with DetectionContext.
/// </summary>
/// <param name="context">HTTP context</param>
/// <param name="cancellationToken">Cancellation token</param>
/// <returns>Detection result with confidence score and reasons</returns>
Task<DetectorResult> DetectAsync(HttpContext context, CancellationToken cancellationToken = default);
/// <summary>
/// Analyze an HTTP request for bot characteristics using shared context.
/// Detectors should read signals from prior stages and write their own signals.
/// </summary>
/// <param name="detectionContext">Shared detection context with signal bus</param>
/// <returns>Detection result with confidence score and reasons</returns>
Task<DetectorResult> DetectAsync(DetectionContext detectionContext)
{
// Default implementation for backward compatibility
return DetectAsync(detectionContext.HttpContext, detectionContext.CancellationToken);
}
}
/// <summary>
/// Result from an individual detector
/// </summary>
public class DetectorResult
{
/// <summary>
/// Confidence score from this detector (0.0 to 1.0)
/// </summary>
public double Confidence { get; set; }
/// <summary>
/// Reasons found by this detector
/// </summary>
public List<DetectionReason> Reasons { get; set; } = new();
/// <summary>
/// Bot type if identified
/// </summary>
public BotType? BotType { get; set; }
/// <summary>
/// Bot name if known
/// </summary>
public string? BotName { get; set; }
}
Each detector indicates what it is, if it depends on others (signals) and what results it gives.
NOTE: This is a core concept for how I built it. StyloBot is a LARGE system with MINIMAL concepts. So adding detectors is SIMPLE.
That stage ordering is the discipline. Stage 0 runs in parallel and writes signals. Stage 1+ reads what came before instead of re-extracting from the raw request. Most requests never get past stage 0.
Using my mostlylucid.ephemeral framework (more on it in Building a Reusable Ephemeral Execution Library and Ephemeral Signals - Turning Atoms into a Sensing Network) it emits what I call 'signals'; these are tiny strings like 'ua.score=0.75' which act BOTH as metadata for the request AND logging / diagnostic data. This enables me to tune the system very finely as the Code LLM / the system itself can use these signals to identify efficiencies (auto-tuning).
Aside: Ephemeral also gives StyloBot LFU / sliding-window processing; it drops human requests while retaining a window so that if a future request crosses a bot threshold we can look back and reprocess the older ones for clues. That mechanism deserves its own post; for now just know it's why retention costs nothing in the steady state.
StyloBot has 50 detectors. It rarely runs more than 5-7 per request. They aren't 50 independent verdicts; they're 50 ways of observing the same underlying behaviour, each contributing evidence toward a single behavioural model.
The 50 is the CAPABILITY; it only uses what it needs.
Fast path (the common case). 5-7 SUPER fast (sub-millisecond) initial detectors and fingerprinters. From that fingerprint it can decide what sort of thing you are...AND what your next requests are likely to be (content->resource pathing). Then it predicts the next request, compares against what actually arrives, and escalates only if the shape diverges. ~150µs end to end. This is what the vast majority of human traffic ever sees.
Slow path (the interesting case). When the fast path is ambiguous (signals contradict each other, the shape doesn't match anything we've seen, confidence sits in the dead zone) or when the request looks novel (new attack pattern, fresh CVE probe, an LLM-driven scraper trying something we haven't fingerprinted yet) StyloBot opens the throttle. ALL 50 detectors run. The Intelligence stage gets to consult an LLM stage that takes the full signal bundle and contributes another dimension of resolution; pattern-matching against threat intel, reasoning about request intent, spotting things the heuristics aren't shaped for yet.
That's the deal: pay microseconds when you can, pay milliseconds when you must, never pay both. The slow path is rare by design (typically <1% of traffic) but it's where StyloBot earns its keep against the level-5 adaptive bots from earlier; the ones that will slip past any fixed pipeline. And every slow-path verdict feeds back as new fast-path signal, so next time around the cheap detectors catch what the expensive ones discovered.
The full set of layers (you only see all of them on a slow-path request that genuinely needs every angle):
| Layer | Detectors | What it catches |
|---|---|---|
| Identity | Signature, HeaderCorrelation, Periodicity | UA rotation, identity factors, temporal patterns |
| Protocol | TLS (JA3/JA4), TCP/IP (p0f), HTTP/2, HTTP/3, Transport, StreamAbuse | Spoofed browser fingerprints, protocol inconsistencies |
| Behavioral | Waveform, SessionVector, AdvancedBehavioral, CacheBehavior, CookieBehavior, ResourceWaterfall, ContentSequence | Timing patterns, Markov chains, missing assets, page-load sequence divergence |
| Content | UserAgent, Header, AiScraper, Haxxor, SecurityTool, VersionAge | Known bots, attack payloads, impossible browser versions |
| Network | IP, GeoChange, ResponseBehavior, MultiLayerCorrelation, CveProbe | Datacenter IPs, impossible travel, CVE scanning, cross-layer mismatches |
| Intelligence | FastPathReputation, ReputationBias, TimescaleReputation, Cluster, Similarity, Intent | Historical reputation, Leiden clustering, HNSW similarity, threat scoring |
| Ad Fraud | ClickFraud, PiiQueryString | IAB SIVT: datacenter/VPN/headless on paid traffic, referrer spoofing, immediate bounce |
| AI | Heuristic, HeuristicLate, LLM | 50-feature model (<1ms), optional LLM for ambiguous cases |
| Client | ClientSide, FingerprintApproval, ChallengeVerification | JS timing probes, headless detection, PoW challenges |
So now to get to the POINT. With 50 detectors and hundreds of signals we have a LOT of metadata about each client. None of it on its own is a verdict; together it's a position in a 130+ dimensional space.
Remember this from earlier? This is a projection (Wikipedia) of that underlying vector space, built from the contributors we just saw. So this is essentially a 'low resolution' image of the fingerprint.
Your bots aren't just a bunch of numbers. They're SHAPES. These shapes are DIFFERENT to human ones.
Humans are noisy but consistent in structure. Bots are consistent but wrong in structure.
That's the whole trick. Once you can see the shape, the per-detector confidence scores stop mattering individually; what matters is whether the projection looks like a human or like something pretending to be one.
Then we combine that with tracking across ALL your sessions (don't worry, the system collects ZERO PII). By looking at a single session a client might look totally human (might even be a recording of a human). HOWEVER...sensitivity across TIME (looking for automated cadences, even human fingerprints which are USED as bots later) is where the shape really gives them away.
Once everything is a shape, bots stop hiding from each other. They cluster.
StyloBot runs Leiden community detection over the live vector space. This trick is borrowed wholesale from my GraphRAG work; if you've read GraphRAG: Why Vector Search Breaks Down at the Corpus Level and GraphRAG Part 2: Minimum Viable GraphRAG you've already seen this exact pattern. There it builds communities of meaning over document chunks so a query can pull a whole connected idea instead of a handful of disconnected snippets. Here it does the structurally identical job over behavioural vectors; it builds communities of clients that move alike. Same algorithm, same insight, different domain. A bot family is just a community in the graph; a GraphRAG topic is the same shape over text.
Bots that share an origin (same toolkit, same operator, same scraping campaign) land in the same neighbourhood even when they've rotated IPs, headers, fingerprints and timing. They didn't co-ordinate to look the same; they look the same because they ARE the same, structurally.
That gives StyloBot two superpowers for free:
This is also where similarity search (HNSW over the same vectors) earns its keep; "show me the 20 closest things to this request right now" is a constant-time question, not a scan over history.
Note what I DIDN'T say...I didn't say 'once set up' or 'when properly configured' because that's StyloBot's secret...it has a good default set but it learns.
As it runs it starts to profile your traffic and understand your users. Not creepily; it works out what request patterns, endpoints, and timings look like for your human vs your automated traffic.
You can THEN decide, or let the system take care of it (set a bot threshold of say 0.8 for most and 0.6 for secure endpoints). The defaults that ship are good; the defaults that emerge after a few hours on your traffic are better.
StyloBot is NOW live. The detection engine, the dashboard, the NuGet packages, the gateway exe; all of it is shipping right now and FREE to run on your own infra. Grab the source at github.com/scottgal/stylobot or brew install scottgal/stylobot/stylobot and point it at your upstream.
I'll add commercial features shortly (managed dashboards, hosted reputation, multi-site reporting, the things that need a server somewhere I have to keep paying for) but the core engine stays free. If you want the technical lead-up to this post: Part 1, Part 2 and Part 3 cover the why, the architecture, and the two-line drop-in. The behavioural inference foundations live in Behavioural Inference and the signal plumbing in Ephemeral Signals.
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.