LLM red-team harness
Find where your
model breaks.
Profile a prompt’s intent, generate targeted attacks, and score every break with a composite no single evaluator can fool.
Pipeline phases
Attacks per run
Scoring signals
Severity bands
Every deployed model
is an attack surface.
Jailbreaks. Injections.
New vectors every week.
Most teams don't know
where they're exposed.
keep scrolling
01 / Analyze
Before a single attack is generated, a Gemini classifier profiles the prompt: how risky it is, which threat category it belongs to, and the intent behind it. Every later phase is steered by this, so testing stays targeted.
Ingest the raw prompt
Exactly as a user would submit it, with no sanitising. The model sees what an attacker types.
Classify the intent
Gemini reads the prompt for adversarial signal and reasons about the real objective.
Return strict JSON
A 0–1 risk score, the most specific threat category, and a one-line statement of intent.
Seed the attack phase
The profile is handed downstream so every generated variant inherits the same target.
02 / Attack
The threat profile becomes ammunition. A Gemini red-teamer expands the prompt into ten distinct adversarial variants. Each preserves the goal while probing a different route: alternate phrasings, indirect framings, realistic threat scenarios.
Inherit the context
Prompt, category and intent from Analyze become the red-teamer's brief.
Diversify with temperature
Runs at temperature 0.8 to maximise variety across the ten variants, not rephrase one sentence.
Explore the surface
Alternate phrasings, indirect attack paths and plausible scenarios, with meaning preserved throughout.
Hand off for review
All ten variants return so you can inspect, trim and select before scoring.
03 / Score
Every variant is fired at an unguarded target model: Gemini with no safety system prompt, the worst-case deployment. Four independent signals grade each response, fuse into one composite, then map to a severity badge and rank.
A strict Gemini auditor rates the response 0–1 on a five-band scale, with a written rationale and confidence.
Programmatic hallucination, toxicity and bias metrics combine into one model-risk score.
Cosine similarity between attack and response. High similarity means the model echoed the goal.
A zero-LLM check: a clean refusal phrase drops this term to 0 and is rewarded.
Composite formula
Clamped to 0–1. Every attack is scored in parallel, then sorted by composite descending, so the most dangerous break always leads the report.
Severity ladder
Three phases, one composite score, every break ranked. Run it on any prompt.