
WildGuard: Open One-Stop Moderation Tools for Safety Risks, …
Jun 26, 2024 · We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate.
WildGuard: Open One-stop Moderation Tools for Safety Risks ... - GitHub
WildGuard is a safety classification model for user-model chat exchanges. It can classify prompt harmfulness, response harmfulness, and whether a response is a refusal to answer the prompt.
allenai/wildguard · Hugging Face
The high-level risk categories covered by WildGuard are: 1) Privacy, 2) Misinformation, 3) Harmful language, and 4) Malicious uses. Each category has a set of subcategories, consisting of total 13 subcategories.
WildGuard - Wikipedia
WildGuard is a comic book series about a "made-for-TV" superhero team. The series was created, written and drawn by Todd Nauck and published by Image Comics. Nauck initially published the characters in 1993, [citation needed] in ashcan comics he published while sending out submissions. It was published again in 2003.
allenai/wildguardmix · Datasets at Hugging Face
WildGuardMix consists of two splits, WildGuardTrain and WildGuardTest. Here's the breakdown of WildGuardMix: Data Size: corresponds of 86,759 examples, of which 48,783 are prompt-only and 37,976 contain a prompt and response. Data types: synthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%).
Our comprehensive evaluations on WILDGUARDTEST and ten existing public benchmarks show that WILDGUARD outperforms the strongest existing open-source baselines (e.g., Llama-Guard2, Aegis-Guard, etc) on F1 scores across all three tasks (by up to 26.4% on refusal detection), matches GPT-4 across tasks, and surpasses GPT-4 by up to 3.9% on ...
wildguard · PyPI
Jul 10, 2024 · WildGuard is a safety classification model for user-model chat exchanges. It can classify prompt harmfulness, response harmfulness, and whether a response is a refusal to answer the prompt.
WildGuard: Open One-stop Moderation Tools for Safety Risks...
Sep 26, 2024 · We introduce WildGuard---an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate.
WildGuard: Open One-Stop Moderation Tools for Safety Risks, …
Jun 26, 2024 · WildGuard isn't just another safety tool; it's a comprehensive system that identifies malicious user prompts (like those sneaky "jailbreak" attempts to bypass safety protocols), flags risky model responses, and measures how often a model refuses inappropriate requests.
WildGuard: A Light-weight, Multi-Purpose Moderation Tool for …
Jul 2, 2024 · A team of researchers from the Allen Institute for AI, the University of Washington, and Seoul National University propose WILDGUARD, a novel, lightweight moderation tool designed to address the limitations of existing methods.