# Set The Bar; See Who Clears It

Most AI vendors can show you a compelling demo, but those are controlled presentations, optimized to impress you rather than match your needs. Swept runs a methodologically sound review of how any AI system will perform with your data, your users, and under real conditions.

Swept's Evaluation offering helps you choose the AI agent(s) that fit your requirements by running role-aware tests on your data, defining acceptance thresholds, and producing a scorecard for consideration.

[Contact us](/contact)

## What We Do

1. **Connect Data and Pick a Template.** We start with a sample of your own data to use in the evaluation. Templates include help desk, copilot, agentic workflows, and question answering, or you can bring your own task definitions.
2. **Define Acceptance Thresholds.** We use your thresholds to set targets by role and task: accuracy, hallucination rate, latency, and cost. Swept keeps runs consistent and reproducible so comparisons stay meaningful.
3. **Run Suites and Review the Scorecard.** Results are visible side-by-side across models, prompts, and configurations, with wins, regressions, and tradeoffs all in one place.
4. **Decide and Save the Baseline.** Once you approve a model and prompt set, we save the evaluation as a baseline for production supervision. If you want our Supervision offering, we can connect it to your baseline in one click.

## What You Get

- **The Swept Scorecard.** A structured record covering quality, safety, privacy, bias, performance, and cost, with a clear pass or fail against your thresholds.
- **Fair Model Comparisons.** We run the same suite across OpenAI, Anthropic, Vertex AI, Azure OpenAI, AWS Bedrock, Mistral, and open-source models. Same data, same tasks, same thresholds. The comparison is only meaningful if the conditions are identical.
- **Pre-Rollout Issue Detection.** We test known failure patterns before launch, including adversarial prompts, sensitive attribute flips, privacy leakage, and exploit resistance. Finding a problem here is cheaper than finding it in production.
- **Data Preparation, Built In.** We can prepare data to any degree you need: import files or connect a warehouse, tag examples by intent and difficulty, and use stratified sampling for coverage that holds up to scrutiny. We provide a version history for audits.
- **Collaboration and Hand off.** As part of our Implementation Support, we assign owners to each suite. We export comments on runs, requested changes, and retests to your notebook, repo, or ticketing system. When you're ready, we hand off the approved baseline.

## FAQ

- **How much data do I need?** Small samples work to start. The system scales to larger datasets for statistical significance and can generate hundreds or thousands of test cases.
- **Can I bring custom metrics?** Yes. You choose tasks, graders, and metrics, including domain-specific ones, and set your own targets for accuracy, safety, latency, and cost.
- **What counts as a pass?** Meeting or beating your acceptance thresholds across quality, safety, privacy, and performance. Passes are clearly marked on the scorecard, with owners and next steps attached.
- **How do I compare multiple agents fairly?** Run the same suite on the same data with the same thresholds. Swept keeps runs consistent so the deltas you see are real.