Prompt & EvalOpen Source✦ Free Tier

DeepEval

LLM evaluation framework — 14+ metrics

5,500 stars● Health 80ActiveApp Infrastructure

About

Open-source evaluation framework with 14+ metrics including faithfulness, relevancy, and hallucination detection. Integrates with CI/CD.

Choose DeepEval when…

  • You want a pytest-style framework for LLM testing
  • Unit-test-like evals for LLM outputs fit your workflow
  • You need RAG-specific metrics like faithfulness and relevancy

Builder Slot

How do you know it's working?Optional for most stacks

Tests, evals, and experiment tracking to measure and improve your AI output quality

Dev Tools
Not applicable
App Infra
Recommended
Hybrid
Optional

Other tools in this slot:

Stack Genome Detection

AIchitect's Genome scanner detects DeepEval in your project via these signals:

pip packages
deepeval
env vars
CONFIDENT_API_KEY

Integrates with (2)

LangfuseLLM Infrastructure

DeepEval sends evaluation results to Langfuse as trace scores via its Langfuse integration.

Quality metrics — faithfulness, hallucination rate, G-Eval scores — visible alongside the raw traces that produced them.

Compare →
OpenAI APILLM Infrastructure

DeepEval uses OpenAI's API as the judge model to score generated outputs on metrics like faithfulness, relevance, and hallucination rate.

LLM-as-judge quality metrics powered by GPT-4o — structured, reproducible evaluation scores for any AI output.

Compare →

Often paired with (1)

Alternatives to consider (4)

Pricing

✦ Free tier available

In 2 stacks

Ruled out by 1 stack

Evaluation & Quality Stack
Promptfoo covers the CI regression testing role; DeepEval shines in Python-only stacks where it's the sole eval tool rather than one of several.

Badge

Add to your GitHub README

DeepEval on AIchitect[![DeepEval](https://aichitect.dev/badge/tool/deepeval)](https://aichitect.dev/tool/deepeval)

Explore the full AI landscape

See how DeepEval fits into the bigger picture — browse all 207 tools and their relationships.

Explore graph →