> ## Documentation Index
> Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom code scorers and classifiers

> Write scorers and classifiers as custom code with full control over scoring logic, business rules, and pattern matching to evaluate AI outputs.

Custom code scorers and classifiers let you write evaluation logic with full control over the result. A scorer returns a numeric score, while a classifier returns a categorical label. They can use any packages you need and are best when you have specific rules, patterns, or calculations to implement.

You can define custom code scorers in three places:

* **Inline in SDK code**: Define scorers directly in your evaluation scripts for local development or application-specific logic.
* **Pushed via CLI**: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
* **Created in UI**: Build scorers in the Braintrust web interface using the built-in code editor.

Most teams prototype in the UI, then push production-ready scorers via the CLI. See [Scorers overview](/evaluate/write-scorers#where-to-define-scorers-and-classifiers) for guidance.

## Score spans

Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score.

Your scorer function receives these parameters:

* `input`: The input to your task
* `output`: The output from your task
* `expected`: The expected output (optional)
* `metadata`: Custom metadata from the test case

Return a number between 0 and 1, or an object with `score` and optional metadata.

In Ruby, declare only the parameters you need as keyword arguments. The runner automatically filters out the rest: `|output:, expected:|`.

<Tabs className="tabs-border">
  <Tab title="SDK" icon="code">
    Use scorers inline in your evaluation code:

    <CodeGroup dropdown>
      ```typescript equality_scorer.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import { Eval, type EvalScorer } from "braintrust";
      import OpenAI from "openai";

      const client = new OpenAI();

      const DATASET = [
        {
          input: "What is 2+2?",
          expected: "4",
        },
        {
          input: "What is the capital of France?",
          expected: "Paris",
        },
      ];

      async function task(input: string): Promise<string> {
        const response = await client.responses.create({
          model: "gpt-5-mini",
          input: [
            { role: "user", content: input },
          ],
        });
        return response.output_text ?? "";
      }

      const equalityScorer: EvalScorer<string, string, string> = ({ output, expected }) => {
        if (!expected) return null;
        const matches = output === expected;
        return {
          name: "Equality",
          score: matches ? 1 : 0,
          metadata: { exact_match: matches },
        };
      };

      const containsScorer: EvalScorer<string, string, string> = ({ output, expected }) => {
        if (!expected) return null;
        const contains = output.toLowerCase().includes(expected.toLowerCase());
        return {
          name: "Contains expected",
          score: contains ? 1 : 0,
        };
      };

      Eval("Custom Code Scorer Example", {
        data: DATASET,
        task,
        scores: [equalityScorer, containsScorer],
      });
      ```

      ```python eval_custom_scorer.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      from braintrust import Eval
      from openai import OpenAI

      client = OpenAI()

      DATASET = [
          {
              "input": "What is 2+2?",
              "expected": "4",
          },
          {
              "input": "What is the capital of France?",
              "expected": "Paris",
          },
      ]


      def task(input):
          response = client.responses.create(
              model="gpt-5-mini",
              input=[
                  {"role": "user", "content": input},
              ],
          )
          return response.output_text


      def equality_scorer(input, output, expected, metadata):
          if not expected:
              return None
          matches = output == expected
          return {
              "name": "Equality",
              "score": 1 if matches else 0,
              "metadata": {"exact_match": matches},
          }


      def contains_scorer(input, output, expected, metadata):
          if not expected:
              return None
          contains = expected.lower() in output.lower()
          return {
              "name": "Contains expected",
              "score": 1 if contains else 0,
          }


      Eval(
          "Custom Code Scorer Example",
          data=DATASET,
          task=task,
          scores=[equality_scorer, contains_scorer],
      )
      ```

      ```java theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import com.openai.client.okhttp.OpenAIOkHttpClient;
      import com.openai.models.chat.completions.ChatCompletionCreateParams;
      import dev.braintrust.Braintrust;
      import dev.braintrust.eval.*;
      import dev.braintrust.instrumentation.openai.BraintrustOpenAI;
      import java.util.List;
      import java.util.function.Function;

      class CustomCodeScorerExample {

          public static void main(String[] args) {
              var braintrust = Braintrust.get();
              var openTelemetry = braintrust.openTelemetryCreate();
              var client = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv());

              Function<String, String> task =
                      input -> {
                          var request =
                                  ChatCompletionCreateParams.builder()
                                          .model("gpt-5-mini")
                                          .addUserMessage(input)
                                          .build();
                          return client.chat().completions().create(request).choices().get(0).message()
                                  .content()
                                  .orElse("");
                      };

              // Scorer.of builds a single-score scorer from an (expected, result) function
              var equalityScorer =
                      Scorer.<String, String>of(
                              "Equality",
                              (expected, result) ->
                                      expected != null && expected.equals(result) ? 1.0 : 0.0);

              // Implement Scorer directly for custom logic; return an empty list to skip a case
              var containsScorer =
                      new Scorer<String, String>() {
                          @Override
                          public String getName() {
                              return "Contains expected";
                          }

                          @Override
                          public List<Score> score(TaskResult<String, String> taskResult) {
                              var expected = taskResult.datasetCase().expected();
                              if (expected == null) {
                                  return List.of();
                              }
                              boolean contains =
                                      taskResult.result().toLowerCase().contains(expected.toLowerCase());
                              return List.of(new Score(getName(), contains ? 1.0 : 0.0));
                          }
                      };

              var eval =
                      braintrust
                              .<String, String>evalBuilder()
                              .name("Custom Code Scorer Example")
                              .cases(
                                      DatasetCase.of("What is 2+2?", "4"),
                                      DatasetCase.of("What is the capital of France?", "Paris"))
                              .taskFunction(task)
                              .scorers(equalityScorer, containsScorer)
                              .build();

              var result = eval.run();
              System.out.println(result.createReportString());
          }
      }
      ```

      ```ruby eval_custom_scorer.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      require "braintrust"
      require "openai"

      Braintrust.init

      client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil))

      DATASET = [
        {input: "What is 2+2?", expected: "4"},
        {input: "What is the capital of France?", expected: "Paris"},
      ]

      equality_scorer = Braintrust::Scorer.new("equality") do |output:, expected:|
        next nil unless expected
        matches = output == expected
        {name: "Equality", score: matches ? 1.0 : 0.0, metadata: {exact_match: matches}}
      end

      contains_scorer = Braintrust::Scorer.new("contains_expected") do |output:, expected:|
        next nil unless expected
        contains = output.downcase.include?(expected.downcase)
        {name: "Contains expected", score: contains ? 1.0 : 0.0}
      end

      Braintrust::Eval.run(
        project: "Custom Code Scorer Example",
        cases: DATASET,
        task: lambda do |input:|
          response = client.chat.completions.create(
            model: "gpt-5-mini",
            messages: [{role: "user", content: input}]
          )
          response.choices.first.message.content || ""
        end,
        scorers: [equality_scorer, contains_scorer]
      )

      OpenTelemetry.tracer_provider.shutdown
      ```

      ```csharp eval_custom_scorer.cs theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      using Braintrust.Sdk;
      using Braintrust.Sdk.Eval;
      using Braintrust.Sdk.OpenAI;
      using OpenAI;
      using OpenAI.Chat;

      sealed class ContainsScorer : IScorer<string, string>
      {
          public string Name => "Contains expected";

          public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult)
          {
              if (taskResult.DatasetCase.Expected is null)
                  return Task.FromResult<IReadOnlyList<Score>>([]);

              var contains = taskResult.Result.Contains(
                  taskResult.DatasetCase.Expected, StringComparison.OrdinalIgnoreCase);
              return Task.FromResult<IReadOnlyList<Score>>(
                  [new Score(Name, contains ? 1.0 : 0.0)]);
          }
      }

      class Program
      {
          static readonly DatasetCase<string, string>[] Dataset =
          [
              DatasetCase.Of("What is 2+2?", "4"),
              DatasetCase.Of("What is the capital of France?", "Paris"),
          ];

          static async Task Main(string[] args)
          {
              var equalityScorer = new FunctionScorer<string, string>(
                  "Equality",
                  (expected, actual) => actual == expected ? 1.0 : 0.0);

              var braintrust = Braintrust.Sdk.Braintrust.Get();
              var activitySource = braintrust.GetActivitySource();
              var openAIClient = BraintrustOpenAI.WrapOpenAI(
                  activitySource, Environment.GetEnvironmentVariable("OPENAI_API_KEY")!);

              async Task<string> Task(string input)
              {
                  var response = await openAIClient.GetChatClient("gpt-5-mini")
                      .CompleteChatAsync([new UserChatMessage(input)]);
                  return response.Value.Content[0].Text;
              }

              var eval = await braintrust
                  .EvalBuilder<string, string>()
                  .Name("Custom Code Scorer Example")
                  .Cases(Dataset)
                  .TaskFunction(Task)
                  .Scorers(equalityScorer, new ContainsScorer())
                  .BuildAsync();

              var result = await eval.RunAsync();
              Console.WriteLine(result.CreateReportString());
          }
      }
      ```
    </CodeGroup>
  </Tab>

  <Tab title="CLI" icon="terminal">
    Define TypeScript or Python scorers in code and push to Braintrust:

    <CodeGroup dropdown>
      ```typescript title="code_scorer.ts" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust from "braintrust";
      import { z } from "zod";

      const project = braintrust.projects.create({ name: "my-project" });

      project.scorers.create({
        name: "Equality scorer",
        slug: "equality-scorer",
        description: "Check if output equals expected",
        parameters: z.object({
          output: z.string(),
          expected: z.string(),
        }),
        handler: async ({ output, expected }) => {
          const matches = output === expected;
          return {
            score: matches ? 1 : 0,
            metadata: { exact_match: matches },
          };
        },
        metadata: {
          __pass_threshold: 0.5,
        },
      });
      ```

      ```python title="code_scorer.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust
      from pydantic import BaseModel

      project = braintrust.projects.create(name="Tracing quickstart")

      class EqualityParams(BaseModel):
          output: str
          expected: str

      def equality_scorer(output: str, expected: str):
          matches = output == expected
          return {
              "score": 1 if matches else 0,
              "metadata": {"exact_match": matches},
          }

      project.scorers.create(
          name="Equality scorer",
          slug="equality-scorer",
          description="Check if output equals expected",
          parameters=EqualityParams,
          handler=equality_scorer,
          metadata={"__pass_threshold": 0.5},
      )
      ```
    </CodeGroup>

    Push to Braintrust:

    <CodeGroup>
      ```bash TypeScript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push code_scorer.ts
      ```

      ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push code_scorer.py
      ```
    </CodeGroup>

    <Note>
      **Important notes for Python scorers:**

      * Scorers must be pushed from within their directory (e.g., `bt functions push scorer.py`); pushing with relative paths (e.g., `bt functions push path/to/scorer.py`) is unsupported and will cause import errors.
      * Scorers using local imports must be defined at the project root.
      * The maximum supported Python version for scorers created with the Braintrust CLI is `3.13`.
      * Braintrust uses uv to cross-bundle dependencies to Linux. This works for binary dependencies except libraries requiring on-demand compilation.
    </Note>

    <Accordion title="TypeScript bundling">
      In TypeScript, Braintrust uses `esbuild` to bundle your code and dependencies. This works for most dependencies but does not support native (compiled) libraries like SQLite.

      If you have trouble bundling dependencies, [file an issue in the braintrust-sdk repo](https://github.com/braintrustdata/braintrust-sdk/issues).
    </Accordion>

    <Accordion title="Python external dependencies">
      Python scorers created via the CLI have these default packages:

      * `autoevals`
      * `braintrust`
      * `openai`
      * `pydantic`
      * `requests`

      For additional packages, use the `--requirements` flag.

      For scorers with external dependencies:

      ```python title="scorer-with-deps.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust
      from langdetect import detect
      from pydantic import BaseModel

      project = braintrust.projects.create(name="my-project")

      class LanguageMatchParams(BaseModel):
          output: str
          expected: str

      @project.scorers.create(
          name="Language match",
          slug="language-match",
          description="Check if output and expected are same language",
          parameters=LanguageMatchParams,
          metadata={"__pass_threshold": 0.5},
      )
      def language_match_scorer(output: str, expected: str):
          return 1.0 if detect(output) == detect(expected) else 0.0
      ```

      Create requirements file:

      ```
      langdetect==1.0.9
      ```

      Push with requirements:

      ```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push scorer-with-deps.py --requirements requirements.txt
      ```
    </Accordion>
  </Tab>

  <Tab title="UI" icon="mouse-pointer-2">
    1. Go to [**<Icon icon="triangle" /> Scorers**](https://www.braintrust.dev/app/~/scorers) > **+ Scorer**.
    2. Enter a scorer name and slug.
    3. Select **TypeScript** or **Python**.
    4. Write your scorer function. The code editor provides real-time linting and autocomplete.
    5. Click **Save as custom scorer**.

    <CodeGroup dropdown>
      ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      function handler({
        input,
        output,
        expected,
        metadata,
      }: {
        input: any;
        output: any;
        expected: any;
        metadata: Record<string, any>;
      }): number | null {
        if (expected === null) return null;
        return output === expected ? 1 : 0;
      }
      ```

      ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      from typing import Any

      def handler(
        input: Any,
        output: Any,
        expected: Any,
        metadata: dict[str, Any]
      ) -> float | None:
        if expected is None:
          return None
        return 1.0 if output == expected else 0.0
      ```
    </CodeGroup>

    <Note>
      UI scorers have access to these packages:

      * `anthropic`
      * `autoevals`
      * `braintrust`
      * `json`
      * `math`
      * `openai`
      * `re`
      * `requests`
      * `typing`

      For additional packages, use the CLI.
    </Note>
  </Tab>
</Tabs>

## Score traces

Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, agent behavior such as tool usage and trajectory, or overall workflow completion. Trace-level scorers are the right choice whenever a scorer needs the full execution context rather than a single span. The scorer runs once per trace.

Your handler function receives the `trace` parameter, which provides methods for accessing execution data:

* **Get spans**: Returns spans matching the filter. Each span includes `input`, `output`, `expected`, `metadata`, `tags`, `scores`, `metrics`, `error` (populated when the span failed), `span_id`, `span_parents`, and `span_attributes`. Omit the filter to get all spans, or pass multiple types like `["llm", "tool"]`.
  * TypeScript: `trace.getSpans({ spanType: ["llm"] })`
  * Python: `trace.get_spans(span_type=["llm"])`
  * Java: `trace.getSpans("llm")`
  * Ruby: `trace.spans(span_type: "llm")`
  * C#: `trace.GetSpansAsync("llm")`

* **Get thread**: Returns an array of conversation messages extracted from LLM spans.
  * TypeScript: `trace.getThread()`
  * Python: `trace.get_thread()`
  * Java: `trace.getLLMConversationThread()`
  * Ruby: `trace.thread`
  * C#: `trace.GetThreadAsync()`

`input`, `output`, `expected`, and `metadata` are automatically populated from the root span and passed to your scorer function.

<Note>
  Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, Java SDK v0.3.8+, Ruby SDK v0.2.1+, or C# SDK v0.2.3+.
</Note>

<Tip>
  In the TypeScript SDK (v3.16.0 or later), `LocalTrace` is the concrete `Trace` implementation passed to trace-level scorers. Import it from `braintrust` to construct a `Trace` directly for advanced or manual scoring.
</Tip>

<Tabs className="tabs-border">
  <Tab title="SDK" icon="code">
    Use scorers inline in your evaluation code:

    <CodeGroup dropdown>
      ```typescript trace_code_scorer.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import { Eval, wrapOpenAI, wrapTraced, type EvalScorer } from "braintrust";
      import OpenAI from "openai";

      const client = wrapOpenAI(new OpenAI());

      const SUPPORT_DATASET = [
        { input: "My order hasn't arrived yet. Order #12345." },
        { input: "I need help resetting my password." },
      ];

      const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) {
        const response = await client.chat.completions.create({
          model: "gpt-5-mini",
          messages,
        });
        return response.choices[0].message.content || "";
      });

      async function supportTask(input: string): Promise<string> {
        const messages: Array<{ role: string; content: string }> = [
          { role: "system", content: "You are a helpful customer support agent." }
        ];

        messages.push({ role: "user", content: input });
        const response1 = await callLLM(messages);
        messages.push({ role: "assistant", content: response1 });

        messages.push({ role: "user", content: "Can you provide more details?" });
        const response2 = await callLLM(messages);
        messages.push({ role: "assistant", content: response2 });

        messages.push({ role: "user", content: "Thank you for your help!" });
        const response3 = await callLLM(messages);

        return response3;
      }

      const politenessScorer: EvalScorer<string, string, unknown> = async ({ trace }) => {
        if (!trace) return 0;

        const thread = await trace.getThread();
        const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant");
        const content = lastAssistantMsg?.content?.toLowerCase() || "";

        const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
        const isPolite = politeWords.some(word => content.includes(word));

        return {
          name: "Politeness",
          score: isPolite ? 1 : 0,
          metadata: { checked_message_preview: content.slice(0, 80) },
        };
      };

      const efficiencyScorer: EvalScorer<string, string, unknown> = async ({ trace }) => {
        if (!trace) return 0;

        const llmSpans = await trace.getSpans({ spanType: ["llm"] });
        const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5;

        return {
          name: "Efficiency",
          score: isEfficient ? 1 : 0,
          metadata: { llm_calls: llmSpans.length },
        };
      };

      Eval("Support Quality", {
        data: SUPPORT_DATASET,
        task: supportTask,
        scores: [politenessScorer, efficiencyScorer],
      });
      ```

      ```python eval_trace_code_scorer.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      from braintrust import Eval, wrap_openai, traced
      from openai import AsyncOpenAI

      client = wrap_openai(AsyncOpenAI())

      SUPPORT_DATASET = [
          {"input": "My order hasn't arrived yet. Order #12345."},
          {"input": "I need help resetting my password."},
      ]


      @traced
      async def call_llm(messages):
          response = await client.chat.completions.create(
              model="gpt-5-mini",
              messages=messages,
          )
          return response.choices[0].message.content or ""


      async def support_task(input):
          messages = [
              {"role": "system", "content": "You are a helpful customer support agent."}
          ]

          messages.append({"role": "user", "content": input})
          response1 = await call_llm(messages)
          messages.append({"role": "assistant", "content": response1})

          messages.append({"role": "user", "content": "Can you provide more details?"})
          response2 = await call_llm(messages)
          messages.append({"role": "assistant", "content": response2})

          messages.append({"role": "user", "content": "Thank you for your help!"})
          response3 = await call_llm(messages)

          return response3


      async def politeness_scorer(input, output, expected, trace=None):
          if not trace:
              return 0

          thread = await trace.get_thread()
          last_assistant_msg = next(
              (msg for msg in reversed(thread) if msg.get("role") == "assistant"), None
          )
          content = (last_assistant_msg.get("content") or "").lower() if last_assistant_msg else ""

          polite_words = ["welcome", "glad", "happy", "pleasure", "thank"]
          is_polite = any(word in content for word in polite_words)

          return {
              "name": "Politeness",
              "score": 1 if is_polite else 0,
              "metadata": {"checked_message_preview": content[:80]},
          }


      async def efficiency_scorer(input, output, expected, trace=None):
          if not trace:
              return 0

          llm_spans = await trace.get_spans(span_type=["llm"])
          is_efficient = 3 <= len(llm_spans) <= 5

          return {
              "name": "Efficiency",
              "score": 1 if is_efficient else 0,
              "metadata": {"llm_calls": len(llm_spans)},
          }


      Eval(
          "Support Quality",
          data=SUPPORT_DATASET,
          task=support_task,
          scores=[politeness_scorer, efficiency_scorer],
      )
      ```

      ```java theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import com.openai.client.OpenAIClient;
      import com.openai.client.okhttp.OpenAIOkHttpClient;
      import com.openai.models.chat.completions.ChatCompletionCreateParams;
      import dev.braintrust.Braintrust;
      import dev.braintrust.eval.*;
      import dev.braintrust.instrumentation.openai.BraintrustOpenAI;
      import dev.braintrust.trace.BrainstoreTrace;
      import java.util.List;
      import java.util.function.Function;

      class TraceScoringExample {

          public static void main(String[] args) {
              var braintrust = Braintrust.get();
              var openTelemetry = braintrust.openTelemetryCreate();
              var client = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv());

              Function<String, String> supportTask =
                      input -> {
                          var messages =
                                  ChatCompletionCreateParams.builder()
                                          .model("gpt-5-mini")
                                          .addSystemMessage("You are a helpful customer support agent.");

                          messages.addUserMessage(input);
                          messages.addAssistantMessage(complete(client, messages));

                          messages.addUserMessage("Can you provide more details?");
                          messages.addAssistantMessage(complete(client, messages));

                          messages.addUserMessage("Thank you for your help!");
                          return complete(client, messages);
                      };

              // Implement TracedScorer to receive the trace; score(TaskResult, BrainstoreTrace) runs once per trace
              var politenessScorer =
                      new TracedScorer<String, String>() {
                          @Override
                          public String getName() {
                              return "Politeness";
                          }

                          @Override
                          public List<Score> score(
                                  TaskResult<String, String> taskResult, BrainstoreTrace trace) {
                              var thread = trace.getLLMConversationThread();
                              var lastAssistant =
                                      thread.stream()
                                              .filter(msg -> "assistant".equals(msg.get("role")))
                                              .reduce((first, second) -> second)
                                              .orElse(null);
                              var content =
                                      lastAssistant == null
                                              ? ""
                                              : String.valueOf(lastAssistant.getOrDefault("content", ""))
                                                      .toLowerCase();

                              var politeWords =
                                      List.of("welcome", "glad", "happy", "pleasure", "thank");
                              boolean isPolite = politeWords.stream().anyMatch(content::contains);

                              return List.of(new Score(getName(), isPolite ? 1.0 : 0.0));
                          }
                      };

              var efficiencyScorer =
                      new TracedScorer<String, String>() {
                          @Override
                          public String getName() {
                              return "Efficiency";
                          }

                          @Override
                          public List<Score> score(
                                  TaskResult<String, String> taskResult, BrainstoreTrace trace) {
                              var llmSpans = trace.getSpans("llm");
                              boolean isEfficient = llmSpans.size() >= 3 && llmSpans.size() <= 5;

                              return List.of(new Score(getName(), isEfficient ? 1.0 : 0.0));
                          }
                      };

              var eval =
                      braintrust
                              .<String, String>evalBuilder()
                              .name("Support Quality")
                              .cases(
                                      DatasetCase.of("My order hasn't arrived yet. Order #12345.", ""),
                                      DatasetCase.of("I need help resetting my password.", ""))
                              .taskFunction(supportTask)
                              .scorers(politenessScorer, efficiencyScorer)
                              .build();

              var result = eval.run();
              System.out.println(result.createReportString());
          }

          private static String complete(OpenAIClient client, ChatCompletionCreateParams.Builder builder) {
              return client.chat().completions().create(builder.build()).choices().get(0).message()
                      .content()
                      .orElse("");
          }
      }
      ```

      ```ruby eval_trace_code_scorer.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      require "braintrust"
      require "openai"

      Braintrust.init

      client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil))

      SUPPORT_DATASET = [
        {input: "My order hasn't arrived yet. Order #12345."},
        {input: "I need help resetting my password."},
      ]

      def chat(client, messages)
        client.chat.completions.create(model: "gpt-5-mini", messages: messages)
          .choices.first.message.content || ""
      end

      support_task = Braintrust::Task.new("support") do |input:|
        messages = [{role: "system", content: "You are a helpful customer support agent."}]

        messages << {role: "user", content: input}
        messages << {role: "assistant", content: chat(client, messages)}

        messages << {role: "user", content: "Can you provide more details?"}
        messages << {role: "assistant", content: chat(client, messages)}

        messages << {role: "user", content: "Thank you for your help!"}
        chat(client, messages)
      end

      politeness_scorer = Braintrust::Scorer.new("politeness") do |trace:|
        next 0 unless trace

        thread = trace.thread
        last_assistant = thread.reverse.find { |msg| msg["role"] == "assistant" }
        content = (last_assistant&.dig("content") || "").downcase

        polite_words = ["welcome", "glad", "happy", "pleasure", "thank"]
        is_polite = polite_words.any? { |word| content.include?(word) }

        {score: is_polite ? 1.0 : 0.0, metadata: {checked_message_preview: content[0, 80]}}
      end

      efficiency_scorer = Braintrust::Scorer.new("efficiency") do |trace:|
        next 0 unless trace

        llm_spans = trace.spans(span_type: "llm")
        is_efficient = llm_spans.length.between?(3, 5)

        {score: is_efficient ? 1.0 : 0.0, metadata: {llm_calls: llm_spans.length}}
      end

      Braintrust::Eval.run(
        project: "Support Quality",
        cases: SUPPORT_DATASET,
        task: support_task,
        scorers: [politeness_scorer, efficiency_scorer]
      )

      OpenTelemetry.tracer_provider.shutdown
      ```

      ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      using Braintrust.Sdk.Eval;
      using Braintrust.Sdk.OpenAI;
      using OpenAI.Chat;

      var braintrust = Braintrust.Sdk.Braintrust.Get();
      var activitySource = braintrust.GetActivitySource();
      var openAIClient = BraintrustOpenAI.WrapOpenAI(
          activitySource, Environment.GetEnvironmentVariable("OPENAI_API_KEY"));
      var chatClient = openAIClient.GetChatClient("gpt-5-mini");

      string SupportTask(string input)
      {
          var messages = new List<ChatMessage>
          {
              new SystemChatMessage("You are a helpful customer support agent."),
              new UserChatMessage(input),
          };

          messages.Add(new AssistantChatMessage(chatClient.CompleteChat(messages).Value.Content[0].Text));
          messages.Add(new UserChatMessage("Can you provide more details?"));
          messages.Add(new AssistantChatMessage(chatClient.CompleteChat(messages).Value.Content[0].Text));
          messages.Add(new UserChatMessage("Thank you for your help!"));

          return chatClient.CompleteChat(messages).Value.Content[0].Text;
      }

      var eval = await braintrust
          .EvalBuilder<string, string>()
          .Name("Support Quality")
          .Cases(
              DatasetCase.Of("My order hasn't arrived yet. Order #12345.", ""),
              DatasetCase.Of("I need help resetting my password.", ""))
          .TaskFunction(SupportTask)
          .Scorers(new PolitenessScorer(), new EfficiencyScorer())
          .BuildAsync();

      await eval.RunAsync();

      // Scores the last assistant message in the conversation thread reconstructed from the trace
      class PolitenessScorer : ITracedScorer<string, string>
      {
          public string Name => "Politeness";

          public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult) =>
              Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]);

          public async Task<IReadOnlyList<Score>> Score(
              TaskResult<string, string> taskResult, EvalTrace trace)
          {
              var thread = await trace.GetThreadAsync();
              var lastAssistant = thread.LastOrDefault(m =>
                  m.TryGetValue("role", out var role) && role as string == "assistant");
              var content = (lastAssistant?.GetValueOrDefault("content") as string ?? "").ToLowerInvariant();

              string[] politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
              var isPolite = politeWords.Any(content.Contains);

              return [new Score(Name, isPolite ? 1.0 : 0.0,
                  new Dictionary<string, object> { ["checked_message_preview"] = content[..Math.Min(80, content.Length)] })];
          }
      }

      // Scores efficiency based on the number of LLM spans in the trace
      class EfficiencyScorer : ITracedScorer<string, string>
      {
          public string Name => "Efficiency";

          public Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult) =>
              Task.FromResult<IReadOnlyList<Score>>([new Score(Name, 0.0)]);

          public async Task<IReadOnlyList<Score>> Score(
              TaskResult<string, string> taskResult, EvalTrace trace)
          {
              var llmSpans = await trace.GetSpansAsync("llm");
              var isEfficient = llmSpans.Count is >= 3 and <= 5;

              return [new Score(Name, isEfficient ? 1.0 : 0.0,
                  new Dictionary<string, object> { ["llm_calls"] = llmSpans.Count })];
          }
      }
      ```
    </CodeGroup>
  </Tab>

  <Tab title="CLI" icon="terminal">
    Define TypeScript or Python scorers in code and push to Braintrust:

    <CodeGroup dropdown>
      ```typescript title="trace_code_scorer.ts" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust from "braintrust";
      import { z } from "zod";

      const project = braintrust.projects.create({ name: "my-project" });

      project.scorers.create({
        name: "Politeness scorer",
        slug: "politeness-scorer",
        description: "Check if assistant responds politely",
        parameters: z.object({
          trace: z.any(),
        }),
        handler: async ({ trace }) => {
          if (!trace) return 0;

          const thread = await trace.getThread();
          const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant");
          const content = lastAssistantMsg?.content?.toLowerCase() || "";

          const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
          const isPolite = politeWords.some(word => content.includes(word));

          return {
            score: isPolite ? 1 : 0,
            metadata: { checked_message_preview: content.slice(0, 80) },
          };
        },
      });

      project.scorers.create({
        name: "Efficiency scorer",
        slug: "efficiency-scorer",
        description: "Check if conversation was efficient",
        parameters: z.object({
          trace: z.any(),
        }),
        handler: async ({ trace }) => {
          if (!trace) return 0;

          const llmSpans = await trace.getSpans({ spanType: ["llm"] });
          const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5;

          return {
            score: isEfficient ? 1 : 0,
            metadata: { llm_calls: llmSpans.length },
          };
        },
      });
      ```

      ```python title="trace_code_scorer.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust
      from pydantic import BaseModel

      project = braintrust.projects.create(name="my-project")

      class TraceParams(BaseModel):
          trace: dict

      async def politeness_scorer(trace):
          if not trace:
              return 0

          thread = await trace.get_thread()
          last_assistant_msg = next(
              (msg for msg in reversed(thread) if msg.get("role") == "assistant"), None
          )
          content = (last_assistant_msg.get("content") or "").lower() if last_assistant_msg else ""

          polite_words = ["welcome", "glad", "happy", "pleasure", "thank"]
          is_polite = any(word in content for word in polite_words)

          return {
              "score": 1 if is_polite else 0,
              "metadata": {"checked_message_preview": content[:80]},
          }

      async def efficiency_scorer(trace):
          if not trace:
              return 0

          llm_spans = await trace.get_spans(span_type=["llm"])
          is_efficient = 3 <= len(llm_spans) <= 5

          return {
              "score": 1 if is_efficient else 0,
              "metadata": {"llm_calls": len(llm_spans)},
          }

      project.scorers.create(
          name="Politeness scorer",
          slug="politeness-scorer",
          description="Check if assistant responds politely",
          parameters=TraceParams,
          handler=politeness_scorer,
      )

      project.scorers.create(
          name="Efficiency scorer",
          slug="efficiency-scorer",
          description="Check if conversation was efficient",
          parameters=TraceParams,
          handler=efficiency_scorer,
      )
      ```
    </CodeGroup>

    Push to Braintrust:

    <CodeGroup>
      ```bash TypeScript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push trace_code_scorer.ts
      ```

      ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push trace_code_scorer.py
      ```
    </CodeGroup>
  </Tab>

  <Tab title="UI" icon="mouse-pointer-2">
    1. Go to [**<Icon icon="triangle" /> Scorers**](https://www.braintrust.dev/app/~/scorers) > **+ Scorer**.
    2. Enter a scorer name and slug.
    3. Select **TypeScript** or **Python**.
    4. Write your scorer function with the `trace` parameter. The code editor provides real-time linting and autocomplete.
    5. Click **Save as custom scorer**.

    <CodeGroup dropdown>
      ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import type { Trace } from 'braintrust';

      async function handler({
        input,
        output,
        expected,
        metadata,
        trace,
      }: {
        input: any;
        output: any;
        expected: any;
        metadata: Record<string, any>;
        trace: Trace;
      }): Promise<
        | number
        | { score: number; name?: string; metadata?: Record<string, unknown> }
        | null
      > {
        if (expected === null) return null;

        const allSpans = await trace.getSpans();
        const llmSpans = await trace.getSpans({ spanType: ["llm"] });

        return {
          name: "span count scorer",
          score: output === expected ? 1 : 0,
          metadata: {
            totalSpanCount: allSpans.length,
            llmSpanCount: llmSpans.length,
          },
        };
      }
      ```

      ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      from typing import Any

      async def handler(
        input: Any,
        output: Any,
        expected: Any,
        metadata: dict[str, Any],
        trace: Any
      ) -> float | dict[str, Any] | None:
        if expected is None:
          return None

        all_spans = await trace.get_spans()
        llm_spans = await trace.get_spans(span_type=['llm'])

        return {
          'name': 'span count scorer',
          'score': 1.0 if output == expected else 0.0,
          'metadata': {
            'total_span_count': len(all_spans),
            'llm_span_count': len(llm_spans),
          },
        }
      ```
    </CodeGroup>

    <Note>
      UI scorers have access to these packages:

      * `anthropic`
      * `autoevals`
      * `braintrust`
      * `json`
      * `math`
      * `openai`
      * `re`
      * `requests`
      * `typing`

      For additional packages, use the CLI.
    </Note>
  </Tab>
</Tabs>

### Trace scorer recipes

Use trace scorers for checks that depend on the agent's trajectory, such as tool usage, tool failures, or step budgets. Add any of these scorers to the `scores` array in an `Eval`, or adapt the handler body for a CLI or UI scorer.

<CodeGroup dropdown>
  ```typescript trace_scorer_recipes.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  import { type EvalScorer } from "braintrust";

  function spanName(span: { span_attributes?: { name?: string } }): string {
    return span.span_attributes?.name ?? "unknown";
  }

  function stringField(value: unknown, fieldName: string): string | null {
    if (typeof value !== "object" || value === null) return null;

    const field = Object.getOwnPropertyDescriptor(value, fieldName)?.value;
    return typeof field === "string" ? field : null;
  }

  // Check if a specific tool was called at least once.
  const requiredToolCalled: EvalScorer<string, string, unknown> = async ({
    trace,
  }) => {
    if (!trace) return null;

    const toolSpans = await trace.getSpans({ spanType: ["tool"] });
    const editViewCalls = toolSpans.filter(
      (span) => span.span_attributes?.name === "edit_view",
    );

    return {
      name: "edit_view called",
      score: editViewCalls.length > 0 ? 1 : 0,
      metadata: { edit_view_calls: editViewCalls.length },
    };
  };

  // Check if a tool was called with an argument matching the expected value.
  const requiredToolCalledWithArg: EvalScorer<
    string,
    string,
    unknown
  > = async ({ expected, trace }) => {
    if (!trace) return null;

    const documentId = stringField(expected, "document_id");
    if (!documentId) return null;

    const toolSpans = await trace.getSpans({ spanType: ["tool"] });
    const searchCalls = toolSpans.filter(
      (span) => span.span_attributes?.name === "search_docs",
    );
    const matchedCall = searchCalls.some(
      (span) => stringField(span.input, "document_id") === documentId,
    );

    return {
      name: "searched expected document",
      score: matchedCall ? 1 : 0,
      metadata: {
        expected_document_id: documentId,
        search_docs_calls: searchCalls.length,
      },
    };
  };

  // Check that no tool from a denylist was called.
  const noDisallowedTools: EvalScorer<string, string, unknown> = async ({
    trace,
  }) => {
    if (!trace) return null;

    const disallowedToolNames = new Set(["send_email", "delete_record"]);
    const toolSpans = await trace.getSpans({ spanType: ["tool"] });
    const disallowedCalls = toolSpans.filter((span) => {
      const name = span.span_attributes?.name;
      return typeof name === "string" && disallowedToolNames.has(name);
    });

    return {
      name: "no disallowed tools",
      score: disallowedCalls.length === 0 ? 1 : 0,
      metadata: {
        disallowed_tools: disallowedCalls.map(spanName),
      },
    };
  };

  // Check that every tool call completed without error.
  const allToolsSucceeded: EvalScorer<string, string, unknown> = async ({
    trace,
  }) => {
    if (!trace) return null;

    const toolSpans = await trace.getSpans({ spanType: ["tool"] });
    const failedToolCalls = toolSpans.filter((span) => Boolean(span.error));

    return {
      name: "tool calls succeeded",
      score: failedToolCalls.length === 0 ? 1 : 0,
      metadata: {
        failed_tools: failedToolCalls.map(spanName),
        tool_calls: toolSpans.length,
      },
    };
  };

  // Check if the agent stayed within a step budget.
  const trajectoryBudget: EvalScorer<string, string, unknown> = async ({
    trace,
  }) => {
    if (!trace) return null;

    const maxSteps = 8;
    const agentSpans = await trace.getSpans({ spanType: ["llm", "tool"] });

    return {
      name: "trajectory budget",
      score: agentSpans.length <= maxSteps ? 1 : 0,
      metadata: {
        agent_steps: agentSpans.length,
        max_steps: maxSteps,
      },
    };
  };
  ```

  ```python eval_trace_scorer_recipes.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  def span_name(span):
      return (span.span_attributes or {}).get("name", "unknown")


  def string_field(value, field_name):
      return value.get(field_name) if isinstance(value, dict) else None


  # Check if a specific tool was called at least once.
  async def required_tool_called(input, output, expected, trace=None):
      if not trace:
          return None

      tool_spans = await trace.get_spans(span_type=["tool"])
      edit_view_calls = [
          span
          for span in tool_spans
          if (span.span_attributes or {}).get("name") == "edit_view"
      ]

      return {
          "name": "edit_view called",
          "score": 1 if edit_view_calls else 0,
          "metadata": {"edit_view_calls": len(edit_view_calls)},
      }


  # Check if a tool was called with an argument matching the expected value.
  async def required_tool_called_with_arg(input, output, expected, trace=None):
      if not trace:
          return None

      document_id = string_field(expected, "document_id")
      if not isinstance(document_id, str):
          return None

      tool_spans = await trace.get_spans(span_type=["tool"])
      search_calls = [
          span
          for span in tool_spans
          if (span.span_attributes or {}).get("name") == "search_docs"
      ]
      matched_call = any(
          string_field(span.input, "document_id") == document_id
          for span in search_calls
      )

      return {
          "name": "searched expected document",
          "score": 1 if matched_call else 0,
          "metadata": {
              "expected_document_id": document_id,
              "search_docs_calls": len(search_calls),
          },
      }


  # Check that no tool from a denylist was called.
  async def no_disallowed_tools(input, output, expected, trace=None):
      if not trace:
          return None

      disallowed_tool_names = {"send_email", "delete_record"}
      tool_spans = await trace.get_spans(span_type=["tool"])
      disallowed_calls = [
          span
          for span in tool_spans
          if (span.span_attributes or {}).get("name") in disallowed_tool_names
      ]

      return {
          "name": "no disallowed tools",
          "score": 1 if not disallowed_calls else 0,
          "metadata": {
              "disallowed_tools": [span_name(span) for span in disallowed_calls],
          },
      }


  # Check that every tool call completed without error.
  async def all_tools_succeeded(input, output, expected, trace=None):
      if not trace:
          return None

      tool_spans = await trace.get_spans(span_type=["tool"])
      failed_tool_calls = [span for span in tool_spans if span.error]

      return {
          "name": "tool calls succeeded",
          "score": 1 if not failed_tool_calls else 0,
          "metadata": {
              "failed_tools": [span_name(span) for span in failed_tool_calls],
              "tool_calls": len(tool_spans),
          },
      }


  # Check if the agent stayed within a step budget.
  async def trajectory_budget(input, output, expected, trace=None):
      if not trace:
          return None

      max_steps = 8
      agent_spans = await trace.get_spans(span_type=["llm", "tool"])

      return {
          "name": "trajectory budget",
          "score": 1 if len(agent_spans) <= max_steps else 0,
          "metadata": {
              "agent_steps": len(agent_spans),
              "max_steps": max_steps,
          },
      }
  ```
</CodeGroup>

## Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as **passing** (green highlighting with checkmark), while scores below are marked as **failing** (red highlighting).

<Note>
  Pass thresholds apply only to scorers that output numeric scores. Classifiers, which output labels, don't use them.
</Note>

<Tabs className="tabs-border">
  <Tab title="SDK" icon="code">
    Add `__pass_threshold` to the scorer's metadata (value between 0 and 1):

    <CodeGroup dropdown>
      ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      project.scorers.create({
        name: "Quality checker",
        slug: "quality-checker",
        handler: async ({ output, expected }) => {
          return output === expected ? 1 : 0;
        },
        metadata: {
          __pass_threshold: 0.8,
        },
      });
      ```

      ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      @project.scorers.create(
          name="Quality checker",
          slug="quality-checker",
          metadata={"__pass_threshold": 0.8},
      )
      def quality_checker(output, expected):
          return 1 if output == expected else 0
      ```

      ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      // Pass thresholds are not supported in the Java SDK.
      // Use the UI or push a TypeScript/Python scorer via the CLI to set a pass threshold.
      ```

      ```ruby theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      # Pass thresholds are not supported in the Ruby SDK.
      # Use the UI or push a TypeScript/Python scorer via the CLI to set a pass threshold.
      ```

      ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      // Pass thresholds are not supported in the C# SDK.
      // Use the UI or push a TypeScript/Python scorer via the CLI to set a pass threshold.
      ```
    </CodeGroup>
  </Tab>

  <Tab title="UI" icon="mouse-pointer-2">
    When creating or editing a scorer in the UI:

    1. Look for the **Pass threshold** slider in the scorer configuration.
    2. Drag the slider to set your minimum acceptable score (0–1).
    3. Click **Save as custom scorer**.

    The threshold can be set for any scorer type.
  </Tab>
</Tabs>

## Return multiple scores

A single scorer can return an array of score objects to emit multiple named metrics from one call. This is useful when several quality dimensions can be computed together or share computation. Each item appears as its own score column in the Braintrust UI.

Each item requires `name` and `score`. `metadata` is optional.

<CodeGroup dropdown>
  ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  Eval("Summary Quality", {
    data: DATASET,
    task,
    scores: [
      ({ output, expected }) => {
        const words = (output ?? "").toLowerCase().split(/\s+/);
        const keyTerms: string[] = expected.key_terms;
        const covered = keyTerms.filter((t) => words.includes(t)).length;
        return [
          {
            name: "coverage",
            score: keyTerms.length ? covered / keyTerms.length : 1,
            metadata: { missing: keyTerms.filter((t) => !words.includes(t)) },
          },
          {
            name: "conciseness",
            score: words.length <= expected.max_words ? 1 : 0,
            metadata: { word_count: words.length, limit: expected.max_words },
          },
        ];
      },
    ],
  });
  ```

  ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  from braintrust import Eval, Score

  def summary_quality(output, expected, **kwargs):
      words = (output or "").lower().split()
      key_terms = expected["key_terms"]
      covered = sum(1 for t in key_terms if t in words)
      return [
          Score(
              name="coverage",
              score=covered / len(key_terms) if key_terms else 1.0,
              metadata={"missing": [t for t in key_terms if t not in words]},
          ),
          Score(
              name="conciseness",
              score=1.0 if len(words) <= expected["max_words"] else 0.0,
              metadata={"word_count": len(words), "limit": expected["max_words"]},
          ),
      ]

  Eval("Summary Quality", data=DATASET, task=task, scores=[summary_quality])
  ```

  ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  import dev.braintrust.eval.*;
  import java.util.List;
  import java.util.Map;

  // A scorer returns List<Score>, so a single scorer can emit several named metrics.
  // The Java Score record holds a name and value; pass per-case criteria through case metadata.
  var summaryQuality =
          new Scorer<String, String>() {
              @Override
              public String getName() {
                  return "Summary quality";
              }

              @Override
              @SuppressWarnings("unchecked")
              public List<Score> score(TaskResult<String, String> taskResult) {
                  var words = List.of(taskResult.result().toLowerCase().split("\\s+"));
                  Map<String, Object> criteria = taskResult.datasetCase().metadata();
                  var keyTerms = (List<String>) criteria.getOrDefault("key_terms", List.of());
                  int maxWords = (Integer) criteria.getOrDefault("max_words", Integer.MAX_VALUE);

                  long covered = keyTerms.stream().filter(words::contains).count();

                  return List.of(
                          new Score(
                                  "coverage",
                                  keyTerms.isEmpty() ? 1.0 : (double) covered / keyTerms.size()),
                          new Score("conciseness", words.size() <= maxWords ? 1.0 : 0.0));
              }
          };
  ```

  ```ruby multi_score.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  summary_quality = Braintrust::Scorer.new("summary_quality") do |output:, expected:|
    words = output.to_s.downcase.split
    key_terms = expected[:key_terms]
    covered = key_terms.count { |t| words.include?(t) }

    [
      {
        name: "coverage",
        score: key_terms.empty? ? 1.0 : covered.to_f / key_terms.size,
        metadata: {missing: key_terms - words}
      },
      {
        name: "conciseness",
        score: words.size <= expected[:max_words] ? 1.0 : 0.0,
        metadata: {word_count: words.size, limit: expected[:max_words]}
      }
    ]
  end

  class StyleChecker
    include Braintrust::Scorer

    def call(output:, **)
      text = output.to_s
      [
        {name: "ends_with_period", score: text.strip.end_with?(".") ? 1.0 : 0.0},
        {name: "no_first_person", score: (%w[i me my we us].none? { |w| text.downcase.include?(w) }) ? 1.0 : 0.0}
      ]
    end
  end
  ```
</CodeGroup>

## Apply classification labels

A [classifier](/evaluate/write-scorers#classifiers) returns a categorical label instead of a numeric score. Define custom code classifiers inline in your eval code, as a function that evaluates a result and constructs one or more classifications.

Each classification your function returns sets a `name` (the group it belongs to, such as `intent`), an `id` (the value you filter by, such as `password_reset`), an optional `label` for display (such as `Password reset`), and optional `metadata`. Unlike an LLM-as-a-judge classifier, custom code sets these fields independently and can return more than one classification at a time.

<Note>
  To create a classifier in the UI, build an [LLM-as-a-judge classifier](/evaluate/llm-as-a-judge#apply-classification-labels).
</Note>

<CodeGroup dropdown>
  ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  import { Eval } from "braintrust";

  const DATASET = [
    {
      input: "Hello! Can you help me reset my password?",
      expected: "password_reset",
    },
  ];

  async function task(input: string): Promise<string> {
    // Stand-in for your LLM call
    return `Thanks for reaching out. ${input}`;
  }

  function intentClassifier({ output }: { output: string }) {
    if (output.toLowerCase().includes("password")) {
      return {
        name: "intent",
        id: "password_reset",
        label: "Password reset",
      };
    }

    return {
      name: "intent",
      id: "other",
      label: "Other",
    };
  }

  Eval("Support intent", {
    data: DATASET,
    task,
    classifiers: [intentClassifier],
  });
  ```

  ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  from braintrust import Classification, Eval

  DATASET = [
      {
          "input": "Hello! Can you help me reset my password?",
          "expected": "password_reset",
      },
  ]


  def task(input):
      # Stand-in for your LLM call
      return f"Thanks for reaching out. {input}"


  def intent_classifier(input, output, expected):
      if "password" in output.lower():
          return Classification(
              name="intent",
              id="password_reset",
              label="Password reset",
          )

      return Classification(name="intent", id="other", label="Other")


  Eval(
      "Support intent",
      data=DATASET,
      task=task,
      classifiers=[intent_classifier],
  )
  ```

  ```go classifier.go theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  package main

  import (
  	"context"
  	"strings"

  	"github.com/braintrustdata/braintrust-sdk-go"
  	"github.com/braintrustdata/braintrust-sdk-go/eval"
  	"go.opentelemetry.io/otel"
  	"go.opentelemetry.io/otel/sdk/trace"
  )

  func main() {
  	tp := trace.NewTracerProvider()
  	defer tp.Shutdown(context.Background())
  	otel.SetTracerProvider(tp)

  	bt, err := braintrust.New(tp, braintrust.WithProject("Support intent"))
  	if err != nil {
  		panic(err)
  	}

  	intentClassifier := eval.NewClassifier("intent",
  		func(_ context.Context, r eval.TaskResult[string, string]) (eval.Classifications, error) {
  			if strings.Contains(strings.ToLower(r.Output), "password") {
  				return eval.Classifications{{ID: "password_reset", Label: "Password reset"}}, nil
  			}
  			return eval.Classifications{{ID: "other", Label: "Other"}}, nil
  		})

  	evaluator := braintrust.NewEvaluator[string, string](bt)
  	_, err = evaluator.Run(context.Background(), eval.Opts[string, string]{
  		Experiment: "Support intent",
  		Dataset: eval.NewDataset([]eval.Case[string, string]{
  			{Input: "Hello! Can you help me reset my password?", Expected: "password_reset"},
  		}),
  		Task: eval.T(func(_ context.Context, input string) (string, error) {
  			return "Thanks for reaching out. " + input, nil // Stand-in for your LLM call
  		}),
  		Classifiers: []eval.Classifier[string, string]{intentClassifier},
  	})
  	if err != nil {
  		panic(err)
  	}
  }
  ```

  ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  import dev.braintrust.Braintrust;
  import dev.braintrust.eval.Classification;
  import dev.braintrust.eval.Classifier;
  import dev.braintrust.eval.DatasetCase;

  class Main {
    public static void main(String... args) {
      var braintrust = Braintrust.get();
      braintrust.openTelemetryCreate();

      Classifier<String, String> intentClassifier =
          Classifier.single(
              "intent",
              tr -> {
                if (tr.result().toLowerCase().contains("password")) {
                  return Classification.of("intent", "password_reset", "Password reset");
                }
                return Classification.of("intent", "other", "Other");
              });

      var eval =
          braintrust
              .<String, String>evalBuilder()
              .name("Support intent")
              .cases(DatasetCase.of("Hello! Can you help me reset my password?", "password_reset"))
              .taskFunction(input -> "Thanks for reaching out. " + input) // Stand-in for your LLM call
              .classifiers(intentClassifier)
              .build();

      eval.run();
    }
  }
  ```

  ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  using System;
  using System.Collections.Generic;
  using System.Threading.Tasks;
  using Braintrust.Sdk;
  using Braintrust.Sdk.Eval;

  class Program
  {
      static async Task Main(string[] args)
      {
          var braintrust = Braintrust.Sdk.Braintrust.Get();

          var intentClassifier = new FunctionClassifier<string, string>(
              "intent",
              taskResult =>
              {
                  if (taskResult.Result.Contains("password", StringComparison.OrdinalIgnoreCase))
                  {
                      return new Classification(Id: "password_reset", Name: "intent", Label: "Password reset");
                  }
                  return new Classification(Id: "other", Name: "intent", Label: "Other");
              });

          var eval = await braintrust
              .EvalBuilder<string, string>()
              .Name("Support intent")
              .Cases(
                  new DatasetCase<string, string>(
                      "Hello! Can you help me reset my password?", "password_reset"))
              .TaskFunction(input => "Thanks for reaching out. " + input) // Stand-in for your LLM call
              .Classifiers(intentClassifier)
              .BuildAsync();

          await eval.RunAsync();
      }
  }
  ```
</CodeGroup>

<Note>
  For the C# and Java examples, use the `BRAINTRUST_DEFAULT_PROJECT_NAME` environment variable to set a project name. Otherwise, the default project is `default-dotnet-project` (C#) or `default-java-project` (Java).
</Note>

In a single evaluation, you can use scorers, classifiers, or both. Classifier failures do not stop the evaluation or affect other scorers and classifiers. Braintrust records classifier errors in the result metadata under `classifier_errors`.

A classifier can also assign multiple labels at once:

<CodeGroup dropdown>
  ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  function intentClassifier() {
    return [
      { name: "intent", id: "billing", label: "Billing" },
      { name: "intent", id: "login", label: "Login" },
    ];
  }
  ```

  ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  def intent_classifier(input, output, expected):
      return [
          Classification(name="intent", id="billing", label="Billing"),
          Classification(name="intent", id="login", label="Login"),
      ]
  ```

  ```go #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  intentClassifier := eval.NewClassifier("intent",
  	func(_ context.Context, r eval.TaskResult[string, string]) (eval.Classifications, error) {
  		return eval.Classifications{
  			{ID: "billing", Label: "Billing"},
  			{ID: "login", Label: "Login"},
  		}, nil
  	})
  ```

  ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  Classifier<String, String> intentClassifier =
      Classifier.of(
          "intent",
          tr ->
              java.util.List.of(
                  Classification.of("intent", "billing", "Billing"),
                  Classification.of("intent", "login", "Login")));
  ```

  ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  var intentClassifier = new FunctionClassifier<string, string>(
      "intent",
      taskResult => (IReadOnlyList<Classification>)new[]
      {
          new Classification(Id: "billing", Name: "intent", Label: "Billing"),
          new Classification(Id: "login", Name: "intent", Label: "Login"),
      });
  ```
</CodeGroup>

<Note>
  Classifiers require TypeScript SDK v3.9.0+, Python SDK v0.16.0+, Go SDK v0.8.0+, Java SDK v0.3.12+, or C# SDK v0.2.8+.
</Note>

## Next steps

* [Autoevals](/evaluate/autoevals) for pre-built scorers without writing code
* [LLM-as-a-judge](/evaluate/llm-as-a-judge) for natural language evaluation criteria
* [Run evaluations](/evaluate/run-evaluations) using your scorers
* [Score production logs](/evaluate/score-online) with online scoring rules
