> ## Documentation Index > Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt > Use this file to discover all available pages before exploring further. # Custom code scorers and classifiers > Write scorers and classifiers as custom code with full control over scoring logic, business rules, and pattern matching to evaluate AI outputs. Custom code scorers and classifiers let you write evaluation logic with full control over the result. A scorer returns a numeric score, while a classifier returns a categorical label. They can use any packages you need and are best when you have specific rules, patterns, or calculations to implement. You can define custom code scorers in three places: * **Inline in SDK code**: Define scorers directly in your evaluation scripts for local development or application-specific logic. * **Pushed via CLI**: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs. * **Created in UI**: Build scorers in the Braintrust web interface using the built-in code editor. Most teams prototype in the UI, then push production-ready scorers via the CLI. See [Scorers overview](/evaluate/write-scorers#where-to-define-scorers-and-classifiers) for guidance. ## Score spans Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score. Your scorer function receives these parameters: * `input`: The input to your task * `output`: The output from your task * `expected`: The expected output (optional) * `metadata`: Custom metadata from the test case Return a number between 0 and 1, or an object with `score` and optional metadata. In Ruby, declare only the parameters you need as keyword arguments. The runner automatically filters out the rest: `|output:, expected:|`. Use scorers inline in your evaluation code: ```typescript equality_scorer.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import { Eval, type EvalScorer } from "braintrust"; import OpenAI from "openai"; const client = new OpenAI(); const DATASET = [ { input: "What is 2+2?", expected: "4", }, { input: "What is the capital of France?", expected: "Paris", }, ]; async function task(input: string): Promise { const response = await client.responses.create({ model: "gpt-5-mini", input: [ { role: "user", content: input }, ], }); return response.output_text ?? ""; } const equalityScorer: EvalScorer = ({ output, expected }) => { if (!expected) return null; const matches = output === expected; return { name: "Equality", score: matches ? 1 : 0, metadata: { exact_match: matches }, }; }; const containsScorer: EvalScorer = ({ output, expected }) => { if (!expected) return null; const contains = output.toLowerCase().includes(expected.toLowerCase()); return { name: "Contains expected", score: contains ? 1 : 0, }; }; Eval("Custom Code Scorer Example", { data: DATASET, task, scores: [equalityScorer, containsScorer], }); ``` ```python eval_custom_scorer.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} from braintrust import Eval from openai import OpenAI client = OpenAI() DATASET = [ { "input": "What is 2+2?", "expected": "4", }, { "input": "What is the capital of France?", "expected": "Paris", }, ] def task(input): response = client.responses.create( model="gpt-5-mini", input=[ {"role": "user", "content": input}, ], ) return response.output_text def equality_scorer(input, output, expected, metadata): if not expected: return None matches = output == expected return { "name": "Equality", "score": 1 if matches else 0, "metadata": {"exact_match": matches}, } def contains_scorer(input, output, expected, metadata): if not expected: return None contains = expected.lower() in output.lower() return { "name": "Contains expected", "score": 1 if contains else 0, } Eval( "Custom Code Scorer Example", data=DATASET, task=task, scores=[equality_scorer, contains_scorer], ) ``` ```java theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import com.openai.client.okhttp.OpenAIOkHttpClient; import com.openai.models.chat.completions.ChatCompletionCreateParams; import dev.braintrust.Braintrust; import dev.braintrust.eval.*; import dev.braintrust.instrumentation.openai.BraintrustOpenAI; import java.util.List; import java.util.function.Function; class CustomCodeScorerExample { public static void main(String[] args) { var braintrust = Braintrust.get(); var openTelemetry = braintrust.openTelemetryCreate(); var client = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv()); Function task = input -> { var request = ChatCompletionCreateParams.builder() .model("gpt-5-mini") .addUserMessage(input) .build(); return client.chat().completions().create(request).choices().get(0).message() .content() .orElse(""); }; // Scorer.of builds a single-score scorer from an (expected, result) function var equalityScorer = Scorer.of( "Equality", (expected, result) -> expected != null && expected.equals(result) ? 1.0 : 0.0); // Implement Scorer directly for custom logic; return an empty list to skip a case var containsScorer = new Scorer() { @Override public String getName() { return "Contains expected"; } @Override public List score(TaskResult taskResult) { var expected = taskResult.datasetCase().expected(); if (expected == null) { return List.of(); } boolean contains = taskResult.result().toLowerCase().contains(expected.toLowerCase()); return List.of(new Score(getName(), contains ? 1.0 : 0.0)); } }; var eval = braintrust .evalBuilder() .name("Custom Code Scorer Example") .cases( DatasetCase.of("What is 2+2?", "4"), DatasetCase.of("What is the capital of France?", "Paris")) .taskFunction(task) .scorers(equalityScorer, containsScorer) .build(); var result = eval.run(); System.out.println(result.createReportString()); } } ``` ```ruby eval_custom_scorer.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} require "braintrust" require "openai" Braintrust.init client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil)) DATASET = [ {input: "What is 2+2?", expected: "4"}, {input: "What is the capital of France?", expected: "Paris"}, ] equality_scorer = Braintrust::Scorer.new("equality") do |output:, expected:| next nil unless expected matches = output == expected {name: "Equality", score: matches ? 1.0 : 0.0, metadata: {exact_match: matches}} end contains_scorer = Braintrust::Scorer.new("contains_expected") do |output:, expected:| next nil unless expected contains = output.downcase.include?(expected.downcase) {name: "Contains expected", score: contains ? 1.0 : 0.0} end Braintrust::Eval.run( project: "Custom Code Scorer Example", cases: DATASET, task: lambda do |input:| response = client.chat.completions.create( model: "gpt-5-mini", messages: [{role: "user", content: input}] ) response.choices.first.message.content || "" end, scorers: [equality_scorer, contains_scorer] ) OpenTelemetry.tracer_provider.shutdown ``` ```csharp eval_custom_scorer.cs theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} using Braintrust.Sdk; using Braintrust.Sdk.Eval; using Braintrust.Sdk.OpenAI; using OpenAI; using OpenAI.Chat; sealed class ContainsScorer : IScorer { public string Name => "Contains expected"; public Task> Score(TaskResult taskResult) { if (taskResult.DatasetCase.Expected is null) return Task.FromResult>([]); var contains = taskResult.Result.Contains( taskResult.DatasetCase.Expected, StringComparison.OrdinalIgnoreCase); return Task.FromResult>( [new Score(Name, contains ? 1.0 : 0.0)]); } } class Program { static readonly DatasetCase[] Dataset = [ DatasetCase.Of("What is 2+2?", "4"), DatasetCase.Of("What is the capital of France?", "Paris"), ]; static async Task Main(string[] args) { var equalityScorer = new FunctionScorer( "Equality", (expected, actual) => actual == expected ? 1.0 : 0.0); var braintrust = Braintrust.Sdk.Braintrust.Get(); var activitySource = braintrust.GetActivitySource(); var openAIClient = BraintrustOpenAI.WrapOpenAI( activitySource, Environment.GetEnvironmentVariable("OPENAI_API_KEY")!); async Task Task(string input) { var response = await openAIClient.GetChatClient("gpt-5-mini") .CompleteChatAsync([new UserChatMessage(input)]); return response.Value.Content[0].Text; } var eval = await braintrust .EvalBuilder() .Name("Custom Code Scorer Example") .Cases(Dataset) .TaskFunction(Task) .Scorers(equalityScorer, new ContainsScorer()) .BuildAsync(); var result = await eval.RunAsync(); Console.WriteLine(result.CreateReportString()); } } ``` Define TypeScript or Python scorers in code and push to Braintrust: ```typescript title="code_scorer.ts" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import braintrust from "braintrust"; import { z } from "zod"; const project = braintrust.projects.create({ name: "my-project" }); project.scorers.create({ name: "Equality scorer", slug: "equality-scorer", description: "Check if output equals expected", parameters: z.object({ output: z.string(), expected: z.string(), }), handler: async ({ output, expected }) => { const matches = output === expected; return { score: matches ? 1 : 0, metadata: { exact_match: matches }, }; }, metadata: { __pass_threshold: 0.5, }, }); ``` ```python title="code_scorer.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import braintrust from pydantic import BaseModel project = braintrust.projects.create(name="Tracing quickstart") class EqualityParams(BaseModel): output: str expected: str def equality_scorer(output: str, expected: str): matches = output == expected return { "score": 1 if matches else 0, "metadata": {"exact_match": matches}, } project.scorers.create( name="Equality scorer", slug="equality-scorer", description="Check if output equals expected", parameters=EqualityParams, handler=equality_scorer, metadata={"__pass_threshold": 0.5}, ) ``` Push to Braintrust: ```bash TypeScript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} bt functions push code_scorer.ts ``` ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} bt functions push code_scorer.py ``` **Important notes for Python scorers:** * Scorers must be pushed from within their directory (e.g., `bt functions push scorer.py`); pushing with relative paths (e.g., `bt functions push path/to/scorer.py`) is unsupported and will cause import errors. * Scorers using local imports must be defined at the project root. * The maximum supported Python version for scorers created with the Braintrust CLI is `3.13`. * Braintrust uses uv to cross-bundle dependencies to Linux. This works for binary dependencies except libraries requiring on-demand compilation. In TypeScript, Braintrust uses `esbuild` to bundle your code and dependencies. This works for most dependencies but does not support native (compiled) libraries like SQLite. If you have trouble bundling dependencies, [file an issue in the braintrust-sdk repo](https://github.com/braintrustdata/braintrust-sdk/issues). Python scorers created via the CLI have these default packages: * `autoevals` * `braintrust` * `openai` * `pydantic` * `requests` For additional packages, use the `--requirements` flag. For scorers with external dependencies: ```python title="scorer-with-deps.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import braintrust from langdetect import detect from pydantic import BaseModel project = braintrust.projects.create(name="my-project") class LanguageMatchParams(BaseModel): output: str expected: str @project.scorers.create( name="Language match", slug="language-match", description="Check if output and expected are same language", parameters=LanguageMatchParams, metadata={"__pass_threshold": 0.5}, ) def language_match_scorer(output: str, expected: str): return 1.0 if detect(output) == detect(expected) else 0.0 ``` Create requirements file: ``` langdetect==1.0.9 ``` Push with requirements: ```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} bt functions push scorer-with-deps.py --requirements requirements.txt ``` 1. Go to [** Scorers**](https://www.braintrust.dev/app/~/scorers) > **+ Scorer**. 2. Enter a scorer name and slug. 3. Select **TypeScript** or **Python**. 4. Write your scorer function. The code editor provides real-time linting and autocomplete. 5. Click **Save as custom scorer**. ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} function handler({ input, output, expected, metadata, }: { input: any; output: any; expected: any; metadata: Record; }): number | null { if (expected === null) return null; return output === expected ? 1 : 0; } ``` ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} from typing import Any def handler( input: Any, output: Any, expected: Any, metadata: dict[str, Any] ) -> float | None: if expected is None: return None return 1.0 if output == expected else 0.0 ``` UI scorers have access to these packages: * `anthropic` * `autoevals` * `braintrust` * `json` * `math` * `openai` * `re` * `requests` * `typing` For additional packages, use the CLI. ## Score traces Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, agent behavior such as tool usage and trajectory, or overall workflow completion. Trace-level scorers are the right choice whenever a scorer needs the full execution context rather than a single span. The scorer runs once per trace. Your handler function receives the `trace` parameter, which provides methods for accessing execution data: * **Get spans**: Returns spans matching the filter. Each span includes `input`, `output`, `expected`, `metadata`, `tags`, `scores`, `metrics`, `error` (populated when the span failed), `span_id`, `span_parents`, and `span_attributes`. Omit the filter to get all spans, or pass multiple types like `["llm", "tool"]`. * TypeScript: `trace.getSpans({ spanType: ["llm"] })` * Python: `trace.get_spans(span_type=["llm"])` * Java: `trace.getSpans("llm")` * Ruby: `trace.spans(span_type: "llm")` * C#: `trace.GetSpansAsync("llm")` * **Get thread**: Returns an array of conversation messages extracted from LLM spans. * TypeScript: `trace.getThread()` * Python: `trace.get_thread()` * Java: `trace.getLLMConversationThread()` * Ruby: `trace.thread` * C#: `trace.GetThreadAsync()` `input`, `output`, `expected`, and `metadata` are automatically populated from the root span and passed to your scorer function. Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, Java SDK v0.3.8+, Ruby SDK v0.2.1+, or C# SDK v0.2.3+. In the TypeScript SDK (v3.16.0 or later), `LocalTrace` is the concrete `Trace` implementation passed to trace-level scorers. Import it from `braintrust` to construct a `Trace` directly for advanced or manual scoring. Use scorers inline in your evaluation code: ```typescript trace_code_scorer.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import { Eval, wrapOpenAI, wrapTraced, type EvalScorer } from "braintrust"; import OpenAI from "openai"; const client = wrapOpenAI(new OpenAI()); const SUPPORT_DATASET = [ { input: "My order hasn't arrived yet. Order #12345." }, { input: "I need help resetting my password." }, ]; const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) { const response = await client.chat.completions.create({ model: "gpt-5-mini", messages, }); return response.choices[0].message.content || ""; }); async function supportTask(input: string): Promise { const messages: Array<{ role: string; content: string }> = [ { role: "system", content: "You are a helpful customer support agent." } ]; messages.push({ role: "user", content: input }); const response1 = await callLLM(messages); messages.push({ role: "assistant", content: response1 }); messages.push({ role: "user", content: "Can you provide more details?" }); const response2 = await callLLM(messages); messages.push({ role: "assistant", content: response2 }); messages.push({ role: "user", content: "Thank you for your help!" }); const response3 = await callLLM(messages); return response3; } const politenessScorer: EvalScorer = async ({ trace }) => { if (!trace) return 0; const thread = await trace.getThread(); const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant"); const content = lastAssistantMsg?.content?.toLowerCase() || ""; const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"]; const isPolite = politeWords.some(word => content.includes(word)); return { name: "Politeness", score: isPolite ? 1 : 0, metadata: { checked_message_preview: content.slice(0, 80) }, }; }; const efficiencyScorer: EvalScorer = async ({ trace }) => { if (!trace) return 0; const llmSpans = await trace.getSpans({ spanType: ["llm"] }); const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5; return { name: "Efficiency", score: isEfficient ? 1 : 0, metadata: { llm_calls: llmSpans.length }, }; }; Eval("Support Quality", { data: SUPPORT_DATASET, task: supportTask, scores: [politenessScorer, efficiencyScorer], }); ``` ```python eval_trace_code_scorer.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} from braintrust import Eval, wrap_openai, traced from openai import AsyncOpenAI client = wrap_openai(AsyncOpenAI()) SUPPORT_DATASET = [ {"input": "My order hasn't arrived yet. Order #12345."}, {"input": "I need help resetting my password."}, ] @traced async def call_llm(messages): response = await client.chat.completions.create( model="gpt-5-mini", messages=messages, ) return response.choices[0].message.content or "" async def support_task(input): messages = [ {"role": "system", "content": "You are a helpful customer support agent."} ] messages.append({"role": "user", "content": input}) response1 = await call_llm(messages) messages.append({"role": "assistant", "content": response1}) messages.append({"role": "user", "content": "Can you provide more details?"}) response2 = await call_llm(messages) messages.append({"role": "assistant", "content": response2}) messages.append({"role": "user", "content": "Thank you for your help!"}) response3 = await call_llm(messages) return response3 async def politeness_scorer(input, output, expected, trace=None): if not trace: return 0 thread = await trace.get_thread() last_assistant_msg = next( (msg for msg in reversed(thread) if msg.get("role") == "assistant"), None ) content = (last_assistant_msg.get("content") or "").lower() if last_assistant_msg else "" polite_words = ["welcome", "glad", "happy", "pleasure", "thank"] is_polite = any(word in content for word in polite_words) return { "name": "Politeness", "score": 1 if is_polite else 0, "metadata": {"checked_message_preview": content[:80]}, } async def efficiency_scorer(input, output, expected, trace=None): if not trace: return 0 llm_spans = await trace.get_spans(span_type=["llm"]) is_efficient = 3 <= len(llm_spans) <= 5 return { "name": "Efficiency", "score": 1 if is_efficient else 0, "metadata": {"llm_calls": len(llm_spans)}, } Eval( "Support Quality", data=SUPPORT_DATASET, task=support_task, scores=[politeness_scorer, efficiency_scorer], ) ``` ```java theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import com.openai.client.OpenAIClient; import com.openai.client.okhttp.OpenAIOkHttpClient; import com.openai.models.chat.completions.ChatCompletionCreateParams; import dev.braintrust.Braintrust; import dev.braintrust.eval.*; import dev.braintrust.instrumentation.openai.BraintrustOpenAI; import dev.braintrust.trace.BrainstoreTrace; import java.util.List; import java.util.function.Function; class TraceScoringExample { public static void main(String[] args) { var braintrust = Braintrust.get(); var openTelemetry = braintrust.openTelemetryCreate(); var client = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv()); Function supportTask = input -> { var messages = ChatCompletionCreateParams.builder() .model("gpt-5-mini") .addSystemMessage("You are a helpful customer support agent."); messages.addUserMessage(input); messages.addAssistantMessage(complete(client, messages)); messages.addUserMessage("Can you provide more details?"); messages.addAssistantMessage(complete(client, messages)); messages.addUserMessage("Thank you for your help!"); return complete(client, messages); }; // Implement TracedScorer to receive the trace; score(TaskResult, BrainstoreTrace) runs once per trace var politenessScorer = new TracedScorer() { @Override public String getName() { return "Politeness"; } @Override public List score( TaskResult taskResult, BrainstoreTrace trace) { var thread = trace.getLLMConversationThread(); var lastAssistant = thread.stream() .filter(msg -> "assistant".equals(msg.get("role"))) .reduce((first, second) -> second) .orElse(null); var content = lastAssistant == null ? "" : String.valueOf(lastAssistant.getOrDefault("content", "")) .toLowerCase(); var politeWords = List.of("welcome", "glad", "happy", "pleasure", "thank"); boolean isPolite = politeWords.stream().anyMatch(content::contains); return List.of(new Score(getName(), isPolite ? 1.0 : 0.0)); } }; var efficiencyScorer = new TracedScorer() { @Override public String getName() { return "Efficiency"; } @Override public List score( TaskResult taskResult, BrainstoreTrace trace) { var llmSpans = trace.getSpans("llm"); boolean isEfficient = llmSpans.size() >= 3 && llmSpans.size() <= 5; return List.of(new Score(getName(), isEfficient ? 1.0 : 0.0)); } }; var eval = braintrust .evalBuilder() .name("Support Quality") .cases( DatasetCase.of("My order hasn't arrived yet. Order #12345.", ""), DatasetCase.of("I need help resetting my password.", "")) .taskFunction(supportTask) .scorers(politenessScorer, efficiencyScorer) .build(); var result = eval.run(); System.out.println(result.createReportString()); } private static String complete(OpenAIClient client, ChatCompletionCreateParams.Builder builder) { return client.chat().completions().create(builder.build()).choices().get(0).message() .content() .orElse(""); } } ``` ```ruby eval_trace_code_scorer.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} require "braintrust" require "openai" Braintrust.init client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil)) SUPPORT_DATASET = [ {input: "My order hasn't arrived yet. Order #12345."}, {input: "I need help resetting my password."}, ] def chat(client, messages) client.chat.completions.create(model: "gpt-5-mini", messages: messages) .choices.first.message.content || "" end support_task = Braintrust::Task.new("support") do |input:| messages = [{role: "system", content: "You are a helpful customer support agent."}] messages << {role: "user", content: input} messages << {role: "assistant", content: chat(client, messages)} messages << {role: "user", content: "Can you provide more details?"} messages << {role: "assistant", content: chat(client, messages)} messages << {role: "user", content: "Thank you for your help!"} chat(client, messages) end politeness_scorer = Braintrust::Scorer.new("politeness") do |trace:| next 0 unless trace thread = trace.thread last_assistant = thread.reverse.find { |msg| msg["role"] == "assistant" } content = (last_assistant&.dig("content") || "").downcase polite_words = ["welcome", "glad", "happy", "pleasure", "thank"] is_polite = polite_words.any? { |word| content.include?(word) } {score: is_polite ? 1.0 : 0.0, metadata: {checked_message_preview: content[0, 80]}} end efficiency_scorer = Braintrust::Scorer.new("efficiency") do |trace:| next 0 unless trace llm_spans = trace.spans(span_type: "llm") is_efficient = llm_spans.length.between?(3, 5) {score: is_efficient ? 1.0 : 0.0, metadata: {llm_calls: llm_spans.length}} end Braintrust::Eval.run( project: "Support Quality", cases: SUPPORT_DATASET, task: support_task, scorers: [politeness_scorer, efficiency_scorer] ) OpenTelemetry.tracer_provider.shutdown ``` ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} using Braintrust.Sdk.Eval; using Braintrust.Sdk.OpenAI; using OpenAI.Chat; var braintrust = Braintrust.Sdk.Braintrust.Get(); var activitySource = braintrust.GetActivitySource(); var openAIClient = BraintrustOpenAI.WrapOpenAI( activitySource, Environment.GetEnvironmentVariable("OPENAI_API_KEY")); var chatClient = openAIClient.GetChatClient("gpt-5-mini"); string SupportTask(string input) { var messages = new List { new SystemChatMessage("You are a helpful customer support agent."), new UserChatMessage(input), }; messages.Add(new AssistantChatMessage(chatClient.CompleteChat(messages).Value.Content[0].Text)); messages.Add(new UserChatMessage("Can you provide more details?")); messages.Add(new AssistantChatMessage(chatClient.CompleteChat(messages).Value.Content[0].Text)); messages.Add(new UserChatMessage("Thank you for your help!")); return chatClient.CompleteChat(messages).Value.Content[0].Text; } var eval = await braintrust .EvalBuilder() .Name("Support Quality") .Cases( DatasetCase.Of("My order hasn't arrived yet. Order #12345.", ""), DatasetCase.Of("I need help resetting my password.", "")) .TaskFunction(SupportTask) .Scorers(new PolitenessScorer(), new EfficiencyScorer()) .BuildAsync(); await eval.RunAsync(); // Scores the last assistant message in the conversation thread reconstructed from the trace class PolitenessScorer : ITracedScorer { public string Name => "Politeness"; public Task> Score(TaskResult taskResult) => Task.FromResult>([new Score(Name, 0.0)]); public async Task> Score( TaskResult taskResult, EvalTrace trace) { var thread = await trace.GetThreadAsync(); var lastAssistant = thread.LastOrDefault(m => m.TryGetValue("role", out var role) && role as string == "assistant"); var content = (lastAssistant?.GetValueOrDefault("content") as string ?? "").ToLowerInvariant(); string[] politeWords = ["welcome", "glad", "happy", "pleasure", "thank"]; var isPolite = politeWords.Any(content.Contains); return [new Score(Name, isPolite ? 1.0 : 0.0, new Dictionary { ["checked_message_preview"] = content[..Math.Min(80, content.Length)] })]; } } // Scores efficiency based on the number of LLM spans in the trace class EfficiencyScorer : ITracedScorer { public string Name => "Efficiency"; public Task> Score(TaskResult taskResult) => Task.FromResult>([new Score(Name, 0.0)]); public async Task> Score( TaskResult taskResult, EvalTrace trace) { var llmSpans = await trace.GetSpansAsync("llm"); var isEfficient = llmSpans.Count is >= 3 and <= 5; return [new Score(Name, isEfficient ? 1.0 : 0.0, new Dictionary { ["llm_calls"] = llmSpans.Count })]; } } ``` Define TypeScript or Python scorers in code and push to Braintrust: ```typescript title="trace_code_scorer.ts" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import braintrust from "braintrust"; import { z } from "zod"; const project = braintrust.projects.create({ name: "my-project" }); project.scorers.create({ name: "Politeness scorer", slug: "politeness-scorer", description: "Check if assistant responds politely", parameters: z.object({ trace: z.any(), }), handler: async ({ trace }) => { if (!trace) return 0; const thread = await trace.getThread(); const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant"); const content = lastAssistantMsg?.content?.toLowerCase() || ""; const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"]; const isPolite = politeWords.some(word => content.includes(word)); return { score: isPolite ? 1 : 0, metadata: { checked_message_preview: content.slice(0, 80) }, }; }, }); project.scorers.create({ name: "Efficiency scorer", slug: "efficiency-scorer", description: "Check if conversation was efficient", parameters: z.object({ trace: z.any(), }), handler: async ({ trace }) => { if (!trace) return 0; const llmSpans = await trace.getSpans({ spanType: ["llm"] }); const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5; return { score: isEfficient ? 1 : 0, metadata: { llm_calls: llmSpans.length }, }; }, }); ``` ```python title="trace_code_scorer.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import braintrust from pydantic import BaseModel project = braintrust.projects.create(name="my-project") class TraceParams(BaseModel): trace: dict async def politeness_scorer(trace): if not trace: return 0 thread = await trace.get_thread() last_assistant_msg = next( (msg for msg in reversed(thread) if msg.get("role") == "assistant"), None ) content = (last_assistant_msg.get("content") or "").lower() if last_assistant_msg else "" polite_words = ["welcome", "glad", "happy", "pleasure", "thank"] is_polite = any(word in content for word in polite_words) return { "score": 1 if is_polite else 0, "metadata": {"checked_message_preview": content[:80]}, } async def efficiency_scorer(trace): if not trace: return 0 llm_spans = await trace.get_spans(span_type=["llm"]) is_efficient = 3 <= len(llm_spans) <= 5 return { "score": 1 if is_efficient else 0, "metadata": {"llm_calls": len(llm_spans)}, } project.scorers.create( name="Politeness scorer", slug="politeness-scorer", description="Check if assistant responds politely", parameters=TraceParams, handler=politeness_scorer, ) project.scorers.create( name="Efficiency scorer", slug="efficiency-scorer", description="Check if conversation was efficient", parameters=TraceParams, handler=efficiency_scorer, ) ``` Push to Braintrust: ```bash TypeScript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} bt functions push trace_code_scorer.ts ``` ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} bt functions push trace_code_scorer.py ``` 1. Go to [** Scorers**](https://www.braintrust.dev/app/~/scorers) > **+ Scorer**. 2. Enter a scorer name and slug. 3. Select **TypeScript** or **Python**. 4. Write your scorer function with the `trace` parameter. The code editor provides real-time linting and autocomplete. 5. Click **Save as custom scorer**. ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import type { Trace } from 'braintrust'; async function handler({ input, output, expected, metadata, trace, }: { input: any; output: any; expected: any; metadata: Record; trace: Trace; }): Promise< | number | { score: number; name?: string; metadata?: Record } | null > { if (expected === null) return null; const allSpans = await trace.getSpans(); const llmSpans = await trace.getSpans({ spanType: ["llm"] }); return { name: "span count scorer", score: output === expected ? 1 : 0, metadata: { totalSpanCount: allSpans.length, llmSpanCount: llmSpans.length, }, }; } ``` ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} from typing import Any async def handler( input: Any, output: Any, expected: Any, metadata: dict[str, Any], trace: Any ) -> float | dict[str, Any] | None: if expected is None: return None all_spans = await trace.get_spans() llm_spans = await trace.get_spans(span_type=['llm']) return { 'name': 'span count scorer', 'score': 1.0 if output == expected else 0.0, 'metadata': { 'total_span_count': len(all_spans), 'llm_span_count': len(llm_spans), }, } ``` UI scorers have access to these packages: * `anthropic` * `autoevals` * `braintrust` * `json` * `math` * `openai` * `re` * `requests` * `typing` For additional packages, use the CLI. ### Trace scorer recipes Use trace scorers for checks that depend on the agent's trajectory, such as tool usage, tool failures, or step budgets. Add any of these scorers to the `scores` array in an `Eval`, or adapt the handler body for a CLI or UI scorer. ```typescript trace_scorer_recipes.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import { type EvalScorer } from "braintrust"; function spanName(span: { span_attributes?: { name?: string } }): string { return span.span_attributes?.name ?? "unknown"; } function stringField(value: unknown, fieldName: string): string | null { if (typeof value !== "object" || value === null) return null; const field = Object.getOwnPropertyDescriptor(value, fieldName)?.value; return typeof field === "string" ? field : null; } // Check if a specific tool was called at least once. const requiredToolCalled: EvalScorer = async ({ trace, }) => { if (!trace) return null; const toolSpans = await trace.getSpans({ spanType: ["tool"] }); const editViewCalls = toolSpans.filter( (span) => span.span_attributes?.name === "edit_view", ); return { name: "edit_view called", score: editViewCalls.length > 0 ? 1 : 0, metadata: { edit_view_calls: editViewCalls.length }, }; }; // Check if a tool was called with an argument matching the expected value. const requiredToolCalledWithArg: EvalScorer< string, string, unknown > = async ({ expected, trace }) => { if (!trace) return null; const documentId = stringField(expected, "document_id"); if (!documentId) return null; const toolSpans = await trace.getSpans({ spanType: ["tool"] }); const searchCalls = toolSpans.filter( (span) => span.span_attributes?.name === "search_docs", ); const matchedCall = searchCalls.some( (span) => stringField(span.input, "document_id") === documentId, ); return { name: "searched expected document", score: matchedCall ? 1 : 0, metadata: { expected_document_id: documentId, search_docs_calls: searchCalls.length, }, }; }; // Check that no tool from a denylist was called. const noDisallowedTools: EvalScorer = async ({ trace, }) => { if (!trace) return null; const disallowedToolNames = new Set(["send_email", "delete_record"]); const toolSpans = await trace.getSpans({ spanType: ["tool"] }); const disallowedCalls = toolSpans.filter((span) => { const name = span.span_attributes?.name; return typeof name === "string" && disallowedToolNames.has(name); }); return { name: "no disallowed tools", score: disallowedCalls.length === 0 ? 1 : 0, metadata: { disallowed_tools: disallowedCalls.map(spanName), }, }; }; // Check that every tool call completed without error. const allToolsSucceeded: EvalScorer = async ({ trace, }) => { if (!trace) return null; const toolSpans = await trace.getSpans({ spanType: ["tool"] }); const failedToolCalls = toolSpans.filter((span) => Boolean(span.error)); return { name: "tool calls succeeded", score: failedToolCalls.length === 0 ? 1 : 0, metadata: { failed_tools: failedToolCalls.map(spanName), tool_calls: toolSpans.length, }, }; }; // Check if the agent stayed within a step budget. const trajectoryBudget: EvalScorer = async ({ trace, }) => { if (!trace) return null; const maxSteps = 8; const agentSpans = await trace.getSpans({ spanType: ["llm", "tool"] }); return { name: "trajectory budget", score: agentSpans.length <= maxSteps ? 1 : 0, metadata: { agent_steps: agentSpans.length, max_steps: maxSteps, }, }; }; ``` ```python eval_trace_scorer_recipes.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} def span_name(span): return (span.span_attributes or {}).get("name", "unknown") def string_field(value, field_name): return value.get(field_name) if isinstance(value, dict) else None # Check if a specific tool was called at least once. async def required_tool_called(input, output, expected, trace=None): if not trace: return None tool_spans = await trace.get_spans(span_type=["tool"]) edit_view_calls = [ span for span in tool_spans if (span.span_attributes or {}).get("name") == "edit_view" ] return { "name": "edit_view called", "score": 1 if edit_view_calls else 0, "metadata": {"edit_view_calls": len(edit_view_calls)}, } # Check if a tool was called with an argument matching the expected value. async def required_tool_called_with_arg(input, output, expected, trace=None): if not trace: return None document_id = string_field(expected, "document_id") if not isinstance(document_id, str): return None tool_spans = await trace.get_spans(span_type=["tool"]) search_calls = [ span for span in tool_spans if (span.span_attributes or {}).get("name") == "search_docs" ] matched_call = any( string_field(span.input, "document_id") == document_id for span in search_calls ) return { "name": "searched expected document", "score": 1 if matched_call else 0, "metadata": { "expected_document_id": document_id, "search_docs_calls": len(search_calls), }, } # Check that no tool from a denylist was called. async def no_disallowed_tools(input, output, expected, trace=None): if not trace: return None disallowed_tool_names = {"send_email", "delete_record"} tool_spans = await trace.get_spans(span_type=["tool"]) disallowed_calls = [ span for span in tool_spans if (span.span_attributes or {}).get("name") in disallowed_tool_names ] return { "name": "no disallowed tools", "score": 1 if not disallowed_calls else 0, "metadata": { "disallowed_tools": [span_name(span) for span in disallowed_calls], }, } # Check that every tool call completed without error. async def all_tools_succeeded(input, output, expected, trace=None): if not trace: return None tool_spans = await trace.get_spans(span_type=["tool"]) failed_tool_calls = [span for span in tool_spans if span.error] return { "name": "tool calls succeeded", "score": 1 if not failed_tool_calls else 0, "metadata": { "failed_tools": [span_name(span) for span in failed_tool_calls], "tool_calls": len(tool_spans), }, } # Check if the agent stayed within a step budget. async def trajectory_budget(input, output, expected, trace=None): if not trace: return None max_steps = 8 agent_spans = await trace.get_spans(span_type=["llm", "tool"]) return { "name": "trajectory budget", "score": 1 if len(agent_spans) <= max_steps else 0, "metadata": { "agent_steps": len(agent_spans), "max_steps": max_steps, }, } ``` ## Set pass thresholds Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as **passing** (green highlighting with checkmark), while scores below are marked as **failing** (red highlighting). Pass thresholds apply only to scorers that output numeric scores. Classifiers, which output labels, don't use them. Add `__pass_threshold` to the scorer's metadata (value between 0 and 1): ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} project.scorers.create({ name: "Quality checker", slug: "quality-checker", handler: async ({ output, expected }) => { return output === expected ? 1 : 0; }, metadata: { __pass_threshold: 0.8, }, }); ``` ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} @project.scorers.create( name="Quality checker", slug="quality-checker", metadata={"__pass_threshold": 0.8}, ) def quality_checker(output, expected): return 1 if output == expected else 0 ``` ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} // Pass thresholds are not supported in the Java SDK. // Use the UI or push a TypeScript/Python scorer via the CLI to set a pass threshold. ``` ```ruby theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} # Pass thresholds are not supported in the Ruby SDK. # Use the UI or push a TypeScript/Python scorer via the CLI to set a pass threshold. ``` ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} // Pass thresholds are not supported in the C# SDK. // Use the UI or push a TypeScript/Python scorer via the CLI to set a pass threshold. ``` When creating or editing a scorer in the UI: 1. Look for the **Pass threshold** slider in the scorer configuration. 2. Drag the slider to set your minimum acceptable score (0–1). 3. Click **Save as custom scorer**. The threshold can be set for any scorer type. ## Return multiple scores A single scorer can return an array of score objects to emit multiple named metrics from one call. This is useful when several quality dimensions can be computed together or share computation. Each item appears as its own score column in the Braintrust UI. Each item requires `name` and `score`. `metadata` is optional. ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} Eval("Summary Quality", { data: DATASET, task, scores: [ ({ output, expected }) => { const words = (output ?? "").toLowerCase().split(/\s+/); const keyTerms: string[] = expected.key_terms; const covered = keyTerms.filter((t) => words.includes(t)).length; return [ { name: "coverage", score: keyTerms.length ? covered / keyTerms.length : 1, metadata: { missing: keyTerms.filter((t) => !words.includes(t)) }, }, { name: "conciseness", score: words.length <= expected.max_words ? 1 : 0, metadata: { word_count: words.length, limit: expected.max_words }, }, ]; }, ], }); ``` ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} from braintrust import Eval, Score def summary_quality(output, expected, **kwargs): words = (output or "").lower().split() key_terms = expected["key_terms"] covered = sum(1 for t in key_terms if t in words) return [ Score( name="coverage", score=covered / len(key_terms) if key_terms else 1.0, metadata={"missing": [t for t in key_terms if t not in words]}, ), Score( name="conciseness", score=1.0 if len(words) <= expected["max_words"] else 0.0, metadata={"word_count": len(words), "limit": expected["max_words"]}, ), ] Eval("Summary Quality", data=DATASET, task=task, scores=[summary_quality]) ``` ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import dev.braintrust.eval.*; import java.util.List; import java.util.Map; // A scorer returns List, so a single scorer can emit several named metrics. // The Java Score record holds a name and value; pass per-case criteria through case metadata. var summaryQuality = new Scorer() { @Override public String getName() { return "Summary quality"; } @Override @SuppressWarnings("unchecked") public List score(TaskResult taskResult) { var words = List.of(taskResult.result().toLowerCase().split("\\s+")); Map criteria = taskResult.datasetCase().metadata(); var keyTerms = (List) criteria.getOrDefault("key_terms", List.of()); int maxWords = (Integer) criteria.getOrDefault("max_words", Integer.MAX_VALUE); long covered = keyTerms.stream().filter(words::contains).count(); return List.of( new Score( "coverage", keyTerms.isEmpty() ? 1.0 : (double) covered / keyTerms.size()), new Score("conciseness", words.size() <= maxWords ? 1.0 : 0.0)); } }; ``` ```ruby multi_score.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} summary_quality = Braintrust::Scorer.new("summary_quality") do |output:, expected:| words = output.to_s.downcase.split key_terms = expected[:key_terms] covered = key_terms.count { |t| words.include?(t) } [ { name: "coverage", score: key_terms.empty? ? 1.0 : covered.to_f / key_terms.size, metadata: {missing: key_terms - words} }, { name: "conciseness", score: words.size <= expected[:max_words] ? 1.0 : 0.0, metadata: {word_count: words.size, limit: expected[:max_words]} } ] end class StyleChecker include Braintrust::Scorer def call(output:, **) text = output.to_s [ {name: "ends_with_period", score: text.strip.end_with?(".") ? 1.0 : 0.0}, {name: "no_first_person", score: (%w[i me my we us].none? { |w| text.downcase.include?(w) }) ? 1.0 : 0.0} ] end end ``` ## Apply classification labels A [classifier](/evaluate/write-scorers#classifiers) returns a categorical label instead of a numeric score. Define custom code classifiers inline in your eval code, as a function that evaluates a result and constructs one or more classifications. Each classification your function returns sets a `name` (the group it belongs to, such as `intent`), an `id` (the value you filter by, such as `password_reset`), an optional `label` for display (such as `Password reset`), and optional `metadata`. Unlike an LLM-as-a-judge classifier, custom code sets these fields independently and can return more than one classification at a time. To create a classifier in the UI, build an [LLM-as-a-judge classifier](/evaluate/llm-as-a-judge#apply-classification-labels). ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import { Eval } from "braintrust"; const DATASET = [ { input: "Hello! Can you help me reset my password?", expected: "password_reset", }, ]; async function task(input: string): Promise { // Stand-in for your LLM call return `Thanks for reaching out. ${input}`; } function intentClassifier({ output }: { output: string }) { if (output.toLowerCase().includes("password")) { return { name: "intent", id: "password_reset", label: "Password reset", }; } return { name: "intent", id: "other", label: "Other", }; } Eval("Support intent", { data: DATASET, task, classifiers: [intentClassifier], }); ``` ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} from braintrust import Classification, Eval DATASET = [ { "input": "Hello! Can you help me reset my password?", "expected": "password_reset", }, ] def task(input): # Stand-in for your LLM call return f"Thanks for reaching out. {input}" def intent_classifier(input, output, expected): if "password" in output.lower(): return Classification( name="intent", id="password_reset", label="Password reset", ) return Classification(name="intent", id="other", label="Other") Eval( "Support intent", data=DATASET, task=task, classifiers=[intent_classifier], ) ``` ```go classifier.go theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} package main import ( "context" "strings" "github.com/braintrustdata/braintrust-sdk-go" "github.com/braintrustdata/braintrust-sdk-go/eval" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/sdk/trace" ) func main() { tp := trace.NewTracerProvider() defer tp.Shutdown(context.Background()) otel.SetTracerProvider(tp) bt, err := braintrust.New(tp, braintrust.WithProject("Support intent")) if err != nil { panic(err) } intentClassifier := eval.NewClassifier("intent", func(_ context.Context, r eval.TaskResult[string, string]) (eval.Classifications, error) { if strings.Contains(strings.ToLower(r.Output), "password") { return eval.Classifications{{ID: "password_reset", Label: "Password reset"}}, nil } return eval.Classifications{{ID: "other", Label: "Other"}}, nil }) evaluator := braintrust.NewEvaluator[string, string](bt) _, err = evaluator.Run(context.Background(), eval.Opts[string, string]{ Experiment: "Support intent", Dataset: eval.NewDataset([]eval.Case[string, string]{ {Input: "Hello! Can you help me reset my password?", Expected: "password_reset"}, }), Task: eval.T(func(_ context.Context, input string) (string, error) { return "Thanks for reaching out. " + input, nil // Stand-in for your LLM call }), Classifiers: []eval.Classifier[string, string]{intentClassifier}, }) if err != nil { panic(err) } } ``` ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} import dev.braintrust.Braintrust; import dev.braintrust.eval.Classification; import dev.braintrust.eval.Classifier; import dev.braintrust.eval.DatasetCase; class Main { public static void main(String... args) { var braintrust = Braintrust.get(); braintrust.openTelemetryCreate(); Classifier intentClassifier = Classifier.single( "intent", tr -> { if (tr.result().toLowerCase().contains("password")) { return Classification.of("intent", "password_reset", "Password reset"); } return Classification.of("intent", "other", "Other"); }); var eval = braintrust .evalBuilder() .name("Support intent") .cases(DatasetCase.of("Hello! Can you help me reset my password?", "password_reset")) .taskFunction(input -> "Thanks for reaching out. " + input) // Stand-in for your LLM call .classifiers(intentClassifier) .build(); eval.run(); } } ``` ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} using System; using System.Collections.Generic; using System.Threading.Tasks; using Braintrust.Sdk; using Braintrust.Sdk.Eval; class Program { static async Task Main(string[] args) { var braintrust = Braintrust.Sdk.Braintrust.Get(); var intentClassifier = new FunctionClassifier( "intent", taskResult => { if (taskResult.Result.Contains("password", StringComparison.OrdinalIgnoreCase)) { return new Classification(Id: "password_reset", Name: "intent", Label: "Password reset"); } return new Classification(Id: "other", Name: "intent", Label: "Other"); }); var eval = await braintrust .EvalBuilder() .Name("Support intent") .Cases( new DatasetCase( "Hello! Can you help me reset my password?", "password_reset")) .TaskFunction(input => "Thanks for reaching out. " + input) // Stand-in for your LLM call .Classifiers(intentClassifier) .BuildAsync(); await eval.RunAsync(); } } ``` For the C# and Java examples, use the `BRAINTRUST_DEFAULT_PROJECT_NAME` environment variable to set a project name. Otherwise, the default project is `default-dotnet-project` (C#) or `default-java-project` (Java). In a single evaluation, you can use scorers, classifiers, or both. Classifier failures do not stop the evaluation or affect other scorers and classifiers. Braintrust records classifier errors in the result metadata under `classifier_errors`. A classifier can also assign multiple labels at once: ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} function intentClassifier() { return [ { name: "intent", id: "billing", label: "Billing" }, { name: "intent", id: "login", label: "Login" }, ]; } ``` ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} def intent_classifier(input, output, expected): return [ Classification(name="intent", id="billing", label="Billing"), Classification(name="intent", id="login", label="Login"), ] ``` ```go #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} intentClassifier := eval.NewClassifier("intent", func(_ context.Context, r eval.TaskResult[string, string]) (eval.Classifications, error) { return eval.Classifications{ {ID: "billing", Label: "Billing"}, {ID: "login", Label: "Login"}, }, nil }) ``` ```java #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} Classifier intentClassifier = Classifier.of( "intent", tr -> java.util.List.of( Classification.of("intent", "billing", "Billing"), Classification.of("intent", "login", "Login"))); ``` ```csharp #skip-compile theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}} var intentClassifier = new FunctionClassifier( "intent", taskResult => (IReadOnlyList)new[] { new Classification(Id: "billing", Name: "intent", Label: "Billing"), new Classification(Id: "login", Name: "intent", Label: "Login"), }); ``` Classifiers require TypeScript SDK v3.9.0+, Python SDK v0.16.0+, Go SDK v0.8.0+, Java SDK v0.3.12+, or C# SDK v0.2.8+. ## Next steps * [Autoevals](/evaluate/autoevals) for pre-built scorers without writing code * [LLM-as-a-judge](/evaluate/llm-as-a-judge) for natural language evaluation criteria * [Run evaluations](/evaluate/run-evaluations) using your scorers * [Score production logs](/evaluate/score-online) with online scoring rules