> ## Documentation Index
> Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Dataset pipelines

> Transform spans or traces from your project logs into dataset rows in bulk with declarative, version-controlled pipelines.

A dataset pipeline transforms spans or traces from your project logs into dataset rows in bulk. It automates the same flow as [promoting individual traces from logs](/annotate/datasets/create#promote-traces-from-logs), but applies it to many spans or traces at once and keeps the logic in version-controlled code.

You define a pipeline with `DatasetPipeline(...)` in a TypeScript or Python file, then run it with the [`bt datasets pipeline`](/reference/cli/datasets#bt-datasets-pipeline) CLI command.

<Warning>
  **Beta** — This feature is subject to change.
</Warning>

<Note>
  Dataset pipelines require `bt` CLI v0.10.0 or later, plus the `braintrust` SDK for the language you write the pipeline in: TypeScript SDK v3.16.0 or later, or Python SDK v0.23.0 or later.
</Note>

## Define a pipeline

A pipeline definition has three parts:

* **`source`**: Which project to read from, an optional [SQL](/reference/sql) filter, and whether to operate on individual spans or entire traces (`scope`).
* **`transform`**: A function that receives a source span or trace and returns one or more dataset rows, or nothing to skip it.
* **`target`**: The dataset to write to. Braintrust creates the dataset if it does not exist.

<CodeGroup dropdown>
  ```typescript pipeline.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  import { DatasetPipeline } from "braintrust";

  DatasetPipeline({
    name: "support-cases", // Optional; required only when a file defines multiple pipelines
    source: {
      projectName: "Customer Support", // Replace with your source project
      filter: "metadata.flagged = true", // Optional SQL filter; omit to consider all spans
      scope: "span", // "span" (default) or "trace"
    },
    transform: ({ id, input, output, expected, metadata, trace }) => {
      return {
        input,
        expected: output, // Map the production output to the expected value
        metadata,
      };
    },
    target: {
      projectName: "Customer Support", // Replace with your target project
      datasetName: "Support cases", // Created if it does not exist
    },
  });
  ```

  ```python pipeline.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  from braintrust import DatasetPipeline

  def to_row(id=None, input=None, output=None, metadata=None, expected=None, trace=None):
      return {
          "input": input,
          "expected": output,  # Map the production output to the expected value
          "metadata": metadata,
      }

  DatasetPipeline(
      name="support-cases",  # Optional; required only when a file defines multiple pipelines
      source={
          "project_name": "Customer Support",  # Replace with your source project
          "filter": "metadata.flagged = true",  # Optional; omit to consider all spans
          "scope": "span",  # "span" (default) or "trace"
      },
      transform=to_row,
      target={
          "project_name": "Customer Support",  # Replace with your target project
          "dataset_name": "Support cases",  # Created if it does not exist
      },
  )
  ```
</CodeGroup>

The `transform` function returns a single row, a list of rows, or `null`/`None` to skip the source. A row accepts the standard dataset fields: `input`, `expected`, `metadata`, `tags`, and `id`. When you omit `id`, the row inherits the source span or trace ID.

When `scope` is `"span"`, the transform receives the span's `id`, `input`, `output`, `expected`, and `metadata`, along with the full `trace`. When `scope` is `"trace"`, it receives only the `trace`.

## Run a pipeline

Run the full pipeline in one shot. The `--limit` flag controls how many source spans or traces to discover, and `--window` sets how far back to look, which defaults to the last day (`1d`). For the full list of flags, see the [`bt datasets pipeline`](/reference/cli/datasets#bt-datasets-pipeline) CLI reference.

<CodeGroup>
  ```bash TypeScript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  bt datasets pipeline run ./pipeline.ts --limit 100 --window 30d
  ```

  ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
  bt datasets pipeline run ./pipeline.py --limit 100 --window 30d
  ```
</CodeGroup>

`run` discovers source refs, transforms them, and inserts the resulting rows into the target dataset.

## Staged workflow

For larger jobs, or when you want to inspect or edit rows before writing them, split the run into three stages:

```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
# 1. Discover source refs and write them to pulled.jsonl
bt datasets pipeline pull ./pipeline.ts --limit 500

# 2. Apply the transform and write the results to transformed.jsonl
bt datasets pipeline transform ./pipeline.ts

# 3. Push the transformed rows to the target dataset
bt datasets pipeline push ./pipeline.ts
```

Each stage writes its artifacts to `bt-sync/` by default: `pull` writes `pulled.jsonl` and `transform` writes `transformed.jsonl`. You can inspect or edit `transformed.jsonl` before running `push`. Use `--root` to change the directory, or `--out` and `--in` to override individual artifact paths.

For the full set of flags, including source and target overrides and concurrency controls, see the [`bt datasets pipeline`](/reference/cli/datasets#bt-datasets-pipeline) CLI reference.

## Next steps

* Browse the [`bt datasets pipeline`](/reference/cli/datasets#bt-datasets-pipeline) command reference for every flag.
* Learn other ways to [create datasets](/annotate/datasets/create) from uploads, the SDK, or production logs.
* [Use datasets in evaluations](/annotate/datasets/use-in-evaluations) once your rows are in place.
