Introduction

Sifter is an open-source document extraction engine. It turns documents — contracts, invoices, receipts, reports, and more — into a structured, queryable database, defined by a schema you describe in natural language. Once extracted, your records are queryable via API, filterable, aggregatable, and hookable into any downstream pipeline.

Why not RAG?

RAG is the default answer for “AI + documents.” It works well on diverse corpora — manuals, research papers, knowledge bases. It breaks on homogeneous collections like invoices, contracts, or receipts.

RAG on 500 invoices

Ask “How much did I invoice to Acme in September?” — RAG searches by similarity. All invoices look alike. It returns the highest-scoring chunks, not all the matching ones. It can’t filter by date or sum totals. You get a guess.

Sifter on 500 invoices

Sifter extracts client, date, total from every invoice once, and stores them as rows in a database. The same query becomes a real aggregation: filter by client and month, sum the totals. Exact and reproducible, every time.

“Total invoiced per client per month” is an aggregation query, not a retrieval query. RAG was built for retrieval. Sifter was built for this.

How it works

Define a Sift

Give it a name and describe what to extract in natural language: "Extract: client name, invoice date, total amount, VAT number". Sifter infers the JSON schema automatically from the first processed document.

Upload documents

Upload documents via the web UI, the REST API, the Python or TypeScript SDK, or the CLI. Organize them into folders to run multiple extractors automatically.

Query and export

Browse extracted records in a table. Filter, sort, run aggregations, or ask questions in natural language. Export to CSV or query programmatically via the API.

Key concepts

Concept	Description
Sift	An extraction schema defined in natural language. One sift → one structured table.
Folder	A document container. Link it to multiple sifts — every upload triggers all of them.
Record	A single extracted result: one document processed by one sift.
Dashboard	A live board of KPIs and charts generated from extracted records.
Webhook	HTTP callback fired on extraction events. Wildcard patterns, retry on failure.

Three pillars

Extract

Define what to extract in plain language. Sifter infers the schema, processes every document, and stores the results as structured records — no templates, no code.

Analyze

Extracted records are real structured data. Filter, sort, build dashboards, or ask questions in natural language. Export to CSV or pipe to your warehouse.

Build

REST API, Python SDK, TypeScript SDK, CLI, webhooks, and an MCP server for Claude, ChatGPT, Gemini, Cursor — any MCP-aware client.

Open source vs. Cloud

Sifter is MIT licensed. The OSS engine ships the complete product — chat, dashboards, webhooks, SDK, MCP stdio — and self-hosts with a single docker compose up. Bring your own LLM key, pay for nothing. Sifter Cloud is the managed version at sifter.run: hosted infra, remote MCP endpoint, Google Drive + mail-to-upload ingress, Stripe billing, SSO, audit log, share links. See Pricing.

Quickstart

Get up and running in 5 minutes

Overview

Concepts

Integrations

Cloud

Self-hosting

Resources

Why not RAG?

RAG on 500 invoices

Sifter on 500 invoices

How it works

Key concepts

Three pillars

Extract

Analyze

Build

Open source vs. Cloud

Quickstart

Overview

Concepts

Integrations

Cloud

Self-hosting

Resources

​Why not RAG?

RAG on 500 invoices

Sifter on 500 invoices

​How it works

​Key concepts

​Three pillars

Extract

Analyze

Build

​Open source vs. Cloud

Quickstart

Why not RAG?

How it works

Key concepts

Three pillars

Open source vs. Cloud