DataFlow Operator

DataFlow Operator is a Kubernetes operator for streaming data between different data sources with support for message transformations.

Current Versions

Component	Version
DataFlow Operator	—
Helm Charts	—
DataFlow MCP	—
DataFlow Web	—

Overview

DataFlow Operator allows you to declaratively define data flows between different sources and sinks through Kubernetes Custom Resource Definitions (CRD). The operator automatically manages the lifecycle of data flows, processes messages, and applies necessary transformations.

Key Features

Multiple Data Source Support

Kafka — read and write messages from/to Kafka topics (TLS, SASL, Avro, Schema Registry)
PostgreSQL — read from tables and write with custom SQL, batch inserts, UPSERT mode
ClickHouse — polling, batch inserts, auto-create MergeTree tables
Trino — SQL queries, Keycloak OAuth2, batch inserts
Nessie — Apache Iceberg tables via Nessie catalog (branches, Basic/Bearer auth, polling, batch appends)

Rich Transformation Set

Timestamp - add timestamp to each message
Flatten - expand arrays into separate messages while preserving parent fields
Filter - filter messages based on JSONPath conditions
Mask - mask sensitive data with or without preserving length
Router - route messages to different sinks based on conditions
Select - select specific fields from messages
Remove - remove specified fields from messages
SnakeCase - convert field names to snake_case
CamelCase - convert field names to CamelCase

Flexible Routing

The operator supports conditional routing of messages to different sinks based on JSONPath expressions, enabling complex data processing scenarios.

Simple Management

Declarative configuration through Kubernetes CRD allows easy management of data flows, versioning configurations, and integration with CI/CD systems.

Secure Configuration

Support for configuring connectors from Kubernetes Secrets through SecretRef allows secure storage of credentials, tokens, and connection strings without explicitly specifying them in the DataFlow specification.

Quick Start

Install in minutes

Install the operator via Helm and create your first data flow in under 5 minutes.

Installing the Operator

# Install operator via Helm from OCI registry
helm install dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator

# Verify installation
kubectl get pods -l app.kubernetes.io/name=dataflow-operator

# Verify CRD
kubectl get crd dataflows.dataflow.dataflow.io

Creating Your First Data Flow

Create a simple data flow from Kafka to PostgreSQL:

kubectl apply -f dataflow/config/samples/kafka-to-postgres.yaml

Check status:

kubectl get dataflow kafka-to-postgres
kubectl describe dataflow kafka-to-postgres

Local Development

For local development and testing:

# Start dependencies (Kafka, PostgreSQL)
docker-compose up -d

# Run operator locally
make run

Architecture

The operator consists of the following components:

CRD (Custom Resource Definitions)

Defines the schema for the DataFlow resource, which describes the data flow configuration, including source, sink, and list of transformations.

Controller

Kubernetes controller that monitors changes to DataFlow resources and manages their lifecycle. The controller creates and manages processors for each active data flow.

Connectors

Modular connector system for various data sources and sinks. Each connector implements a standard interface for reading or writing data.

Transformers

Message transformation modules that are applied sequentially to each message in the order specified in the configuration.

Processor

Message processing orchestrator that coordinates the work of source, transformations, and sink. Handles errors, maintains statistics, and manages the data flow lifecycle.

Monitoring and Status

Each DataFlow resource has a status that includes:

Phase - current phase of the data flow (Running, Error, etc.)
Message - additional status information
LastProcessedTime - time of the last processed message
ProcessedCount - number of processed messages
ErrorCount - number of errors

The operator also exports Prometheus metrics and supports Sentry for detailed monitoring: - Number of messages received/sent per manifest - Errors in connectors and transformers - Message processing time and transformer execution time - Connector connection status - Sentry error monitoring and distributed tracing (optional)

See Metrics for more details.

Documentation

Getting Started — installation and first data flow
Installation & Helm Configuration — Helm charts, values, installation options
Web GUI — web interface: how it works and capabilities
Connectors — Kafka, PostgreSQL, ClickHouse, Trino, Nessie (sources and sinks)
Transformations — message transformations
Examples — practical examples
Errors — error handling and error sink
Fault Tolerance — at-least-once semantics, idempotent sinks, data consistency
Metrics — Prometheus metrics
Development — developer guide

License

Apache License 2.0