Getting Started

This guide will help you get started with DataFlow Operator. You'll learn how to install the operator, create your first data flow, and set up a local development environment.

Prerequisites

For Production Installation

Kubernetes cluster (version 1.24+)
Helm 3.0+
kubectl configured to work with the cluster
Access to data sources (Kafka, PostgreSQL)

For Local Development

Go 1.21+
Docker and Docker Compose
Task (optional, for using Taskfile commands)
Access to ports: 8080, 5050, 15672, 8081, 5432, 9092, 5672

Installation

CRD Management

The DataFlow CRD (Custom Resource Definition) defines the DataFlow resource type in Kubernetes.

Automatic Installation (via Helm)

When installing via Helm (recommended), the CRD is installed and updated automatically. No separate kubectl apply step is needed — the chart manages the CRD lifecycle with crds.install: true (default).

Manual Installation

If you manage CRDs separately (e.g. with ArgoCD, FluxCD, or crds.install: false in Helm values), install the CRD manually:

kubectl apply -f https://raw.githubusercontent.com/dataflow-operator/dataflow/refs/heads/main/config/crd/bases/dataflow.dataflow.io_dataflows.yaml

Or from a local file:

kubectl apply -f dataflow/config/crd/bases/dataflow.dataflow.io_dataflows.yaml

CRD Helm Configuration

Parameter	Default	Description
`crds.install`	`true`	Install and update CRD on `helm install` / `helm upgrade`
`crds.keep`	`true`	Add `helm.sh/resource-policy: keep` annotation to prevent CRD deletion on `helm uninstall`

Upgrade behavior: The CRD is updated on every helm upgrade, so schema changes are applied automatically.

Uninstall behavior: With crds.keep: true (default), the CRD remains in the cluster after helm uninstall. This protects against accidental deletion of all DataFlow resources. To disable CRD installation via Helm:

crds:
  install: false

Installation via Helm (Recommended)

Basic Installation

The simplest way to install the operator from OCI registry:

helm install dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator

This command will install the operator with default settings in the default namespace.

Installation in a Specific Namespace

helm install dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator \
  --namespace dataflow-system \
  --create-namespace

Note: For local development, you can also use the local chart:

helm install dataflow-operator ./helm-charts/dataflow-operator

Installation with Custom Settings

You can override default values via flags:

helm install dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator \
  --set image.repository=your-registry/controller \
  --set image.tag=v1.0.0 \
  --set replicaCount=2 \
  --set resources.limits.memory=1Gi \
  --set resources.limits.cpu=500m \
  --set resources.requests.memory=256Mi \
  --set resources.requests.cpu=100m

Installation with Values File

For more complex configurations, create a my-values.yaml file:

image:
  repository: your-registry/controller
  tag: v1.0.0

replicaCount: 2

resources:
  limits:
    cpu: 1000m
    memory: 1Gi
  requests:
    cpu: 500m
    memory: 512Mi

# Settings for working with Kubernetes API
serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/dataflow-operator

# Security settings
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000

# Optional: Sentry for error monitoring and tracing
# sentry:
#   enabled: true
#   dsn: "https://xxx@o0.ingest.sentry.io/123"
#   environment: production
#   tracesSampleRate: 0.1

Then install:

helm install dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator -f my-values.yaml

Verification

After installation, check the status:

# Check pod status
kubectl get pods -l app.kubernetes.io/name=dataflow-operator

# Check CRD
kubectl get crd dataflows.dataflow.dataflow.io

# Check operator logs
kubectl logs -l app.kubernetes.io/name=dataflow-operator --tail=50

# Check deployment status
kubectl get deployment dataflow-operator

Expected output:

NAME                                  READY   STATUS    RESTARTS   AGE
dataflow-operator-7d8f9c4b5d-xxxxx   1/1     Running   0          1m

Updating

To update the operator to a new version:

helm upgrade dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator

With custom values:

helm upgrade dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator -f my-values.yaml

To update to a specific version:

helm upgrade dataflow-operator oci://ghcr.io/dataflow-operator/helm-charts/dataflow-operator \
  --set image.tag=v1.1.0

Uninstallation

To uninstall the operator:

helm uninstall dataflow-operator

CRD behavior on uninstall: With crds.keep: true (default), the CRD remains in the cluster after helm uninstall. Existing DataFlow resources are preserved but will stop being processed.

To completely remove the CRD and all DataFlow resources in the cluster:

# Delete all DataFlow resources first
kubectl delete dataflow --all --all-namespaces

# Then uninstall the operator
helm uninstall dataflow-operator

# Finally, remove the CRD (only if crds.keep was true)
kubectl delete crd dataflows.dataflow.dataflow.io

Warning

Deleting the CRD removes all DataFlow resources across all namespaces in the cluster. Make sure this is intended.

First DataFlow

Simple Example: Kafka → PostgreSQL

Create a simple DataFlow resource to transfer data from Kafka to PostgreSQL:

apiVersion: dataflow.dataflow.io/v1
kind: DataFlow
metadata:
  name: kafka-to-postgres
  namespace: default
spec:
  source:
    type: kafka
    config:
      brokers:
        - kafka-broker:9092
      topic: input-topic
      consumerGroup: dataflow-group
  sink:
    type: postgresql
    config:
      connectionString: "postgres://user:password@postgres-host:5432/dbname?sslmode=disable"
      table: output_table
      autoCreateTable: true

Apply the resource:

kubectl apply -f dataflow/config/samples/kafka-to-postgres.yaml

Note: Each DataFlow resource creates a separate pod (Deployment) for processing. You can configure resources, node selection, affinity, and tolerations. See Examples for details.

Example with Kubernetes Secrets

For secure credential storage, use Kubernetes Secrets. See example:

kubectl apply -f dataflow/config/samples/kafka-to-postgres-secrets.yaml

This example demonstrates using SecretRef for connector configuration. For more details, see the Using Kubernetes Secrets section in the connectors documentation.

Checking Status

Check the status of the created data flow:

# Get DataFlow information
kubectl get dataflow kafka-to-postgres

# Detailed information
kubectl describe dataflow kafka-to-postgres

# View status in YAML format
kubectl get dataflow kafka-to-postgres -o yaml

Expected status:

status:
  phase: Running
  processedCount: 150
  errorCount: 0
  lastProcessedTime: "2024-01-15T10:30:00Z"
  message: "Processing messages successfully"

Sending Test Message

To test the data flow, send a message to the Kafka topic:

# Using kafka-console-producer
kafka-console-producer --broker-list localhost:9092 --topic input-topic
# Enter JSON message and press Enter
{"id": 1, "name": "Test", "value": 100}

Or use the project script:

./scripts/send-test-message.sh

Checking Data in PostgreSQL

Connect to PostgreSQL and check the data:

psql postgres://user:password@postgres-host:5432/dbname

# Check the table
SELECT * FROM output_table;

Local Development

Starting Dependencies

Use docker-compose to start all dependencies locally:

docker-compose up -d

This command will start:

Kafka (port 9092) with Kafka UI (port 8080)
PostgreSQL (port 5432) with pgAdmin (port 5050)

Accessing UI Interfaces

After starting, the following UIs are available:

Kafka UI: http://localhost:8080
View topics, messages, consumer groups
pgAdmin: http://localhost:5050
Login: admin@admin.com, password: admin
PostgreSQL database management
Queue and exchange management

Running Operator Locally

For development, run the operator locally:

# Install CRD in cluster (if using kind or minikube)
task install

# Run the operator
task run

Or use the script:

./scripts/run-local.sh

Setting Up Local Cluster (Optional)

For full testing, use kind (Kubernetes in Docker):

# Create kind cluster
./scripts/setup-kind.sh

# Install CRD
task install

# Run operator locally
task run

Debugging

For debugging, use operator logs:

# If operator is running locally, logs are output to console
# For operator in cluster:
kubectl logs -l app.kubernetes.io/name=dataflow-operator -f

Check Kubernetes events:

kubectl get events --sort-by='.lastTimestamp' | grep dataflow

Next Steps

Now that you've installed the operator and created your first data flow:

Study Connectors to understand all available sources and sinks
Familiarize yourself with Transformations for working with message transformations
Check out Examples for practical usage examples
Read Development to participate in development

Troubleshooting

Operator Not Starting

# Check logs
kubectl logs -l app.kubernetes.io/name=dataflow-operator

# Check events
kubectl describe pod -l app.kubernetes.io/name=dataflow-operator

# Check CRD
kubectl get crd dataflows.dataflow.dataflow.io -o yaml

DataFlow Not Processing Messages

Check DataFlow status:
```
kubectl describe dataflow <name>
```

Check connection to data source:

# For Kafka
kafka-console-consumer --bootstrap-server localhost:9092 --topic <topic>

# For PostgreSQL
psql <connection-string> -c "SELECT * FROM <table> LIMIT 10;"

Check operator logs for errors

Connection Issues

Ensure data sources are accessible from the cluster
Check Kubernetes network policies
Verify connection strings and credentials are correct
For local development, use localhost or host.docker.internal