Skip to content

StreamSets

Overview

StreamSets Data Collector provides data pipeline processing for ETL and data integration workflows.

Architecture

graph LR
    subgraph Sources
        S1[Databases]
        S2[Files]
        S3[APIs]
    end

    subgraph StreamSets
        DC[Data Collector]
        Pipelines[Pipelines]
    end

    subgraph Destinations
        D1[Elasticsearch]
        D2[InfluxDB]
        D3[MongoDB]
    end

    S1 --> DC
    S2 --> DC
    S3 --> DC
    DC --> Pipelines
    Pipelines --> D1
    Pipelines --> D2
    Pipelines --> D3

Deployment

Kubernetes Resources

Resource Name Namespace
Deployment streamsets streamsets
Service streamsets streamsets
PVC streamsets-data streamsets

Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: streamsets
  namespace: streamsets
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: streamsets
        image: streamsets/datacollector:latest
        ports:
        - containerPort: 18630
        volumeMounts:
        - name: data
          mountPath: /data

Pipelines

Pipeline Types

Type Description
Batch Process fixed datasets
Streaming Continuous data flow
CDC Change data capture

Pipeline Components

Origins

Origin Use Case
JDBC Database tables
File Local/network files
HTTP REST API polling
Kafka Message streaming

Processors

Processor Function
Field Type Converter Type transformation
Expression Evaluator Custom expressions
Jython/JavaScript Script processing
Record Deduplicator Remove duplicates

Destinations

Destination Target
Elasticsearch Search index
InfluxDB Time series
MongoDB Document store
JDBC Database tables

Web Interface

Access

Property Value
Port 18630
Protocol HTTPS
Default User admin

Features

  • Pipeline designer
  • Preview mode
  • Metrics dashboard
  • Alert configuration

Data Processing

Example Pipeline

[MySQL Origin]
    → [Field Type Converter]
    → [Expression Evaluator]
    → [Elasticsearch Destination]

Batch Processing

# Pipeline configuration
pipeline:
  name: mysql-to-elasticsearch
  executionMode: STANDALONE
  origin:
    type: JDBC
    config:
      connectionString: jdbc:mysql://mysql:3306/db
      query: "SELECT * FROM events"
  processors:
    - type: FIELD_TYPE_CONVERTER
      config:
        fields:
          - name: timestamp
            type: DATETIME
  destination:
    type: ELASTICSEARCH
    config:
      httpUrls:
        - http://elasticsearch:9200
      index: events

Monitoring

Metrics

Metric Description
Records processed Total records
Batch time Processing duration
Error records Failed records
Stage metrics Per-stage stats

Alerts

  • Pipeline failure
  • High error rate
  • Slow processing
  • Connection issues

Storage

Persistent Data

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: streamsets-data
  namespace: streamsets
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 50Gi

Data Directories

Path Purpose
/data/sdc Pipeline configs
/data/sdc/log Application logs
/data/sdc/resources External resources

Best Practices

  1. Use preview - Test before running
  2. Handle errors - Configure error handling
  3. Monitor metrics - Watch for issues
  4. Version pipelines - Export configurations
  5. Resource limits - Set appropriate memory

Troubleshooting

Common Issues

Issue Cause Resolution
Pipeline won't start Config error Check validation
Slow processing Resource limits Increase memory
Connection timeout Network issue Verify connectivity
Data loss Error handling Configure error records