StreamSets¶

Overview¶

StreamSets Data Collector provides data pipeline processing for ETL and data integration workflows.

Architecture¶

graph LR
    subgraph Sources
        S1[Databases]
        S2[Files]
        S3[APIs]
    end

    subgraph StreamSets
        DC[Data Collector]
        Pipelines[Pipelines]
    end

    subgraph Destinations
        D1[Elasticsearch]
        D2[InfluxDB]
        D3[MongoDB]
    end

    S1 --> DC
    S2 --> DC
    S3 --> DC
    DC --> Pipelines
    Pipelines --> D1
    Pipelines --> D2
    Pipelines --> D3

Deployment¶

Kubernetes Resources¶

Resource	Name	Namespace
Deployment	streamsets	streamsets
Service	streamsets	streamsets
PVC	streamsets-data	streamsets

Configuration¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: streamsets
  namespace: streamsets
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: streamsets
        image: streamsets/datacollector:latest
        ports:
        - containerPort: 18630
        volumeMounts:
        - name: data
          mountPath: /data

Pipelines¶

Pipeline Types¶

Type	Description
Batch	Process fixed datasets
Streaming	Continuous data flow
CDC	Change data capture

Pipeline Components¶

Origins¶

Origin	Use Case
JDBC	Database tables
File	Local/network files
HTTP	REST API polling
Kafka	Message streaming

Processors¶

Processor	Function
Field Type Converter	Type transformation
Expression Evaluator	Custom expressions
Jython/JavaScript	Script processing
Record Deduplicator	Remove duplicates

Destinations¶

Destination	Target
Elasticsearch	Search index
InfluxDB	Time series
MongoDB	Document store
JDBC	Database tables

Web Interface¶

Access¶

Property	Value
Port	18630
Protocol	HTTPS
Default User	admin

Features¶

Pipeline designer
Preview mode
Metrics dashboard
Alert configuration

Data Processing¶

Example Pipeline¶

[MySQL Origin]
    → [Field Type Converter]
    → [Expression Evaluator]
    → [Elasticsearch Destination]

Batch Processing¶

# Pipeline configuration
pipeline:
  name: mysql-to-elasticsearch
  executionMode: STANDALONE
  origin:
    type: JDBC
    config:
      connectionString: jdbc:mysql://mysql:3306/db
      query: "SELECT * FROM events"
  processors:
    - type: FIELD_TYPE_CONVERTER
      config:
        fields:
          - name: timestamp
            type: DATETIME
  destination:
    type: ELASTICSEARCH
    config:
      httpUrls:
        - http://elasticsearch:9200
      index: events

Monitoring¶

Metrics¶

Metric	Description
Records processed	Total records
Batch time	Processing duration
Error records	Failed records
Stage metrics	Per-stage stats

Alerts¶

Pipeline failure
High error rate
Slow processing
Connection issues

Storage¶

Persistent Data¶

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: streamsets-data
  namespace: streamsets
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 50Gi

Data Directories¶

Path	Purpose
`/data/sdc`	Pipeline configs
`/data/sdc/log`	Application logs
`/data/sdc/resources`	External resources

Best Practices¶

Use preview - Test before running
Handle errors - Configure error handling
Monitor metrics - Watch for issues
Version pipelines - Export configurations
Resource limits - Set appropriate memory

Troubleshooting¶

Common Issues¶

Issue	Cause	Resolution
Pipeline won't start	Config error	Check validation
Slow processing	Resource limits	Increase memory
Connection timeout	Network issue	Verify connectivity
Data loss	Error handling	Configure error records