StreamSets
Overview
StreamSets Data Collector provides data pipeline processing for ETL and data integration workflows.
Architecture
graph LR
subgraph Sources
S1[Databases]
S2[Files]
S3[APIs]
end
subgraph StreamSets
DC[Data Collector]
Pipelines[Pipelines]
end
subgraph Destinations
D1[Elasticsearch]
D2[InfluxDB]
D3[MongoDB]
end
S1 --> DC
S2 --> DC
S3 --> DC
DC --> Pipelines
Pipelines --> D1
Pipelines --> D2
Pipelines --> D3
Deployment
Kubernetes Resources
| Resource |
Name |
Namespace |
| Deployment |
streamsets |
streamsets |
| Service |
streamsets |
streamsets |
| PVC |
streamsets-data |
streamsets |
Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: streamsets
namespace: streamsets
spec:
replicas: 1
template:
spec:
containers:
- name: streamsets
image: streamsets/datacollector:latest
ports:
- containerPort: 18630
volumeMounts:
- name: data
mountPath: /data
Pipelines
Pipeline Types
| Type |
Description |
| Batch |
Process fixed datasets |
| Streaming |
Continuous data flow |
| CDC |
Change data capture |
Pipeline Components
Origins
| Origin |
Use Case |
| JDBC |
Database tables |
| File |
Local/network files |
| HTTP |
REST API polling |
| Kafka |
Message streaming |
Processors
| Processor |
Function |
| Field Type Converter |
Type transformation |
| Expression Evaluator |
Custom expressions |
| Jython/JavaScript |
Script processing |
| Record Deduplicator |
Remove duplicates |
Destinations
| Destination |
Target |
| Elasticsearch |
Search index |
| InfluxDB |
Time series |
| MongoDB |
Document store |
| JDBC |
Database tables |
Web Interface
Access
| Property |
Value |
| Port |
18630 |
| Protocol |
HTTPS |
| Default User |
admin |
Features
- Pipeline designer
- Preview mode
- Metrics dashboard
- Alert configuration
Data Processing
Example Pipeline
[MySQL Origin]
→ [Field Type Converter]
→ [Expression Evaluator]
→ [Elasticsearch Destination]
Batch Processing
# Pipeline configuration
pipeline:
name: mysql-to-elasticsearch
executionMode: STANDALONE
origin:
type: JDBC
config:
connectionString: jdbc:mysql://mysql:3306/db
query: "SELECT * FROM events"
processors:
- type: FIELD_TYPE_CONVERTER
config:
fields:
- name: timestamp
type: DATETIME
destination:
type: ELASTICSEARCH
config:
httpUrls:
- http://elasticsearch:9200
index: events
Monitoring
Metrics
| Metric |
Description |
| Records processed |
Total records |
| Batch time |
Processing duration |
| Error records |
Failed records |
| Stage metrics |
Per-stage stats |
Alerts
- Pipeline failure
- High error rate
- Slow processing
- Connection issues
Storage
Persistent Data
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: streamsets-data
namespace: streamsets
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 50Gi
Data Directories
| Path |
Purpose |
/data/sdc |
Pipeline configs |
/data/sdc/log |
Application logs |
/data/sdc/resources |
External resources |
Best Practices
- Use preview - Test before running
- Handle errors - Configure error handling
- Monitor metrics - Watch for issues
- Version pipelines - Export configurations
- Resource limits - Set appropriate memory
Troubleshooting
Common Issues
| Issue |
Cause |
Resolution |
| Pipeline won't start |
Config error |
Check validation |
| Slow processing |
Resource limits |
Increase memory |
| Connection timeout |
Network issue |
Verify connectivity |
| Data loss |
Error handling |
Configure error records |