Parquet API
Overview
The Parquet API service provides file processing capabilities for Apache Parquet files, enabling efficient columnar data access.
Architecture
graph LR
subgraph Kubernetes
API[Parquet API]
Storage[(Longhorn PVC)]
end
Clients[Clients] --> API
API --> Storage
Deployment
Kubernetes Resources
| Resource |
Name |
Namespace |
| Deployment |
parquet-api |
parquet-api |
| Service |
parquet-api |
parquet-api |
| PVC |
parquet-data |
parquet-api |
Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: parquet-api
namespace: parquet-api
spec:
replicas: 1
template:
spec:
containers:
- name: parquet-api
image: ajxfear/parquet-api:latest
ports:
- containerPort: 8080
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: parquet-data
API Endpoints
File Operations
| Endpoint |
Method |
Description |
/api/files |
GET |
List parquet files |
/api/files/{id} |
GET |
Get file metadata |
/api/files/{id}/data |
GET |
Read file data |
/api/files |
POST |
Upload file |
Query Operations
| Endpoint |
Method |
Description |
/api/query |
POST |
Execute query on file |
/api/schema/{id} |
GET |
Get file schema |
Features
Supported Operations
- Read parquet files
- Query with column projection
- Filter pushdown
- Schema introspection
- Metadata extraction
Data Types
| Parquet Type |
API Type |
| INT32/64 |
integer |
| FLOAT/DOUBLE |
number |
| BYTE_ARRAY |
string |
| BOOLEAN |
boolean |
| TIMESTAMP |
datetime |
Storage
Persistent Volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: parquet-data
namespace: parquet-api
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 50Gi
File Organization
/data/
├── uploads/ # Uploaded files
├── processed/ # Processed files
└── temp/ # Temporary files
Monitoring
Metrics
| Metric |
Description |
parquet_files_total |
Total files stored |
parquet_queries_total |
Total queries executed |
parquet_bytes_read_total |
Bytes read from files |
parquet_query_duration_seconds |
Query execution time |
Health Checks
livenessProbe:
httpGet:
path: /health
port: 8080
readinessProbe:
httpGet:
path: /ready
port: 8080
Optimization Tips
- Column projection - Only request needed columns
- Filter pushdown - Apply filters in query
- Partitioning - Organize data by date/key
- Compression - Use snappy for balanced speed/size
Resource Requirements
| Resource |
Request |
Limit |
| CPU |
100m |
500m |
| Memory |
256Mi |
1Gi |
Security
Access Control
- Namespace isolation
- No external exposure (internal only)
- PVC access restricted to pod
Troubleshooting
Common Issues
| Issue |
Cause |
Resolution |
| Out of memory |
Large file |
Increase memory limit |
| Slow queries |
Full scan |
Use column projection |
| File not found |
Path error |
Check mount path |