Skip to content

Parquet API

Overview

The Parquet API service provides file processing capabilities for Apache Parquet files, enabling efficient columnar data access.

Architecture

graph LR
    subgraph Kubernetes
        API[Parquet API]
        Storage[(Longhorn PVC)]
    end

    Clients[Clients] --> API
    API --> Storage

Deployment

Kubernetes Resources

Resource Name Namespace
Deployment parquet-api parquet-api
Service parquet-api parquet-api
PVC parquet-data parquet-api

Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: parquet-api
  namespace: parquet-api
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: parquet-api
        image: ajxfear/parquet-api:latest
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: data
          mountPath: /data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: parquet-data

API Endpoints

File Operations

Endpoint Method Description
/api/files GET List parquet files
/api/files/{id} GET Get file metadata
/api/files/{id}/data GET Read file data
/api/files POST Upload file

Query Operations

Endpoint Method Description
/api/query POST Execute query on file
/api/schema/{id} GET Get file schema

Features

Supported Operations

  • Read parquet files
  • Query with column projection
  • Filter pushdown
  • Schema introspection
  • Metadata extraction

Data Types

Parquet Type API Type
INT32/64 integer
FLOAT/DOUBLE number
BYTE_ARRAY string
BOOLEAN boolean
TIMESTAMP datetime

Storage

Persistent Volume

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: parquet-data
  namespace: parquet-api
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 50Gi

File Organization

/data/
├── uploads/          # Uploaded files
├── processed/        # Processed files
└── temp/             # Temporary files

Monitoring

Metrics

Metric Description
parquet_files_total Total files stored
parquet_queries_total Total queries executed
parquet_bytes_read_total Bytes read from files
parquet_query_duration_seconds Query execution time

Health Checks

livenessProbe:
  httpGet:
    path: /health
    port: 8080
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

Performance

Optimization Tips

  1. Column projection - Only request needed columns
  2. Filter pushdown - Apply filters in query
  3. Partitioning - Organize data by date/key
  4. Compression - Use snappy for balanced speed/size

Resource Requirements

Resource Request Limit
CPU 100m 500m
Memory 256Mi 1Gi

Security

Access Control

  • Namespace isolation
  • No external exposure (internal only)
  • PVC access restricted to pod

Troubleshooting

Common Issues

Issue Cause Resolution
Out of memory Large file Increase memory limit
Slow queries Full scan Use column projection
File not found Path error Check mount path