Google Cloud Storage
Synopsis
Creates a collector that subscribes to a Google Cloud Pub/Sub subscription to receive OBJECT_FINALIZE notifications, then downloads and ingests the corresponding objects from Google Cloud Storage.
Schema
- id: <numeric>
name: <string>
description: <string>
type: gcs
tags: <string[]>
pipelines: <pipeline[]>
status: <boolean>
properties:
project_id: <string>
subscription_id: <string>
credentials_json: <string>
bucket: <string>
object_size: <numeric>
file_name_filter: <string>
max_files_in_archive: <numeric>
max_size_archive_bytes: <numeric>
timeout: <numeric>
Configuration
The following fields are used to define the device:
Device
| Field | Required | Default | Description |
|---|---|---|---|
id | Y | Unique numeric identifier | |
name | Y | Device name | |
description | N | - | Optional description |
type | Y | Must be gcs | |
tags | N | - | Array of labels for categorization |
pipelines | N | - | Array of preprocessing pipeline references |
status | N | true | Enable/disable the device |
Connection
| Field | Required | Default | Description |
|---|---|---|---|
project_id | Y | Google Cloud project ID that owns the Pub/Sub subscription | |
subscription_id | Y | Pub/Sub subscription ID that receives GCS object notifications | |
credentials_json | N | - | Service account credentials JSON. If omitted, Application Default Credentials are used |
bucket | N | - | Restrict ingestion to objects from this bucket name. If omitted, all buckets in the subscription are processed |
Object Processing
| Field | Required | Default | Description |
|---|---|---|---|
object_size | N | 100000000 | Maximum object size in bytes to download. Objects exceeding this limit are skipped. Set to 0 for no limit |
file_name_filter | N | ".*" | Regular expression applied to the object name before download. Objects not matching the pattern are skipped |
timeout | N | 10 | Config poll interval in seconds. Accepts values between 1 and 10 |
Archive Extraction
| Field | Required | Default | Description |
|---|---|---|---|
max_files_in_archive | N | 100 | Maximum number of files extracted from a single archive. Set to 0 for no limit |
max_size_archive_bytes | N | 104857600 | Maximum total extracted size in bytes across all files in an archive. Set to 0 for no limit |
Details
The device uses two Google Cloud clients concurrently. A Pub/Sub subscriber listens on the configured subscription and receives notifications for each newly written GCS object. Only OBJECT_FINALIZE events are processed; all other event types are acknowledged and discarded. When a matching notification arrives, the GCS client downloads the object content for ingestion.
If bucket is set, notifications referencing a different bucket are acknowledged and skipped without downloading any data. This allows a single Pub/Sub subscription to cover multiple buckets while the device ingests from one specific bucket.
Authentication uses the credentials_json field when provided. If the field is empty, the client falls back to Application Default Credentials, which includes Workload Identity, environment variables, and the gcloud CLI credential chain.
The following file formats are supported directly: .json, .jsonl, .parquet (also .parq, .pq). Files with .log or .txt extensions are auto-detected. Archive formats .gz, .zip, .bz2, .tar, .tgz, .tar.gz, and .tar.bz2 are extracted before ingestion. Files with any other extension are acknowledged and discarded.
Object size is validated before download using the GCS object metadata. If object_size is set to a positive value, any object whose stored size exceeds the limit is skipped. For archives, object_size limits the compressed download size while max_size_archive_bytes limits the total extracted content size.
The file_name_filter regex is applied to the object name (the key path within the bucket) before download. If the pattern does not compile, the device falls back to the default .* pattern and logs a warning. The pattern is re-evaluated on each config poll interval, so it can be changed without restarting the device.
Transient processing errors cause the Pub/Sub message to be nacked so Pub/Sub retries delivery. Permanent errors (unsupported format, size limit exceeded, archive constraint violations, file name filter mismatch) cause the message to be acknowledged to prevent infinite redelivery.
Examples
The following are commonly used configuration types.
Basic
The minimum required configuration creates the collector using Application Default Credentials:
Creating a basic GCS collector with ADC authentication... | |
Service Account
A service account credentials JSON can be provided explicitly:
Authenticating with a service account credentials file... | |
Bucket Filter
Ingestion can be restricted to a specific bucket when the subscription covers multiple buckets:
Filtering to a single bucket from a shared subscription... | |
File Name Filter
Object names can be filtered by regex to ingest only matching files:
Ingesting only JSON log files from a specific prefix... | |
Archive Processing
Archive extraction limits prevent oversized archives from consuming excessive resources:
Configuring archive extraction with size and file count limits... | |