Skip to main content

Google Cloud Storage

Synopsis

Creates a collector that subscribes to a Google Cloud Pub/Sub subscription to receive OBJECT_FINALIZE notifications, then downloads and ingests the corresponding objects from Google Cloud Storage.

Schema

- id: <numeric>
name: <string>
description: <string>
type: gcs
tags: <string[]>
pipelines: <pipeline[]>
status: <boolean>
properties:
project_id: <string>
subscription_id: <string>
credentials_json: <string>
bucket: <string>
object_size: <numeric>
file_name_filter: <string>
max_files_in_archive: <numeric>
max_size_archive_bytes: <numeric>
timeout: <numeric>

Configuration

The following fields are used to define the device:

Device

FieldRequiredDefaultDescription
idYUnique numeric identifier
nameYDevice name
descriptionN-Optional description
typeYMust be gcs
tagsN-Array of labels for categorization
pipelinesN-Array of preprocessing pipeline references
statusNtrueEnable/disable the device

Connection

FieldRequiredDefaultDescription
project_idYGoogle Cloud project ID that owns the Pub/Sub subscription
subscription_idYPub/Sub subscription ID that receives GCS object notifications
credentials_jsonN-Service account credentials JSON. If omitted, Application Default Credentials are used
bucketN-Restrict ingestion to objects from this bucket name. If omitted, all buckets in the subscription are processed

Object Processing

FieldRequiredDefaultDescription
object_sizeN100000000Maximum object size in bytes to download. Objects exceeding this limit are skipped. Set to 0 for no limit
file_name_filterN".*"Regular expression applied to the object name before download. Objects not matching the pattern are skipped
timeoutN10Config poll interval in seconds. Accepts values between 1 and 10

Archive Extraction

FieldRequiredDefaultDescription
max_files_in_archiveN100Maximum number of files extracted from a single archive. Set to 0 for no limit
max_size_archive_bytesN104857600Maximum total extracted size in bytes across all files in an archive. Set to 0 for no limit

Details

The device uses two Google Cloud clients concurrently. A Pub/Sub subscriber listens on the configured subscription and receives notifications for each newly written GCS object. Only OBJECT_FINALIZE events are processed; all other event types are acknowledged and discarded. When a matching notification arrives, the GCS client downloads the object content for ingestion.

If bucket is set, notifications referencing a different bucket are acknowledged and skipped without downloading any data. This allows a single Pub/Sub subscription to cover multiple buckets while the device ingests from one specific bucket.

Authentication uses the credentials_json field when provided. If the field is empty, the client falls back to Application Default Credentials, which includes Workload Identity, environment variables, and the gcloud CLI credential chain.

The following file formats are supported directly: .json, .jsonl, .parquet (also .parq, .pq). Files with .log or .txt extensions are auto-detected. Archive formats .gz, .zip, .bz2, .tar, .tgz, .tar.gz, and .tar.bz2 are extracted before ingestion. Files with any other extension are acknowledged and discarded.

Object size is validated before download using the GCS object metadata. If object_size is set to a positive value, any object whose stored size exceeds the limit is skipped. For archives, object_size limits the compressed download size while max_size_archive_bytes limits the total extracted content size.

The file_name_filter regex is applied to the object name (the key path within the bucket) before download. If the pattern does not compile, the device falls back to the default .* pattern and logs a warning. The pattern is re-evaluated on each config poll interval, so it can be changed without restarting the device.

Transient processing errors cause the Pub/Sub message to be nacked so Pub/Sub retries delivery. Permanent errors (unsupported format, size limit exceeded, archive constraint violations, file name filter mismatch) cause the message to be acknowledged to prevent infinite redelivery.

Examples

The following are commonly used configuration types.

Basic

The minimum required configuration creates the collector using Application Default Credentials:

Creating a basic GCS collector with ADC authentication...

devices:
- id: 1
name: gcs_basic
type: gcs
properties:
project_id: "my-gcp-project"
subscription_id: "gcs-notifications-sub"

Service Account

A service account credentials JSON can be provided explicitly:

Authenticating with a service account credentials file...

devices:
- id: 2
name: gcs_service_account
type: gcs
properties:
project_id: "my-gcp-project"
subscription_id: "gcs-notifications-sub"
credentials_json: '{"type":"service_account","project_id":"my-gcp-project","private_key_id":"...","private_key":"-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----\n","client_email":"[email protected]","client_id":"...","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token"}'

Bucket Filter

Ingestion can be restricted to a specific bucket when the subscription covers multiple buckets:

Filtering to a single bucket from a shared subscription...

devices:
- id: 3
name: gcs_bucket_filter
type: gcs
properties:
project_id: "my-gcp-project"
subscription_id: "gcs-all-buckets-sub"
bucket: "security-logs-bucket"

File Name Filter

Object names can be filtered by regex to ingest only matching files:

Ingesting only JSON log files from a specific prefix...

devices:
- id: 4
name: gcs_filtered
type: gcs
properties:
project_id: "my-gcp-project"
subscription_id: "gcs-notifications-sub"
bucket: "app-logs-bucket"
file_name_filter: "^logs/firewall/.*\\.json$"
object_size: 52428800

Archive Processing

Archive extraction limits prevent oversized archives from consuming excessive resources:

Configuring archive extraction with size and file count limits...

devices:
- id: 5
name: gcs_archives
type: gcs
pipelines:
- json_parser
properties:
project_id: "my-gcp-project"
subscription_id: "gcs-notifications-sub"
bucket: "archive-logs-bucket"
max_files_in_archive: 50
max_size_archive_bytes: 52428800
object_size: 10485760