Google Cloud Storage

Synopsis

Creates a collector that subscribes to a Google Cloud Pub/Sub subscription to receive OBJECT_FINALIZE notifications, then downloads and ingests the corresponding objects from Google Cloud Storage.

Schema

- id: <numeric>
  name: <string>
  description: <string>
  type: gcs
  tags: <string[]>
  pipelines: <pipeline[]>
  status: <boolean>
  properties:
    project_id: <string>
    subscription_id: <string>
    credentials_json: <string>
    bucket: <string>
    object_size: <numeric>
    file_name_filter: <string>
    max_files_in_archive: <numeric>
    max_size_archive_bytes: <numeric>
    timeout: <numeric>

Configuration

The following fields are used to define the device:

Device

Field	Required	Default	Description
`id`	Y		Unique numeric identifier
`name`	Y		Device name
`description`	N	-	Optional description
`type`	Y		Must be `gcs`
`tags`	N	-	Array of labels for categorization
`pipelines`	N	-	Array of preprocessing pipeline references
`status`	N	`true`	Enable/disable the device

Connection

Field	Required	Default	Description
`project_id`	Y		Google Cloud project ID that owns the Pub/Sub subscription
`subscription_id`	Y		Pub/Sub subscription ID that receives GCS object notifications
`credentials_json`	N	-	Service account credentials JSON. If omitted, Application Default Credentials are used
`bucket`	N	-	Restrict ingestion to objects from this bucket name. If omitted, all buckets in the subscription are processed

Object Processing

Field	Required	Default	Description
`object_size`	N	`100000000`	Maximum object size in bytes to download. Objects exceeding this limit are skipped. Set to `0` for no limit
`file_name_filter`	N	`".*"`	Regular expression applied to the object name before download. Objects not matching the pattern are skipped
`timeout`	N	`10`	Config poll interval in seconds. Accepts values between `1` and `10`

Archive Extraction

Field	Required	Default	Description
`max_files_in_archive`	N	`100`	Maximum number of files extracted from a single archive. Set to `0` for no limit
`max_size_archive_bytes`	N	`104857600`	Maximum total extracted size in bytes across all files in an archive. Set to `0` for no limit

Details

The device uses two Google Cloud clients concurrently. A Pub/Sub subscriber listens on the configured subscription and receives notifications for each newly written GCS object. Only OBJECT_FINALIZE events are processed; all other event types are acknowledged and discarded. When a matching notification arrives, the GCS client downloads the object content for ingestion.

If bucket is set, notifications referencing a different bucket are acknowledged and skipped without downloading any data. This allows a single Pub/Sub subscription to cover multiple buckets while the device ingests from one specific bucket.

Authentication uses the credentials_json field when provided. If the field is empty, the client falls back to Application Default Credentials, which includes Workload Identity, environment variables, and the gcloud CLI credential chain.

The following file formats are supported directly: .json, .jsonl, .parquet (also .parq, .pq). Files with .log or .txt extensions are auto-detected. Archive formats .gz, .zip, .bz2, .tar, .tgz, .tar.gz, and .tar.bz2 are extracted before ingestion. Files with any other extension are acknowledged and discarded.

Object size is validated before download using the GCS object metadata. If object_size is set to a positive value, any object whose stored size exceeds the limit is skipped. For archives, object_size limits the compressed download size while max_size_archive_bytes limits the total extracted content size.

The file_name_filter regex is applied to the object name (the key path within the bucket) before download. If the pattern does not compile, the device falls back to the default .* pattern and logs a warning. The pattern is re-evaluated on each config poll interval, so it can be changed without restarting the device.

Transient processing errors cause the Pub/Sub message to be nacked so Pub/Sub retries delivery. Permanent errors (unsupported format, size limit exceeded, archive constraint violations, file name filter mismatch) cause the message to be acknowledged to prevent infinite redelivery.

Examples

The following are commonly used configuration types.

Basic

The minimum required configuration creates the collector using Application Default Credentials:

Creating a basic GCS collector with ADC authentication...

devices:
  - id: 1
    name: gcs_basic
    type: gcs
    properties:
      project_id: "my-gcp-project"
      subscription_id: "gcs-notifications-sub"

Service Account

A service account credentials JSON can be provided explicitly:

Authenticating with a service account credentials file...

devices:
  - id: 2
    name: gcs_service_account
    type: gcs
    properties:
      project_id: "my-gcp-project"
      subscription_id: "gcs-notifications-sub"
      credentials_json: '{"type":"service_account","project_id":"my-gcp-project","private_key_id":"...","private_key":"-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----\n","client_email":"[email protected]","client_id":"...","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token"}'

Bucket Filter

Ingestion can be restricted to a specific bucket when the subscription covers multiple buckets:

Filtering to a single bucket from a shared subscription...

devices:
  - id: 3
    name: gcs_bucket_filter
    type: gcs
    properties:
      project_id: "my-gcp-project"
      subscription_id: "gcs-all-buckets-sub"
      bucket: "security-logs-bucket"

File Name Filter

Object names can be filtered by regex to ingest only matching files:

Ingesting only JSON log files from a specific prefix...

devices:
  - id: 4
    name: gcs_filtered
    type: gcs
    properties:
      project_id: "my-gcp-project"
      subscription_id: "gcs-notifications-sub"
      bucket: "app-logs-bucket"
      file_name_filter: "^logs/firewall/.*\\.json$"
      object_size: 52428800

Archive Processing

Archive extraction limits prevent oversized archives from consuming excessive resources:

Configuring archive extraction with size and file count limits...

devices:
  - id: 5
    name: gcs_archives
    type: gcs
    pipelines:
      - json_parser
    properties:
      project_id: "my-gcp-project"
      subscription_id: "gcs-notifications-sub"
      bucket: "archive-logs-bucket"
      max_files_in_archive: 50
      max_size_archive_bytes: 52428800
      object_size: 10485760

Synopsis​

Schema​

Configuration​

Device​

Connection​

Object Processing​

Archive Extraction​

Details​

Examples​

Basic​

Service Account​

Bucket Filter​

File Name Filter​

Archive Processing​