Google Cloud Storage

Google Cloud Long-Term Storage

Synopsis

Creates a target that writes log messages to Google Cloud Storage buckets with support for various file formats, authentication methods, and multipart uploads. The target handles large file uploads efficiently with configurable rotation based on size or event count.

Schema

- name: <string>
  description: <string>
  type: gcpstorage
  pipelines: <pipeline[]>
  status: <boolean>
  properties:
    credentials: <string>
    project_id: <string>
    bucket: <string>
    buckets:
      - bucket: <string>
        name: <string>
        format: <string>
        compression: <string>
        extension: <string>
        schema: <string>
    name: <string>
    format: <string>
    compression: <string>
    extension: <string>
    schema: <string>
    max_size: <numeric>
    batch_size: <numeric>
    timeout: <numeric>
    field_format: <string>
    interval: <string|numeric>
    cron: <string>
    debug:
      status: <boolean>
      dont_send_logs: <boolean>

Configuration

The following fields are used to define the target:

Field	Required	Default	Description
`name`	Y		Target name
`description`	N	-	Optional description
`type`	Y	-	Must be `gcpstorage`
`pipelines`	N	-	Optional post-processor pipelines
`status`	N	`true`	Enable/disable the target

Google Cloud Storage Credentials

Field	Required	Default	Description
`credentials`	N	-	Service account credentials JSON. Uses Application Default Credentials if not provided
`project_id`	Y	-	Google Cloud project ID

Connection

Field	Required	Default	Description
`timeout`	N	`30`	Connection timeout in seconds
`field_format`	N	-	Data normalization format. See applicable Normalization section

Files

Field	Required	Default	Description
`bucket`	N*	-	Default GCS bucket name (acts as catch-all when `buckets` is also specified)
`buckets`	N*	-	Array of bucket configurations for file distribution
`buckets.bucket`	Y	-	GCS bucket name
`buckets.name`	Y	-	File name template
`buckets.format`	N	`"json"`	Output format: `json`, `multijson`, `avro`, `parquet`
`buckets.compression`	N	-	Compression algorithm. See Compression below
`buckets.extension`	N	Matches `format`	File extension override
`buckets.schema`	N**	-	Schema definition file path (required for Avro and Parquet formats)
`name`	N	`"vmetric.{{.Timestamp}}.{{.Extension}}"`	Default file name template (used with `bucket` for catch-all)
`format`	N	`"json"`	Default output format (used with `bucket` for catch-all)
`compression`	N	-	Default compression (used with `bucket` for catch-all)
`extension`	N	Matches `format`	Default file extension (used with `bucket` for catch-all)
`schema`	N	-	Default schema path (used with `bucket` for catch-all)
`max_size`	N	`0`	Maximum file size in bytes before rotation
`batch_size`	N	`100000`	Maximum number of messages per file

* = Either bucket or buckets must be specified.

** = Conditionally required for Avro and Parquet formats when using buckets.

note

When max_size is reached, the current file is uploaded to GCS and a new file is created. For unlimited file size, set the field to 0.

Scheduler

Field	Required	Default	Description
`interval`	N	realtime	Execution frequency. See Interval for details
`cron`	N	-	Cron expression for scheduled execution. See Cron for details

Debug Options

Field	Required	Default	Description
`debug.status`	N	`false`	Enable debug logging
`debug.dont_send_logs`	N	`false`	Process logs but don't send to target (testing)

Details

The Google Cloud Storage target provides enterprise-grade cloud storage integration with comprehensive file format support. GCS offers high durability (99.999999999%), strong consistency for read-after-write operations, and integration with Google Cloud's security and analytics ecosystem.

Authentication Methods

Supports service account credentials JSON provided via the credentials field. When deployed on Google Cloud infrastructure, can leverage Application Default Credentials without explicit credentials by omitting the credentials field.

IAM Permissions

The service account requires the following IAM role:

IAM Role	Role ID	Purpose
`Storage Object Creator`	`roles/storage.objectCreator`	Upload (create) objects in GCS buckets

Minimum permissions: storage.objects.create

Storage Classes

Google Cloud Storage supports multiple storage classes for cost optimization:

Storage Class	Use Case
Standard	Frequently accessed data
Nearline	Data accessed less than once per month
Coldline	Data accessed less than once per quarter
Archive	Data accessed less than once per year

Available Regions

Google Cloud Storage is available in multiple regions worldwide:

Region Code	Location
`us-central1`	Iowa, USA
`us-east1`	South Carolina, USA
`us-west1`	Oregon, USA
`europe-west1`	Belgium
`europe-west2`	London, UK
`europe-west3`	Frankfurt, Germany
`asia-east1`	Taiwan
`asia-northeast1`	Tokyo, Japan
`asia-southeast1`	Singapore
`australia-southeast1`	Sydney, Australia

Loading include...

Templates

The following template variables can be used in file names:

Variable	Description	Example
`{{.Year}}`	Current year	`2024`
`{{.Month}}`	Current month	`01`
`{{.Day}}`	Current day	`15`
`{{.Timestamp}}`	Current timestamp in nanoseconds	`1703688533123456789`
`{{.Format}}`	File format	`json`
`{{.Extension}}`	File extension	`json`
`{{.Compression}}`	Compression type	`zstd`
`{{.TargetName}}`	Target name	`my_logs`
`{{.TargetType}}`	Target type	`gcpstorage`
`{{.Table}}`	Bucket name	`logs`

Multiple Buckets

Single target can write to multiple GCS buckets with different configurations, enabling data distribution strategies (e.g., raw data to one bucket, processed data to another).

Schema Requirements

Avro and Parquet formats require schema definition files. Schema files must be accessible at the path specified in the schema parameter during target initialization.

Integration with Google Cloud

GCS integrates seamlessly with other Google Cloud services including BigQuery for analytics, Cloud Functions for serverless processing, and Cloud Logging for centralized logging.

Examples

Basic Configuration

The minimum configuration for a JSON GCS target:

targets:
  - name: basic_gcs
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      bucket: "datastream-logs"

Service Account Authentication

Configuration with explicit service account credentials:

targets:
  - name: gcs_service_account
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      credentials: |
        {
          "type": "service_account",
          "project_id": "my-project-123456",
          "private_key_id": "key-id",
          "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
          "client_email": "[email protected]",
          "client_id": "123456789",
          "auth_uri": "https://accounts.google.com/o/oauth2/auth",
          "token_uri": "https://oauth2.googleapis.com/token"
        }
      bucket: "datastream-logs"

Pipeline-Based Routing

Dynamic bucket routing using pipeline processors to analyze log content and route to appropriate buckets:

targets:
  - name: smart_routing_gcs
    type: gcpstorage
    pipelines:
      - dynamic_routing
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      buckets:
        - bucket: "security-events"
          name: "security-{{.Year}}-{{.Month}}-{{.Day}}.json"
          format: "json"
        - bucket: "application-events"
          name: "app-{{.Year}}-{{.Month}}-{{.Day}}.json"
          format: "json"
        - bucket: "system-events"
          name: "system-{{.Year}}-{{.Month}}-{{.Day}}.json"
          format: "json"
      bucket: "other-events"
      name: "other-{{.Timestamp}}.json"
      format: "json"

pipelines:
  - name: dynamic_routing
    processors:
      - set:
          field: "_vmetric.bucket"
          value: "security-events"
          if: "ctx.event_type == 'security'"
      - set:
          field: "_vmetric.bucket"
          value: "application-events"
          if: "ctx.event_type == 'application'"
      - set:
          field: "_vmetric.bucket"
          value: "system-events"
          if: "ctx.event_type == 'system'"

Multiple Buckets with Catch-All

Configuration for routing different log types to specific buckets with a catch-all for unmatched logs:

targets:
  - name: multi_bucket_routing
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      buckets:
        - bucket: "security-logs"
          name: "security-{{.Year}}-{{.Month}}-{{.Day}}.json"
          format: "json"
        - bucket: "application-logs"
          name: "app-{{.Year}}-{{.Month}}-{{.Day}}.json"
          format: "json"
      bucket: "general-logs"
      name: "general-{{.Timestamp}}.json"
      format: "json"

Multiple Buckets with Different Formats

Configuration for distributing data across multiple GCS buckets with different formats:

targets:
  - name: multi_bucket_export
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      buckets:
        - bucket: "raw-data-archive"
          name: "raw-{{.Year}}-{{.Month}}-{{.Day}}.json"
          format: "multijson"
          compression: "gzip"
        - bucket: "analytics-data"
          name: "analytics-{{.Year}}/{{.Month}}/{{.Day}}/data_{{.Timestamp}}.parquet"
          format: "parquet"
          schema: "<schema definition>"
          compression: "snappy"

Parquet Format

Configuration for daily partitioned Parquet files:

targets:
  - name: parquet_analytics
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      bucket: "analytics-lake"
      name: "events/year={{.Year}}/month={{.Month}}/day={{.Day}}/part-{{.Timestamp}}.parquet"
      format: "parquet"
      schema: "<schema definition>"
      compression: "snappy"
      max_size: 536870912

High Reliability

Configuration with enhanced settings:

targets:
  - name: reliable_gcs
    type: gcpstorage
    pipelines:
      - checkpoint
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      bucket: "critical-logs"
      name: "logs-{{.Timestamp}}.json"
      format: "json"
      timeout: 60

With Field Normalization

Using field normalization for standard format:

targets:
  - name: normalized_gcs
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      bucket: "normalized-logs"
      name: "logs-{{.Timestamp}}.json"
      format: "json"
      field_format: "cim"

BigQuery Integration

Configuration optimized for BigQuery data lake:

targets:
  - name: bigquery_ready
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      bucket: "bigquery-staging"
      name: "bq-import/{{.Year}}/{{.Month}}/{{.Day}}/data-{{.Timestamp}}.json"
      format: "json"
      compression: "gzip"
      max_size: 1073741824

Debug Configuration

Configuration with debugging enabled:

targets:
  - name: debug_gcs
    type: gcpstorage
    properties:
      project_id: "my-project-123456"
      credentials: "${GCP_CREDENTIALS_JSON}"
      bucket: "test-logs"
      name: "test-{{.Timestamp}}.json"
      format: "json"
      debug:
        status: true
        dont_send_logs: true

Synopsis​

Schema​

Configuration​

Google Cloud Storage Credentials​

Connection​

Files​

Scheduler​

Debug Options​

Details​

Authentication Methods​

IAM Permissions​

Storage Classes​

Available Regions​

Templates​

Multiple Buckets​

Schema Requirements​

Integration with Google Cloud​

Examples​

Basic Configuration​

Service Account Authentication​

Pipeline-Based Routing​

Multiple Buckets with Catch-All​

Multiple Buckets with Different Formats​

Parquet Format​

High Reliability​

With Field Normalization​

BigQuery Integration​

Debug Configuration​