Data Warehouse Sync

Data Warehouse Sync


If you'd like to export Customer.io data to your Data Warehouse, one option is to use our Data Warehouse Sync functionality. (Another is to use Reporting Webhooks and/or a product like Stitch.)

Our Data Warehouse Sync exports to an Amazon S3 or Google Cloud Storage bucket of your choosing. It does so every 15 minutes by default, but can be configured to a different interval, as well.

The exported data comes in parquet format. Once the parquet files are in your S3 or GCS bucket, you can import them into data platforms like Fivetran or data warehouses like Redshift, BigQuery, and Snowflake.

The following data is included in our export:

  • Deliveries. Email, push, SMS, slack, and webhook records.
  • Metrics. The various events that occur to each delivery (e.g. sent, opened).
  • Subjects. The unique workflow journeys that people take through Campaigns and API Triggered Broadcasts.
  • Outputs. The unique steps within each workflow journey.
  • People. For compliance purposes, a list of people IDs, as well as whether a person has been deleted or suppressed.

On first sync, all historical data is backfilled except for People.

The People parquet file will represent all people as of first sync, but not people deleted or suppressed in the past.

Data Schema and Definitions

Deliveries

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the delivery record.
delivery_id STRING (Required). The ID of the delivery record.
internal_customer_id STRING (Nullable). The ID of the person in question. Use the people parquet file to resolve this ID to your customer_id.
subject_id STRING (Nullable). If the delivery was created as part of a Campaign or API Triggered Broadcast workflow, this is the ID for the path the person went through in the workflow.
event_id STRING (Nullable). If the delivery was created as part of an event-triggered Campaign, this is the ID for the unique event that triggered the workflow.
delivery_type STRING (Required). The type of delivery: email, push, sms, slack, or webhook.
campaign_id INTEGER (Nullable). If the delivery was created as part of a Campaign or API Triggered Broadcast workflow, this is the ID for the Campaign or API Triggered Broadcast.
action_id INTEGER (Nullable). If the delivery was created as part of a Campaign or API Triggered Broadcast workflow, this is the ID for the unique workflow item that caused the delivery to be created.
newsletter_id INTEGER (Nullable). If the delivery was created as part of a Newsletter, this is the unique ID of that Newsletter.
content_id INTEGER (Nullable). If the delivery was created as part of a Newsletter split test, this is the unique ID of the Newsletter variant.
trigger_id INTEGER (Nullable). If the delivery was created as part of an API Triggered Broadcast, this is the unique trigger ID associated with the API call that triggered the broadcast.
created_at TIMESTAMP (Required). The timestamp the delivery was created at.

Metrics

Field Name Description
event_id STRING (Required). The unique ID of the metric event. This can be useful for deduplicating purposes.
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the metric record.
delivery_id STRING (Required). The ID of the delivery record.
metric STRING (Required). The type of metric (e.g. sent, delivered, opened, clicked).
reason STRING (Nullable). For certain metrics (e.g. attempted), the reason behind the action.
link_id INTEGER (Nullable). For "clicked" metrics, the unique ID of the link being clicked.
link_url STRING (Nullable). For "clicked" metrics, the URL of the clicked link. (Truncated to 1000 bytes.)
created_at TIMESTAMP (Required). The timestamp the metric was created at.

Subjects

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the subject record.
subject_id STRING (Required). The ID for the path a person went through in a Campaign or API Triggered Broadcast workflow.
internal_customer_id STRING (Nullable). The ID of the person in question. Use the people parquet file to resolve this ID to your customer_id.
campaign_type STRING (Required). The type of Campaign (segment, event, or triggered_broadcast)
campaign_id INTEGER (Required). The ID of the Campaign or API Triggered Broadcast.
event_id STRING (Required). The ID for the unique event that triggered the workflow.
trigger_id INTEGER (Required). If the delivery was created as part of an API Triggered Broadcast, this is the unique trigger ID associated with the API call that triggered the broadcast.
started_campaign_at TIMESTAMP (Required). The timestamp when the person first matched the campaign trigger. For event-triggered campaigns, this is the timestamp of the trigger event. For segment-triggered campaigns, this is the time the user entered the segment.
created_at TIMESTAMP (Required). The timestamp the subject was created at.

Outputs

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the output record.
output_id STRING (Required). The ID for the step of the unique path a person went through in a Campaign or API Triggered Broadcast workflow.
subject_id STRING (Required). The ID for the path a person went through in a Campaign or API Triggered Broadcast workflow.
output_type STRING (Required). The type of step a person went through in a Campaign or API Triggered Broadcast workflow. Note that the "delay" output_type covers many use cases: a Time Delay or Time Window workflow item, a "grace period", or a Date-based campaign trigger.
action_id INTEGER (Required). The ID for the unique workflow item associated with the output.
explanation STRING (Required). The explanation for the output.
delivery_id STRING (Nullable). If a delivery resulted from this step of the workflow, this is the ID of that delivery.
draft BOOLEAN (Nullable). If a delivery resulted from this step of the workflow, this indicates whether the delivery was created as a draft.
link_tracked BOOLEAN (Nullable). If a delivery resulted from this step of the workflow, this indicates whether links within the delivery are configured for tracking.
split_test_index INTEGER (Nullable). If the step of the workflow was a Split Test, this indicates the variant of the Split Test.
delay_ends_at TIMESTAMP (Nullable). If the step of the workflow involves a delay, this is the timestamp for when the delay will end.
branch_index INTEGER (Nullable). If the step of the workflow was a T/F Branch, a Multi-Split Branch, or a Random Cohort Branch, this indicates the branch that was followed.
manual_segment_id INTEGER (Nullable). If the step of the workflow was a Manual Segment Update, this is the ID of the Manual Segment involved.
add_to_manual_segment BOOLEAN (Nullable). If the step of the workflow was a Manual Segment Update, this indicates whether a person was added or removed from the Manual Segment involved.
created_at TIMESTAMP (Required). The timestamp the output was created at.

People

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the person.
customer_id STRING (Required). The ID of the person in question. This will match the ID you see in the Customer.io UI.
internal_customer_id STRING (Required). The internal ID of the person in question.
deleted BOOLEAN (Nullable). This indicates whether the person has been deleted.
suppressed BOOLEAN (Nullable). This indicates whether the person has been suppressed.
created_at TIMESTAMP (Required). The timestamp the profile was created at. Note that this is the timestamp the profile was created in Customer.io's database, which may not match the created_at profile attribute.

Getting Started

To configure a Google Cloud Storage Bucket:

  1. Create a service account key in JSON format that grants read/write permissions to the GCS bucket you want us to export to.

  2. Log in to Customer.io.

  3. Navigate to Integrations on the left sidebar, then search for whichever service you want to sync to (e.g. BigQuery or Google Cloud Storage).

  4. Choose the desired option in the list, then select Get Started on the next page.

  5. Follow the instructions to submit your configuration request. You'll need:

    • The name of your GCS bucket
    • The path to your GCS bucket
    • The JSON of the service account key you created (please paste this in)
    • The ID(s) of the Customer.io workspace whose data you want synced. You can find a Workspace ID by navigating to the workspace in question, then copying the number from your browser's URL:
      Workspace ID

Configuring a Data Warehouse Sync usually takes 1-2 business days. You'll be emailed when configuration is complete.

To configure an Amazon S3 Bucket:

  1. Create an Access Key and a Service Key with read/write permissions to the S3 bucket you want us to export to.

  2. Log in to Customer.io.

  3. Navigate to Integrations on the left sidebar, then search for whichever service you want to sync to (e.g. Redshift or S3).

  4. Choose the desired option in the list, then select Get Started on the next page.

  5. Follow the instructions to submit your configuration request. You'll need:

    • The name of your S3 bucket
    • The path to your S3 bucket
    • The Access Key (please paste this in)
    • The Secret Key (please paste this in)
    • The AWS Region your S3 bucket is located in
    • The ID(s) of the Customer.io workspace whose data you want synced. You can find a Workspace ID by navigating to the workspace in question, then copying the number from your browser's URL:
      Workspace ID

Configuring a Data Warehouse Sync usually takes 1-2 business days. You'll be emailed when configuration is complete.

FAQ

1. How are the exported parquet files organized?

Each parquet file is named <name>_<workspace_id>_<sequence> .

<name> is either deliveries, metrics, subjects, outputs, or people.
<workspace_id> refers to the Customer.io workspace whose data is included.
<sequence> is an ever-increasing value over time.

2. How do you handle export failures?

If we experience an internal failure, we monitor and repair it promptly. If there's an external failure such as a networking problem, we retry the export. If there's an issue with bucket access, we'll reach out to you with details.

In each retry case, the next export will contain all data since the prior export. No data will be lost.

3. How should I import data from my S3 or GCS bucket to my data warehouse?

There are many approaches to this, but here's one example we've seen work for moving data from a GCS bucket to BigQuery.

  • Implement a Cloud Function to automatically import the parquet files from your GCS bucket to a BigQuery table.
  • Set an expiration on the parquet files so they get automatically deleted.

Here's a screenshot of an example Cloud Function:

BigQuery Cloud Function

You can download sample code for this Cloud Function below:

main(3).go

4. How can I get information about a campaign, workflow action, or message that's not included in the export?

With the data in our export, it's possible to query our API to pull any extra information you need.

For example, calling https://beta-api.customer.io/v1/api/campaigns/:id with the id of the campaign from a subject record will give you details about the campaign.

Check out our full API docs here.

Was this article helpful?