Data Warehouse Sync
  • 12 Minutes To Read

Data Warehouse Sync


If you want to sync all historical and future data from Customer.io to your Data Warehouse, you can use our Data Warehouse Sync feature. For a limited, future-only set of data, you can use Reporting Webhooks and/or a product like Stitch.

Our Data Warehouse Sync exports individual parquet files to an Amazon S3 or Google Cloud Storage bucket that you host. We attempt to sync every 15 minutes, though actual sync intervals and processing times may vary. When syncing large data sets, or Customer.io experiences a high volume of concurrent sync operations, it can take up to several hours to process and export data. This feature is not intended to sync data in real time.

The exported parquet files include the following types of Customer.io data: Deliveries, Metrics, Subjects, Outputs, and People (see definitions below). Each file will contain new rows for any data that has changed since the last export.

During the first sync, complete historical data is backfilled for Deliveries, Metrics, Subjects, and Outputs. However, any People who have been deleted or suppressed prior to the first sync will NOT be included in the People file export and the historical data in the other export files will be anonymized for the deleted and suppressed People.

Once the parquet files are in your S3 or GCS bucket, you can import them into data platforms like Fivetran or data warehouses like Redshift, BigQuery, and Snowflake.

Data Definitions:

  • Deliveries. Email, push, SMS, slack, and webhook records.
  • Metrics. The various events that occur to each delivery (e.g. sent, opened).
  • Subjects. The unique workflow journeys that people take through Campaigns and API Triggered Broadcasts.
  • Outputs. The unique steps within each workflow journey.
  • People. For compliance purposes, a list of people IDs, as well as whether a person has been deleted or suppressed.
  • Attributes. A record of changes to profile attributes over time.

Data warehouse v2 changes

We've updated the data warehouse feature to v2. This update includes:

  • Support for transactional messages (as transactional_message_id) in the Deliveries schema.
  • A fix for an issue that caused missing rows in Subjects and Outputs data. As a result of this bug, data warehouse v1 no longer supports Subjects or Outputs data.

If you used our initial data warehouse release, we recommend updating to the v2 implementation. However, you can continue using our original data warehouse sync feature if you don't use:

  • Subjects and/or Outputs data
  • Transactional messages.

Update from data warehouse sync v1 to v2

  1. Update your ingestion logic to recognize new v2 filenames.
    • v1: <name>_<workspace_id>_<sequence>.parquet
    • v2: <name>_v2_<workspace_id>_<sequence>.parquet
  2. Update your ingestion logic to accept the new fields for Subjects, Outputs, and Deliveries. For Subjects and Outputs, the schemas are similar, but the formats of subject_id and output_id fields have changed.
  3. Truncate your historical Subjects and Outputs tables (ie delete any existing rows). When you update, we will send you all of your Subjects and Outputs data from the beginning of your history using the new schema.
  4. Contact Customer.io and confirm that you're ready to update to v2.

Data Schema and Definitions

All schemas and definitions are based on our v2 update. If using data warehouse v1, you will not receive exports for Subjects or Outputs, and the Deliveries schema will not include transactional_message_id.

Deliveries

After sync is enabled, the first deliveries export file will include a backfill of historical data. Subsequent files will contain rows for any data that has changed since the last export.

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the delivery record.
delivery_id STRING (Required). The ID of the delivery record.
internal_customer_id STRING (Nullable). The ID of the person in question. Use the people parquet file to resolve this ID to your customer_id.
subject_id STRING (Nullable). If the delivery was created as part of a Campaign or API Triggered Broadcast workflow, this is the ID for the path the person went through in the workflow.
event_id STRING (Nullable). If the delivery was created as part of an event-triggered Campaign, this is the ID for the unique event that triggered the workflow.
delivery_type STRING (Required). The type of delivery: email, push, sms, slack, or webhook.
campaign_id INTEGER (Nullable). If the delivery was created as part of a Campaign or API Triggered Broadcast workflow, this is the ID for the Campaign or API Triggered Broadcast.
action_id INTEGER (Nullable). If the delivery was created as part of a Campaign or API Triggered Broadcast workflow, this is the ID for the unique workflow item that caused the delivery to be created.
newsletter_id INTEGER (Nullable). If the delivery was created as part of a Newsletter, this is the unique ID of that Newsletter.
content_id INTEGER (Nullable). If the delivery was created as part of a Newsletter split test, this is the unique ID of the Newsletter variant.
trigger_id INTEGER (Nullable). If the delivery was created as part of an API Triggered Broadcast, this is the unique trigger ID associated with the API call that triggered the broadcast.
created_at TIMESTAMP (Required). The timestamp the delivery was created at.
transactional_message_id INTEGER (Nullable). If the delivery occured as a part of a transactional message, this is the unique identifier for the API call that triggered the message.

Metrics

The first metrics export file will include a backfill of historical data. Subsequent files will contain rows for any data that has changed since the last export.

Field Name Description
event_id STRING (Required). The unique ID of the metric event. This can be useful for deduplicating purposes.
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the metric record.
delivery_id STRING (Required). The ID of the delivery record.
metric STRING (Required). The type of metric (e.g. sent, delivered, opened, clicked).
reason STRING (Nullable). For certain metrics (e.g. attempted), the reason behind the action.
link_id INTEGER (Nullable). For "clicked" metrics, the unique ID of the link being clicked.
link_url STRING (Nullable). For "clicked" metrics, the URL of the clicked link. (Truncated to 1000 bytes.)
created_at TIMESTAMP (Required). The timestamp the metric was created at.

Subjects

The first subjects export file will include a backfill of historical data. Subsequent files will contain rows for any data that has changed since the last export.

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the subject record.
subject_id STRING (Required). The ID for the path a person went through in a Campaign or API Triggered Broadcast workflow.
subject_name STRING (Required). A secondary unique ID for the path a person took through a campaign or broadcast workflow.
internal_customer_id STRING (Nullable). The ID of the person in question. Use the people parquet file to resolve this ID to your customer_id.
campaign_type STRING (Required). The type of Campaign (segment, event, or triggered_broadcast)
campaign_id INTEGER (Required). The ID of the Campaign or API Triggered Broadcast.
event_id STRING (Required). The ID for the unique event that triggered the workflow.
trigger_id INTEGER (Required). If the delivery was created as part of an API Triggered Broadcast, this is the unique trigger ID associated with the API call that triggered the broadcast.
started_campaign_at TIMESTAMP (Required). The timestamp when the person first matched the campaign trigger. For event-triggered campaigns, this is the timestamp of the trigger event. For segment-triggered campaigns, this is the time the user entered the segment.
created_at TIMESTAMP (Required). The timestamp the subject was created at.

Outputs

The first outputs file will include a backfill of historical data. Subsequent files will contain rows for any data that has changed since the last export.

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the output record.
output_id STRING (Required). The ID for the step of the unique path a person went through in a Campaign or API Triggered Broadcast workflow.
subject_id STRING (Required). The ID for the path a person went through in a Campaign or API Triggered Broadcast workflow.
output_type STRING (Required). The type of step a person went through in a Campaign or API Triggered Broadcast workflow. Note that the "delay" output_type covers many use cases: a Time Delay or Time Window workflow item, a "grace period", or a Date-based campaign trigger.
action_id INTEGER (Required). The ID for the unique workflow item associated with the output.
explanation STRING (Required). The explanation for the output.
delivery_id STRING (Nullable). If a delivery resulted from this step of the workflow, this is the ID of that delivery.
draft BOOLEAN (Nullable). If a delivery resulted from this step of the workflow, this indicates whether the delivery was created as a draft.
link_tracked BOOLEAN (Nullable). If a delivery resulted from this step of the workflow, this indicates whether links within the delivery are configured for tracking.
split_test_index INTEGER (Nullable). If the step of the workflow was a Split Test, this indicates the variant of the Split Test.
delay_ends_at TIMESTAMP (Nullable). If the step of the workflow involves a delay, this is the timestamp for when the delay will end.
branch_index INTEGER (Nullable). If the step of the workflow was a T/F Branch, a Multi-Split Branch, or a Random Cohort Branch, this indicates the branch that was followed.
manual_segment_id INTEGER (Nullable). If the step of the workflow was a Manual Segment Update, this is the ID of the Manual Segment involved.
add_to_manual_segment BOOLEAN (Nullable). If the step of the workflow was a Manual Segment Update, this indicates whether a person was added or removed from the Manual Segment involved.
created_at TIMESTAMP (Required). The timestamp the output was created at.

People

The first People export file includes a list of current People at the time of first sync (deleted or suppressed people will not be included in the first file). Subsequent exports will include any People deleted or suppressed since the last export.

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the person.
customer_id STRING (Required). The ID of the person in question. This will match the ID you see in the Customer.io UI.
internal_customer_id STRING (Required). The internal ID of the person in question.
deleted BOOLEAN (Nullable). This indicates whether the person has been deleted.
suppressed BOOLEAN (Nullable). This indicates whether the person has been suppressed.
created_at TIMESTAMP (Required). The timestamp the profile was created at. Note that this is the timestamp the profile was created in Customer.io's database, which may not match the created_at profile attribute.

Attributes

On setup, the initial Attributes export will include a list of profiles and their current attributes. Subsequent files will only contain attribute changes, with one change per row.

Field Name Description
workspace_id INTEGER (Required). The ID of the Customer.io workspace associated with the person.
internal_customer_id STRING (Required). The internal ID of the person in question.
attribute_name STRING (Required). The attribute that was updated.
attribute_value STRING (Required). The new value of the attribute.
timestamp TIMESTAMP (Required). The timestamp of the attribute update.

Getting Started

To configure a Google Cloud Storage Bucket:

  1. Create a service account key in JSON format that grants read/write permissions to the GCS bucket you want us to export to.

  2. Log in to Customer.io.

  3. Navigate to Integrations on the left sidebar, then search for whichever service you want to sync to (e.g. BigQuery or Google Cloud Storage).

  4. Choose the desired option in the list, then select Get Started on the next page.

  5. Follow the instructions to submit your configuration request. You'll need:

    • The name of your GCS bucket
    • The path to your GCS bucket
    • The JSON of the service account key you created (please paste this in)
    • The ID(s) of the Customer.io workspace whose data you want synced. You can find a Workspace ID by navigating to the workspace in question, then copying the number from your browser's URL:
      Workspace ID

Configuring a Data Warehouse Sync usually takes 1-2 business days. You'll be emailed when configuration is complete.

To configure an Amazon S3 Bucket:

  1. Create an Access Key and a Service Key with read/write permissions to the S3 bucket you want us to export to.

  2. Log in to Customer.io.

  3. Navigate to Integrations on the left sidebar, then search for whichever service you want to sync to (e.g. Redshift or S3).

  4. Choose the desired option in the list, then select Get Started on the next page.

  5. Follow the instructions to submit your configuration request. You'll need:

    • The name of your S3 bucket
    • The path to your S3 bucket
    • The Access Key (please paste this in)
    • The Secret Key (please paste this in)
    • The AWS Region your S3 bucket is located in
    • The ID(s) of the Customer.io workspace whose data you want synced. You can find a Workspace ID by navigating to the workspace in question, then copying the number from your browser's URL:
      Workspace ID

Configuring a Data Warehouse Sync usually takes 1-2 business days. You'll be emailed when configuration is complete.

FAQ

1. How are the exported parquet files organized?

Each parquet file is named <name>_v2_<workspace_id>_<sequence>.

  • <name> is either deliveries, metrics, subjects, outputs, or people.
  • v2 Indicates the schema version. The current version is v2; v1 schemas do not have a version indicator and are deprecated.
  • <workspace_id> refers to the Customer.io workspace whose data is included.
  • <sequence> is an ever-increasing value over time.

2. How do you handle export failures?

If we experience an internal failure, we monitor and repair it promptly. If there's an external failure such as a networking problem, we retry the export. If there's an issue with bucket access, we'll reach out to you with details.

In each retry case, the next export will contain all data since the prior export. No data will be lost.

3. How should I import data from my S3 or GCS bucket to my data warehouse?

There are many approaches to this, but here's one example we've seen work for moving data from a GCS bucket to BigQuery.

  • Implement a Cloud Function to automatically import the parquet files from your GCS bucket to a BigQuery table.
  • Set an expiration on the parquet files so they get automatically deleted.

Here's a screenshot of an example Cloud Function:

BigQuery Cloud Function

You can download sample code for this Cloud Function below:

main(4).go

4. How can I get information about a campaign, workflow action, or message that's not included in the export?

With the data in our export, it's possible to query our API to pull any extra information you need.

For example, calling https://beta-api.customer.io/v1/api/campaigns/:id with the id of the campaign from a subject record will give you details about the campaign.

Check out our full API docs here.

5. How do I get all of the current attributes for a profile?
On setup, the initial Attributes export will include a list of profiles and their current attributes. Subsequent files will only contain attribute changes, with one change per row. In order to get the most recent attributes for a particular profile, you'll need to use the timestamp value to query for the latest.

An example query for Snowflake would be:

select internal_customer_id, attribute_name, attribute_value 
from (select internal_customer_id, attribute_name, attribute_value, row_number() over (partition by internal_customer_id,attribute_name order by timestamp desc) RNO from attributes) 
where rno=1 and internal_customer_id='xxxxxxxxxxx';
Was This Article Helpful?