Heap Connect for S3

For access to Heap Connect, contact your Account Manager or [email protected].

Heap Connect for S3 enables any downstream system (ex. Hadoop, Stitch, Fivetran) to access Heap data at scale. This allows you to reap the benefits of codeless event creation, retroactivity, and cross-device user identity.

ETL Requirements

Heap Connect for S3 is designed to support building a custom data pipeline, not for querying directly in an Enterprise Data Lake. Interested customers will need to work with one of our ETL partners or provision Data Engineering resources that will build and maintain a data pipeline.

Process Overview

Heap will provide a periodic dump of data into S3 (nightly by default). That data will be delivered in the form of Avro-encoded files, each of which corresponds to one downstream table (though there can be multiple files per table). Dumps will be incremental, though individual table dumps can be full resyncs, depending on whether the table was recently toggled or the event definition modified.

We’ll include the following tables:

  • users
  • pageviews
  • sessions
  • toggled event tables (separate tables per event)
  • user_migrations (a fully materialized mapping of users merged as a result of heap.identify calls)

Each periodic data delivery will be accompanied by a manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table.

Metadata

For each dump, there will be a metadata file with the following information:

  • dump_id - a monotonically increasing sequence number for dumps.
  • tables - for each table synced:
    • name - the name of the table.
    • columns - an array consisting of the columns contained in the table. This can be used to determine which columns need to be added or removed downstream.
    • files - an array of full s3 paths to the Avro-encoded files for the relevant table.
    • incremental - a boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and all previous data is invalid.
  • property_definitions - the s3 path to the defined property definition file.

An example of this metadata file can be found below:

{
  "dump_id": 1234,
  "tables": [
    {
      "name": "users",
      "files": [
        "s3://customer/sync_1234/users/a97432cba49732.avro",
        "s3://customer/sync_1234/users/584cdba3973c32.avro",
        "s3://customer/sync_1234/users/32917bc3297a3c.avro"
      ],
      "columns": [
        "user_id",
        "last_modified",
        ...
      ],
      "incremental": true
    },
    {
      "name": "user_migrations",
      "files": [
        "s3://customer/sync_1234/user_migrations/2a345bc452456c.avro",
        "s3://customer/sync_1234/user_migrations/4382abc432862c.avro"
      ],
      "columns": [
        "from_user_id",
        "to_user_id",
        ...
      ],
      "incremental": false  // Will always be false for migrations
    },
    {
      "name": "defined_event",
      "files": [
        "s3://customer/sync_1234/defined_event/2fa2dbe2456c.avro"
      ],
      "columns": [
        "user_id",
        "event_id",
        "time",
        "session_id",
        ...
      ],
      "incremental": true
    }
  ],
  "property_definitions": "s3://customer/sync_1234/property_definitions.json"
}

Data Type

The user_id, event_id, and session_id are the only columns that are long types. All other columns should be inferred as string types.

Data Delivery

Data will sync directly to customers’ S3 buckets. Customers will create a bucket policy for Heap, and we’ll use that policy when dumping to S3. The target S3 bucket name needs to begin with the prefix 'heap-rs3-' for Heap's systems to have access to it.

No additional user/role is required.

Sync Reporting

Each sync will be accompanied by a sync log file that reports on delivery status. These log files will be placed in the sync_reports directory. Each report will be in a JSON format as follows:

{
  "start_time":1566968405225,
  "finish_time":1566968649169,
  "status":"succeeded",
  "next_sync_at":1567054800000,
  "error":null
}

start_time, finish_time, and next_sync_at are represented as epoch timestamps.

You can learn about how the data will be structured upon sync by viewing our docs on data syncing.

Granting Access

Add the following policy to the destination S3 bucket. This policy applies only to the Heap bucket you created specifically for this export.

If you would like to restrict the allowed actions, the minimum required actions are s3:PutObject and s3:ListBucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1441164338000",
      "Effect": "Allow",
      "Action": [
        "s3:*"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket-name>",
        "arn:aws:s3:::<bucket-name>/*"
      ],
      "Principal": {
        "AWS": [
          "arn:aws:iam::085120003701:root"
        ]
      }
    }
  ]
}

We also recommend reviewing the following documentation:

Completion of a dump is signaled by delivery of a new manifest file. You should poll s3://<BUCKET>/heap_exports/manifests/* for new manifests. Upon receipt of a new manifest, ETL can proceed downstream.

Defined Properties JSON File

We will sync defined property definitions daily and provide a JSON file containing all defined properties and their definitions. Downstream consumers will be responsible for applying these definitions to generate the defined property values for each row.

The JSON file format is as follows:


[
  {
    "property_name": "Channel",
    "type": "<event|user>",
    "cases": [
      {
        "value": {...}, // Refer to values spec below
        "condition": {...} // Refer to conditions spec below
      }
    ],
    "default_value": {...} // Refer to values spec below. This field is optional
  },
  ...
]

Property Values

value in cases and the default_value can be a constant or another non-defined property on the same entity (eg event defined props will only refer to other properties on the event).


{
  "type": "<property|constant>",
  "value": <name of property|constant value>
}

Conditions

Each case produces a value for the defined property if the conditions evaluate to true. Notes:

  • Case statements are evaluated in order, so if the cases aren’t mutually exclusive, the value of the defined property will come from the first case to evaluate to true.
  • We currently only support 1 level of condition nesting beyond the top level, but this format can support more than that.
  • The conditions can be traversed to represent the logic in another format just as SQL case statements.

    {  
      "clause_combinator": "<and|or>",
      "clauses": [...]  // Refer to clauses spec below
    }

Clauses


{
  "property_name": "utm_source",
  "operator": "...", // Refer to operators spec below
  "value": ... // Refer to clause values spec below
}

Operators

These are the names we give operators internally. They’re reasonably readable, so we can just use them.

Operator
Description

=

Equal

!=

Not Equal

contains

Contains

notcontains

Does not contain

isdef

Is defined

notdef

Is not defined

matches

Wildcard matches (SQL equivalent of ILIKE)

notmatches

Doesn't wildcard match (SQL equivalent of NOT ILIKE)

includedin

Included in a collection of values

notincludedin

Not included in a collection of values

Clause values

All operators but includedin and notincludedin have string values. includedin and notincludedin values are supplied via a file in the defined property definition UI. Internally, we store the contents of the file (split by newline) as a JSON array. We can keep using this representation.

Example defined properties file

[
  {
    "property_name": "channel",
    "type": "event",
    "cases": [
      {
        "value": {
          "type": "constant",
          "value": "Social"
        },
        "condition": {
          "clause_combinator": "or",
          "clauses": [
            {
              "clause_combinator": "and",
              "clauses": [
                {
                  "property_name": "campaign_name",
                  "operator": "=",
                  "value": "myfavoritecampaign"
                },
                {
                  "property_name": "utm_source",
                  "operator": "=",
                  "value": "facebook"
                }
              ]
            },
            {
              "property_name": "utm_source",
              "operator": "=",
              "value": "instagram"
            }
          ]
        }
      },
      {
        "value": {
          "type": "property",
          "value": "utm_source" // This is a property on the event
        },
        "condition": {
          "clause_combinator": "or",
          "clauses": [
            {
              "property_name": "utm_source",
              "operator": "=",
              "value": "google"
            },
            {
              "property_name": "utm_source",
              "operator": "=",
              "value": "bing"
            }
          ]
        }
      }
    ],
    "default_value": {
      "type": "constant",
      "value": "Idk"
    }
  }
]

ETL Considerations

  • Data across dumps/files are not guaranteed to be disjointed. As a result, downstream consumers are responsible for de-duplication. De-duplication must happen after applying user migrations. We recommend the following de-duplication strategy:
Table
De-duplication Columns

Sessions

session_id, user_id

Users

user_id

Event tables

event_id, user_id

  • Updated users (users with properties that have changed since the last sync) will re-appear in the sync files, and thus every repeated occurrence of a user (check on user_id) should replace the old one to ensure that the corresponding property updates are picked up.
  • user_migrations is a fully materialized mapping of from_user_ids to to_user_ids. Downstream consumers are responsible for joining this with events/users tables downstream to resolve identity retroactively. For complete steps, see Identity Resolution.
  • For v2, we only sync defined property definitions rather than the actual defined property values. Downstream consumers are responsible for applying these definitions to generate the defined property values for each row.
  • Schemas are expected to evolve over time (i.e. properties can be added to the user and events tables)

Updated 3 months ago

Heap Connect for S3


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.