Dataset and features
Dataset format
When you do a GET
request on the dataUrls
for the url under a specific dataset key such as initial
, you'll get the full dataset with all features returned as JSON.
Under data
we an array of rows can be found, where each row represent an unique user with its features and properties. Under metadata.columns
you will find the column names, where each index number corresponds to the column index for each user row in data
.
Here are the column names that are always available for each dataset. All timestamps are in ISO 8601 format.
- user_id: The unique ID of the user. In case the user has been identified by a custom ID, this will be that id.
- user_created: The timestamp when the user was first created/tracked.
- data_now: Timestamp when the dataset was created
- y_value: Will be
"true"
if the user converted and"false"
if not. Note this is a string type, not boolean! - y_timestamp: If the user converted, this is the earliest timestamp when the conversion happened. If the conversion goal is that a user did
play_song
, and the user did that event 10 times, this is the timestamp of the first time that event happened. - random: A random value between 0 and 1. Can be used to query for a smaller sample of all users in the dataset, see next section.
- moment_key: This contains the same value as the key of the dataset this is for, such as
initial
,train_more
, etc. - user_moment_base_timestamp: The user base timestamp defaults to user 'created at', so in that case it will be the same as the
user_created
column from above. But we can also choose any other user property such as 'identified at' or any other timestamp that you're tracking. See the Insight API intelligence plugin section later on how to specify a different user base timestamp. - moment_timestamp: A dataset is always based on a snapshot from the number of seconds since their creation date, or whatever
user_moment_base_timestamp
was specified. For theinitial
dataset, where the moment is 0 seconds themoment_timestamp
equalsuser_created
. But for 60 seconds, it will beuser_moment_base_timestamp + 60 seconds
. If the dataset moment islatest
, it will be equal todata_now
. The moment timestamp is useful for a specific type of analysis, such as predicting what behavior leads to conversion — we don't want to analyze users where their snapshot state is beyond the conversion y_timestamp, as our analysis would be biased then. So in this specific case we want to filter out all users wheremoment_timestamp < y_timestamp
.
We can have also have one or more features. Each column name
that starts with feature_
is a feature, and we can find the corresponding feature details in the JSON manifest discussed in the previous section, here is the features part again:
Each feature has a user friendly name
.
The moment
of a feature can be either static
, which is for user properties such as country or device — those are always known at the initial moment — 0 seconds since user creation — and don't change if the moment of the dataset changes from say 0 seconds since user creation to 3600 seconds since user creation. On the other hand, dynamic
moments features do change as we calculate them at to the moment of the dataset snapshot time, for example 300 second since user creation, so that users has 5 minutes to do some actions.
Each feature can have a type
:
- "integer": Discrete numeric value
(1, 2, 9, 10)
- "numeric": Continous numeric value
(3.330, 8.4846)
- "categorical": String based categorical value
("true", "false", "red", "green")
. Boolean values are encoded as categorical too. - "text": Content based text value (user review or comment for example), which could be used with NLP or other text based analysis.
- "string": A string value that is mostly unusable for most cases. For example a user id, some random hash, etc.
nativeType
can be one of string
, integer
, float
, boolean
, timestamp
. For example country would be encoded as string, so that type
would be categorical
and nativeType
would be string
.
Finally the details
can be looked up of a feature, but this is used only in rare cases.
Querying datasets
When you do a GET
on any of the the dataUrls
, it will return the whole dataset. While you can then trim your dataset down by filter locally, this can get inefficient. A better way would be to append those data urls with the query
parameter, so you can run any SQL query directly onto the dataset table. All columns such as y_value
, y_timestamp
and feature_abcd1234
are available as-is, and can be used in the query.
The table is called DATA_TABLE
.
Here is an example that returns 10% of the dataset where time since conversion > 1 hour:
Don't forget to escape the query
parameter when used in a GET
request.
The random
column is in the range of 0.0 and 1.0, so we can use that get a sub-selection of the dataset. There is a shortcut by using the range_start_gt_or_eq
and range_end_lt
query parameters:
translates to random >= 0.50:
translates to random < 0.33:
translates to random >= 0.10 AND random < 0.75:
Events input data
Events input data is a list of events with their timestamp, id, name and the requested event properties. Because each row in the complete dataset represents an unique users, the events belonging to that user are encoded in a column as a JSON encoded nested array. Each element in that array represents an event, such as an item in a webshop interacted with.
A dataset can contain many event input data features, so each one will be represented in a separate column, encoded as JSON. Each event array always contains three elements:
- The first one is the unique ID of the event. Because we can have multiple event input data features, such as
price
andcolor
, each will be JSON encoded in their own column in the dataset. To then tie those two propertiesprice
andcolor
to the same event in your code, do so through the unique ID of the event. In the JSON example below we can see how one event's three properties — SKU id, color and price — are represented in three different arrays. - The second element contains the ISO 8601 timestamp of the event. Sometimes you want to take only certain events into consideration, such as the ones that happened before the conversion timestamp — available under
y_timestamp
in the dataset. To do so, you need to filter events in your own code (Python, etc) where this array's second elementevent timestamp < y_timestamp
. - The third element always contains the requested event property, such as color, price or SKU id.
Because events input data is based on events, it collects only events up to the dataset moment. So for 0 seconds since user creation, there will never we any events input data. For dataset moment latest
it will have the full history for that event, so sure to always collect events input data from the right dataset. In addition, the amount of events for each input data JSON encoded array is limited to a maximum of 200.
Below is an example of how the array of events will roughly look like, where each one comes from a different column in the dataset:
The event input data above maps to the JSON manifest inputData
as shown below. The important part of that column
, which indicates what column of the dataset this event input data can be found under.
Note that for inputData
, the type is as-is; so text
doesn't mean it NLP-able; it's just a text type. For numeric it can be integer
or float
.