Wikimedia Product/Analytics Infrastructure/Stream configuration
This page describes features of the Event Platform Client (EPC) as it pertains to stream configuration in the Modern Event Platform (MEP), which – for analytics purposes – is set in PHP with wgEventStreams
in wmf-config/InitialiseSettings.php (from EventStreamConfig).
Required components
editThe following fields are required for each stream:
stream
: the name of the streamschema
: just the title of the schema the stream uses (note: schema name and schema version should be specified by the instrumentation)destination_event_service
: which EventGate instance to send events to. For analytics events logged with EventLogging'smw.eventLog.submit()
or Event Platform Client on Android/iOS, this will be "eventgate-analytics-external" and those events will be sent to intake-analytics.wikimedia.org/v1/events
The following fields are optional:
- the
sampling
settings defines the rules for determining whether an event is sent or thrown away, see below
Sampling settings
editFuture developments
editStream cc-ing
editEPC supports cc-ing to streams sharing prefixes. This makes it possible to direct or copy events to different streams without additional instrumentation work. Specifically, the cc-ing feature lets engineers and analysts log events to additional streams without having to perform multiple log()
calls manually.
To illustrate this concept, suppose we have an analytics/editing/attempt-step-v2
schema and the following stream:
'wgEventStreams' => [
'default' => [
[
'stream' => 'edit',
'schema_title' => 'analytics/editing/attempt-step-v2',
'destination_event_service' => 'eventgate-analytics-external',
'sampling' => [
'rate' => 0.1,
],
],
],
]
As detailed in the sampling logic section above, this is a stream where events are determined to be in-sampled (and are sent) for 10% of sessions – since this stream uses the default identifier (session token).
The Growth team wants to collect editing behavior data on Czech and Korean Wikipedias, but sampled at a higher rate. They can create a new stream (e.g. "edit.growth") for those wikis (which will use the default 100% rate) and when the instrumentation logs events to "edit" stream, those events would be logged to "edit.growth" stream automatically, without the need for a separate log()
call (mw.eventLog.submit
on MediaWiki):
'wgEventStreams' => [
'default' => [
[
'stream' => 'edit',
'schema_title' => 'analytics/editing/attempt-step-v2',
'destination_event_service' => 'eventgate-analytics-external',
'sampling' => [
'rate' => 0.1,
],
],
],
'cswiki' => [
[
'stream' => 'edit.growth',
'schema_title' => 'analytics/editing/attempt-step-v2',
'destination_event_service' => 'eventgate-analytics-external',
],
],
'kowiki' => [
[
'stream' => 'edit.growth',
'schema_title' => 'analytics/editing/attempt-step-v2',
'destination_event_service' => 'eventgate-analytics-external',
],
],
]
Remember, in the MEP paradigm streams map to tables. The second option would give the Growth team a separate table "edit_growth" to work with, and that they can apply a different retention policy to – for example, if data in the "edit" table is stored for 90 days maximum but Growth team has an exemption from Legal to retain data for 270 days, that can be applied to the "edit_growth" table.
cc'd streams
editThe child streams to be cc'd are determined by shared prefixes separated by dot, starting at the beginning and up to a maximum depth of 1 level (direct child). To prevent duplication, only direct children are cc'd. The parent stream does not need to exist in the stream configuration for its children to be cc'd. See example below for clarification.
Suppose we have 4 streams in a (loaded) stream configuration:
- a
- a.b
- a.b.c
- b.c
and that we log 4 separate events in the instrumentation, one to each stream. Here's what happens:
log("a", "/analytics/example/1.0.0", data1)
- data1 is posted to stream "a" depending on its
sampling
- data1 is cc'd to the only child stream for "a" ("a.b", NOT "a.b.c") via
log("a.b", "/analytics/example/1.0.0", data1)
- data1 is posted to stream "a.b" depending on its
sampling
- data1 is cc'd to the only child stream for "a.b" ("a.b.c") via
log("a.b.c", "/analytics/example/1.0.0", data1)
- data1 is posted to stream "a.b.c" depending on its
sampling
- data1 is posted to stream "a.b.c" depending on its
- data1 is posted to stream "a.b" depending on its
- data1 is posted to stream "a" depending on its
log("a.b", "/analytics/example/1.0.0", data2)
- data2 is posted to stream "a.b" depending on its
sampling
- data2 is cc'd to the only child stream for "a.b" ("a.b.c") via
log("a.b.c", "/analytics/example/1.0.0", data2)
- data2 is posted to stream "a.b.c" depending on its
sampling
- data2 is posted to stream "a.b.c" depending on its
- data2 is posted to stream "a.b" depending on its
log("a.b.c", "/analytics/example/1.0.0", data3)
- data3 is posted to stream "a.b.c" depending on its
sampling
- data3 is posted to stream "a.b.c" depending on its
log("b", "/analytics/example/1.0.0", data4)
- data4 is NOT posted to stream "b" because there's no stream by that name in the configuration
- HOWEVER, data4 IS cc'd to the only child stream for "b" ("b.c") via
log("b.c", "/analytics/example/1.0.0", data4)
- data4 is posted to stream "b.c" depending on its
sampling
- data4 is posted to stream "b.c" depending on its
log("b.c", "/analytics/example/1.0.0", data5)
- data5 is posted to stream "b.c" depending on its
sampling
- data5 is posted to stream "b.c" depending on its
Assuming the sampling logic evaluates to TRUE in all cases, here's what's we end up with:
Table | Event data | In instrumentation | Explanation |
---|---|---|---|
a | data1 | log("a", "/analytics/example/1.0.0", data1)
|
Logged directly |
a_b | data1 | log("a", "/analytics/example/1.0.0", data1)
|
Logged via cc |
a_b | data2 | log("a.b", "/analytics/example/1.0.0", data2)
|
Logged directly |
a_b_c | data1 | log("a", "/analytics/example/1.0.0", data1)
|
Logged via cc |
a_b_c | data2 | log("a.b", "/analytics/example/1.0.0", data2)
|
Logged via cc |
a_b_c | data3 | log("a.b.c", "/analytics/example/1.0.0", data3)
|
Logged directly |
b_c | data4 | log("b", "/analytics/example/1.0.0", data4)
|
Logged via cc |
b_c | data5 | log("b.c", "/analytics/example/1.0.0", data5)
|
Logged directly |
There's no table "b" because there is no stream "b" in the configuration, even though data4 was logged to that stream.
Notice that when logging to parent stream "a", only its direct child ("a.b") is cc'd. The stream "a.b.c" is not a direct child of "a". Imagine if all levels of children were considered: data1 would have been cc'd to "a.b.c" twice – once from "a" and once from "a.b". Also notice that even though b.c's parent stream "b" does not exist in the stream config, "b.c" still got cc'd.
Stream cc-ing is a powerful feature, but with great power comes great responsibility.
Specifying exemptions
editIn a later version of the Modern Event Platform Client Libraries we'd like a more detailed, more sophisticated targeting solution. One way we could achieve that is by adding a new configurable to sampling
:
An exemptions
field which can be used to override the stream's rate
in specific situations. The core use-cases we wanted to support are:
- being able to specify per-
wiki
sampling rates, for example:- to decrease volume of events sent from English Wikipedia
- to increase volume of events sent from Czech and Korean Wikipedias
- being able to specify per-
platform
sampling rates, for example:- to enable a stream on desktop but not mobile web
- to disable a stream on desktop and mobile web, but not mobile apps
- being able to specify sampling rates based on key-value pairs in persistent storage
- to only enable a stream if a
key
has a specificvalue
(assuming the key exists at all in the persistent storage) - to only enable a stream if a
key
has one of several values - to only enable a stream for specific combinations of
keys
- to only enable a stream if a
The various ways to specify exemptions (wiki
, platform
, key
, keys
) can be combined together, resulting in very specific sampling logic. Here are some examples that illustrate how the streams can be configured to have specific sampling behaviors.