User:AGreen (WMF)/Draft:Mapping of EventLogging properties to FundraisingImpressions database
This page sets out how data will be input into the pgehres database by new scripts that will replace DjangoBannerStats.
Tables to deprecate edit
Table name | Summary | No e-com queries | No other uses reported |
---|---|---|---|
landingpageimpressions | Not populated by existing scripts, table is empty. | ||
globalcollect_orderids | Table doesn't exist on production, only in create.sql. | ||
globalcollect_orderids_unused | Table doesn't exist on production, only in create.sql. | ||
squidhost | Lookup table of Varnish (formerly squid) hosts. Referenced by landingpageimpression_raw and bannerimpression_raw, but those columns are not used in queries. | ||
squidrecord | Not populated by existing scripts, table is empty. |
Table for tracking file ingestion, to update edit
In legacy, the squidlog
table records which files that have been ingressed. We'll rename it to files
, and for backwards compatibility, create a view.
Also, we'll add several columns:
New column | Description |
---|---|
status | Enum column for the file's processing status. Can be processing or consumed . (Unprocessed files are not included in the table.)
|
consumed_events | Total events consumed from the file. |
ignored_events | Total events ignored in the file (i.e., read and validated but not stored). |
invalid_events | Total invalid events (or lines) in the file. |
Tables to maintain edit
Table | Description |
---|---|
language | Lookup table for languages. |
country | Lookup table for countries. |
project | Lookup table for projects. |
bannerimpressions | Aggregates CentralNotice banner impressions on specific dimensions. |
landingpageimpression_raw | Individual landing page pageview events (only certain properties). |
donatewiki_unique | Individual landing page pageview events (only certain properties). |
Mapping from LandingPageImpression events edit
Legacy could process events from two sites, wikimediafoundation.org and and donate.wikimedia.org, but only put events from the second site in donatewiki_unique.
Since effectively we're no longer receiving events from the first site, the new script will check all LandingPageImpression events for 'donatewiki'
in the wiki
property.
Input | Source | Validation | Transformation | DB location |
---|---|---|---|---|
dt | event metadata | Parseable by datetime.strptime() using configured format.
|
Parsed from JSON and provided to mysql.connector as datetime.datetime object.
|
landingpageimpression_raw.timestamp,
donatewiki_unique.timestamp |
country | event | ^[a-zA-Z]{2,2}$
|
Reference in lookup table. | landingpageimpression_raw.country_id |
language | event | ^[a-z]+([_-][a-zA-Z0-9]+)?$
|
Reference in lookup table. Default to 'en' if value is missing or is an empty string. | landingpageimpression_raw.language_id |
utm_source | event | Default validation regex (from config) | landingpageimpression_raw.utm_source,
donatewiki_unique.utm_source | |
utm_campaign | event | Default validation regex (from config) | landingpageimpression_raw.utm_campaign,
donatewiki_unique.utm_campaign | |
utm_medium | event | Default validation regex (from config) | landingpageimpression_raw.utm_medium | |
utm_key | event | Default validation regex (from config) | landingpageimpression_raw.utm_key | |
contact_id | event | Default validation regex (from config) | donatewiki_unique.contact_id | |
link_id | event | Default validation regex (from config) | donatewiki_unique.link_id | |
template | event | Default validation regex (from config) | Use only if landingpage is 'Special:LandingPage''. In that case, join all event properties with '~'; for any that are absent, use the string 'default'. | landingpageimpression_raw.landingpage |
appeal | event | Default validation regex (from config) | landingpageimpression_raw.landingpage | |
appeal_template | event | Default validation regex (from config) | landingpageimpression_raw.landingpage | |
form_template | event | Default validation regex (from config) | landingpageimpression_raw.landingpage | |
form_countryspecific | event | Default validation regex (from config) | landingpageimpression_raw.landingpage | |
landingpage | event | Default validation regex (from config) | If this value is not 'Special:LandingPage', remove the last two parts separated by ('/'); otherwise, ignore. | landingpageimpression_raw.landingpage |
sample_rate | event | (none) | ||
(none) | landingpageimpression_raw.squid_id,
landingpageimpression_raw.squid_sequence | |||
(none) | landingpageimpression_raw.processed |
Mapping from CentralNoticeImpression events edit
Input | Source | Validation | Transformation | Output | Notes |
---|---|---|---|---|---|
dt | event metadata | Parseable by datetime.strptime() using configured format.
|
Parsed from JSON and provided to mysql.connector as datetime.datetime object.
|
bannerimpressions.timestamp | |
banner | event | ^[A-Za-z0-9_]+$
|
bannerimpressions.banner | ||
campaign | event | Default validation regex (from config) | bannerimpressions.campaign | ||
project | event | ^[a-z0-9\-_\.]+$
|
Reference in lookup table. | bannerimpressions.project_id | |
uselang | event | ^[a-z]+([_-][a-zA-Z0-9]+)?$
|
Only certain languages processed separately, following this legacy setting. Remaining languages are aggregated. Reference in lookup table. | bannerimpressions.language_id | |
country | event | ^[a-zA-Z]{2,2}$
|
Reference in lookup table. | bannerimpressions.country_id | |
impressionEventSampleRate | event | (not used, as per legacy) | |||
(aggregate of all of the above) | Total number of hits with the same values for banner, campaign, project, language (see above) and country, received in a given log file, for a given minute, scaled according to combined (server- and client-side) sample rate. | bannerimpressions.count |