Wikimedia Product/Data dictionary/content pv
The cchen.content_pv
table (available on Hive) contains content topics related daily pageview data, generated by aggregating wmf.pageview_hourly
and join with isaacj.article_topics_outlinks
on Hive. It is stored in the Parquet columnar file format and partitioned by year, month and day.
This page describes the data set content_pv
that is loaded from cchen.content_pv
on Hive through Presto, which can be accessed via Superset.
Schema
editField name | data type | description | data example | source schema | source field |
---|---|---|---|---|---|
date | timestamp | The date of pageviews | 2021-05-29 00:00:00.0 | wmf.pageview_hourly | event_timestamp |
project | string | Project name from hostname | hu.wikipedia. | wmf.pageview_hourly | project |
market | string | Global markets (see definition) | Global North | canonical_data.countries | economic_region |
country | string | Country | Albania | canonical_data.countries | country |
country_code | string | ISO code for country | AL | canonical_data.countries | country_code |
topics | string | Topics related to certain articles using outlink-based model (refer to the taxonomy for detailed article topics) | Geography.Geographical | isaacj.article_topics_outlinks | topic |
main_topic | string | Top level of the topic | Geography | cchen.topic_component | main_topic |
sub_topic | string | Second level of the topic | Geographical | cchen.topic_component | sub_topic |
pageviews | bigint | Number of pageviews | 10000 | wmf.pageview_hourly | count(1) then aggregated year, month, and day |