Wikimedia Maps/Reports/OSM replication disk usage 2020
Starting from December 24, 2019, disk space increased on Maps servers, especially on PostgreSQL, further investigations found a correlation between OSM replication and the disk space increase. The issue was found on Jan 24, 2020. At that moment, replication was changed to hourly rates, executing every 2 hours.
Full planet tile regeneration
Tile generation report for the work on the OS upgrade.
|
Solution
editThe most straightforward solution found by the team was to re-import OSM and re-enable OSM replication in a more spaced time, as part of T137939. The new replication rate is 12 hours, this initial work was tracked at T249086. To be sure that the disk space wouldn't happen again we performed a set of experimentations in the codfw cluster.
Experiment timeline
edit- Perform osm-intial-import script
- After the re-import operation, DB size was equal to 793GB (measured at 04/06/2020 13:04)
- Monitor disk-space usage using the following commands:
sudo -u postgres bash du -sh /srv/postgresql/9.6/main/base/* psql -d gis -t -c "SELECT TABLE_NAME, index_bytes, toast_bytes, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM ( SELECT relname AS TABLE_NAME , pg_total_relation_size(c.oid) AS total_bytes , pg_indexes_size(c.oid) AS index_bytes , pg_total_relation_size(reltoastrelid) AS toast_bytes FROM pg_class c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace WHERE relkind = 'r' AND nspname = 'public' ) a;"
- After collecting enough data we verified that the DB was still growing 1GB/day
- The DB size increased to 817GB in 11 days (measured at 04/17/2020 14:50)
- We Performed
FULL VACUUM
in the PostgreSQL database to see how much of the extra disk-space could be recovered - We performed
REINDEX DATABASE gis
to see how much of the extra disk-space could be recovered- After this step the Database disk space downsized to 793GB (measured at 04/28/2020 14:35), same as the initial size after the data re-import operation.
Conclusion
editWe were able to better understand how the OSM DB behaves in our infrastructure and are able to perform better solutions in the future to avoid the problem to happen again. Our findings are:
- OSM replication causes an increase of disk space on PostgreSQL DB, mostly on indexes
- The solution of re-importing the OSM DB did work for a short term, but them the disk space started increasing again
- The time space between replications seem to be decisive on the growth rate of disk space
AUTO VACUUM
is not enough to clean the dangling data in the database,FULL VACUUM
is more effectiveFULL VACUUM
blocks the database for read/write operations [1] , this forces us to depool the machines before running itREINDEX DATABASE gis
blocks the database for write-only operations [2], requiring us to only disable OSM replication for a short period of time