Wikimedia Maps/Reports/OSM replication disk usage 2020

Starting from December 24, 2019, disk space increased on Maps servers, especially on PostgreSQL, further investigations found a correlation between OSM replication and the disk space increase. The issue was found on Jan 24, 2020. At that moment, replication was changed to hourly rates, executing every 2 hours.

Solution

edit

The most straightforward solution found by the team was to re-import OSM and re-enable OSM replication in a more spaced time, as part of T137939. The new replication rate is 12 hours, this initial work was tracked at T249086. To be sure that the disk space wouldn't happen again we performed a set of experimentations in the codfw cluster.

Experiment timeline

edit
  • Perform osm-intial-import script
    • After the re-import operation, DB size was equal to 793GB (measured at 04/06/2020 13:04)
  • Monitor disk-space usage using the following commands:
 sudo -u postgres bash
 du -sh /srv/postgresql/9.6/main/base/*
 psql -d gis -t -c "SELECT TABLE_NAME, index_bytes, toast_bytes, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
    SELECT relname AS TABLE_NAME
         , pg_total_relation_size(c.oid) AS total_bytes
         , pg_indexes_size(c.oid) AS index_bytes
         , pg_total_relation_size(reltoastrelid) AS toast_bytes
    FROM pg_class c
    LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
    WHERE relkind = 'r' AND nspname = 'public'
  ) a;"
  • After collecting enough data we verified that the DB was still growing 1GB/day
    • The DB size increased to 817GB in 11 days (measured at 04/17/2020 14:50)
  • We Performed FULL VACUUM in the PostgreSQL database to see how much of the extra disk-space could be recovered
  • We performed REINDEX DATABASE gis to see how much of the extra disk-space could be recovered
    • After this step the Database disk space downsized to 793GB (measured at 04/28/2020 14:35), same as the initial size after the data re-import operation.

Conclusion

edit

We were able to better understand how the OSM DB behaves in our infrastructure and are able to perform better solutions in the future to avoid the problem to happen again. Our findings are:

  1. OSM replication causes an increase of disk space on PostgreSQL DB, mostly on indexes
  2. The solution of re-importing the OSM DB did work for a short term, but them the disk space started increasing again
  3. The time space between replications seem to be decisive on the growth rate of disk space
  4. AUTO VACUUM is not enough to clean the dangling data in the database, FULL VACUUM is more effective
  5. FULL VACUUM blocks the database for read/write operations [1] , this forces us to depool the machines before running it
  6. REINDEX DATABASE gis blocks the database for write-only operations [2], requiring us to only disable OSM replication for a short period of time

References

edit