Technical decision making/Decision records/T292402

What are your constraints?

edit
General Assumptions and Requirements Source
Deploying configuration for MediaWiki MUST NOT require rebuilding a docker container RelEng (Dan Duval)
At least some specific config changes MUST be possible within seconds (for incident response) Ops (Guiseppe)
It SHOULD be easy to see the effective configuration for a given site (and data center, and server group). RelEng (Ahmon), SRE
Loading configuration MUST be performant (at least comparable to current performance) Performance Team (Timo)
Security Requirements
It MUST be possible to securely provide secrets (e.g. database passwords) to MediaWiki Common sense
MUST provide a way to deploy ad-hoc PHP code as a security measure Security team (Scott)
Privacy Requirements
n/a

Important Questions

edit
Question Who can answer? Resolution, answer or action
What caching strategy should be used for configuration? ServiceOps, Performance Three strategies, for different use cases: Pre-generated PHP files are the fastest option, used for things that rarely change. Config read from YAML file can be cached in APCu and invalidated based on mtime.  Config coming from etcd can be cached briefly in APCu and must support stale reads, to be resilient against failure of the etcd service.
How do we de-risk deployment of a new config loading mechanism? ServiceOps Feature switch to be toggled in the MultiVersion wrapper, based on host name. Patched in manually.
How will the new approach relate to the existing functionality in the SiteConfiguration class? RelEng SiteConfiguration can in the future be used to generate per-wiki config files which get copied into the deployment image.
How do we overcome the 1MB limit of k8s config maps? RelEng Large bits of the config rarely change, it can be deployed as part of the image. Smaller parts of the config that may change more frequently can be deployed via config maps. Highly critical overrides (like the name of the active data center) can be overwritten from etcd.

Decision

edit
Selected Option Option 2: Allow MediaWiki to load configuration from a set of JSON files.
Rationale The current situation (option 1) is undesirable because it is hard to determine the effective configuration for each wiki. The current form of configuration also does not fit well with how configuration management is done in kubernetes.

The alternative approach of implementing all needed functionality in the mediawiki-config repo (option 3) is inferior, since it would not allow development environments, CI scenarios, and third parties to benefit from the new configuration mechanism.

Data See the stakeholder consultation minutes and the design decision section off this document.
Informing PET has been informing the stakeholders in a series of meetings. We will publish a write-up that summarizes the changes that will be in the 1.38 release.
Who Daniel Kinzler
Date The general approach was decided in December 2021, with some tuning on the details up to March 2022.

What are your options?

edit
Option 1: Do Nothing
Description Configuration remains as complex executable code, deployed as part of the MediaWiki image.
Benefits No development effort, no risk of breaking things.
Risks Configuration is hard to maintain and risky to change. Configuration changes require the docker image to be re-build.
Effort none
Costs none
Testing Same as always: use a “mostly similar” configuration in Beta Cluster.
Performance & Scaling none
Deployment The ones we currently have
Rollback and reversibility None needed
Operations & Monitoring Same as we already have
Additional References https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF#MediaWiki_configuration
Consultations
RelEng Undesirable. The current structure makes it hard to see the effective configuration of a given wiki. The large number of conditionals in CommonSettings.php makes it hard to reason about.
SRE Undesirable. The current structure makes it hard to see the effective configuration of a given wiki. The large number of conditionals in CommonSettings.php makes it hard to reason about.
Code Health Undesirable, since it does not give us a good way to control configuration for end-to-end test scenarios.
Growth Undesirable, since it does not give us a good way to control configuration for end-to-end test scenarios.
Performance The current system is probably the fastest, so would be ok.
Security The current system provides an easy way to inject ad hoc security measures. However, since the configuration is hard to reason about, it is not ideal from a security perspective.
Option 2: Allow MediaWiki to load configuration from a set of JSON files.
Description Allow MediaWiki to load configuration from a set of JSON files. This needs a mechanism to pick the correct file for the requested site (multi-tenancy) as well as a mechanism to merge configuration from different sources.

The concept of “configuration” also needs to be expanded beyond configuration settings, to include the list of extensions and skins to load, and adjustments to be made to the PHP runtime environment.

Benefits
  • Allow us to fully benefit from Kubernetes.
  • Align with industry best practice.
  • Makes it easier to configure local development environments.
  • Provides a foundation for automated configuration management for end-to-end tests.
  • Provides a standard mechanism for setting up multi-wiki environments (wiki farms) for testing and development, as well as third party installations.
  • Reduces the effort needed to move MediaWiki core away from relying on global variables for configuration.
Risks Adding the new capability to MediaWiki is virtually risk-free. Transitioning our production environment to using the new system will involve risks. That transition however is outside the scope of this proposal.
Effort Four to eight weeks (three senior engineers) to do the changes in core. Transitioning our production environment to using the new system will be an iterative process, and will take more time. That transition however is outside the scope of this proposal.
Costs none
Testing None for the change to MediaWiki core. When transitioning our production environment to using the new system, we should probably make this change for the deployment-prep (aka “beta”) environment first.
Performance & Scaling We should create a mock configuration that corresponds to what we would be loading in production, and compare loading time to the time it currently takes to execute CommonsSettings.php.
Deployment None for the change to MediaWiki core. Transitioning our production environment to using the new system will require careful planning. That transition however is outside the scope of this proposal.
Rollback and reversibility When transitioning the loading of config defaults to the new system, a feature switch will need to be introduced into the MultiVersion wrapper so we can easily switch back to the old system if problems arise.

While transitioning our production environment to using the new system, it should easily be possible to go back to the old system of configuration by simply reverting to an old version of the mediawiki-config repository. That transition however is outside the scope of this proposal.

Operations & Monitoring When transitioning the loading of config defaults to the new system, the performance impact needs to be monitored carefully.

When transitioning our production environment to using the new system will require careful monitoring, since it affects all configuration, and problems could manifest in various ways in any part of the system. That transition however is outside the scope of this proposal.

Additional References
  • Exploration notes: https://docs.google.com/document/d/1wJ5aL6sr-AvuU9ZDkuIIflWFq_oJXCiDbNYToAwKBy0/edit
  • Requirement gathering: https://docs.google.com/document/d/1fLZJsoy_XnP3VppVyu6tZ6vSl_i0CSsd-m5Kgr7FLDA/edit#heading=h.gel4l5urnr5x
  • Stakeholder conversation log: https://docs.google.com/document/d/1XjypA5wxwWJq46_2cFjuJxNLRxF1P3kL01KFypLGaBE/edit#
Consultations
RelEng Support. Would provide us with flexibility as to how we represent configuration and how we combine different parts and aspects. This would allow us to transition to a system that makes configuration easier to maintain and reason about.

Note: we will need to change how our end of the config system works in any case. But having good support for loading and combining configuration in core will make this much easier. The explorations that have been done around T263166 will likely be useful in improving our end of the config system.

SRE Support. Would provide us with flexibility as to how we represent configuration and how we combine different parts and aspects. This would allow us to transition to a system that makes configuration easier to maintain and reason about.

Caution: Care needs to be taken to get the caching characteristics right, with respect to performance but also fault tolerance and the ability to quickly change configuration.

Code Health Support. This improves testability of our configuration system, and configurability of our testing system. In addition, it moves us away from relying on global variables for configuration, which should improve the overall testability of the code base.
Growth Support. This doesn’t quite provide the support for testing scenarios we need, but it is moving in the right direction. Once this is implemented, it should be much easier for us to get what we need.
Performance No objections.

Caution: Care must be taken to design the new system in a performant way, especially with respect to caching. Runtime overhead must be measured carefully during deployment.

Security No objections. The design of the proposed system doesn’t introduce security issues.

We need to retain a way to deploy ad hoc php code as a security measure, though.

Option 3: No changes to MediaWiki core, rewrite mediawiki-config repo
Description Implement the logic for loading and merging configuration from static files as described in option 2, but the mediawiki-config repo, not in MediaWiki itself. This basically means doing the “transition production to the new config system” project that would follow option 2 immediately.
Benefits No changes needed to MediaWiki as a software. Avoids generalization, fully customized to WMF’s production needs. Only one project needed (transitioning production config) instead of two (option 2 adds capabilities to core first, which are then used to transition the prod system).
Risks
  • Ownership of the code in CommonSettings.php is unclear.
  • Code changes follow config lifecycle, not code deployment lifecycle.
  • Additional effort needed to allow CI environments to benefit.
  • Configuration management for end-to-end testing will need to be implemented separately (duplicate effort).
  • Local development and testing environments will not benefit.
  • Hard to test locally, no CI or any kind of test suite available.
  • Further cements MediaWiki core’s use of global variables for configuration input.
Effort A first milestone can probably be reached in four to eight weeks by three senior engineers. Completion of the transition will be an iterative process.
Costs None at the moment, leave everything for later.
Testing The only way we have for testing configuration is deployment-prep (aka “beta”).
Performance & Scaling Similar to option 2: Manual benchmarking will have to be injected into the configuration code. Performance changes will have to be monitored for each change.
Deployment Changes would be deployed like regular configuration changes.
Rollback and reversibility Changes to the mediawiki-config environment are easy to roll back.
Operations & Monitoring Careful monitoring will be required immediately, since every change potentially affects all configuration, and problems could manifest in various ways in any part of the system.

Note that there is no CI testing for the production config system.

Additional References There has been some exploration of this idea:
Consultations
RelEng Undesirable. It would fall on us to come up with a new system of maintaining configuration, and mapping it directly to the global state of a PHP code base, rather than data structures that can be compared and validated. It would also be much more work to allow the CI system to benefit.

Note: we will need to change how our end of the config system works in any case. But having good support for loading and combining configuration in core will make this much easier. The explorations that have been done around T263166 will likely be useful in improving our end of the config system.

SRE Neutral. As long as someone comes up with a better way to manage configuration, we do not care who writes it or where it lives.
Code Health Undesirable, since it offers no improvement over the current situation. Development environments and CI systems would not benefit, nor would the quality of the MediaWiki code base itself.
Growth Undesirable, since it does not move us closer to the functionality we need, namely control of end-to-end testing scenarios.
Performance Neutral. As long as the new system doesn’t slow things down, we do not care who writes it or where it lives.
Security Neutral. As long as the new system doesn’t pose a risk and we can still deploy ad hoc measures, we do not care who writes it or where it lives.

Design Decisions

edit
Should the new config loading mechanism become part of MediaWiki core? YES. Having the option to maintain configuration as standalone data files, and especially the ability to easily and safely combine multiple such files, is likely to benefit development environments, CI setup, as well as third party installations.
Should settings files and extension.json files have the same schema/structure? NOT RIGHT NOW, but perhaps eventually. Settings files and extension.json files serve essentially the same purpose, and need to be processed in the same way. Having them use the same structure would avoid confusion and duplication of logic. However, we would have to retain backwards compatibility to the old format for the foreseeable future, so there is no immediate benefit.

Converting extension.json to the same structure as settings files internally seems advantageous though, since it will allow us to share code between config loading and extension registration, and avoid inconsistencies.

Shall we support YAML in addition to JSON? YES. Configuration files need to be human editable. JSON files are hard to read and edit and don’t allow comments. Performance of loading YAML isn’t great, but can be improved by using a native PHP extension rather than a YAML parser written in PHP. Also, the performance implication of loading files will be mitigated by a transparent caching layer.

However, YAML is a complex format with surprising edge cases and unintuitive behavior. We should investigate tooling to mitigate these issues.

Should we use APCu for caching configuration? YES, for now. But we also need to support loading from generated PHP files to make use of the opcode cache, which is by far the fastest option.

However, there is a desire to disable opcode cache revalidation in production, which means that we can’t update config represented as PHP arrays without a pod restart. The solution is to use config loaded from PHP arrays as a baseline, and override it with values coming from config maps or etcd, and cached in APCu.

Shall we merge together multiple settings files before caching? NOT RIGHT NOW, but the design needs to allow for us to change direction on this. Batching reduces the amount of work needed to be done while loading configuration from cache. However, because of the large number of possible permutations of config files, we may end up evicting cache entries and degrade performance. Which solution is better depends on a large number of factors, such as how we end up splitting the configuration, the size of each data file, the complexity of the merge operations and the hardware specification of the application servers.
Do we need to support interpolation or other pre-processing of settings values, such as php constants? NOT RIGHT NOW, but keep the option to add this feature later. In particular the ability to reference namespace constants would be nice to have in manually maintained YAML files. But the additional complexity does not seem worthwhile to this time; also there is a danger of adding in more complex pre-processing such as expression evaluation, which could defeat the idea of making configuration easier to reason about.
Should we convert DefaultSettings.php to  data files (JSON or YAML or such)? YES and NO: we need a schema, but it should be defined in PHP, for better integration with phpdoc and so we can use constants.

We need a schema, rather than just default values, so we can determine the merge strategy for each configuration key.

After some experimentation, we settled on defining JSON schema structures for each config setting as a constant in a PHP file. This allows us to generate a proper JSON Schema file, while retaining much of the structure and documentation currently present in DefaultSettings.php. We also retain the ability to use PHP constants (especially for namespaces) in default values, which would have been lost when representing the schema as YAML or JSON.

Having a schema for configuration will be useful for validating configuration, especially when parts of the configuration is maintained by the community on-wiki.