Reposted from https://phabricator.wikimedia.org/T281499#7085780
Especially when naming symbols for data (which almost everything is) I'd really like to stress that capital letters in data keys is a really bad idea.
https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Identifier_Naming_Rules
Data moves around. It will be used in different languages with different typing and different naming rules. It will certainly be used in SQL systems, which are for the most part case insensitive. The only common identifier naming rule that will function in all of these systems is snake_case.
Any time data passes through a case insensitive system, it will be normalized, most likely to all lower case.
Fields like isPartOf
and mainEntity
will become ispartof
and mainentity
. Longer names that include acronyms get even worse. In camelCase, it isn't clear what the acronym capitalization rules are. E.g. HTTPURLID
? HttpUrlId
? Whatever the camelCase acronym rule is, the name will be normalized in SQL systems to e.g. httpurlid
. Data integration automation code has to reason about which fields are the same. If ingesting data that has capital letters, it is possible that two different fields end up normalized to the same lower cased name. Then we just have to guess about how to ingest data.
Every time someone needs to move camelCased data identifiers in case insensitive systems, they will have to write code that reasons about the case changes. If we avoid upper cased field names in our schemas, we are less likely to encounter bugs and breakages in data pipelines.
Additionally, I've heard that camelCase can be difficult for non native English speakers. incomingHTTPRequestIpAddress
(which is normalized to incominghttprequestipaddress
) is (subjectively) more difficult to read than incoming_http_request_ip_address
.
Is it worth adding this to this Naming_Things page?