User:BDavis (WMF)/Projects/Structured logging

This is a collection of ideas about Structured logging in MediaWiki and related software projects. For the purposes of this discussion "logging" refers to wfErrorLog() and it's related family of functions and not audit logs stored in the database by Special:Log or similar systems.

Problem edit

Logging in MediaWiki (and WMF in general?) is optimized for human consumption. This works well for local development and testing and can scale to management of a small to mid size wiki depending on the number of eyes applied to the logs and the acumen of the operational support personnel with grep, cut, sed, awk and other text processing tools.

Unfortunately what works well for a single developer doesn't work as well for analyzing the log output of a large production site.

  • need better tools for wider audience
  • aggregation
  • de-dup
  • cross system correlation
  • alerting
  • reporting

This is not a wholly new idea. Let's look at what's out there and see if we can find a solution or at least borrow the best bits.

Current logging edit

wfDebug( $text, $logonly = false )
Logs developer provided free-form text + optional global prefix string
Possibly has time-elapsed since start of request and real memory usage inserted between prefix and message
Delegates to wfErrorLog()
wfDebugMem( $exact = false )
Uses wfDebug() to log "Memory usage: N (kilo)?bytes"
wfDebugLog( $logGroup, $text, $public = true)
Logs either to a custom log sink defined in $wgDebugLogGroups or via wfDebug()
Default
Prepends "[$logGroup] " to message
Custom sink
May log only a fraction of occurrences via mt_rand() sampling
Prepends wfTimestamp( TS_DB ) wfWikiID() wfHostname(): to message
Delegates to wfErrorLog() to actually write to sink
wfLogDBError( $text )
Enabled/disabled with $wgDBerrorLog sink location
Logs "$date\t$host\t$wiki\$text" via wfErrorLog() to $wgDBerrorLog sink
Date format is 'D M j G:i:s T Y' with possible custom timezone specified by $wgDBerrorLogTZ
wfErrorLog( $text, $file )
Writes $text to either a local file or a UDP packet depending on the value of $file
UDP
If $file ends with a string following the host name/IP it will be used as a prefix to $text
The final message with optional prefix added will be trimmed to 65507 bytes and a trailing newline may be added
FILE
$text will be appended to file unless the resulting file size would be >= 0x7fffffff bytes (~2G)
wfLogProfilingData()
Delegates to wfErrorLog() using $wgDebugLogFile sink
Creates a tab delimited log message including timestamp, elapsed request time, requesting IPs, and request URL followed by a newline and the profiler output.
Date is from gmdate( 'YmdHis' )

Structured logging standards edit

Graylog Extended Log Format (GELF) edit

From http://www.graylog2.org/about/gelf:

"The Graylog Extended Log Format (GELF) avoids the shortcomings of classic plain syslog:

- Limited to length of 1024 byte - Not much space for payloads like backtraces

- Unstructured. You can only build a long message string and define priority, severity etc."

Structured logs messages sent via UDP as gzip'd json with chunking support.

  • client libraries in PHP, Ruby, Java, Python, etc
  • native input for graylog, plugin for logstash
version
GELF spec version ("1.0")
host
hostname sending message
short_message
short descriptive message
full_message
long message (eg backtrace, env vars)
timestamp
UNIX microsecond timestamp
level
numeric syslog level
facility
event generator label
line
source file line
file
source file
_<whatever>
sender specified field

Metlog edit

From https://wiki.mozilla.org/Services/Sagrada/Metlog:

"The Metlog project is part of Project Sagrada, providing a service for applications to capture and inject arbitrary data into a back end storage suitable for out-of-band analytics and processing."

3-tier architecture: generator (eg an app), router (logstash), endpoints (back end event destinations: statsd, Sentry, HDFS, Esper, OpenTSDB, ...)

  • python and node.js client libs for formatting and emitting messages.
  • logstash plugins for input processing and output to backends
  • json serialized messages
timestamp
time message generated
logger
message generator (eg app name)
type
type of message payload
(application defined; correlated with payload and fields)
severity
rfc5424 numeric code (syslog level)
payload
message contents
fields
arbitrary key/value pairs
env_version
message envelope version (0.8)

Common Event Expression (CEE) edit

From http://cee.mitre.org/:

"Common Event Expression (CEE™) improves the audit process and the ability of users to effectively interpret and analyze event log and audit data. This is accomplished by defining an extensible unified event structure, which users and developers can leverage to describe, encode, and exchange their CEE Event Records."

Other ideas edit

Proposal edit

Serialization Format edit

Rather than dive down a rabbit hole of trying to find a universal spec for log file formats let's just keep things simple. PHP loves dictionaries (well they call them arrays but whatever; key=value collections) and has a pretty fast json formatter. So the simplest thing that will work reasonably well would be to keep log events internally as PHP arrays and serialize them as json objects. This will be relatively easy to recreate on other internally developed applications as well with the possible exception of apps written in low level languages such as C that don't have ready made key=value data structures.

Data collected edit

Here's a list of the data points that I think we should definitely have:

timestamp
Local system time that event occurred either as UNIX epoch timestamp or ISO 8601 formatted string
date( 'c' )
host
FQDN of system where event occurred
php_uname( 'n' )
source
Name of application generating events; correlates to APP-NAME of RFC 5424
'Mediawiki'
pid
Unix process id, thread id, thread name or other process identifier
getmypid()
severity
Message severity (RFC 5424 levels)
'WARN'
channel
Log channel. Often the function/module/class creating message (similar to wgDebugLogGroups groups)
get_class( $this )
message
Message body
"Help! I'm trapped in a logger factory!"

Additionally I would suggest that we add a semi-structured "context" component to logs. This would be a collection of key=value pairs that the developers determine to be useful for debugging. In the past I have found it useful to have two different methods available to add such data. The first is as an optional argument to the logging method itself and the second is a global collection patterned after the Log4J Mapped Diagnostic Context (MDC).

The local collection is useful for obvious reasons such as attaching class/method state data to the log output and deferring stringification of resources in the event that runtime configuration is ignoring messages of the provided level.

file
Source file triggering message
line
Source line triggering message
errcode
Numeric or string identifier for the error
exception
Live exception object to be stringified by the log event emitter
args
key=value map of method arguments

The global collection is very useful for attaching global application state data to all log messages that may be emitted. Examples of data that could be included:

vhost
Apache vhost processing request
$_SERVER['HTTP_HOST']
ip
Requesting ip address
$_SERVER['REMOTE_ADDR']
user
Authenticated user identity
req
Request ID; UUID or similar token that can be used to correlate all log messages connected to a given request

There are obvious privacy concerns with some of these data fields. Careful thought will need to be given to designing a system such that wikis can choose whether or not to log any of these fields.

Not collected (rejected fields) edit

ua
User agent

sess

Application session ID (or a proxy token that can be used to correlate events connected to the session)

API edit

FIXME: Propose a PHP API for logging and a migration plan for existing wfErrorLog() calls

See Also edit