Logging - this is something we as developers do very badly - perhaps we need to ban the debugger. Unless you've worked in first-line support or in OPS, the only time you ever look at a log file is as a last resort. This is unfortunate, since the last resort may occur when you are trying to figure out why someone has lost money or when your company is about to be sued by a client (usually for losing their money, video, kids photo, etc). I've decided to compile a short guide about logging to both stimulate my thinking, and to jot down some notes for the next time I stand in front of a whiteboard and rant about logging in a production system. Not all of this is my own thinking and I am grateful to John Dickson for first sparking the idea that logging matters.

Why do we log?

  • Audit trail & application status
  • Automatic monitoring for errors and warnings
  • Helping track down configuration problems
  • Helping track down bugs

What logging level should be used?

FATAL

This should generally only be used for recording a failure that prevents the system starting, i.e. the system is completely unusable. It is also possible that errors during operation will also render the system unusable, but typically these will be identified as java.lang.Error instances (such as an OutOfMemoryError), and hence we will not likely catch them, since catching Throwable should only be done in very special cases.

ERROR

Records that something went wrong, i.e. some sort of failure occurred, and either:

  1. The system was not able to recover from the error, or
  2. The system was able to recover, but at the expense of losing some information or failing to honour a request.

This should be immediately brought to the attention of an operator. Or to rephrase it, if your error does not need immediate investigation by an operator, then it isn't an error.

To permit monitoring tools to watch the log files for ERRORs and WARNings is crucial that:

  • These get logged
  • Sufficient information is provided to identify the cause of the problem
  • The logging is done in a standard way, which lends itself to automatic monitoring.

For example, if the error is caused by a configuration failure, the configuration filename should be provided (especially if you have more than one file, yuck), as well as the property causing the problem.

WARN

A WARN message records that something in the system was not as expected. It is not an error, i.e. it is not preventing correct operation of the system or any part of it, but it is still an indicator that something is wrong with the system that the operator should be aware of, and may wish to investigate. This level may be used for errors in user-supplied information.

INFO

INFO priority messages are intended to show what's going on in the system, at a broad-brush level. INFO messages do not indicate that something's amiss (use WARN or ERROR for that), and the system should be able to run at full speed in production with INFO level logging.

The following types of message are probably appropriate at INFO level:

  • <System component> successfully initialised
  • <Transaction type> transaction started, member: <member number>, amount: <amount>
  • <Transaction type> transaction completed, txNo:<transaction number>, member: <member number>, amount: <amount>, result: <result code>

Using Log4J's MDC or NDC should provide a mechanism to avoid repetition of contextual information when creating log files, allowing the log file entries to have a standard format (see the DEBUG section below).

DEBUG

DEBUG messages are intended to help isolate a problem in a running system, by showing the code that is executed, and the context information used during that execution. In many cases, it is that context information that is most important, so you should take pains to make the context as useful as possible. For example, the message 'doTransaction() started' says nothing about which merchant the transaction is for, the type of transaction, the amount, or anything else that might help us to relate this to a transaction that failed.

Using Log4J's mapped diagnostic context (MDC) or nested diagnostic context (NDC) there is a common mechanism provided for thread-based (effectively request-based) context (things like session IDs, transaction IDs etc.), so if some of the context in your message should be common to all log messages for this request, set it in the context (and make sure your PatternLayout can display it).

In normal operation, a production system would not be expected to run at DEBUG level. However, if there is an occasional problem being experienced, DEBUG logging may be enabled for an extended period, so it's important that the overhead of this is not too high (up to 25% is perhaps OK).

The following types of message are probably appropriate at DEBUG level:

  • Entering <class name>.<method name>, <argument name>: <argument value>, [<argument name>: <argument value>]
  • Method <class name>.<method name> completed [, returning: <return value>]
  • <class name>.<method name>: <description of some action being taken, complete with context information>
  • <class name>.<method name>: <description of some calculated value, or decision made, complete with context information>

Please note that DEBUG messages are intended to used for debugging in production systems, so must be written for public consumption. In particular, please avoid any messages in a non-standard format, e.g.

DEBUG ++++++++++++ This is here cause company "Blah" sucks +++++++++++

If a DEBUG message is very expensive to generate, you can guard it with a logger.isDebugEnabled() if check. Just make sure that nothing that happens inside that if block is required for normal system operation. Only sendmail should require debug logging to work ;-)

When to log an Exception?

Ideally, an exception should only be logged by the code that handles the exception. Code that merely translates the exception should do no logging. With nested exceptions available on Java 5+, no context should be lost when an exception is translated.

'Enterprise' Logging

Some OPS environments use automatic monitoring tools that will display to an operator a screen containing error details parsed from the log file and suggested information for dealing with the error. In order to do this such systems usually require prior knowledge of all possible error conditions and a way to map them to suggestions. It is common in such systems to have specific error codes for certain conditions (e.g. host unavailable is error 1024). Error codes also provide a way to avoid needing to internationalise error messages for various locales while still providing 'enough' detail to get someone started.