Tuesday, September 16, 2008

Logging Best Practices

Logging Best Practices

Applications need to log diagnostic information to detect problems when something goes wrong and unexpected things happen. Developers need extensive trace information in the logs for debugging during and after development. Good logging practices will go a long way in serving these requirements effectively.

Logging API

Use the standard logging API available in your environment.

.NET applications must use log4net. Java applications must use de-facto standard logging API - apache commons with log4j beneath it.

Logger Category Name

It is common practice to instantiate logger using fully qualified class name as the message category name. This allows the application to limit/fine tune log messages at a fine grained level of up to class name.

However, there are instances where this approach is not sufficient. For instance, the application may have to raise an alarm if the application detects an attempt to violate access permissions. For example, in a banking application, account X trying to read details of account Y may indicate a serious break in attempt. Application may have to log such attempts to a separate category named "security" with a high decibel alarm meaning, this logger category may email/SMS the security team about the incident. Such a requirement has to be analyzed per application basis and decided.

Logging Level

These are the commonly available log levels in logging frameworks. This set has been taken from apache commons logging framework. Pretty much any other logging framework's log levels can be mapped to this set.

  • FATAL
  • ERROR
  • WARN
  • INFO
  • DEBUG
  • TRACE

Very strict rules apply to the two extremes of log levels. Rest of them are very debatable and it is the aesthetic decision to be made by the team.

FATAL - Reserved for the problems which lead to complete failure of the component/application. For example, network error on database connections are FATAL. Wrong user inputs leading to problems are not fatal - they can be WARN or INFO at the max. However, this rule will slightly change in case of re-usable components because the component can not decide whether a problem is FATAL - it is the job of calling code to decide that. Therefore, re-usable code should try to avoid logging such problems and indicate the problem to outside world through exception mechanism. If you have to log it, use WARN or INFO level depending on the severity of the situation arising out of intended usage of re-usable code.

ERROR - Genuine errors arising within your code should be logged at ERROR level. Problems you catch from the code you depend on are not errors. Such problems should be logged as WARN if you can recover from it or FATAL if you decide to give up. However, this rule will slightly change in case of re-usable components because the component can not decide whether a problem is ERROR - it is the job of calling code to decide that. Therefore, re-usable code should try to avoid logging such problems and indicate the problem to outside world through exception mechanism. If you have to log it, use WARN or INFO level depending on severity of problem.

WARN - Situations where something went wrong seriously and the code can recover from it should be logged at WARN level.

INFO - As the name suggests, information should be logged at INFO level. For example, log configuration data read from configuration files at INFO level. Because, you want to inform the world about what you read from configuration file when the application started. Another example - log the exceptions you caught in re-usable components as INFO. Because you have insufficient information to decide any other severity level in the code and you want to inform that such a thing happened in the code.

DEBUG - Log low level information concerning code logic decisions, internal state etc. This should allow a developer to understand the behavior of code. It is expected to increase the log size. This log level is turned off in production deployment.

TRACE - Lowest log level. This level is used to log finest details of code flow - including method entry and exit details. This should allow a developer to debug the application code in the absence of interactive debugger. This log level is expected to generate huge logs. This log level is turned off in production deployment. [See "tracing" later in the document.]

Your Code - Logger Calls

Your code

Create a static logger instance - because you don't want to keep one logger instance per class instance. Logger instance must be final also because you don't want to keep changing the instance after it is created once.

public class MyClass {
private static final Logger LOG = LogFactory.getLog(MyClass.class);
...
...
public void myMethod() {
LOG.warning("Failed to fetch profile data.", profileException);
...
LOG.info("Redirecting to input page.");
...
}
}

Exception/throwable

Exception/throwable should be logged in its entirety - including stack trace. This is achieved by passing the exception/throwable object to the logger call as a parameter so that logger can log all the details. Logger uses this exception/throwable for logging full stack trace including the root causes wrapped in the exception until the bottom most cause of exception.

LOG.fatal("Earth has gone out of trajectory.", trajectoryException);

Don't log exceptions as shown below - that is bad practice because you lose stack trace. This will reduce the log to a mere information level without telling exactly what went wrong and where.

/* Bad practice */
LOG.fatal("Earth has gone out of trajectory. " + trajectoryException);

Conditional logging

Log statements must be enclosed in conditionals to avoid un-necessary runtime overhead of evaluating the log statement. This is preferably done using AOP to wrap all your log statements within conditionals - provided your environment supports AOP. End effect of wrapping the log statements in conditional statement should be same as the following piece of code would do (highlighted in bold). Using AOP will reduce code clutter in application code and avoid the coding effort involved.

if (LOG.isDebugEnabled()) {
LOG.debug("User state: isEmployed=" + user.isEmployed() + ", isRegistered=" + user.isRegistered());
}

Fall back to coding the conditionals in application code only when you can not use AOP because of environment restrictions. For example, client does not want to use AOP or the app server where you are deploying your application may not support AOP or your programming environment does not support AOP.

Log formatting

Don't use any kind of formatting in log statements. Log lines are not for printing purposes - it is only to help you to debug problems. Therefore, you don't need any kind of formatting in the log lines. These are all bad practices:

/* All these are bad practices */
LOG.info("######### Executing the action. #############");
LOG.debug("Line 3333333 executed.");
LOG.debug("Employee: --->" + employee.getName() + "\r\n\t" + employee.getId());

Log context

Context data must be logged in the logs - for the logs to be useful. In the above example (under conditional logging), just logging user state is useless unless you know which user's state it is - that is the context we need. This context can be a unique piece of information like user principal or session id. Similarly, thread id is also needed in the logs to trace the code flow for a specific use case execution from the user. In clustered environments, server name or IP address is also needed to identify the server. Context insertion has to be achieved in unobtrusive way - without major changes to application code for inserting the context wherever logger methods are called.

Example: Java web applications using log4j implementation can achieve this using a servlet filter for injecting context information in log4j NDC (See http://logging.apache.org/log4j/1.2/manual.html for more information).

Note: Using the context information logging features like class name, method name and line number may lead to some runtime overheads. Use this with caution. Disable such context logging in production environment after debugging is over.

Tracing

Fine grained tracing like method entry/exit should not be done in the code. Whenever possible, use AOP for inserting fine grained trace logs. This will avoid code clutter and make it easy to change the logs from one place.

Don't log sensitive data

Sensitive data should not be logged. For example, never log a user's password to the logs. There is no magic to avoid logging sensitive data - developers have to be careful. Only careful reviews can catch the problem, if it exists.

Clock sync

You must synchronize system clocks in a cluster to have right sequence of log entries when the logs are merged for log analysis. Out of sync log entries will lead to confusing situations at the best to completely failed log analysis at the worst.

Dynamic log level threshold changes

Preferably, it should be possible to change logger log level threshold dynamically. This will be very useful in production environment where restarting the app server is not possible for the new log level threshold to take effect for collecting debug information.

Logging inside loops

Avoid logging inside loops. Major problem with this is - logging can be come considerable overhead when logging is done inside loop. It will become more severe overhead if this loop is called form within another loop - you suddenly explode the log statements. Only exception to the rule is when you need it badly for debugging purposes. In that case, use DEBUG or TRACE level for logging and enclose the log statements within conditionals so that the log statements won't have any overhead in production where DEBUG level is turned off.

Logging huge objects/lists

Huge objects/lists should never be logged. It will become a major runtime overhead. It will also increase the log file size dramatically.

Log asynchronously

Enable asynchronous logging capabilities of the logger. Otherwise, your code will slow down because of logger calls. Your code may slow down significantly in inherently multi-threaded web applications.

Log rotation

Use sensible log rotation policy to avoid getting huge log files. Huge log files create problems with log analysis. It becomes impossible to wade through a multi GB log file for debugging. This may also lead to disk space problems - you may run out of disk space. Another problem with huge logs is - you may hit the file system limit of log file size leading to unexpected logger behavior.

Size based log rotation and time based log rotation are commonly used rotation policies.

Size based log rotation: This policy rotates the logs based on size. Something like, rotate logs when the log file reaches 10 MB. Logger creates a new log file whenever log file size reaches 10 MB.

Time based log rotation: This policy rotates the logs based on time. Something like, rotate logs every day. Logger creates a new log file every day.

General rule is configuring the log rotation in such a way that you can open the log file in a tool for analysis. At the same time, log file should not be too small - by spanning any of your analysis effort across multiple files. For example, log file size of 10 kB each may be too small and you may have to analyze thousands of files when you are tracing some problems for one full day. At the same time log files bigger than 50 or 100 MB may not be easy to browse using an editor on windows platform. Unix environments (including linux) are little more forgiving here because tt is also possible to grep huge files for specific lines in the log easily on Unix. Therefore, on windows platform, don't exceed 50 MB per log file. Don't exceed 1 GB on Unix environments.

You can do log rotation using logger's built-in facilities like log rotation policy. On Unix environments you can also use system wide log rotation facility (known as logrotate) in combination with logger's log rotation policy.

Log cleanup

Use scheduled scripts/scheduled jobs to archive/delete logs older than some duration to avoid running out of disk space. Keep minimum needed log files online - archive/delete older ones. If you decide to archive old logs, compress them to save storage space.

Use logrotate facility on Unix (and linux) environment for compressing, archiving and cleaning up old log files. Use a scheduled job on windows for compressing, archiving and cleaning up old logs.

Important: Both log rotation and log cleanup should be configured from dev environment to production on all servers. Starting with dev and moving to QA, staging and production will allow the process to stabilize before reaching production. This will dramatically minimize the problems you may face in production later. For example, knowing the daily log file size and putting an archival/cleaning up procedure in QA or staging will help you to avoid disk space issues on production later because you can move the same procedure to production.