So you use logging. We all do — after being taught to log and not to print.

What is usually tracked in a log? Logs track events: API requests, file uploads/downloads, start and end of large subroutines, triggers, and, of course, errors with tracebacks; sometimes also progress indicators for heavy operations.

That is how logs work in practice.

What is a good log, anyway?

At first glance, a log is a chronological record of events in a system. From this perspective, a “good” log should contain everything — there is rarely a clear convention for what should not be logged (aside from obvious things like credentials).

I disagree with this approach and, more broadly, with dumping a ton of information in logs. The result is bloated log files that won’t help in almost any scenario. In large distributed systems, logs can grow so large that processing them becomes too slow.

Common suggestion is to tune logging filters, assuming one can identify which parts generate redundant logs. I could imagine logging being automatically reduced if, for example, no one used a particular type of log entries for a week. I don’t think anyone has yet implemented tooling able to confirm that specific log records were not used in post-analysis.

There are three common scenarios where we rely on log files:

  • initial development
    • logging is helpful here: essentially print-debugging but maxed out. Once the implementation is confirmed, most logging can be removed (or downgraded to debug or info)
  • large jobs: verifying that the system is not stuck and is making progress
    • a rolling log of the last 1k lines is sufficient
  • post-incident analysis

The last case is both the most valuable and the most painful. We hope logs will answer “why did this fail?” and “what exactly happened.” In practice, logs are often poor at revealing root causes, which leads to stuffing every message with context — because reconstructing history from output produced by hundreds or thousands of processes is hard.

Log level suggestions

Debates what should be TRACE/DEBUG/INFO/WARN/ERROR/FATAL are as old as log levels themselves. One popular framing is: FATAL means “wake an admin at midnight”, ERROR means “needs investigation, but not immediately”, and so on.

This sounds actionable, but it isn’t — logging systems lack essential context to make a call on important/not important.

What should an admin actually do when they see an error at midnight? What if the admin works for your client, not your company? What if you are the admin?

If code was just committed, shouldn’t the same error be more severe? If no one fixes it for a month, should its level be lowered automatically? Should deprecations be warn or info? (warn is noisy and info is easy to miss).

If an error is caused by invalid user input, should that be info? What if the person running the program is also the one providing the input — would it be helpful to assign highest level to message?

Finally, when a program crashes with a fatal error, do I want to see preceding warn entries? Absolutely. Do I want to see info entries? Very likely. Yet their presence depends entirely on configuration, not on the outcome.

Last, but not the least… I’ve never seen admins actually being awaken by logging. Either they are on shift if the system is critical, or there is a special handle not tied to logging.

More about trace and debug levels

I will not use trace for a library I don’t develop. The same applies to debug (and likely to info).

Fundamentally, developer-focused debugging and observability for downstream users are different use-cases. We try to handle both using the same logging system. There should be better tools for debugging than logging.

Logs for timing

Logging systems allow one to track time spent doing some operation. I don’t recommend doing so for two reasons: first, it is very noisy to keep timestamps in the log. Second, it is inconvenient to trace back why something is slow. Instead, entry in the log should appear only if code took more than expected, and this entry should contain all relevant information.

Eliot and causality in logs

Logging systems are surprisingly similar across languages and ecosystems, likely due to simplicity and inertia.

The idea of introducing causal relationships between log records appeared on my radar after Eliot, an unusual logging system by Itamar Turner-Trauring. Eliot’s tagline: “the logging system that tells you why it happened.”

The official documentation and examples are worth five minutes of your time:
https://eliot.readthedocs.io/en/stable/quickstart.html

In short: log entries are nested using context managers and explicit parent-child relationships. One log entry is written upon entrance to context manager, and one more on exit.

Analyzing logs means matching these relationships, and Eliot provides tooling to visualize logs as a tree. It also does not use log levels: everything is logged, and “filtering” happens later via observing tree.

Less is more: what I changed compared to Eliot’s approach

We should not log things that worked. If a job completes without exceptions or suspicious conditions, all I want to see is “job completed as expected.” Inputs and outputs shall be tracked — but not logs.

I see two practical approaches: one introduces a logging service that can automatically discard records once we no longer need them; the other keeps rich, multi-level logging context in memory until an error or noteworthy situation occurs.

The second approach is simpler, but with a drawback: if the process crashes, nothing is logged. Additionally, it requires preserving objects in logging context, and may misbehave if objects are mutable.

Because of simplicity (and efficiency), I chose the latter. I can come up with many “what if?”, but after more than a year in production, it works very well.

Log less, log only what requires your actions, and then you won’t need to set priorities like fatal/error/warn, etc.

I still keep more detailed logs, but they don’t meaningfully contribute to the debugging experience.



👋




Sidenotes: I don’t discuss here observability solutions that have an explicit (questionable) goal of tracking everything (like OpenTelemetry). These systems are designed with causality tracking — hardly avoidable due to sheer amount of information. I also should mention that in languages without GC keeping arbitraty context requires additional hassles.

There is also a lost benefit “you can configure logging just using configs” — critical for compiled third-party code, and not so much for your own code.

Update: Boris Tane argues for a similar (but more explicit) approach with “wide events”.