What Gets Logged and Why (and What You're Losing by Logging Everything)
Jeff Straney·
I inherited a system where every function logged entry and exit. Every database call logged the query and the results. Every HTTP request logged headers and body. The log files were 10GB a day. Searching for an actual error took 30 minutes of grepping through noise.
The problem was not that logging was bad. The problem was that logging everything made finding anything nearly impossible. The logs were technically complete. They were also useless.
The Signal/Noise Problem
When you log everything, you are hoping that when something breaks, the information you need will be in the logs. That is true. It will be. It will also be buried under a mountain of information about things that work fine every day.
A request that fails to authenticate: that is important. The fact that the code checked the authorization header and found no token: also important. The fact that 100,000 requests per day are being checked and the vast majority pass: not important. It is noise.
The time you spend filtering the noise is time you are not spending thinking about the problem. The time you spend thinking about the logs is time you are not spending fixing the issue.
I have debugged production incidents where the logs were so high-volume that I had to write a script to filter them. The script took longer than the fix would have taken. The information I was looking for was in the logs. But the signal-to-noise ratio was so bad that I could not find it without automation.
What Actually Needs Logging
The decision is simpler than it sounds: log what you would need to know if the system were broken, not what the system does when it is working.
Log when something fails. Log why it failed if you know. That is the primary signal. Beyond that, log the branch points where something meaningful happened: a user authenticated, a retry triggered, a fallback activated. Log state transitions: a job completed, a cache invalidated, a connection opened. These are the touchstones for reconstructing what happened. Log performance outliers: a request that took ten times longer than usual, a query that hit the timeout.
What you do not need to log: routine success. If a request succeeded, the absence of an error log implies it. You do not need to log it. Function entry and exit adds noise without signal. The calling code tells you whether the function ran. And anything that happens thousands of times per second at normal load does not belong in production logs unless it is slow or broken. A cache hit is not interesting. A cache hit that took 500ms is.
How to Actually Implement This
Use log levels. ERROR is for things that are wrong. WARN is for things that are unexpected but recoverable. INFO is for significant state changes. DEBUG is for everything else.
In production, log at INFO level or above. Log at DEBUG only when you are actively investigating a problem. High-volume DEBUG logs are for development and staging, not production.
Structure your logs so they are parseable. A log entry should be: timestamp, level, context (what operation triggered this), message (what happened), and any relevant data. Make the message consistent so you can grep for it. Make the data structured so you can parse it.
2026-04-15 14:23:01 ERROR auth userId=42 error="token_expired" request_id="abc123"
2026-04-15 14:23:02 WARN cache operation=invalidate key="user:42" reason="token_changed"
2026-04-15 14:23:03 INFO job completed job_id="job:999" duration_ms=1200 rows_processed=5000
Each of these entries tells you something specific that happened. When you search the logs for userId=42, you get a narrative of what happened with that user, not a timeline of every function that executed while that user's request was being processed.
The Payoff
When something breaks at 2am, you need to know what is broken and why. You do not need to know everything the system did while it was breaking. You need the signal. You need to be able to read the logs in less time than it takes to fix the problem, not more.
I have been in two incidents: one where logs were verbose and I spent 30 minutes filtering them, and one where logs were selective and I found the problem in 5 minutes. The difference was not that one team was smarter. The difference was that one team had decided in advance what was worth logging and what was noise.
That decision, made when the system is working, saves hours when it is broken.
