Update: I originally posted this in 2012. The fix we put in has been working well for the last three years! I will be posting more technical details in another post soon!
We all know that quote. This is applicable to computer software also. Software* talks to us through “logging”. Excessive “talking” in one of our applications almost killed it! By fixing that issue, we were able to make the server “Do more”!
* I am referring to software running on the server.
I am subscribed to emails from Java Code Geeks. Today, I got an e-mail titled “Logging can be tricky, here is how to properly do it”. In that they had links to several posts. One particular post 10 Tips for Proper Application Logging caught my attention. It is a very nice article. I have been through this recently. Logging in our application almost killed it.
Anybody who has worked with an enterprise application knows about the pains of inefficient logging. Sometimes it’s too much logging, taking up resources. Sometimes, when you critically need it, there is poor logging or no logging at all. We have those. We have a huge n-tier application written in PowerBuilder (PB) running inside Sybase EAServer (aka Jaguar). We also have several web applications hooking into this PB application server. The logging problem is everywhere, but it’s prevalent in the web applications, as we get a lot more users.
One particular application is written in Java JSP. It’s a tax collection portal. This software also runs inside an EAServer. It “talks” to the database through business logic in PowerBuilder code running on the server. This website is very active between Oct – Feb each year. And in Feb, it gets really busy. Apparently, during the peak days, it had crashed several times every day, since it went live about 8 years ago. During that time, the server would get crazily busy rejecting every tax payer signing onto our web site. After some research, I found out it might be due to improper and excessive logging! Here is how.
When I got on board, I noticed a few memory leaks and plugged them. It got better, but it would still crash frequently during the peak days. The infrastructure type in our group talked about adding more EAServers in the cluster. Though this would help, all the EAServers essentially connected to one PROD database. So, my goal was to find out whether the bottleneck was on the “Client to Server” or on the “Server to DB” side. While digging for clues, I noticed a most of the logging from the web application actually ended up in a specific table the database. It had accumulated over a million rows each year prior. Closer look at the database during the peak days exposed very high # of database sessions, particularly for logging!
For the convenience of centalizing logs, some developer had decided to write all error logs in the database. Though, there was a setting to turn it off, that same setting turned of the “logger” as well, which resulted in bad logging via sysout. To make matters worse, another developer tried to piggy back on that twist to logs from the web application in the database. To do this, he/she made all web logging as “error” logging!!! Such shoddy decisions resulted in inadvertent traffic to the database, thus taking away precious database connection resources from tax payers accessing business rules!
During the peak days, the server was busy doing just logging (looking at the database sessions and SQLs) in the database. A lot of them were redundant, and sometimes even useless, as they lacked the information we really needed. We started looking at logging in terms of the web application.
I got down to business. We drafted a project plan to fix the “logging” problem in the web application. For this, we will not touch any business logic. That will make the project much bigger.
- I decided to move from the existing logger to a more universal solution. I wanted to separate out the logging from the main application framework. Existing logger was intertwined with business rules too much. So, we will not try to fix it.
- Next, we will identify all the log statements in the web application and correct their message types (debug, error etc). Once identified, I will work with other team members to fix the logging in the JSP code.
I decided to switch to log4J for the logging in the JSP website. After several hours of research (dear Google, books) and several tries later I created a wrapper class in Java so as to shield log4J details from our application code. I then wrote a Perl script to parse our code and list all the logging in the code. We then drafted a project plan, got down to changing the old logger calls to the new wrapper calls. While doing this, we also removed redundant/useless logging.
For identifying all the log statements in JSP code, I wrote a Perl script to parse our code and list all the logging in the code. This helped us save a lot of time in going over the code manually. We then drafted a project plan, got down to changing the old logger calls to the new wrapper calls. While doing this, we also removed redundant/useless logging.
Voila! That did it! The next peak season, we didn’t run out of database connections! We had a max of 600, we almost touched it and got back, but did not exceed it. Even better, the EA Servers did not crash. Of course, we did several other things along with this fix – including changing EA Server parameter like connection caches, object pooling etc. We also added couple of more EA Servers into the cluster. While all these helped with the higher availability of the Server itself, one culprit that stole database sessions earlier was logging. That now stands cleaned. More users can actually login to the system to do the taxes now!
So, there you have it – logging (nearly) killed our application! By rearranging the logging module and controlling logging better, we were able to make our application serve its purpose. But, we have only done this exercise on the one JSP web application. We still have the PB application that’s doing bulk of the logging. Need to get this, one of these days!