The moment when it clicks

Standard

There are moments when something suddenly clicks. Something that appeared to be veiled and impossible to understand suddenly becomes intelligible and clear. It is a very astonishing moment that occasionally happens when we study something. My understanding of my own learning of a particular subject—whether it be a tool or a domain—is that it happens gradually. I add pieces to the puzzle and over time a more complete picture evolves. It can be a tedious affair. Sudden insight, as if passing through a door that unexpectedly opens, does not happen to me a lot.

Yet last week I had such a moment. I had been toying around with Kibana during the last couple of weeks, but without a lot of success. We use Kibana to sift through the logging that is generated in the production environment. We try to gather relevant statistics, signals through the aggregation of the data that is logged. I personally think the tester should familiarize himself with the usage of logs to analyze what is going on in production. The data gathered can inform testing, can tell him about the actual usage of the product and can reveal risk and help him direct his testing.

So, since our team uses Kibana (Kibana 3, to be precise), I felt like I had no excuse to dodge that bullet. I probably could have gotten away with avoiding looking at the logging. In my team there are at least two engineers who regularly look at the dashboards and I could have left it up to them to monitor the production environment and perhaps do some requests for me. But I personally wanted to get more out of monitoring and so I had to try to tackle the Elastic Stack.

For weeks I struggled with the Kibana dashboard. The queries and filtering seemed counter-intuitive and the results almost random. The creation of rows and panels (the layout of the dashboard) baffled me. It was my first encounter with Log4j and Tomcat logging and my inexperience with many of the parts of the Elastic Stack caused frustration. I would spend a couple of hours creating some queries but never ended up with the right result. The Elastic query DSL just failed to make a logical connection in my head. I looked up tutorials and some instructions videos on Youtube, but I did not advance. It was like knocking at the same door all the time to find it shut tight.

And last week the door suddenly opened. In the matter of an hour I went from hitting keys in frustration to freely and joyfully playing around with the tool. I do not think there is a single thing that unlocked the door, but in retrospect there are some things that helped. I’d like to offer a quick examination of those things.

First off, last week, I set myself a small, well-defined Kibana task, caused by the following. My team uses a Grafana dashboard to keep track of the errors that are generated in the production environment. The dashboard is shown on a wide screen television that is on all the time. Errors appear on our dashboard but it seems that we pay only marginal attention to them. The lack of interest that I noticed is a common one. It is the same lack of interest that can be observed when putting the results of flaky automated tests on a dashboard. Over time, the lack of trust in the results of these tests causes a kind of boredom, the shutting out of the false alarm. Since the Grafana dashboard does not facilitate the splitting up of the errors by root cause but Kibana does, my only task was to split up the errors by root cause and therewith increase our insight in the errors. This task was within my reach. The fact that there were some examples, created by other teams, readily available also helped.

Second, I finally took the time to notice the things that were going on in the Kibana dashboard. I should have paid attention to them long ago, but I think my frustration got in the way. For example; it is pretty easy to create a query in Kibana that will run indefinitely. Setting the scope of the query to a large number of days can do that for you. It will leave you guessing endlessly about the flakiness of your query unless you notice the tiny, tiny progress indicator running in the right upper corner of the panel.

Also, different panels of the dashboard will react differently to the results of the query. The table panel, which shows a paginated table of records matching your query, can show results pretty quickly, but a graph potentially takes a lot of time to build up. This seems downright obvious and yet understanding this dynamic takes away a lot of the frustration of working with a Kibana dashboard. It is a delicate tool and you have to think through each query in terms of performance.

Thirdly, I think determination also contributed to the click moment. I desperately wanted to win the battle against Kibana and I wanted to take away some of fuziness of the dashboard. Last week I noticed a difference between the number of errors as shown in the Grafana dashboard and the number of errors (for the same time period) as gathered from Kibana. So there was a bug in our dashboard. Then I knew for certain that Kibana can serve as a testing tool. Once I was fully aware of its potential, I knew there was only one way forward.