Tuesday, April 1, 2014

Zabbix Performance

Email and monitoring are two services that I believe can and should be outsourced if you don't have the people power to dedicate to them fully. Like so many technologies they can be fairly simple to get up and running, but how often is the out-of-the-box install sufficient for your business? You can install Exchange fairly easily, and there are countless tutorials and articles out there to guide you through it, but what happens when it breaks or when some new functionality needs to be added? Monitoring systems love to advertise with catchy slogans like "Get your entire infrastructure monitored in less than 10 minutes". That's true...if all you're monitoring are basic Linux stats like disk space, cpu, memory, etc. Need to add JMX queries, or monitor a custom app? Now it's time to roll up your sleeves and get to work.

We were using LogicMonitor to monitor our entire infrastructure when I came onboard almost two years ago (my Old Boss had brought it in). Every device, Java apps, file sizes, whatever custom queries we needed to keep an eye on our production platform, was handled by this app. My New Old Boss (who started about 6 months ago after my Old Boss quit, and has since left as well) came in and wanted to chuck the system. He was a big advocate of "if you can do it in-house, do it". LogicMonitor wasn't perfect, and we had our share of problems with it, but in hindsight a large part of that pain was that we hadn't followed through with set up. We were using default values and triggers for a lot of things, and they created noise. We didn't tune the metrics we received from LogicMonitor's setup team to match our environment. Rather than invest the time to learn the system we had, we scrapped it and went with Zabbix.

I hate Zabbix. I hated it from the beginning. We didn't have much of a burn-in of the product; it got installed, we started "testing" it, and suddenly it was our production monitoring platform for all of the new equipment we were putting into production. This was part of a major rollout as we were introducing a new platform, so it all became one and the same. One of my biggest complaints about Zabbix was that it wasn't user-friendly. Adding a host and its metrics to Zabbix had a pretty unnatural step process, as did adding users/groups and setting up alerts. For example, to set up a host with alerting or particular metrics you have to first add the host, then create items, then create triggers for the items, then create actions for the triggers. You can create host groups to aggregate hosts of like purpose, and you can create a template to group items, but you can't apply a template to a host group—you have to apply them to hosts individually still. When I raised concerns about the complexity of its front-end—because really, a GUI should not be more difficult to use than say doing creating the same thing from the command-line in Nagios as an example—New Old Boss explained that Zabbix was "expert-friendly".

Months later, New Old Boss has moved on and I've inherited Zabbix, which I pretty much let New Old Bus handle while he was here, and what a bag of trouble it has turned into. "Bag of trouble" is of course polite blog speak for what I really want to call it.


Historically I knew it had problems. Every now and again we would get the well-known "Zabbix agent on host X is unreachable for 5 minutes" error message generating from a large number of servers, but would find that the Zabbix agent on those servers was alive and kicking and doing zabbix-get requests from the Zabbix server itself went through without issue. New Old Boss would inevitably go in and "do things" to get it back up and going again. It was unclear whether "things" were being done to the Zabbix server itself or the PostgreSQL backend database. Whatever it was it wasn't documented, and needless to say once New Boss became New Old Boss, we ran into trouble.

Initially a restart of the Zabbix server instance would fix the problem. That worked until I added our network devices (which were still being handled by LogicMonitor) into Zabbix. Suddenly Zabbix went haywire. Graphing, which had been intermittently flaky before, were missing huge swaths of data. You can't alert too well on "no data". We were getting inundated with error messages about hosts being unreachable and Zabbix server processes spiking to over 75% utilization. I immediately disabled the new network hosts and templates but that didn't alleviate the problem. Several Zabbix server and database restarts later and we're still struggling.

Here's what I know: everyone has these problems with Zabbix. The first thing they tell you to do is to tune your database, especially if you're running PostgreSQL. MySQL apparently performs fine with the default install, but PostgreSQL needs some finessing. I read through a number of best practice docs and slides and found that for the most part our PGSQL install was spot on. A few relevant lines from the PGSQL config file:

max_connection= 2048
checkpoint_segments=256 (every 4GB at 16MB a file)
checkpoint_completion_targets=0.9 (90%)
shared_buffers = 8196MB
wal_buffers           = 512

And some from Zabbix itself:

Code:
### Option: StartPollers
# StartPollers=5
StartPollers=128
### Option: StartIPMIPollers
# StartIPMIPollers=0
StartIPMIPollers=24
### Option: StartPollersUnreachable
# StartPollersUnreachable=1
StartPollersUnreachable=80
### Option: StartTrappers
# StartTrappers=5
StartTrappers=32
### Option: StartPingers
# StartPingers=1
StartPingers=24
### Option: StartDiscoverers
# StartDiscoverers=1
StartDiscoverers=4
### Option: StartHTTPPollers
# StartHTTPPollers=1
# Only required if Java pollers are started.
### Option: StartJavaPollers
StartJavaPollers=64
### Option: StartSNMPTrapper
# If 1, SNMP trapper process is started.
# StartSNMPTrapper=0
StartSNMPTrapper=1
### Option: StartDBSyncers
# StartDBSyncers=4
StartDBSyncers=4
### Option: StartProxyPollers
# StartProxyPollers=1
********
HousekeepingFrequency=1
DisableHousekeeping=1
CacheSize=2G
HistoryCacheSize=2G
TrendCacheSize=2G
HistoryTextCacheSize=2G
LogSlowQueries=3000

In addition, the Zabbix history_uint table was partitioned in PostgreSQL according to the instructions on this website. So on the surface everything looked fine and as if we'd actually followed best practice for installation, but we were still getting error emails and missing data. 

As you might expect, I began Googling, searching through the Zabbix documentation, asking for help in IRC channels, and posting in the forums. Unfortunately, Zabbix documentation is not the greatest. It's missing some vital, basic information on how Zabbix works under the hood. For example, I couldn't find any clear info about what the history syncer process does, or how it does it. One could guess of course, but when trying to fix your monitoring server it's good to know exactly what it's doing. The forums are a place where questions go to die. I asked two, got over 30 "views" for each, and not a single response. The Zabbix IRC channel is a much better resource. I asked questions and got useful answers there every time. Googling is always a good choice because even if you don't find the answers to your exact question/scenario, it's nice to know that you're not the only one for whom everything is blowing up. 

So, I set up slow query logging and turned on monitoring of Zabbix internal processes themselves and did some Googling and came back with the following information:

  • Slow queries, slow queries, slow queries. A steady stream of log lines that looked like this:
  • 4004:20140331:140245.675 slow query: 88.236755 sec, "select value from history_uint where itemid=50217 and clock<=1396264947"
    4006:20140331:140332.836 slow query: 273.012893 sec, "select value from history_uint where itemid=50471 and clock<=1396264691"
    4004:20140331:140402.007 slow query: 76.314164 sec, "select value from history_uint where itemid=50218 and clock<=1396264948"
    4004:20140331:140433.793 slow query: 31.765498 sec, "select value from history_uint where itemid=50458 and clock<=1396264948"
    4006:20140331:140504.464 slow query: 90.653878 sec, "select value from history_uint where itemid=50351 and clock<=1396264691"
    
    Deadlocks from processes waiting for things like AccessExclusiveLock and AccessShareLock on relations. 

  • The queries themselves were monstrous. I didn't know a lot of PGSQL before I started trying to debug this. One of the tips I learned during my mission was how to essentially "walk" a database query using EXPLAIN ANALYZE. I walked a sample query, like one of the select statements from the first bullet here. I got something akin to the following:

  • Result  (cost=0.00..132813.99 rows=36238 width=5)
    
       ->  Append  (cost=0.00..132813.99 rows=36238 width=5)
             ->  Seq Scan on history_uint  (cost=0.00..0.00 rows=1 width=20)
                   Filter: ((clock <= 1395242011) AND (itemid = 51355))
             ->  Bitmap Heap Scan on history_uint_p2014_03_04 history_uint  (cost=104.88..9114.16 rows=2488 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_04_1  (cost=0.00..104.26 rows=2488 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_05 history_uint  (cost=104.85..9103.67 rows=2485 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_05_1  (cost=0.00..104.23 rows=2485 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_06 history_uint  (cost=104.87..9111.01 rows=2487 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_06_1  (cost=0.00..104.25 rows=2487 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_07 history_uint  (cost=104.93..9128.92 rows=2492 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_07_1  (cost=0.00..104.31 rows=2492 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_08 history_uint  (cost=104.68..9075.58 rows=2479 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_08_1  (cost=0.00..104.06 rows=2479 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_09 history_uint  (cost=104.99..9176.11 rows=2509 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_09_1  (cost=0.00..104.36 rows=2509 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_10 history_uint  (cost=104.69..9075.95 rows=2479 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_10_1  (cost=0.00..104.07 rows=2479 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_11 history_uint  (cost=105.02..9181.30 rows=2510 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_11_1  (cost=0.00..104.40 rows=2510 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_12 history_uint  (cost=104.97..9161.08 rows=2504 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_12_1  (cost=0.00..104.35 rows=2504 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_13 history_uint  (cost=104.64..9037.30 rows=2466 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_13_1  (cost=0.00..104.03 rows=2466 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_14 history_uint  (cost=104.90..9034.43 rows=2458 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_14_1  (cost=0.00..104.29 rows=2458 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_15 history_uint  (cost=104.78..8989.47 rows=2445 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_15_1  (cost=0.00..104.17 rows=2445 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_16 history_uint  (cost=99.83..8701.71 rows=2365 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_16_1  (cost=0.00..99.24 rows=2365 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_17 history_uint  (cost=76.14..6547.17 rows=1773 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))
                   ->  Bitmap Index Scan on history_uint_p2014_03_17_1  (cost=0.00..75.69 rows=1773 width=0)
                         Index Cond: ((itemid = 51355) AND (clock <= 1395242011))
             ->  Bitmap Heap Scan on history_uint_p2014_03_18 history_uint  (cost=41.75..3691.50 rows=997 width=5)
                   Recheck Cond: ((itemid = 51355) AND (clock <= 1395242011))

That didn't look right. 


  • Our default templates were too expansive and had too many default settings.

  • This last bit was probably the most important discovery. Our single Zabbix server was monitoring 12556 items on 179 hosts with 4935 triggers. We were, by default, updating data every 30 seconds on average (with some as low as every 5 seconds), keeping it in history for 90 days, and trending for 365. According to this website—which arguably gave me some of the most useful data on Zabbix that I have found to date—this kind of lazy use of templates can kill your performance. In our environment, if we had items updating every 5 seconds with the other parameters set to their defaults—90 days history, 365 trend—we're writing ~1555200 rows per item in the history table. That's some heavy writing. Also, as seen in the query analysis above, queries don't seem to be smart enough to use the indexes to go straight to the partition containing the data based on the clock field. Instead, queries seem to walk each partition until it finds the one it wants—which is usually the last one for an insert.

    That discovery led me to check IO specifically on the server (I had looked at top and network stats but neglected to check disk activity) and found that the zfs pool was pegged at 100% most of the time.

                     extended device statistics            
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
      214.1   36.0 27039.3 3169.4  2.0  5.1    8.1   20.3  11 100 rpool1
       86.0   16.0 11012.8 1056.5  1.2  0.6   12.0    5.9  32  61 c4t3d0
       55.0   16.0 6921.5 1056.5  1.0  0.5   14.3    6.7  30  47 c4t1d0
       73.0   17.0 9105.0 1056.5  1.2  0.6   12.9    6.4  31  57 c4t2d0
                        extended device statistics            
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
       44.0 1130.0 4659.9 135806.8 204.2 18.4  174.0   15.7  83 100 rpool1
        9.0  383.0  909.0 45268.9  5.7  0.6   14.6    1.4  85  56 c4t3d0
       16.0  364.0 1440.5 45269.0  5.5  0.6   14.6    1.6  87  60 c4t1d0
       19.0  383.0 2310.4 45269.0  5.3  0.6   13.1    1.5  86  61 c4t2d0
                        extended device statistics            
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
      150.0  290.0 18835.8 2436.0  5.9  4.4   13.4   10.1  25 100 rpool1
       50.0   97.0 6400.1  812.0  1.1  0.5    7.3    3.1  39  46 c4t3d0
       49.0  114.0 6150.6  812.0  1.1  0.5    6.6    3.0  39  48 c4t1d0
       51.0  103.0 6285.1  812.0  0.9  0.5    5.6    2.9  34  45 c4t2d0
                        extended device statistics            
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
      173.0    9.0 21657.5  432.0  0.0  2.0    0.0   11.2   0 100 rpool1
       44.0    5.0 5631.9  144.0  0.1  0.4    2.4    8.1  12  39 c4t3d0
       66.0    5.0 8326.3  144.0  0.3  0.6    3.6    8.3  24  59 c4t1d0
       63.0    5.0 7699.3  144.0  0.1  0.5    2.1    7.6  12  52 c4t2d0
                        extended device statistics            
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
      171.0    3.0 21628.1   36.0  0.0  2.1    0.0   12.0   0 100 rpool1
       55.0    2.0 7040.0   12.0  0.2  0.5    2.9    8.3  16  47 c4t3d0
       61.0    2.0 7548.0   12.0  0.2  0.6    3.9   10.0  24  63 c4t1d0
       55.0    2.0 7040.0   12.0  0.1  0.4    2.4    7.5  12  43 c4t2d0
                        extended device statistics            
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
      200.0    3.0 25477.3   36.0  0.0  2.7    0.2   13.2   2 100 rpool1
       69.0    2.0 8709.8   12.0  0.2  0.5    3.3    7.3  21  52 c4t3d0
       68.0    2.0 8703.8   12.0  0.5  0.6    7.4    9.1  31  63 c4t1d0
       63.0    2.0 8063.8   12.0  0.3  0.5    4.4    7.1  16  46 c4t2d0

    The obvious answer seemed to be to remove some items, change polling intervals, etc. We found ourselves in a Catch 22 however because trying to update a large number of items from the GUI resulted in deadlocks. It became almost impossible to even update a single item, let alone try and remove templates. I wound up having to stop the Zabbix server and from within PostgreSQL itself run update commands to change the polling intervals wholesale to 60 seconds in order to buy some time, and history to 7 days. I then manually dropped all history partitions older than 7 days.

    Once done our queue immediately caught up and our server was stable again. I was able to begin the process of pruning our templates and making them more efficient. While clearly there was user error involved in our Zabbix woes (Lesson: do not use templates without customizing them first), there are some known problems in the way Zabbix works with the database backend. Googling "Zabbix deadlock" yields tons of bug reports. It seems to be a recurring issue that spans multiple versions of Zabbix/database, and ultimately getting Zabbix scaled to a large environment requires a dedicated resource who can spend time on tuning both ends, which is where I started this post. You can of course always go through the official support channel which I considered, but the pricing on it was exorbitant in my opinion.

    Next steps are to start deploying proxy servers, which are supposed to take some of the edge off of the main Zabbix server, and converting some of our checks from passive to active (having the clients send their info instead of relying on Zabbix polling).



    No comments:

    Post a Comment