Thursday, September 5, 2013

Postgres Shutdown and Startup

Before I start, I have to give a quick shout out to the folks over at the Illuminos IRC channel. I'm always disappointed when I pop into a channel looking for some advice and get crickets...for hours. Or, even worse, one half-hearted, disinterested reply that only goes so far as to inform you that you're "doing it wrong" and "should've done it like x". These folks were fast and spot on. I had no fewer than 4 people respond to my admittedly newb question, and it felt good. It felt right. It felt like what I expect open source to be about. So Kudos, folks. 

Let's get back to the matter at hand. Why was I even on the Illuminos channel, you might ask? You know how sometimes when you get a new member on your team and they're eager to bring in the technology they like? That happened. My new boss is a big ol' OpenSolaris fan, and proselytizes OpenIndiana wherever he can. I didn't find an OpenIndiana room but I did find an Illuminos room and since OI is built on Illuminos...well, that was where I went. Good thing too since my boss uses OI whenever he has an opportunity to, so when he was tasked with building out Postgres servers he chose OI instead of CentOS, which is used on every other box in our environment...I won't go into my concerns about trying to bring in a new OS into a production environment where only one person would have the expertise to set it up, maintain, and troubleshoot it... :)

So we have Postgres installed and started using it with our Hadoop cluster and ran into an issue in our staging environment. The Hive server refused to start up. Based on the logs it looked like this was the issue: https://issues.apache.org/jira/browse/HIVE-3994. I made a change to the pg_hba.conf file and needed to restart the service. Were this CentOS or another *Nix variant with which I'm familiar, I would've gone with service svcname restart or something similar because it probably would have been installed with an init script. I don't know how OpenIndiana handles packages by default, but there was no init script for postgres. I did some Googling and found the pg_ctl command. I tried it: pg_ctl -D /var/postgres/9.1/data_64 stop -m immediate. The service failed to stop, and in fact errored out. I looked through the history, figuring that at some point during the installation my boss must have had to restart postgres, and I found svcadm commands. 

What is svcadm? It's a tool related to the service management facility in Solaris. I believe it is essentially the Solaris equivalent of service commands in Linux, but it goes further than that. It's part of an underlying self-healing aspect to the system. Solaris has the ability to keep track of process dependencies, and restart processes if possible. It's apparently why running pg_ctl may not work as OI will see that the service stopped outside of its own SMF, and assume that that was unintentional and try to start it up again. (And I realize that I am using OI, Solaris, and Illuminos a bit interchangeably, which is incorrect. Solaris begat OpenSolaris begat Illuminos begat OI.) 

I used svcadm disable svc:/application/database/postgresql_91:default_64bit to stop postgres, and svcs -a | grep postgres to monitor the state of the service. 5 minutes later that output still showed:

root@fdsma002:~# svcs -a | grep postgres
disabled       Aug_20   svc:/application/database/postgresql_91:default_32bit
online*        14:36:19 svc:/application/database/postgresql_91:default_64bit

That asterisk concerned me. I searched around for what it meant. I had my hunch, and got it confirmed via IRC. It meant that the service was stuck; it had gotten the message to restart, but hadn't completed doing that. Since it had been in that state since for a good long bit at this point, it became clear that manual intervention was necessary. I gathered as much info about the situation as I could on my own (and found a great admin page from Oracle in the process) before I had to ask for help because I was l-o-s-t. 

Here are the steps that were necessary to pull myself out of this mess as described by my buddies on the Illuminos channel:

1. Review the logs to see how svcadm was trying to stop postgres. This line told me what I needed to know:  Sep  5 14:37:19 Executing stop method ("/lib/svc/method/postgres_91 stop"). ]

2. Check that file to exactly how pg_ctl is being called. The line for stopping (part of a case loop) was "$PGBIN/pg_ctl -D $PGDATA stop", which translated to "/usr/bin/pg_ctl -D /var/postgres/9.1/data_64/ -m fast". I added the "-m fast" portion. This flag apparently does what I needed from the beginning, which was to just drop connections and not wait for existing connections to gracefully exit. 

3. Start the service up again using svcadm (after checking that it was showing as disabled at long last). svcadm enable svc:/application/database/postgresql_91:default_64bit 

4. Verify that my Hive server was happy and able to start back up with the new config change. 


Phew. 

No comments:

Post a Comment