Netbackup 2505 : Semaphores have run out

Issue:

Netbackup database has gone offline. All jobs are reporting a 2505 error. Online community recommends setting semaphores to official recommendations. backupserver already has the recommended settings in place.

Let’s dig into that.

View the current limits: sysctl -a | grep kernel.sem
kernel.sem = 300 307200 32 1024
View the current usage in a very messy manner: `ipcs
Count the currently used semaphores:
ipcs | grep Semaphore\ Arrays -A2000 | grep ^0 | wc -l (yes… there are cleaner ways of doing this- I am not making this pretty though)

This resulted in seeing 1024 currently used (out of 1024)

Stopping the DB resulted in 1015 consumed…

The thought at this point was that either netbackup was not releasing semaphores, or another process was consuming them. I hear they are tasty.

Let’s map semaphores to PIDs- this is a huge hassle to do manually as you have to do a lookup of each semaphore ID to a PID with the -s -i flags (specify semaphore with -s, specify -i for “print details on resource identified by id”)

Can anyone say “FOR LOOP”?

for pid in $( for semid in $( ipcs -s | awk ‘/0x/{ print $2 }’ ) ; do ipcs -s -i $pid | head -9 | tail -n1 | awk ‘{print $5}’ ; done

… this strips the key out of the results of ipcs -s, and then tosses the semaphore ID to ipcs via the specification of -i for a lookup on the specific semaphore. Easy.

We get this fallout: (with the highly sophisticated uniq -c tool…. Sarcasm is enhanced after 6 cups of coffee.)

3 2137
1 32550
5 2137
683 32550
1 36917
3 32550
281 36917
1 16356
6 36917
23 16356
1 6639
3 16356

We see a pattern here: pid 32550 has a bunch open!

So… what is that process?

Given that our max pid is 40960 on backupserver… we could have wrapped around again… (see: `cat /proc/sys/kernel/pid_max`) but if we are to believe what /var/log/messages has to say…

This was ‘nfsidmap[32550]’

The second-most semaphores loving PID was also nfsidmap: nfsidmap[36917]

A no-longer living process never released these. This was likely an issue with nfsidmap

So I rebooted backupserver- satisfied that the mine was still active in our environment… lurking for another day.

Conclusion:

Running backups again…

Only seeing 13 semaphores actively used:

[root@backupserver ~]# ipcs | grep Semaphore\ Arrays -A2000 | grep ^0 | wc -l

13

Advertisements

Story Time: The Server Room is on Fire

The result of a bad draft and a firmware update
2:13am Phone rings with a very concerned person on the other end of the line. “The server room is full of smoke, we think a server is on fire”

In a groggy state, I ask the obvious question: “Is it an emergency? do you want me to come in and check it out?” response: “YES!”

Considering myself to be a daily firefighter- I didn’t think much of the situation, so I drove on the deserted streets to work. Upon quick glance, I did not see any flames from outside the building- which I took to be a good sign.

I get escorted up to the server room by a panicked guard. I ask the reasonable questions of “how is your night going?” to lighten the mood.

I should take a moment to explain… most offices keep their server rooms where the business folks do not have to see or smell the IT folk, but this was a new concept building that believed in highlighting the technology advances in a prominent fashion.

Upon walking into the server room, I notice the smoke monitoring system is indeed alerting on a higher than average smoke concentration. Still no flames visible.

Going off of the basics, I am smelling for a pleasant “magic smoke” essence as I walk on the hot aisle. Much to my dismay, I am unable to locate the scent I was looking for. Instead I am able to get close enough to the air intake to the room to smell it is coming form the roof HVAC units.

A splendid opportunity as I had yet to test my access to the roof. I begin crawling into the access doors of the HVAC units- trying to trace the smell, and eventually follow them to the inlets from the outside.

Looking around- I spot the source: the generators were spewing diesel soot into the air- all part of the regular maintenance.

This morning was a special morning- the wind had been blowing from the West which directed all of the exhaust into the server room. As it turns out- the building management had run a firmware upgrade which reset the generator testing schedule to a time that closely matched 2am.