Bug #10418

Puppet agent hangs when listen is true and reading from /proc filesystem on redhat

Added by Jo Rhett 7 months ago. Updated 4 months ago.

Status:Closed Start date:11/01/2011
Priority:Normal Due date:
Assignee:Patrick Otto % Done:

0%

Category:agent
Target version:-
Affected Puppet version:2.6.12 Branch:
Keywords:enabledisable hang select proc listen redhat
Votes: 9

Description

Mon Oct 31 23:03:31 +0000 2011 Puppet (notice): Caught TERM; calling stop

Ever since the 2.6.12 upgrade I’ve been seeing these reports reach us. As in, about a hundred of a half thou machines. Most of the time we find that $vardir/state/puppetdlock is in place and blocking further puppet runs, which requires a manual resolution.

I wrote a quick cron script to look for puppetdlock files older than one hour, remove them and mail me a report and I’ve received several dozen in the last few hours. Something is clearly broken in 2.6.12, we are backgrading our systems to 2.6.11.

No— I have no other information than that it crosses all of our machine types, and we have had no significant changes in our modules in this time period. Many of the machines which have failed have had zero module or manifest changes which would apply to them. I cannot get this to replicate on the command line.


Related issues

related to Puppet - Bug #2888: puppetd doesn't always cleanup lockfile properly Accepted 12/04/2009
related to Puppet - Bug #5139: puppetdlock file can be empty Needs More Information 10/28/2010
related to Puppet - Bug #10819: Puppet agent hangs when 'listen = true' on Centos 5.7 Duplicate 11/14/2011
related to Facter - Bug #10909: Shell out when reading from /proc Rejected 11/16/2011
related to Puppet - Bug #12588: Running Puppet in one-time mode from cron leaves hung pup... Duplicate 02/12/2012
duplicated by Puppet - Bug #7201: [puppet2.7]puppet kick exit code 3 Duplicate 04/21/2011
duplicated by Puppet - Bug #11135: puppet kick / puppet agent Duplicate 12/02/2011
duplicated by Puppet - Bug #11360: puppet client hangs after period of being unable to conta... Closed 12/12/2011

History

Updated by Jo Rhett 7 months ago

So we have found some consistency in the systems which are affected. Certain classes of hosts are more often affected than others. Very oddly, this class of servers is one of the classes of hosts where the fewest classes are applied — and every class applied to them is applied to hundreds of other hosts in our environment!

So I logged into a host which hadn’t checked in for a while, and found that it seems to be stuck within a single loop, whereas the puppet processes on systems running normally have more variance in their strace output. Here’s a quick example:

Healthy system:

select(9, [4 7], [], [], {1, 999999})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 698})      = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
select(6, [4], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [4 7], [], [], {1, 999999})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
select(6, [4], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [4 7], [], [], {1, 999999})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
select(6, [4], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [4 7], [], [], {0, 465482})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(9, [7], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0

So it is definitely looping through a couple selects, but the queries seem to vary a bit and it goes into different functions at time — like checking /etc/puppet/puppet.conf.

On the systems which are hung, the process stays forever within a single loop without variance:

select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 88})       = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 85})       = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 113})      = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999998})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 56})       = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0

These two systems are both running puppet 2.6.12 on CentOS 5.5.

Updated by Jo Rhett 7 months ago

More info: not all systems have puppetdlock files. However, they always seem to stop after a run. They even accept kicks, but do nothing with them. Here’s an example of “grep puppet /var/log/messages”

Nov  2 01:55:59 us0101acdc008 puppet-agent[143582]: (/File[/local/tomcat/webapps/abregistrar/WEB-INF/property-configurer.xml]) Filebucketed /local/tomcat/webapps/abregistrar/WEB-INF/property-configurer.xml to puppet with sum 83231d183f9d40f0ed44db880504a3f4
Nov  2 01:55:59 us0101acdc008 puppet-agent[143582]: (/File[/local/tomcat/webapps/abregistrar/WEB-INF/property-configurer.xml]/content) content changed '{md5}83231d183f9d40f0ed44db880504a3f4' to '{md5}1a4d31b2b0df03846e51017226a7ead6'
Nov  2 01:55:59 us0101acdc008 puppet-agent[143582]: (/Stage[main]/Webapps::Deploy/File[webinf]) Scheduling refresh of Exec[start-tomcat]
Nov  2 01:55:59 us0101acdc008 puppet-agent[143582]: (/Stage[main]/Webapps::Deploy/Exec[start-tomcat]/returns) executed successfully
Nov  2 01:55:59 us0101acdc008 puppet-agent[143582]: (/Stage[main]/Webapps::Deploy/Exec[start-tomcat]) Triggered 'refresh' from 127 events
Nov  2 01:56:14 us0101acdc008 puppet-agent[143582]: Finished catalog run in 91.72 seconds
Nov  2 03:26:31 us0101acdc008 puppet-agent[90299]: triggered run
[04:23 root@us0101acdc008 ~]$ 

As you can see, it observed the kick request but did nothing about it. System was bored silly in the same period:

[04:26 root@us0101acdc008 ~]$ sar
Linux 2.6.18-274.7.1.el5 (us0101acdc008.tangome.gbl)    11/02/2011

12:00:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM       all      1.25      0.00      0.79      0.00      0.00     97.96
12:20:01 AM       all      1.36      0.00      0.83      0.00      0.00     97.81
12:30:01 AM       all      2.32      0.00      0.90      0.00      0.00     96.79
12:40:01 AM       all      1.48      0.00      0.92      0.00      0.00     97.60
12:50:01 AM       all      1.34      0.00      0.84      0.00      0.00     97.82
01:00:01 AM       all      1.24      0.00      0.78      0.00      0.00     97.98
01:10:01 AM       all      1.31      0.00      0.82      0.00      0.00     97.87
01:20:01 AM       all      1.18      0.00      0.71      0.00      0.00     98.11
01:30:01 AM       all      1.27      0.00      0.89      0.00      0.00     97.84
01:40:01 AM       all      1.09      0.00      0.66      0.00      0.00     98.25
01:50:01 AM       all      0.91      0.00      0.59      0.00      0.00     98.50
02:00:01 AM       all      4.82      0.00      0.50      0.03      0.00     94.65
02:10:01 AM       all      0.14      0.00      0.16      0.00      0.00     99.70
02:20:01 AM       all      0.12      0.00      0.17      0.00      0.00     99.72
02:30:01 AM       all      1.22      0.00      0.42      0.00      0.00     98.35
02:40:01 AM       all      0.87      0.00      0.57      0.00      0.00     98.56
02:50:01 AM       all      0.78      0.00      0.53      0.00      0.00     98.69
03:00:01 AM       all      0.65      0.00      0.51      0.00      0.00     98.84
03:10:01 AM       all      0.69      0.00      0.47      0.00      0.00     98.84
03:20:01 AM       all      0.78      0.00      0.53      0.00      0.00     98.69
03:30:01 AM       all      0.83      0.00      0.58      0.00      0.00     98.59
03:40:01 AM       all      0.84      0.00      0.57      0.00      0.00     98.59
03:50:01 AM       all      0.58      0.00      0.43      0.00      0.00     98.99
04:00:01 AM       all      0.65      0.00      0.45      0.00      0.00     98.90
04:10:01 AM       all      0.60      0.00      0.42      0.14      0.00     98.83
04:20:01 AM       all      1.20      0.00      0.40      0.00      0.00     98.39
Average:          all      1.13      0.00      0.59      0.01      0.00     98.26

Updated by Jo Rhett 7 months ago

I’d like to send you an strace debug log, but I need an SSL-secure (or better) way to send the file to you, and assurances that the file will not be publically shared. I have a single host where I can easily replicate the problem willfully. I stop puppet, restart puppet client. It creates the puppetdlock file but never contacts the puppet master to download a catalog.

Updated by James Turnbull 7 months ago

  • Status changed from Unreviewed to Investigating

Jo – email to james@lovedthanlost,net (my GPG/PGP key – http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x215AFE50E4147032)

Updated by Jo Rhett 7 months ago

I just mailed the files. Note that I have the system standing by for other tests as needed today, but if we can’t fix it then I’ll be forced to roll back to 2.6.11 tonight.

Updated by James Turnbull 7 months ago

Jo – I can’t promise we’ll be able to look at this today.

Updated by James Turnbull 7 months ago

And you are sure the only change is the 2.6.12 upgrade?

Updated by Jo Rhett 7 months ago

I understand on timing, just letting you know my timing. I did try changing the passenger setup last night per my last e-mail to the list, and it didn’t affect the client problems but managed to kill the box. We’re now running the original/stock passenger definition again and it’s sane/working but the client problem remains unaffected all through this.

We did not see this problem happen until 2.6.12 was pushed out to clients. We’re observing them via a NRPE check on state/last_run_summary.yaml and that’s when Nagios started to see problems. Oddly enough, it’s happening the most to a certain set of machines … but those machines have the least amount of classes applied (ie, just the base classes in node default that everyone inherits)

The only factor which might be related but I can’t correlate is that these systems have a higher number of open TCP sessions that most other systems … but not all. I’m trying to validate any consistency on those metrics to see if they correlate.

Updated by Jo Rhett 7 months ago

Any IRC or chat channel I can go back and forth with someone? Is that possible? Trying hard to track this down, and I have an active system that won’t replay no matter what… I’m “jorhett” on every IM service.

Updated by James Turnbull 7 months ago

Jo – I am on #puppet on Freenode – jamesturnbull

Updated by James Turnbull 7 months ago

Can you show me the logs on the agent around the message you’re receiving? Is it possible to run the agent in —verbose —debug —trace mode and replicate the failure?

Updated by James Turnbull 7 months ago

  • Keywords set to enabledisable

Updated by Jo Rhett 7 months ago

Here’s the agent output when run with debug verbose trace but not —onetime

$ /usr/sbin/puppetd --server=puppetmaster --logdest=syslog --debug --verbose --trace
debug: Puppet::Type::User::ProviderPw: file pw does not exist
debug: Puppet::Type::User::ProviderUser_role_add: file roleadd does not exist
debug: Puppet::Type::User::ProviderDirectoryservice: file /usr/bin/dscl does not exist
debug: Failed to load library 'ldap' for feature 'ldap'
debug: Puppet::Type::User::ProviderLdap: feature ldap is missing
debug: Failed to load library 'rubygems' for feature 'rubygems'
debug: Puppet::Type::File::ProviderMicrosoft_windows: feature microsoft_windows is missing
debug: /File[/var/lib/puppet/ssl/private]: Autorequiring File[/var/lib/puppet/ssl]
debug: /File[/var/lib/puppet/log/http.log]: Autorequiring File[/var/lib/puppet/log]
debug: /File[/var/lib/puppet/ssl]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/client_yaml]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/ssl/certs/us0101acm017.tangome.gbl.pem]: Autorequiring File[/var/lib/puppet/ssl/certs]
debug: /File[/etc/puppet/puppet.conf]: Autorequiring File[/etc/puppet]
debug: /File[/var/lib/puppet/state/last_run_summary.yaml]: Autorequiring File[/var/lib/puppet/state]
debug: /File[/etc/puppet/namespaceauth.conf]: Autorequiring File[/etc/puppet]
debug: /File[/var/lib/puppet/ssl/public_keys/us0101acm017.tangome.gbl.pem]: Autorequiring File[/var/lib/puppet/ssl/public_keys]
debug: /File[/var/lib/puppet/lib]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/ssl/public_keys]: Autorequiring File[/var/lib/puppet/ssl]
debug: /File[/var/lib/puppet/state/state.yaml]: Autorequiring File[/var/lib/puppet/state]
debug: /File[/var/lib/puppet/log]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/ssl/certs]: Autorequiring File[/var/lib/puppet/ssl]
debug: /File[/var/lib/puppet/classes.txt]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/client_data]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/state]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/facts]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/ssl/crl.pem]: Autorequiring File[/var/lib/puppet/ssl]
debug: /File[/var/lib/puppet/clientbucket]: Autorequiring File[/var/lib/puppet]
debug: /File[/var/lib/puppet/ssl/certs/ca.pem]: Autorequiring File[/var/lib/puppet/ssl/certs]
debug: /File[/var/lib/puppet/state/graphs]: Autorequiring File[/var/lib/puppet/state]
debug: /File[/var/lib/puppet/ssl/private_keys/us0101acm017.tangome.gbl.pem]: Autorequiring File[/var/lib/puppet/ssl/private_keys]
debug: /File[/var/lib/puppet/state/last_run_report.yaml]: Autorequiring File[/var/lib/puppet/state]
debug: /File[/var/lib/puppet/ssl/private_keys]: Autorequiring File[/var/lib/puppet/ssl]
debug: /File[/var/lib/puppet/ssl/certificate_requests]: Autorequiring File[/var/lib/puppet/ssl]
debug: Finishing transaction 23761369257900

Updated by Jo Rhett 7 months ago

Here’s another strace. It’s the very end of the output of “strace puppet agent —enable”. What it shows is puppet failing to remove the puppetdlock file. There is something wrong in this piece of code:

rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
stat("/var/lib/puppet/state/puppetdlock", {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
open("/var/lib/puppet/state/puppetdlock", O_RDONLY) = 6
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ad97fefe000
lseek(6, 0, SEEK_CUR)                   = 0
read(6, "13115", 4096)                  = 5
read(6, "", 4096)                       = 0
close(6)                                = 0
munmap(0x2ad97fefe000, 4096)            = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
stat("/var/lib/puppet/state/puppetdlock", {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
open("/var/lib/puppet/state/puppetdlock", O_RDONLY) = 6
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ad97fefe000
lseek(6, 0, SEEK_CUR)                   = 0
read(6, "13115", 4096)                  = 5
read(6, "", 4096)                       = 0
close(6)                                = 0
munmap(0x2ad97fefe000, 4096)            = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigaction(SIGINT, {SIG_DFL, [INT], SA_RESTORER|SA_RESTART, 0x301ac302d0}, {0x36b8c8ccf0, [INT], SA_RESTORER|SA_RESTART, 0x301ac302d0}, 8) = 0
close(4)                                = 0
munmap(0x2ad97f1a6000, 4096)            = 0
close(3)                                = 0
munmap(0x2ad97f1a5000, 4096)            = 0
exit_group(0)                           = ?
$ ls /var/lib/puppet/state/
graphs  last_run_report.yaml  last_run_summary.yaml  puppetdlock  state.yaml
$ cat /var/lib/puppet/state/puppetdlock 
13115

(same as it was before — the running daemon’s pid)

Updated by Jo Rhett 7 months ago

Just FYI, it’s a mix of centos 5.5 to 5.7, but all systems have these versions of the following:

libselinux-ruby-1.33.4-5.7.el5 ruby-libs-1.8.5-19.el5_6.1 ruby-shadow-1.4.1-7.el5 ruby-1.8.5-19.el5_6.1 ruby-augeas-0.4.1-1.el5

facter-1.6.2-1.el5 puppet-2.6.12-2.el5

The last two were downloaded from yum.puppetlabs.com, the previous ones are all from centos or epel.

Updated by Jason Smith 7 months ago

I also observed the exact same behavior, right down to the identical strace for the hung puppet daemon. When looking at lsof and the strace together I see that puppet is waiting on the puppet agent listening port (as expected) and a read fd open to a /proc path, often /proc/cpuinfo, but not always. At first I also thought it was the updated puppet version 2.6.12, but after testing several combinations, I think I narrowed it down to the RHEL5.7 kernel version (2.6.18-274.7.1.el5). Any system using this kernel, no matter what puppet version I try, always hangs. If I reboot a system with a hung puppet daemon into an earlier RHEL5.7 kernel then puppet starts to work again. Note, the supposed bad RHEL5.7 kernel was released just a few days before the most recent puppet security update, on October 20th, see: RHSA-2011-1386. Could puppet be hung waiting to read info from /proc and this kernel has a bug somewhere in /proc? I also tried searching RedHat’s bugzilla and didn’t see any obvious related bugs yet, but it has only been 2 weeks since the kernel was released.

Updated by Jo Rhett 7 months ago

Oh sweet jesus, yeah that timeline matches exactly with the explosion of problem reports because the class of systems which are all having this problem were rebooted shortly after the puppet upgrade so would have restarted with that kernel.

Updated by Jason Smith 7 months ago

FYI, I just opened a ticket in RedHat’s bugzilla #751214.

Updated by Jo Rhett 7 months ago

I’ve rebooted a problematic system with reverting to kernel 2.6.18-274.3.1.el5 and confirmed that the puppet daemon locking problem has disappeared.

Updated by Todd Zullinger 6 months ago

I’ve seen this same issue and also found that reverting to a previous kernel fixed it. I do have listen = true in the puppet config, I’ve not tried disabling that to see if it affects things (someone on the list mentioned that possibility).

Unfortunately, the RHEL bug is not accessible to me (but hey, I’m only logged in with my credentials as an EPEL puppet maintainer ;). Any chance to get that bug opened up to the public or add me to the Cc list?

Updated by Jo Rhett 6 months ago

So apparently the listen bit is a factor, according to discussion in that bug. Apparently when ruby has only a single file open, it just opens the file in /proc which works fine. But if more than one file will be open (ie, the listen socket plus the /proc file) then ruby uses select() on the file instead which is what trips this bug.

So yeah, disabling listen might relieve the symptoms a bit.

I’ll add you to the bug if I can. Erm, no. Yeah, they locked the bug down to just redhat internal plus reporter and CC list, and I don’t seem to be able to edit the bug. Good news is that they tested a new kernel this morning that appears to fix the problem: (remainder is copy/paste from there)

David Howells 2011-11-16 05:14:07 EST I’ve put a test kernel with the patch applied for download at:

http://people.redhat.com/~dhowells/.067ac120438d738257e2a305a3ddac64/kernel-2.6.18-298.el5.bz751214.1.x86_64.rpm

[reply] [–] Comment 31 Dmitry Zamaruev 2011-11-16 05:53:06 EST I could confirm that with given kernel test passes:

[root@app ~]# uname -r 2.6.18-298.el5.bz751214.1 [root@app ~]# ./test /proc/uptime 153.99 137.31

And Ruby applications (Chef/Shef in my case) which hung on 274.7.1 – works as expected on this kernel.

Updated by Josh Cooper 5 months ago

  • Subject changed from "Caught TERM; calling stop" with state/puppetdlock left in place to Puppet agent hangs when listen is true and reading from /proc filesystem on redhat
  • Keywords changed from enabledisable to enabledisable hang select proc listen redhat

Updated by XiangJun Wu 5 months ago

Does CentOS6.1 fix it?

Updated by Peter Meier 5 months ago

XiangJun Wu wrote:

Does CentOS6.1 fix it?

No. According to the RHEL Bugtracker this will be fixed in kernel-2.6.18-299.el5 , which will then be taken up by CentOS.

Updated by XiangJun Wu 5 months ago

Hopefully, CentOS6.2 will include fix. Peter Meier wrote:

XiangJun Wu wrote:

Does CentOS6.1 fix it?

No. According to the RHEL Bugtracker this will be fixed in kernel-2.6.18-299.el5 , which will then be taken up by CentOS.

Updated by Jason Smith 5 months ago

I don’t know about anyone else, but I have not seen this problem in any of the RHEL6 kernels, neither 6.1 nor 6.2. Note, the RHEL6 kernel (based on 2.6.32) is completely different than the rhel5 one (based on 2.6.18). As Peter said, for the RHEL5 kernels, the bugzilla ticket says that RedHat has verified that they have fixed it in their 2.6.18-299 kernel, so I would assume any RHEL5 kernel version greater than that would have the fix. Hopefully this is almost done going through RedHat’s QA and will be released soon, either in a RHEL5.7 errata or possibly the soon to be released 5.8.

XiangJun Wu wrote:

Hopefully, CentOS6.2 will include fix. Peter Meier wrote:

XiangJun Wu wrote:

Does CentOS6.1 fix it?

No. According to the RHEL Bugtracker this will be fixed in kernel-2.6.18-299.el5 , which will then be taken up by CentOS.

Updated by Peter Meier 5 months ago

Jason Smith wrote:

I don’t know about anyone else, but I have not seen this problem in any of the RHEL6 kernels, neither 6.1 nor 6.2. Note, the RHEL6 kernel (based on 2.6.32) is completely different than the rhel5 one (based on 2.6.18). As Peter said, for the RHEL5 kernels, the bugzilla ticket says that RedHat has verified that they have fixed it in their 2.6.18-299 kernel, so I would assume any RHEL5 kernel version greater than that would have the fix. Hopefully this is almost done going through RedHat’s QA and will be released soon, either in a RHEL5.7 errata or possibly the soon to be released 5.8.

Ah, good point that this is about RHEL5 :)

Updated by Corey Osman 4 months ago

This also affect RHEL4

Updated by Michael Stahnke 4 months ago

I’d love to know if kernels from https://rhn.redhat.com/errata/RHSA-2012-0007.html fix the issue. I’ve haven’t had a chance to throw them onto our test environment yet.

Updated by Jo Rhett 4 months ago

Michael Stahnke wrote:

I’d love to know if kernels from https://rhn.redhat.com/errata/RHSA-2012-0007.html fix the issue. I’ve haven’t had a chance to throw them onto our test environment yet.

I don’t see any mention of the redhat bug :(

Updated by Andrew Beresford 4 months ago

Jo Rhett wrote:

Michael Stahnke wrote:

I’d love to know if kernels from https://rhn.redhat.com/errata/RHSA-2012-0007.html fix the issue. I’ve haven’t had a chance to throw them onto our test environment yet.

I don’t see any mention of the redhat bug :(

Neither the 2.6.18-274.17.1 kernel in that errata nor the 2.6.18-300 currently in RHEL5 beta seem to help.

Updated by Jo Rhett 4 months ago

Andrew Beresford wrote:

Neither the 2.6.18-274.17.1 kernel in that errata nor the 2.6.18-300 currently in RHEL5 beta seem to help.

Hm. Redhat claimed otherwise in the ticket today:

A fix for the current RHEL5 minor release (5.7) was tracked through bug #755483 and is included as of kernel 2.6.18-274.17.1.el5 from http://rhn.redhat.com/errata/RHSA-2012-0007.html.

I’ll try applying the kernel and test here.

Updated by Jo Rhett 4 months ago

According to the most basic test, puppet agent appears to run cleanly and finish it’s job without puppetdlock in the state directory. This is with listen enabled, and on a system that did not work with 7.1 kernel.

uname -a

Linux xabbcd4 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

Jan 11 23:17:46 xabbcd4 puppet-agent[4009]: Starting Puppet client version 2.6.12 Jan 11 23:20:05 xabbcd4 puppet-agent[4009]: Finished catalog run in 131.27 seconds

$ ls /var/lib/puppet/state graphs last_run_report.yaml last_run_summary.yaml state.yaml

Updated by Mark Chappell 4 months ago

Also working for me on Linux 2.6.18-274.17.1.el5 #1 SMP Wed Jan 4 22:45:44 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

Updated by Brian Pitts 4 months ago

The kernel 2.6.18-274.17.1.el5 also resolved the issue for me in CentOS.

Updated by Marc Cortinas Val 4 months ago

Yes, i’ve updated kernel to 2.6.18-274.17.1.el5 and it has fixed, thank for your support guys!

Updated by Patrick Otto 4 months ago

  • Status changed from Investigating to Closed
  • Assignee set to Patrick Otto

Seems like this is fixed with 2.6.18-274.17.1.el5

Also available in: Atom PDF