Bug #10418
Puppet agent hangs when listen is true and reading from /proc filesystem on redhat
| Status: | Closed | Start date: | 11/01/2011 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | agent | |||
| Target version: | - | |||
| Affected Puppet version: | 2.6.12 | Branch: | ||
| Keywords: | enabledisable hang select proc listen redhat | |||
| Votes: | 9 |
Description
Mon Oct 31 23:03:31 +0000 2011 Puppet (notice): Caught TERM; calling stop
Ever since the 2.6.12 upgrade I’ve been seeing these reports reach us. As in, about a hundred of a half thou machines. Most of the time we find that $vardir/state/puppetdlock is in place and blocking further puppet runs, which requires a manual resolution.
I wrote a quick cron script to look for puppetdlock files older than one hour, remove them and mail me a report and I’ve received several dozen in the last few hours. Something is clearly broken in 2.6.12, we are backgrading our systems to 2.6.11.
No— I have no other information than that it crosses all of our machine types, and we have had no significant changes in our modules in this time period. Many of the machines which have failed have had zero module or manifest changes which would apply to them. I cannot get this to replicate on the command line.
Related issues
History
Updated by Jo Rhett 7 months ago
So we have found some consistency in the systems which are affected. Certain classes of hosts are more often affected than others. Very oddly, this class of servers is one of the classes of hosts where the fewest classes are applied — and every class applied to them is applied to hundreds of other hosts in our environment!
So I logged into a host which hadn’t checked in for a while, and found that it seems to be stuck within a single loop, whereas the puppet processes on systems running normally have more variance in their strace output. Here’s a quick example:
Healthy system:
select(9, [4 7], [], [], {1, 999999}) = 0 (Timeout)
select(9, [4 7], [], [], {0, 698}) = 0 (Timeout)
select(9, [4 7], [], [], {0, 0}) = 0 (Timeout)
select(6, [4], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
select(9, [4 7], [], [], {1, 999999}) = 0 (Timeout)
select(9, [4 7], [], [], {0, 0}) = 0 (Timeout)
select(6, [4], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
select(9, [4 7], [], [], {1, 999999}) = 0 (Timeout)
select(9, [4 7], [], [], {0, 0}) = 0 (Timeout)
select(6, [4], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
select(9, [4 7], [], [], {0, 465482}) = 0 (Timeout)
select(9, [4 7], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(9, [7], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
So it is definitely looping through a couple selects, but the queries seem to vary a bit and it goes into different functions at time — like checking /etc/puppet/puppet.conf.
On the systems which are hung, the process stays forever within a single loop without variance:
select(9, [7 8], [], [], {1, 999999}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 88}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 0}) = 0 (Timeout)
select(9, [8], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
select(9, [7 8], [], [], {1, 999999}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 0}) = 0 (Timeout)
select(9, [8], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
select(9, [7 8], [], [], {1, 999999}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 85}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 0}) = 0 (Timeout)
select(9, [8], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
select(9, [7 8], [], [], {1, 999999}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 113}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 0}) = 0 (Timeout)
select(9, [8], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
select(9, [7 8], [], [], {1, 999998}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 56}) = 0 (Timeout)
select(9, [7 8], [], [], {0, 0}) = 0 (Timeout)
select(9, [8], [], [], {0, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
These two systems are both running puppet 2.6.12 on CentOS 5.5.
Updated by Jo Rhett 7 months ago
More info: not all systems have puppetdlock files. However, they always seem to stop after a run. They even accept kicks, but do nothing with them. Here’s an example of “grep puppet /var/log/messages”
Nov 2 01:55:59 us0101acdc008 puppet-agent[143582]: (/File[/local/tomcat/webapps/abregistrar/WEB-INF/property-configurer.xml]) Filebucketed /local/tomcat/webapps/abregistrar/WEB-INF/property-configurer.xml to puppet with sum 83231d183f9d40f0ed44db880504a3f4
Nov 2 01:55:59 us0101acdc008 puppet-agent[143582]: (/File[/local/tomcat/webapps/abregistrar/WEB-INF/property-configurer.xml]/content) content changed '{md5}83231d183f9d40f0ed44db880504a3f4' to '{md5}1a4d31b2b0df03846e51017226a7ead6'
Nov 2 01:55:59 us0101acdc008 puppet-agent[143582]: (/Stage[main]/Webapps::Deploy/File[webinf]) Scheduling refresh of Exec[start-tomcat]
Nov 2 01:55:59 us0101acdc008 puppet-agent[143582]: (/Stage[main]/Webapps::Deploy/Exec[start-tomcat]/returns) executed successfully
Nov 2 01:55:59 us0101acdc008 puppet-agent[143582]: (/Stage[main]/Webapps::Deploy/Exec[start-tomcat]) Triggered 'refresh' from 127 events
Nov 2 01:56:14 us0101acdc008 puppet-agent[143582]: Finished catalog run in 91.72 seconds
Nov 2 03:26:31 us0101acdc008 puppet-agent[90299]: triggered run
[04:23 root@us0101acdc008 ~]$
As you can see, it observed the kick request but did nothing about it. System was bored silly in the same period:
[04:26 root@us0101acdc008 ~]$ sar Linux 2.6.18-274.7.1.el5 (us0101acdc008.tangome.gbl) 11/02/2011 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:10:01 AM all 1.25 0.00 0.79 0.00 0.00 97.96 12:20:01 AM all 1.36 0.00 0.83 0.00 0.00 97.81 12:30:01 AM all 2.32 0.00 0.90 0.00 0.00 96.79 12:40:01 AM all 1.48 0.00 0.92 0.00 0.00 97.60 12:50:01 AM all 1.34 0.00 0.84 0.00 0.00 97.82 01:00:01 AM all 1.24 0.00 0.78 0.00 0.00 97.98 01:10:01 AM all 1.31 0.00 0.82 0.00 0.00 97.87 01:20:01 AM all 1.18 0.00 0.71 0.00 0.00 98.11 01:30:01 AM all 1.27 0.00 0.89 0.00 0.00 97.84 01:40:01 AM all 1.09 0.00 0.66 0.00 0.00 98.25 01:50:01 AM all 0.91 0.00 0.59 0.00 0.00 98.50 02:00:01 AM all 4.82 0.00 0.50 0.03 0.00 94.65 02:10:01 AM all 0.14 0.00 0.16 0.00 0.00 99.70 02:20:01 AM all 0.12 0.00 0.17 0.00 0.00 99.72 02:30:01 AM all 1.22 0.00 0.42 0.00 0.00 98.35 02:40:01 AM all 0.87 0.00 0.57 0.00 0.00 98.56 02:50:01 AM all 0.78 0.00 0.53 0.00 0.00 98.69 03:00:01 AM all 0.65 0.00 0.51 0.00 0.00 98.84 03:10:01 AM all 0.69 0.00 0.47 0.00 0.00 98.84 03:20:01 AM all 0.78 0.00 0.53 0.00 0.00 98.69 03:30:01 AM all 0.83 0.00 0.58 0.00 0.00 98.59 03:40:01 AM all 0.84 0.00 0.57 0.00 0.00 98.59 03:50:01 AM all 0.58 0.00 0.43 0.00 0.00 98.99 04:00:01 AM all 0.65 0.00 0.45 0.00 0.00 98.90 04:10:01 AM all 0.60 0.00 0.42 0.14 0.00 98.83 04:20:01 AM all 1.20 0.00 0.40 0.00 0.00 98.39 Average: all 1.13 0.00 0.59 0.01 0.00 98.26
Updated by Jo Rhett 7 months ago
I’d like to send you an strace debug log, but I need an SSL-secure (or better) way to send the file to you, and assurances that the file will not be publically shared. I have a single host where I can easily replicate the problem willfully. I stop puppet, restart puppet client. It creates the puppetdlock file but never contacts the puppet master to download a catalog.
Updated by James Turnbull 7 months ago
- Status changed from Unreviewed to Investigating
Jo – email to james@lovedthanlost,net (my GPG/PGP key – http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x215AFE50E4147032)
Updated by Jo Rhett 7 months ago
I just mailed the files. Note that I have the system standing by for other tests as needed today, but if we can’t fix it then I’ll be forced to roll back to 2.6.11 tonight.
Updated by James Turnbull 7 months ago
Jo – I can’t promise we’ll be able to look at this today.
Updated by James Turnbull 7 months ago
And you are sure the only change is the 2.6.12 upgrade?
Updated by Jo Rhett 7 months ago
I understand on timing, just letting you know my timing. I did try changing the passenger setup last night per my last e-mail to the list, and it didn’t affect the client problems but managed to kill the box. We’re now running the original/stock passenger definition again and it’s sane/working but the client problem remains unaffected all through this.
We did not see this problem happen until 2.6.12 was pushed out to clients. We’re observing them via a NRPE check on state/last_run_summary.yaml and that’s when Nagios started to see problems. Oddly enough, it’s happening the most to a certain set of machines … but those machines have the least amount of classes applied (ie, just the base classes in node default that everyone inherits)
The only factor which might be related but I can’t correlate is that these systems have a higher number of open TCP sessions that most other systems … but not all. I’m trying to validate any consistency on those metrics to see if they correlate.
Updated by Jo Rhett 7 months ago
Any IRC or chat channel I can go back and forth with someone? Is that possible? Trying hard to track this down, and I have an active system that won’t replay no matter what… I’m “jorhett” on every IM service.
Updated by James Turnbull 7 months ago
Jo – I am on #puppet on Freenode – jamesturnbull
Updated by James Turnbull 7 months ago
Can you show me the logs on the agent around the message you’re receiving? Is it possible to run the agent in —verbose —debug —trace mode and replicate the failure?
Updated by James Turnbull 7 months ago
- Keywords set to enabledisable
Updated by Jo Rhett 7 months ago
Here’s the agent output when run with debug verbose trace but not —onetime
$ /usr/sbin/puppetd --server=puppetmaster --logdest=syslog --debug --verbose --trace debug: Puppet::Type::User::ProviderPw: file pw does not exist debug: Puppet::Type::User::ProviderUser_role_add: file roleadd does not exist debug: Puppet::Type::User::ProviderDirectoryservice: file /usr/bin/dscl does not exist debug: Failed to load library 'ldap' for feature 'ldap' debug: Puppet::Type::User::ProviderLdap: feature ldap is missing debug: Failed to load library 'rubygems' for feature 'rubygems' debug: Puppet::Type::File::ProviderMicrosoft_windows: feature microsoft_windows is missing debug: /File[/var/lib/puppet/ssl/private]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/log/http.log]: Autorequiring File[/var/lib/puppet/log] debug: /File[/var/lib/puppet/ssl]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/client_yaml]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/certs/us0101acm017.tangome.gbl.pem]: Autorequiring File[/var/lib/puppet/ssl/certs] debug: /File[/etc/puppet/puppet.conf]: Autorequiring File[/etc/puppet] debug: /File[/var/lib/puppet/state/last_run_summary.yaml]: Autorequiring File[/var/lib/puppet/state] debug: /File[/etc/puppet/namespaceauth.conf]: Autorequiring File[/etc/puppet] debug: /File[/var/lib/puppet/ssl/public_keys/us0101acm017.tangome.gbl.pem]: Autorequiring File[/var/lib/puppet/ssl/public_keys] debug: /File[/var/lib/puppet/lib]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/public_keys]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/state/state.yaml]: Autorequiring File[/var/lib/puppet/state] debug: /File[/var/lib/puppet/log]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/certs]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/classes.txt]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/client_data]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/state]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/facts]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/crl.pem]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/clientbucket]: Autorequiring File[/var/lib/puppet] debug: /File[/var/lib/puppet/ssl/certs/ca.pem]: Autorequiring File[/var/lib/puppet/ssl/certs] debug: /File[/var/lib/puppet/state/graphs]: Autorequiring File[/var/lib/puppet/state] debug: /File[/var/lib/puppet/ssl/private_keys/us0101acm017.tangome.gbl.pem]: Autorequiring File[/var/lib/puppet/ssl/private_keys] debug: /File[/var/lib/puppet/state/last_run_report.yaml]: Autorequiring File[/var/lib/puppet/state] debug: /File[/var/lib/puppet/ssl/private_keys]: Autorequiring File[/var/lib/puppet/ssl] debug: /File[/var/lib/puppet/ssl/certificate_requests]: Autorequiring File[/var/lib/puppet/ssl] debug: Finishing transaction 23761369257900
Updated by Jo Rhett 7 months ago
Here’s another strace. It’s the very end of the output of “strace puppet agent —enable”. What it shows is puppet failing to remove the puppetdlock file. There is something wrong in this piece of code:
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
stat("/var/lib/puppet/state/puppetdlock", {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
open("/var/lib/puppet/state/puppetdlock", O_RDONLY) = 6
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ad97fefe000
lseek(6, 0, SEEK_CUR) = 0
read(6, "13115", 4096) = 5
read(6, "", 4096) = 0
close(6) = 0
munmap(0x2ad97fefe000, 4096) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
stat("/var/lib/puppet/state/puppetdlock", {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
open("/var/lib/puppet/state/puppetdlock", O_RDONLY) = 6
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
fstat(6, {st_mode=S_IFREG|0644, st_size=5, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ad97fefe000
lseek(6, 0, SEEK_CUR) = 0
read(6, "13115", 4096) = 5
read(6, "", 4096) = 0
close(6) = 0
munmap(0x2ad97fefe000, 4096) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigaction(SIGINT, {SIG_DFL, [INT], SA_RESTORER|SA_RESTART, 0x301ac302d0}, {0x36b8c8ccf0, [INT], SA_RESTORER|SA_RESTART, 0x301ac302d0}, 8) = 0
close(4) = 0
munmap(0x2ad97f1a6000, 4096) = 0
close(3) = 0
munmap(0x2ad97f1a5000, 4096) = 0
exit_group(0) = ?
$ ls /var/lib/puppet/state/ graphs last_run_report.yaml last_run_summary.yaml puppetdlock state.yaml $ cat /var/lib/puppet/state/puppetdlock 13115
(same as it was before — the running daemon’s pid)
Updated by Jo Rhett 7 months ago
Just FYI, it’s a mix of centos 5.5 to 5.7, but all systems have these versions of the following:
libselinux-ruby-1.33.4-5.7.el5 ruby-libs-1.8.5-19.el5_6.1 ruby-shadow-1.4.1-7.el5 ruby-1.8.5-19.el5_6.1 ruby-augeas-0.4.1-1.el5
facter-1.6.2-1.el5 puppet-2.6.12-2.el5
The last two were downloaded from yum.puppetlabs.com, the previous ones are all from centos or epel.
Updated by Jason Smith 7 months ago
I also observed the exact same behavior, right down to the identical strace for the hung puppet daemon. When looking at lsof and the strace together I see that puppet is waiting on the puppet agent listening port (as expected) and a read fd open to a /proc path, often /proc/cpuinfo, but not always. At first I also thought it was the updated puppet version 2.6.12, but after testing several combinations, I think I narrowed it down to the RHEL5.7 kernel version (2.6.18-274.7.1.el5). Any system using this kernel, no matter what puppet version I try, always hangs. If I reboot a system with a hung puppet daemon into an earlier RHEL5.7 kernel then puppet starts to work again. Note, the supposed bad RHEL5.7 kernel was released just a few days before the most recent puppet security update, on October 20th, see: RHSA-2011-1386. Could puppet be hung waiting to read info from /proc and this kernel has a bug somewhere in /proc? I also tried searching RedHat’s bugzilla and didn’t see any obvious related bugs yet, but it has only been 2 weeks since the kernel was released.
Updated by Jo Rhett 7 months ago
Oh sweet jesus, yeah that timeline matches exactly with the explosion of problem reports because the class of systems which are all having this problem were rebooted shortly after the puppet upgrade so would have restarted with that kernel.
Updated by Jason Smith 7 months ago
FYI, I just opened a ticket in RedHat’s bugzilla #751214.
Updated by Jo Rhett 7 months ago
I’ve rebooted a problematic system with reverting to kernel 2.6.18-274.3.1.el5 and confirmed that the puppet daemon locking problem has disappeared.
Updated by Todd Zullinger 6 months ago
I’ve seen this same issue and also found that reverting to a previous kernel fixed it. I do have listen = true in the puppet config, I’ve not tried disabling that to see if it affects things (someone on the list mentioned that possibility).
Unfortunately, the RHEL bug is not accessible to me (but hey, I’m only logged in with my credentials as an EPEL puppet maintainer ;). Any chance to get that bug opened up to the public or add me to the Cc list?
Updated by Jo Rhett 6 months ago
So apparently the listen bit is a factor, according to discussion in that bug. Apparently when ruby has only a single file open, it just opens the file in /proc which works fine. But if more than one file will be open (ie, the listen socket plus the /proc file) then ruby uses select() on the file instead which is what trips this bug.
So yeah, disabling listen might relieve the symptoms a bit.
I’ll add you to the bug if I can. Erm, no. Yeah, they locked the bug down to just redhat internal plus reporter and CC list, and I don’t seem to be able to edit the bug. Good news is that they tested a new kernel this morning that appears to fix the problem: (remainder is copy/paste from there)
David Howells 2011-11-16 05:14:07 EST I’ve put a test kernel with the patch applied for download at:
http://people.redhat.com/~dhowells/.067ac120438d738257e2a305a3ddac64/kernel-2.6.18-298.el5.bz751214.1.x86_64.rpm
[reply] [–] Comment 31 Dmitry Zamaruev 2011-11-16 05:53:06 EST I could confirm that with given kernel test passes:
[root@app ~]# uname -r 2.6.18-298.el5.bz751214.1 [root@app ~]# ./test /proc/uptime 153.99 137.31
And Ruby applications (Chef/Shef in my case) which hung on 274.7.1 – works as expected on this kernel.
Updated by Josh Cooper 5 months ago
- Subject changed from "Caught TERM; calling stop" with state/puppetdlock left in place to Puppet agent hangs when listen is true and reading from /proc filesystem on redhat
- Keywords changed from enabledisable to enabledisable hang select proc listen redhat
Updated by XiangJun Wu 5 months ago
Does CentOS6.1 fix it?
Updated by Peter Meier 5 months ago
XiangJun Wu wrote:
Does CentOS6.1 fix it?
No. According to the RHEL Bugtracker this will be fixed in kernel-2.6.18-299.el5 , which will then be taken up by CentOS.
Updated by XiangJun Wu 5 months ago
Hopefully, CentOS6.2 will include fix. Peter Meier wrote:
XiangJun Wu wrote:
Does CentOS6.1 fix it?
No. According to the RHEL Bugtracker this will be fixed in kernel-2.6.18-299.el5 , which will then be taken up by CentOS.
Updated by Jason Smith 5 months ago
I don’t know about anyone else, but I have not seen this problem in any of the RHEL6 kernels, neither 6.1 nor 6.2. Note, the RHEL6 kernel (based on 2.6.32) is completely different than the rhel5 one (based on 2.6.18). As Peter said, for the RHEL5 kernels, the bugzilla ticket says that RedHat has verified that they have fixed it in their 2.6.18-299 kernel, so I would assume any RHEL5 kernel version greater than that would have the fix. Hopefully this is almost done going through RedHat’s QA and will be released soon, either in a RHEL5.7 errata or possibly the soon to be released 5.8.
XiangJun Wu wrote:
Hopefully, CentOS6.2 will include fix. Peter Meier wrote:
XiangJun Wu wrote:
Does CentOS6.1 fix it?
No. According to the RHEL Bugtracker this will be fixed in kernel-2.6.18-299.el5 , which will then be taken up by CentOS.
Updated by Peter Meier 5 months ago
Jason Smith wrote:
I don’t know about anyone else, but I have not seen this problem in any of the RHEL6 kernels, neither 6.1 nor 6.2. Note, the RHEL6 kernel (based on 2.6.32) is completely different than the rhel5 one (based on 2.6.18). As Peter said, for the RHEL5 kernels, the bugzilla ticket says that RedHat has verified that they have fixed it in their 2.6.18-299 kernel, so I would assume any RHEL5 kernel version greater than that would have the fix. Hopefully this is almost done going through RedHat’s QA and will be released soon, either in a RHEL5.7 errata or possibly the soon to be released 5.8.
Ah, good point that this is about RHEL5 :)
Updated by Corey Osman 4 months ago
This also affect RHEL4
Updated by Michael Stahnke 4 months ago
I’d love to know if kernels from https://rhn.redhat.com/errata/RHSA-2012-0007.html fix the issue. I’ve haven’t had a chance to throw them onto our test environment yet.
Updated by Jo Rhett 4 months ago
Michael Stahnke wrote:
I’d love to know if kernels from https://rhn.redhat.com/errata/RHSA-2012-0007.html fix the issue. I’ve haven’t had a chance to throw them onto our test environment yet.
I don’t see any mention of the redhat bug :(
Updated by Andrew Beresford 4 months ago
Jo Rhett wrote:
Michael Stahnke wrote:
I’d love to know if kernels from https://rhn.redhat.com/errata/RHSA-2012-0007.html fix the issue. I’ve haven’t had a chance to throw them onto our test environment yet.
I don’t see any mention of the redhat bug :(
Neither the 2.6.18-274.17.1 kernel in that errata nor the 2.6.18-300 currently in RHEL5 beta seem to help.
Updated by Jo Rhett 4 months ago
Andrew Beresford wrote:
Neither the 2.6.18-274.17.1 kernel in that errata nor the 2.6.18-300 currently in RHEL5 beta seem to help.
Hm. Redhat claimed otherwise in the ticket today:
A fix for the current RHEL5 minor release (5.7) was tracked through bug #755483 and is included as of kernel 2.6.18-274.17.1.el5 from http://rhn.redhat.com/errata/RHSA-2012-0007.html.
I’ll try applying the kernel and test here.
Updated by Jo Rhett 4 months ago
According to the most basic test, puppet agent appears to run cleanly and finish it’s job without puppetdlock in the state directory. This is with listen enabled, and on a system that did not work with 7.1 kernel.
uname -a¶
Linux xabbcd4 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
Jan 11 23:17:46 xabbcd4 puppet-agent[4009]: Starting Puppet client version 2.6.12 Jan 11 23:20:05 xabbcd4 puppet-agent[4009]: Finished catalog run in 131.27 seconds
$ ls /var/lib/puppet/state graphs last_run_report.yaml last_run_summary.yaml state.yaml
Updated by Mark Chappell 4 months ago
Also working for me on Linux 2.6.18-274.17.1.el5 #1 SMP Wed Jan 4 22:45:44 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
Updated by Brian Pitts 4 months ago
The kernel 2.6.18-274.17.1.el5 also resolved the issue for me in CentOS.
Updated by Marc Cortinas Val 4 months ago
Yes, i’ve updated kernel to 2.6.18-274.17.1.el5 and it has fixed, thank for your support guys!
Updated by Patrick Otto 4 months ago
- Status changed from Investigating to Closed
- Assignee set to Patrick Otto
Seems like this is fixed with 2.6.18-274.17.1.el5