Bug #4681
Puppet 2.6.1.r2 randomly hangs on Exec in FreeBSD
| Status: | Rejected | Start date: | 09/01/2010 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | FreeBSD | |||
| Target version: | - | |||
| Affected Puppet version: | 0.25.5 | Branch: | ||
| Keywords: | ||||
| Votes: | 2 |
Description
relevant info from pkg_info:
- ruby-1.8.7.248_3,1 (compiled with debug)
- puppet-2.6.1.r2
- facter-1.5.7_1
uname -a FreeBSD localhost 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Wed Jul 28 11:44:21 UTC 2010 root@localhost:/usr/obj/usr/src/sys/GENERIC amd64
I did also see this problem in FreeBSD 8.0 and puppet 0.25. Puppet does not hang every time it’s run, only on random Exec statements. When debugging this I simply created an endless loop which ran the puppet configuration until it hang. Sometimes it would only run a few times before it hang and sometimes it run for hours. The manifest is quite big, but contains some sensitive stuff so I can’t upload it here, but this is probably reproducable with any manifest that uses a great deal of Exec statements.
A few examples of output when it hang with command ‘ruby18 —debug /usr/local/bin/puppet agent —test —debug —trace’:
/usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized /usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized /usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized debug: /Stage[main]/Webbase::Backup/Line[sudo_backup]/Exec[echo 'backup ALL= ( root ) NOPASSWD: /usr/sbin/backup' >> '/usr/local/etc/sudoers']: Executing check '/usr/bin/grep -qFx 'backup ALL= ( root ) NOPASSWD: /usr/sbin/backup' '/usr/local/etc/sudoers'' debug: Executing '/usr/bin/grep -qFx 'backup ALL= ( root ) NOPASSWD: /usr/sbin/backup' '/usr/local/etc/sudoers'' removing /tmp/puppet20100901-36732-16mgs1p-0...done [here puppet hangs]
debug: /Stage[main]/Webbase::Apache/Exec[apache_purge]: Executing check '/bin/test -e /usr/local/etc/apache/ssl.crt' debug: Executing '/bin/test -e /usr/local/etc/apache/ssl.crt' removing /tmp/puppet20100901-50535-1p6j13j-0...done [here puppet hangs]
/usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized debug: /Stage[main]/Webbase::Nas/Exec[/sbin/mount /proc]: Executing check '/sbin/mount | /usr/bin/egrep ^procfs' debug: Executing '/sbin/mount | /usr/bin/egrep ^procfs' removing /tmp/puppet20100901-95974-1n5jg13-0...done [here puppet hangs]
Since I have very little experience debugging stuff, and no previous experience with ruby I just hacked in a few debug statements in type/exec.rb, and the relevant code appears to be this code from the run function:
withenv environment do
Timeout::timeout(self[:timeout]) do
output, status = Puppet::Util::SUIDManager.run_and_capture(
[command], self[:user], self[:group]
)
Puppet.debug "returned from run_and_capture #{self[:timeout]}"
end
Puppet.debug "timeout ended"
# The shell returns 127 if the command is missing.
if status.exitstatus == 127
Puppet.debug "argument error"
raise ArgumentError, output
end
Puppet.debug "no argument error"
end
When puppet hangs it is after
Puppet.debug "returned from run_and_capture #{self[:timeout]}"
but before
Puppet.debug "timeout ended"
Since I don’t really have time to figure out how the timeout –> do –> end thing works I’ll just file this bug here so hopefully someone can say if this is a bug in puppet or an upstream bug in ruby. Let me know if you need any more info.
Related issues
History
Updated by Markus Roberts over 1 year ago
- Status changed from Unreviewed to Investigating
- Assignee set to Markus Roberts
- Affected Puppet version set to 0.25.5
Thanks for the excellent bug report.
Updated by Markus Roberts over 1 year ago
- Status changed from Investigating to Accepted
Looking at the code in question:
def timeout(sec, klass = nil)
return yield if sec == nil or sec.zero?
raise ThreadError, "timeout within critical session" if Thread.critical
exception = klass || Class.new(ExitException)
begin
x = Thread.current
y = Thread.start {
begin
sleep sec
rescue => e
x.raise e
else
x.raise exception, "execution expired" if x.alive?
end
}
yield sec
# return true
rescue exception => e
[snip]
ensure
if y and y.alive?
y.kill
y.join # make sure y is dead.
end
end
it appears ruby isn’t being sufficiently paranoid with regard to thread critical and we (or another library) aren’t being sufficiently careful. It’s pretty easy to construct a case in which the time-out doesn’t occur:
require "timeout"
begin
Timeout::timeout(1) {
p "foo"
sleep 200
}
rescue Timeout::Error
puts "timed-out"
end
begin
Timeout::timeout(1) {
Thread.critical = true
p "bar"
sleep 200
}
rescue Timeout::Error
puts "timed-out"
end
but that’s still short of explaining the hang as described, in that it would still require the underlying code to hang and add the additional complication (clue?) that this happen in a critical region.
The next question is probably how can run_and_capture return while still leaving things in an odd state (e.g. Thread.critical = true, dangling joins, or…)?
Updated by Markus Roberts over 1 year ago
From conversation with a site that is experiencing the problem:
When it stops, the process does not respond to -KILL, -HUP or anything; truss (like linux strace) works, and there is some activity. Connecting with gdb produces:
Loaded symbols for /libexec/ld-elf.so.1 [Switching to Thread 0x28a79be0 (LWP 100156)] 0x2819d1a7 in __error () from /lib/libthr.so.3 (gdb) backtrace #0 0x2819d1a7 in __error () from /lib/libthr.so.3 #1 0x2819cd88 in __error () from /lib/libthr.so.3 #2 0x29e02060 in ?? () #3 0x00000008 in ?? () #4 0x00000001 in ?? () #5 0x29e02040 in ?? () #6 0xbc7fbf2c in ?? () #7 0x29e01020 in ?? () #8 0x00000000 in ?? () #9 0x2819b6df in pthread_setcancelstate () from /lib/libthr.so.3 #10 0x2819af7d in pthread_cond_signal () from /lib/libthr.so.3 #11 0x280a8710 in thread_timer () from /usr/local/lib/libruby18.so.18 #12 0x281926ff in pthread_getprio () from /lib/libthr.so.3 #13 0xbc6fafec in ?? () ### calling rb_raise.... (gdb) call (void)rb_raise((int)rb_eException, "raising an exception") /usr/local/lib/ruby/1.8/timeout.rb:64: [BUG] cross-thread violation on rb_thread_schedule() ruby 1.8.7 (2009-12-24 patchlevel 248) [i386-freebsd7]
Updated by Markus Roberts about 1 year ago
- Status changed from Accepted to Needs More Information
This appears to be a ruby bug (possibly triggered by something in puppet); either upgrading the ruby18 port to 1.8.7.302 from 1.8.7.284/285 or puppet from 0.25.5 to 2.6.2 seems to eliminate the symptoms to a reasonable if not conclusive statistical confidence level.
If anyone else experience the problem please note it on this ticket; otherwise, we can probably consider it resolved.
Updated by Nigel Kersten about 1 year ago
- Status changed from Needs More Information to Rejected
Updated by Lasse Brandt about 1 year ago
I experienced the same problems with puppet back when this ticket was created. Puppets hangs randomly.
I haven’t tested with puppet lately, but I had similar problems with other ruby apps on freebsd. Seem to be fixed in ruby 1.8.7.302 ( http://www.freshports.org/lang/ruby18/ ). ( And never had any problems with ruby 1.9 )