Bug #4681

Puppet 2.6.1.r2 randomly hangs on Exec in FreeBSD

Added by Fredrik Eriksson over 1 year ago. Updated 3 months ago.

Status:Rejected Start date:09/01/2010
Priority:Normal Due date:
Assignee:Markus Roberts % Done:

0%

Category:FreeBSD
Target version:-
Affected Puppet version:0.25.5 Branch:
Keywords:
Votes: 2

Description

relevant info from pkg_info:

  • ruby-1.8.7.248_3,1 (compiled with debug)
  • puppet-2.6.1.r2
  • facter-1.5.7_1
uname -a
FreeBSD localhost 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Wed Jul 28 11:44:21 UTC 2010     root@localhost:/usr/obj/usr/src/sys/GENERIC  amd64

I did also see this problem in FreeBSD 8.0 and puppet 0.25. Puppet does not hang every time it’s run, only on random Exec statements. When debugging this I simply created an endless loop which ran the puppet configuration until it hang. Sometimes it would only run a few times before it hang and sometimes it run for hours. The manifest is quite big, but contains some sensitive stuff so I can’t upload it here, but this is probably reproducable with any manifest that uses a great deal of Exec statements.

A few examples of output when it hang with command ‘ruby18 —debug /usr/local/bin/puppet agent —test —debug —trace’:

/usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized
/usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized
/usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized
debug: /Stage[main]/Webbase::Backup/Line[sudo_backup]/Exec[echo 'backup  ALL= ( root )   NOPASSWD: /usr/sbin/backup' >> '/usr/local/etc/sudoers']: Executing check '/usr/bin/grep -qFx 'backup     ALL= ( root )   NOPASSWD: /usr/sbin/backup' '/usr/local/etc/sudoers''
debug: Executing '/usr/bin/grep -qFx 'backup     ALL= ( root )   NOPASSWD: /usr/sbin/backup' '/usr/local/etc/sudoers''
removing /tmp/puppet20100901-36732-16mgs1p-0...done
[here puppet hangs]
debug: /Stage[main]/Webbase::Apache/Exec[apache_purge]: Executing check '/bin/test -e /usr/local/etc/apache/ssl.crt'
debug: Executing '/bin/test -e /usr/local/etc/apache/ssl.crt'
removing /tmp/puppet20100901-50535-1p6j13j-0...done
[here puppet hangs]
/usr/local/lib/ruby/site_ruby/1.8/puppet/type.rb:1881: warning: instance variable @virtual not initialized
debug: /Stage[main]/Webbase::Nas/Exec[/sbin/mount /proc]: Executing check '/sbin/mount | /usr/bin/egrep ^procfs'
debug: Executing '/sbin/mount | /usr/bin/egrep ^procfs'
removing /tmp/puppet20100901-95974-1n5jg13-0...done 
[here puppet hangs]

Since I have very little experience debugging stuff, and no previous experience with ruby I just hacked in a few debug statements in type/exec.rb, and the relevant code appears to be this code from the run function:

          withenv environment do
            Timeout::timeout(self[:timeout]) do
              output, status = Puppet::Util::SUIDManager.run_and_capture(
                [command], self[:user], self[:group]
              )
              Puppet.debug "returned from run_and_capture #{self[:timeout]}"
            end
            Puppet.debug "timeout ended"
            # The shell returns 127 if the command is missing.
            if status.exitstatus == 127
              Puppet.debug "argument error"
              raise ArgumentError, output
            end
            Puppet.debug "no argument error"
          end

When puppet hangs it is after

Puppet.debug "returned from run_and_capture #{self[:timeout]}"

but before

Puppet.debug "timeout ended"

Since I don’t really have time to figure out how the timeout –> do –> end thing works I’ll just file this bug here so hopefully someone can say if this is a bug in puppet or an upstream bug in ruby. Let me know if you need any more info.


Related issues

related to Puppet - Bug #5139: puppetdlock file can be empty Accepted 10/28/2010

History

Updated by Markus Roberts over 1 year ago

  • Status changed from Unreviewed to Investigating
  • Assignee set to Markus Roberts
  • Affected Puppet version set to 0.25.5

Thanks for the excellent bug report.

Updated by Markus Roberts over 1 year ago

  • Status changed from Investigating to Accepted

Looking at the code in question:

  def timeout(sec, klass = nil)
    return yield if sec == nil or sec.zero?
    raise ThreadError, "timeout within critical session" if Thread.critical
    exception = klass || Class.new(ExitException)
    begin
      x = Thread.current
      y = Thread.start {
        begin
          sleep sec
        rescue => e
          x.raise e
        else
          x.raise exception, "execution expired" if x.alive?
        end
      }
      yield sec
      #    return true
    rescue exception => e
      [snip]
    ensure
      if y and y.alive?
        y.kill
        y.join # make sure y is dead.
      end
    end

it appears ruby isn’t being sufficiently paranoid with regard to thread critical and we (or another library) aren’t being sufficiently careful. It’s pretty easy to construct a case in which the time-out doesn’t occur:

require "timeout"

begin
  Timeout::timeout(1) {
    p "foo"
    sleep 200
  }
rescue Timeout::Error
  puts "timed-out"
end

begin
  Timeout::timeout(1) {
    Thread.critical = true
    p "bar"
    sleep 200
  }
rescue Timeout::Error
  puts "timed-out"
end

but that’s still short of explaining the hang as described, in that it would still require the underlying code to hang and add the additional complication (clue?) that this happen in a critical region.

The next question is probably how can run_and_capture return while still leaving things in an odd state (e.g. Thread.critical = true, dangling joins, or…)?

Updated by Markus Roberts over 1 year ago

From conversation with a site that is experiencing the problem:

When it stops, the process does not respond to -KILL, -HUP or anything; truss (like linux strace) works, and there is some activity. Connecting with gdb produces:

Loaded symbols for /libexec/ld-elf.so.1
[Switching to Thread 0x28a79be0 (LWP 100156)]
0x2819d1a7 in __error () from /lib/libthr.so.3

(gdb) backtrace

#0  0x2819d1a7 in __error () from /lib/libthr.so.3
#1  0x2819cd88 in __error () from /lib/libthr.so.3
#2  0x29e02060 in ?? ()
#3  0x00000008 in ?? ()
#4  0x00000001 in ?? ()
#5  0x29e02040 in ?? ()
#6  0xbc7fbf2c in ?? ()
#7  0x29e01020 in ?? ()
#8  0x00000000 in ?? ()
#9  0x2819b6df in pthread_setcancelstate () from /lib/libthr.so.3
#10 0x2819af7d in pthread_cond_signal () from /lib/libthr.so.3
#11 0x280a8710 in thread_timer () from /usr/local/lib/libruby18.so.18
#12 0x281926ff in pthread_getprio () from /lib/libthr.so.3
#13 0xbc6fafec in ?? ()


### calling rb_raise.... 

(gdb) call (void)rb_raise((int)rb_eException, "raising an exception")

/usr/local/lib/ruby/1.8/timeout.rb:64: [BUG] cross-thread violation on rb_thread_schedule()
ruby 1.8.7 (2009-12-24 patchlevel 248) [i386-freebsd7]

Updated by Markus Roberts about 1 year ago

  • Status changed from Accepted to Needs More Information

This appears to be a ruby bug (possibly triggered by something in puppet); either upgrading the ruby18 port to 1.8.7.302 from 1.8.7.284/285 or puppet from 0.25.5 to 2.6.2 seems to eliminate the symptoms to a reasonable if not conclusive statistical confidence level.

If anyone else experience the problem please note it on this ticket; otherwise, we can probably consider it resolved.

Updated by Nigel Kersten about 1 year ago

  • Status changed from Needs More Information to Rejected

Updated by Lasse Brandt about 1 year ago

I experienced the same problems with puppet back when this ticket was created. Puppets hangs randomly.

I haven’t tested with puppet lately, but I had similar problems with other ruby apps on freebsd. Seem to be fixed in ruby 1.8.7.302 ( http://www.freshports.org/lang/ruby18/ ). ( And never had any problems with ruby 1.9 )

Also available in: Atom PDF