The Puppet Labs Issue Tracker has Moved: https://tickets.puppetlabs.com

This issue tracker is now in read-only archive mode and automatic ticket export has been disabled. Redmine users will need to create a new JIRA account to file tickets using https://tickets.puppetlabs.com. See the following page for information on filing tickets with JIRA:

Bug #12185

Puppet agents get stuck

Added by Carlos Cruz Luengo over 4 years ago. Updated about 3 years ago.

Status:ClosedStart date:01/26/2012
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:agent
Target version:-
Affected Puppet version:2.7.9 Branch:
Keywords: customer

We've Moved!

Ticket tracking is now hosted in JIRA: https://tickets.puppetlabs.com


Description

I am having some problems with my Puppet instalation. I have one Puppet in master mode running with:

  • Foreman as web interface.
  • Storeconfigs running.
  • Passenger Apache module.

I have 25 nodes running Puppet Agent every 5 minutes (runinterval=300) and sending their reports to the master node. Everything is fine but suddenly (and arbitrarily) the puppet agent stops running (I left yesterday all nodes working and today 20 of them had stopped asking for their catalogues, some 1 hour ago, other, 4 hours, etc.). The process is still working and when I run a ‘puppet kick’ from the master node they start working again. Is something wrong in my config?? (I am running 2.7.9-1puppetlabs3 on both master and agent nodes)

*** puppet.conf in master node ***
[main]
logdir=/var/log/puppet
vardir=/var/lib/puppet
ssldir=/var/lib/puppet/ssl
rundir=/var/run/puppet
factpath=$vardir/lib/facter
templatedir=$confdir/templates
certname=puppet.YYYYY.com
node_name=cert
server=puppet.YYYYY.com

[master]
ssl_client_header = SSL_CLIENT_S_DN
ssl_client_verify_header = SSL_CLIENT_VERIFY
storeconfigs=true
dbadapter=mysql
dbuser=puppet
dbname=foreman
dbpassword=XXXXX
dbserver=localhost
dbsocket=/var/run/mysqld/mysqld.sock
reports=log, foreman
node_terminus = exec
external_nodes = /etc/puppet/scripts/external_node.rb
allow_duplicate_certs = true

[agent]
report = true
listen = true
runinterval = 300

*** puppet.conf in agent node ***
[main]
logdir=/var/log/puppet
vardir=/var/lib/puppet
ssldir=/var/lib/puppet/ssl
rundir=/var/run/puppet
factpath=$vardir/lib/facter
templatedir=$confdir/templates
server=puppet.YYYYYY.com

[agent]
report = true
listen = true
runinterval = 300

Thank you very much in advanced.

07-Thu-Apr-05-2012.log - Last successful run. (115 KB) Steve Traylen, 04/12/2012 03:05 am


Related issues

Related to Puppet - Bug #2888: puppetd doesn't always cleanup lockfile properly Needs More Information 12/04/2009
Related to Puppet - Bug #10418: Puppet agent hangs when listen is true and reading from /... Closed 11/01/2011

History

#1 Updated by James Turnbull over 4 years ago

  • Description updated (diff)

#2 Updated by James Turnbull over 4 years ago

  • Status changed from Unreviewed to Needs More Information

Any log output or errors?

#3 Updated by Carlos Cruz Luengo over 4 years ago

There are no logs, it just seems that puppet process gets stuck:

root@AS016:~# tail -f /var/log/daemon.log 
Jan 26 17:03:24 AS016 xinetd[688]: Started working: 1 available service
Jan 26 17:03:25 AS016 puppet-agent[730]: Reopening log files
Jan 26 17:03:26 AS016 puppet-agent[730]: Starting Puppet client version 2.7.9
Jan 26 17:03:35 AS016 puppet-agent[730]: Finished catalog run in 0.03 seconds
Jan 26 17:04:50 AS016 puppet-agent[730]: triggered run
Jan 26 17:05:24 AS016 puppet-agent[730]: Finished catalog run in 0.03 seconds
Jan 26 17:08:41 AS016 puppet-agent[730]: Finished catalog run in 0.02 seconds
Jan 26 17:13:46 AS016 puppet-agent[730]: Finished catalog run in 0.03 seconds
Jan 26 17:18:52 AS016 puppet-agent[730]: Finished catalog run in 0.03 seconds
Jan 26 17:23:57 AS016 puppet-agent[730]: Finished catalog run in 0.11 seconds
root@AS016:~# date
Thu Jan 26 17:40:37 CET 2012
root@AS016:~# pgrep -l puppet
730 puppet

If I trigger ‘puppet kick’ from the master node, it asks for its catalogue and starts running again normally (until it suddenly stops…):

(triggered by master node)
Jan 26 17:41:36 AS016 puppet-agent[730]: Finished catalog run in 0.03 seconds

(five minutes later, it asks for its catalogue)
Jan 26 17:46:41 AS016 puppet-agent[730]: Finished catalog run in 0.03 seconds

Am I missing something? Thank you very much!

#4 Updated by Anonymous over 4 years ago

Can you run one of your nodes with debug output turned on, using the --debug flag, and see if that sheds more light on where the agent is when it hangs?

Can you also supply details of the OS and kernel version you are seeing this on?

#5 Updated by Carlos Cruz Luengo over 4 years ago

Both master and agent nodes run Debian Squeeze (2.6.32-5-xen-amd64 kernel).

I have enabled debug and this is what I get:

Jan 28 19:57:48 AS006 puppet-agent[6265]: Loaded state in 0.00 seconds
Jan 28 19:57:51 AS006 puppet-agent[6265]: Executing 'apt-get update'
Jan 28 19:57:52 AS006 puppet-agent[6265]: catalog supports formats: b64_zlib_yaml dot marshal pson raw yaml; using pson
Jan 28 19:57:52 AS006 puppet-agent[6265]: Caching catalog for as006.preprod.sip.euskaltel.com
Jan 28 19:57:52 AS006 puppet-agent[6265]: Creating default schedules
Jan 28 19:57:52 AS006 puppet-agent[6265]: Loaded state in 0.00 seconds
Jan 28 19:57:52 AS006 puppet-agent[6265]: Applying configuration version '1327769647'
Jan 28 19:57:52 AS006 puppet-agent[6265]: (/Schedule[daily]) Skipping device resources because running on a host
Jan 28 19:57:52 AS006 puppet-agent[6265]: (/Schedule[monthly]) Skipping device resources because running on a host
Jan 28 19:57:52 AS006 puppet-agent[6265]: (/Schedule[hourly]) Skipping device resources because running on a host
Jan 28 19:57:52 AS006 puppet-agent[6265]: (/Schedule[never]) Skipping device resources because running on a host
Jan 28 19:57:52 AS006 puppet-agent[6265]: (/Schedule[weekly]) Skipping device resources because running on a host
Jan 28 19:57:52 AS006 puppet-agent[6265]: (/Schedule[puppet]) Skipping device resources because running on a host
Jan 28 19:57:52 AS006 puppet-agent[6265]: Finishing transaction 70042503419660
Jan 28 19:57:52 AS006 puppet-agent[6265]: Storing state
Jan 28 19:57:52 AS006 puppet-agent[6265]: Stored state in 0.00 seconds
Jan 28 19:57:52 AS006 puppet-agent[6265]: Finished catalog run in 0.02 seconds
Jan 28 19:57:52 AS006 puppet-agent[6265]: Value of 'preferred_serialization_format' (pson) is invalid for report, using default (marshal)
Jan 28 19:57:52 AS006 puppet-agent[6265]: report supports formats: b64_zlib_yaml marshal raw yaml; using marshal
Jan 28 20:02:52 AS006 puppet-agent[6265]: Loaded state in 0.00 seconds

The 19:57 run goes OK (and previous ones are identical) and the 20:02 gets stucked in the fist step… any guess about what could it be?

Thank you very much.

#6 Updated by Carlos Cruz Luengo over 4 years ago

I have found something interesting… In all the stalled nodes I have /var/lib/puppet/state/puppetdlock. I have read that this prevents puppet from running, when and why was this file created?

Last log message and then gets stucked:

Jan 29 04:18:39 ekt-aux-puppetagent1 puppet-agent[8061]: Loaded state in 0.00 seconds

And the file:

# ls -l /var/lib/puppet/state/puppetdlock 
-rw-r--r-- 1 root root 4 Jan 29 04:18 puppetdlock

Thanks, I keep reading!

#7 Updated by Carlos Cruz Luengo over 4 years ago

It’s the same thing that happens here, and it was 3 years ago!

I have tried to force the lock creating a puppetdlock file myself but this is what I get on the next run:

Jan 29 20:56:25 ekt-aux-puppetagent2 puppet-agent[773]: Run of Puppet configuration client already in progress; skipping

It is different from the last log message I get on ‘naturally’ blocked puppet agents…

#8 Updated by Carlos Cruz Luengo over 4 years ago

Definitely, the /var/lib/puppet/state/puppetdlock is not the lock of the previous run, it is the lock the current (and blocked) run. It sets the lock and it loads the state:

Jan 29 04:18:39 ekt-aux-puppetagent1 puppet-agent[8061]: Loaded state in 0.00 seconds

There must be something wrong with the loaded state that makes it hang… any guess??

Thanks in advanced.

#9 Updated by Anonymous over 4 years ago

Carlos Cruz Luengo wrote:

Definitely, the /var/lib/puppet/state/puppetdlock is not the lock of the previous run, it is the lock the current (and blocked) run. It sets the lock and it loads the state:

[…]

There must be something wrong with the loaded state that makes it hang… any guess??

No idea. Is there any confidential information in your manifests? Could you share the state file with us?

#10 Updated by Carlos Cruz Luengo over 4 years ago

No confidential information, I will wait until one of the nodes hangs and I share the state.yaml file.

Thanks!

#11 Updated by Carlos Cruz Luengo over 4 years ago

This is the /var/lib/puppet/state/state.yaml file of a just blocked agent node:

--- 
  "Filebucket[puppet]": 
    !ruby/sym checked: 2012-01-31 12:03:28.715136 +01:00
  "File[/var/lib/puppet]": 
    !ruby/sym checked: 2012-01-23 21:06:31.811621 +01:00
  "File[/var/lib/puppet/ssl/certs]": 
    !ruby/sym checked: 2012-01-23 21:06:31.823280 +01:00
  "File[/var/lib/puppet/client_data]": 
    !ruby/sym checked: 2012-01-23 21:06:37.702577 +01:00
    !ruby/sym synced: 2012-01-23 21:06:31.679875 +01:00
  "Class[Main]": 
    !ruby/sym checked: 2012-01-31 12:03:28.717358 +01:00
  "File[/var/run/puppet]": 
    !ruby/sym checked: 2012-01-23 21:06:31.825033 +01:00
  "File[/var/lib/puppet/ssl/public_keys/MYHOST.MYDOMAIN.pem]": 
    !ruby/sym checked: 2012-01-23 21:06:31.824179 +01:00
    !ruby/sym synced: 2012-01-23 21:06:31.678270 +01:00
  "File[/var/lib/puppet/ssl/public_keys]": 
    !ruby/sym checked: 2012-01-23 21:06:31.822391 +01:00
  "Class[Settings]": 
    !ruby/sym checked: 2012-01-31 12:03:28.710386 +01:00
  "File[/var/lib/puppet/state/graphs]": 
    !ruby/sym checked: 2012-01-23 21:06:37.703446 +01:00
    !ruby/sym synced: 2012-01-23 21:06:31.681333 +01:00
  "File[/var/run/puppet/agent.pid]": 
    !ruby/sym checked: 2012-01-23 21:06:37.700733 +01:00
  "File[/var/lib/puppet/clientbucket]": 
    !ruby/sym checked: 2012-01-23 21:06:37.705281 +01:00
    !ruby/sym synced: 2012-01-23 21:06:31.685539 +01:00
  "File[/var/lib/puppet/ssl/private_keys/MYHOST.MYDOMAIN.pem]": 
    !ruby/sym checked: 2012-01-23 21:06:31.827236 +01:00
    !ruby/sym synced: 2012-01-23 21:06:31.684125 +01:00
  "File[/var/log/puppet]": 
    !ruby/sym checked: 2012-01-23 21:06:31.817084 +01:00
  "File[/etc/puppet/puppet.conf]": 
    !ruby/sym checked: 2012-01-23 21:06:37.701629 +01:00
  "File[/var/lib/puppet/client_yaml]": 
    !ruby/sym checked: 2012-01-23 21:06:37.704396 +01:00
    !ruby/sym synced: 2012-01-23 21:06:31.682270 +01:00
  "File[/var/lib/puppet/lib]": 
    !ruby/sym checked: 2012-01-23 21:06:31.831439 +01:00
  "File[/var/lib/puppet/facts]": 
    !ruby/sym checked: 2012-01-23 21:06:31.816493 +01:00
  !ruby/sym configuration: {}
  "Stage[main]": 
    !ruby/sym checked: 2012-01-31 12:03:28.719005 +01:00
  "File[/var/lib/puppet/ssl/private]": 
    !ruby/sym checked: 2012-01-23 21:06:31.826345 +01:00
  "File[/var/lib/puppet/state]": 
    !ruby/sym checked: 2012-01-23 21:06:31.825675 +01:00
  "File[/var/lib/puppet/ssl/certificate_requests]": 
    !ruby/sym checked: 2012-01-23 21:06:31.832076 +01:00
  "File[/var/lib/puppet/ssl/private_keys]": 
    !ruby/sym checked: 2012-01-23 21:06:31.819048 +01:00
  "File[/var/lib/puppet/ssl]": 
    !ruby/sym checked: 2012-01-23 21:06:31.817843 +01:00
  "File[/etc/puppet]": 

I hope this helps… Thanks!

#12 Updated by Anonymous over 4 years ago

Darn. That helps, in the sense that it says nothing is wrong or odd with the state file, but it doesn’t help work out why they process is hanging.

#13 Updated by Davide Guerri about 4 years ago

Hi all, doing an strace i found that the agent hangs polling with a select for a socket fd.

The following shows a “normal” puppet agent “polling”:

# strace -p 9494
Process 9494 attached - interrupt to quit
select(9, [3 7], [], [], {0, 464613})   = 0 (Timeout)
select(9, [3 7], [], [], {0, 0})        = 0 (Timeout)
stat("/etc/puppet/puppet.conf", {st_mode=S_IFREG|0644, st_size=27, ...}) = 0
select(9, [3 7], [], [], {0, 944414})   = 0 (Timeout)
select(9, [3 7], [], [], {0, 0})        = 0 (Timeout)
select(9, [3 7], [], [], {1, 999999})   = 0 (Timeout)
select(9, [3 7], [], [], {0, 0})        = 0 (Timeout)
select(9, [3 7], [], [], {1, 999999}^C <unfinished ...>
Process 9494 detached

Where the file descriptors are related to the the following

# ls -la /proc/9494/fd/
total 0
dr-x------ 2 root root  0 Feb  8 14:39 .
dr-xr-xr-x 7 root root  0 Feb  8 12:50 ..
lr-x------ 1 root root 64 Feb  8 14:39 0 -> /dev/null
l-wx------ 1 root root 64 Feb  8 14:39 1 -> /dev/null
l-wx------ 1 root root 64 Feb  8 14:39 2 -> /dev/null
lr-x------ 1 root root 64 Feb  8 14:39 3 -> pipe:[268775]
l-wx------ 1 root root 64 Feb  8 14:39 4 -> pipe:[268775]
lrwx------ 1 root root 64 Feb  8 14:39 5 -> socket:[268784]
lrwx------ 1 root root 64 Feb  8 14:39 6 -> /var/lib/puppet/log/http.log
lrwx------ 1 root root 64 Feb  8 14:39 7 -> socket:[268792]

The fd 7 points to a tcp socket that is listening on port 8139 (used for RESTful API if I’m not mistaken).

# lsof | grep 268792
puppet     9494     root    7u     IPv4             268792      0t0        TCP *:8139 (LISTEN)

Here follows a strace of a stucked puppet agent:

# strace -p 9494
Process 9494 attached - interrupt to quit
select(8, [7], NULL, NULL, {0, 433733}) = 0 (Timeout)
select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
select(8, [7], NULL, NULL, {2, 0}^C <unfinished ...>
Process 9494 detached

A simple

echo "" | nc localhost 8139

will un-hang the process. I guess it’s something related to the RESTful API but I don’t have enough knowledge about puppet internals to solve this issue.

#14 Updated by Steffen Zieger about 4 years ago

Same here. Running echo "" | nc localhost 8139 will “fix” it. For now, I’ve set listen = false

Puppet version: 2.7.11 (both master and clients)

#15 Updated by Davide Guerri about 4 years ago

I’m using puppet to install this workaround

 cron { "puppet gets stuck wa":
  command => "echo '' | nc localhost 8139",
  user => "nobody",
  minute => "*/5"
 }

I know it’s dirty but there’s no fix for this issue yet and it works very well for me ;)

#16 Updated by Jason Antman about 4 years ago

As an FYI, I believe this is the same issue that I had been talking about (incorrectly, I guess) in Issue #2888

I have this happening with a number of Puppet 2.6.12 clients running on CentOS 6.2 x86_64. I’ve been cron'ing a “puppet kick” of any nodes that haven’t run in over 1 hour (from the master), so that seems to solve the problem as well (of course, it’s not a good solution until Issue #10104 is implemented).

#17 Updated by Steve Traylen about 4 years ago

We also see exact same symtoms puppet 2.7.11 on RHEL6 and the netcat trick above unsticks puppetd. Will try and get some more information.

#18 Updated by Steve Traylen about 4 years ago

Have now caught a ‘puppet agent stuck’ with debugging enabled.

Puppet running with:

/usr/bin/ruby /usr/sbin/puppetd --debug --logdest /tmp/puppet.log --autoflush true

The last successful run was perfectly normal and is

Thu Apr 05 19:07:06 +0200 2012 Puppet (debug): Loaded state in 0.01 seconds
Thu Apr 05 19:07:06 +0200 2012 Puppet (info): Retrieving plugin
Thu Apr 05 19:07:06 +0200 2012 /File[/var/lib/puppet/lib]/seluser (debug): Found seluser default 'system_u' for /var/lib/puppet/lib
Thu Apr 05 19:07:06 +0200 2012 /File[/var/lib/puppet/lib]/selrole (debug): Found s
...
...
hu Apr 05 19:07:07 +0200 2012 /File[/var/lib/puppet/lib/puppet/provider/network_config/interfaces.rb]/selrange (debug): Found selrange default 's0' for /var/lib/puppet/lib/puppet/provider/network_config/interfaces.rb
Thu Apr 05 19:07:07 +0200 2012 Puppet (debug): file_metadata supports formats: b64_zlib_yaml marshal pson raw yaml; using pson
Thu Apr 05 19:07:08 +0200 2012 Puppet (debug): Finishing transaction 70060370436820

Full log attached

At this point puppet agent is stuck, the strace is as above.

Running

echo "" | nc localhost 8139

then puppet immediately starts up again with

Thu Apr 12 11:44:36 +0200 2012 Puppet (info): Loading facts in /var/lib/puppet/lib/facter/bios_and_system.rb
Thu Apr 12 11:44:36 +0200 2012 Puppet (info): Loading facts in /var/lib/puppet/lib/facter/iptables.rb
Thu Apr 12 11:44:36 +0200 2012 Puppet (info): Loading facts in /var/lib/puppet/lib/facter/lemon.rb
...

and a normal run happens. Everything is fine from this point.

One thing I noticed is that the transaction id ‘70060370436820’ from the last good run happens to have been repeat of one a couple of days ago. Are these not unique? Looking

Tue Apr 03 06:36:04 +0200 2012 Puppet (debug): Finishing transaction 70060370436820
Thu Apr 05 19:07:08 +0200 2012 Puppet (debug): Finishing transaction 70060370436820

but looking there are other occurrences of repeats so I guess this is normal.

#19 Updated by Dave Alden about 4 years ago

I see the current status is “Needs More Information” — what information is still needed?

#20 Updated by Steve Traylen about 4 years ago

Just to exclude something I ran a tcpdump on the puppet client 8139 to see if some stray packet on the network causes the lock up. This does not seem to be the case, puppet locked with no packets arriving.

Current idea, we only see this on a particular service configuration that does an iptables restart quite frequently. I’m checking now if puppet stopping and that are correlated. .. Waiting for it to happen.

#21 Updated by Andreas Unterkircher about 4 years ago

Current idea, we only see this on a particular service configuration that does an iptables restart quite frequently. I’m checking now if puppet stopping and that are correlated. .. Waiting for it to happen.

I have the same problem – but also on non-firewalled nodes. What seems strange to me – puppet-agent cycles around a fd (8 in my example)

    
select(8, [7], NULL, NULL, {1, 992446}) = 0 (Timeout)
select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
...

but the process has nothing open there

lr-x------ 1 root root 64 Apr 30 09:25 0 -> /dev/null
l-wx------ 1 root root 64 Apr 30 09:25 1 -> /dev/null
l-wx------ 1 root root 64 Apr 30 09:25 2 -> /dev/null
lr-x------ 1 root root 64 Apr 30 09:25 3 -> pipe:[10845151]
l-wx------ 1 root root 64 Apr 30 09:25 4 -> pipe:[10845151]
lrwx------ 1 root root 64 Apr 30 09:25 5 -> socket:[10845173]
lrwx------ 1 root root 64 Apr 30 09:25 6 -> /var/log/puppet/http.log
lrwx------ 1 root root 64 Apr 30 09:25 7 -> socket:[10845216]

#22 Updated by Gregory Etling almost 4 years ago

I have also begun to see this behavior on three of my 2.6.13 nodes. strace output isn’t an exact match, but largely similar:

select(8, [7], NULL, NULL, {1, 905284}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0

And the nc “fix” will successfully unstick the process.

Could this also be related to #7141 or #13000?

#23 Updated by Andreas Unterkircher almost 4 years ago

I have kept two agents running in foreground with debugging enabled (listen=true) – after their last successful run they are stuck after loaded-state-in:

debug: Puppet::Type::Package::ProviderApt: Executing '/usr/bin/apt-cache policy tzdata'
debug: file_metadata supports formats: b64_zlib_yaml pson raw yaml; using pson
debug: Finishing transaction 70258316897200
debug: Storing state
debug: Stored state in 0.25 seconds
notice: Finished catalog run in 29.88 seconds
debug: Loaded state in 0.14 seconds
debug: Puppet::Type::Package::ProviderApt: Executing '/usr/bin/apt-cache policy tzdata'
debug: file_metadata supports formats: b64_zlib_yaml pson raw yaml; using pson
debug: Finishing transaction 69976307883480
debug: Storing state
debug: Stored state in 0.19 seconds
notice: Finished catalog run in 29.26 seconds
debug: Loaded state in 0.08 seconds

as soon as I telnet to 8139 it continues to

debug: catalog supports formats: b64_zlib_yaml dot pson raw yaml; using pson
info: Caching catalog for somehost

#24 Updated by James Turnbull almost 4 years ago

  • Category set to agent
  • Status changed from Needs More Information to Accepted

#25 Updated by Jo Rhett almost 4 years ago

removed

Wrong ticket, my bad.

#26 Updated by konrad rzentarzewski almost 4 years ago

maybe related, about 10% of my puppet agents started acting strange (delayed runs) today – we haven’t observed it earlier. puppetd daemon is restarted nightly to avoid leaks.

Jul  2 05:52:03 pciuch puppet-agent[5451]: Starting Puppet client version 2.7.17
Jul  2 06:14:09 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 06:14:59 pciuch puppet-agent[5451]: Finished catalog run in 49.46 seconds
Jul  2 06:22:33 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 06:23:26 pciuch puppet-agent[5451]: Finished catalog run in 53.41 seconds
Jul  2 06:52:44 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 06:53:37 pciuch puppet-agent[5451]: Finished catalog run in 53.01 seconds
Jul  2 07:22:54 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 07:23:43 pciuch puppet-agent[5451]: Finished catalog run in 49.00 seconds
Jul  2 07:53:01 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 07:53:58 pciuch puppet-agent[5451]: Finished catalog run in 57.03 seconds
Jul  2 08:23:01 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 08:23:52 pciuch puppet-agent[5451]: Finished catalog run in 51.19 seconds
Jul  2 08:53:10 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 08:54:04 pciuch puppet-agent[5451]: Finished catalog run in 54.20 seconds
Jul  2 09:23:22 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 09:24:11 pciuch puppet-agent[5451]: Finished catalog run in 49.53 seconds
Jul  2 09:53:29 pciuch puppet-agent[5451]: Applying configuration version '20120701110111-r21129-x@polly'
Jul  2 09:54:24 pciuch puppet-agent[5451]: Finished catalog run in 55.16 seconds
Jul  2 10:23:43 pciuch puppet-agent[5451]: Applying configuration version '20120702095348-r21130-y@polly'
Jul  2 10:24:35 pciuch puppet-agent[5451]: Finished catalog run in 52.25 seconds
Jul  2 10:53:53 pciuch puppet-agent[5451]: Applying configuration version '20120702095348-r21130-y@polly'
Jul  2 10:54:50 pciuch puppet-agent[5451]: Finished catalog run in 57.16 seconds
Jul  2 11:23:57 pciuch puppet-agent[5451]: Applying configuration version '20120702105731-r21131-z@polly'
Jul  2 11:24:48 pciuch puppet-agent[5451]: Finished catalog run in 51.22 seconds
Jul  2 12:48:23 pciuch puppet-agent[5451]: Applying configuration version '20120702124537-r21137-z@polly'
Jul  2 12:49:23 pciuch puppet-agent[5451]: Finished catalog run in 59.57 seconds

normal run interval is 30 minutes, however there was 84 minutes long gap after last run.

we use storedconfigs (puppetdb) however don’t use listen.

we limit reporting timeout to 15m on haproxy.

#27 Updated by Steve Traylen almost 4 years ago

maybe related, about 10% of my puppet agents started acting strange (delayed runs) today 

Is this on RHEL6 machines? We observed very sick puppets as above due to the leapsecond. If you have puppetd’s running with very high load then try “service ntpd stop ; date -s $(date)” to see if that helps.

Restarting puppet will not help with this one… If it is this one at all.

#28 Updated by konrad rzentarzewski almost 4 years ago

Centos 5

#29 Updated by konrad rzentarzewski almost 4 years ago

it didn’t helped. ie, i’ve restarted ntpd and ran date at 14:23 and experienced another gap at 17:08 – 18:31 (no puppet run since now, normally should run at 17:38 and 18:08 but ran – delayed – at 18:33):

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ntp      12830  0.0  0.0  23820  5456 ?        SNLs 14:23   0:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g

Jul  2 14:24:00 blade602 puppet-agent[13511]: Finished catalog run in 57.92 seconds
Jul  2 15:38:31 blade602 puppet-agent[13511]: Finished catalog run in 61.29 seconds
Jul  2 16:08:50 blade602 puppet-agent[13511]: Finished catalog run in 59.85 seconds
Jul  2 16:38:33 blade602 puppet-agent[13511]: Finished catalog run in 53.15 seconds
Jul  2 17:08:51 blade602 puppet-agent[13511]: Finished catalog run in 59.46 seconds
Jul  2 18:33:24 blade602 puppet-agent[13511]: Finished catalog run in 67.76 seconds

#30 Updated by Kristjan Koppel almost 4 years ago

Steve Traylen wrote:

maybe related, about 10% of my puppet agents started acting strange (delayed runs) today

Is this on RHEL6 machines? We observed very sick puppets as above due to the leapsecond. If you have puppetd’s running with very high load then try “service ntpd stop ; date -s $(date)” to see if that helps.

Restarting puppet will not help with this one… If it is this one at all.

FYI, we also noticed some puppet agents (versions 2.6.2 and 2.7.18) running on Debian Squeeze VMs using 100% CPU and seeming stuck based on strace output. Telnet to port 8139 didn’t help at all, but setting the date solved this problem for us.

#31 Updated by Brian Almeida over 3 years ago

We are seeing this issue on our Ubuntu 12.04 (64-bit) environment. We didn’t have this problem under 11.10.

We are using the Ubuntu 2.7.11-1ubuntu2.1 packages right now. I’ve worked around the hung agent problems right now with a simple cron that runs on the master:

!/bin/bash

THRESHOLD=1200

cd /var/lib/puppet/yaml/node

for i in *.yaml; do

/usr/lib/nagios/plugins/check_file_age -w $THRESHOLD $i > /dev/null 2>&1
rc=$?
if [ $rc -ne 0 ]; then
     node=`basename $i .yaml`
     puppet kick $node > /dev/null 2>&1
fi

done

#32 Updated by Anonymous over 3 years ago

I’ve just pushed a fix to a very closely related issue in #2888 where by two Puppet agent processes will eventually become deadlocked on the puppetdlock file.

If possible, could someone who has been troubled by this issue please try running Puppet from the 2.7.x branch, or with this patch applied?

Thanks, -Jeff

#33 Updated by Brian Almeida over 3 years ago

Jeff,

I applied the patch to all our systems (54) and restarted puppet, and they are still getting stuck (unstuck via a puppet kick or “echo ‘’|nc localhost 8139”).

So it doesn’t look like it fixed this particular issue.

Thanks,

Brian

Jeff McCune wrote:

I’ve just pushed a fix to a very closely related issue in #2888 where by two Puppet agent processes will eventually become deadlocked on the puppetdlock file.

If possible, could someone who has been troubled by this issue please try running Puppet from the 2.7.x branch, or with this patch applied?

Thanks, -Jeff

#34 Updated by Anonymous over 3 years ago

Ugh, let’s try that again without yelling this time.

@Brian,

Thanks for trying this out and following up. The problem you’re running into sounds like #10418 since the agent becomes unstuck when data arrives on the listening 8139 TCP port.

10418 has some really good information, particularly strace calls from a healthy system and strace calls from a hung system.

If you haven’t tried this already, perhaps upgrading to Facter 1.6.12 will resolve the issue for you Brian. If it does, please let us know.

-Jeff

#35 Updated by Brian Almeida over 3 years ago

Jeff,

Upgrading our environment to the apt.puppetlabs.com repo versions of puppet (2.7.19) and facter (1.6.12) resolved the issue for us (along with a few others).

Thanks for the tip!

Brian

#36 Updated by Anonymous over 3 years ago

  • Status changed from Accepted to Closed

I believe that this issue was a duplicate of #10418 or #2888. Both of these have been resolved, and so this should be resolved as well. See the related bugs for more information about those issues.

#37 Updated by Charlie Sharpsteen about 3 years ago

  • Keywords set to customer

Also available in: Atom PDF