The Puppet Labs Issue Tracker has Moved: https://tickets.puppetlabs.com

Bug #15420

puppet master resource starvation on http report url timeouts instead of failures

Added by Ramon Alteren almost 2 years ago. Updated over 1 year ago.

Status:ClosedStart date:07/09/2012
Priority:NormalDue date:
Assignee:eric sorenson% Done:

0%

Category:reports
Target version:-
Affected Puppet version:2.7.13 Branch:
Keywords:ntbf

We've Moved!

Ticket tracking is now hosted in JIRA: https://tickets.puppetlabs.com

This issue is currently not available for export. If you are experiencing the issue described below, please file a new ticket in JIRA. Once a new ticket has been created, please add a link to it that points back to this Redmine ticket.


Description

We recently had an issue with puppetdashboard that caused it to timeout instead of fail / succeed. Since the default http timeout in ruby is set to 60 seconds, the puppetmaster process is busy for the entire catalog-handling + and additional 60secs to wait on report http post timeout.

This causes severe resource starvation on the master resulting in failed puppet runs for all nodes

It would make sense to add a timeout parameter for http reports or in the long run split off http reporting into a separate thread/process on the master…

History

#1 Updated by Erik Dalén almost 2 years ago

Code fixing the issue in pull request 923

#2 Updated by Patrick Carlisle over 1 year ago

  • Status changed from Unreviewed to Investigating

It appears to me from reading the code and doing some testing that the master respects the configtimeout option, which defaults to 120 seconds. The documentation on that option implies it is only for the agent, but all http connections share the same code. Would it be practical to use that? If so, we could change the docs for that option. If not, we should probably separate them completely.

While playing with this I also discovered that on ruby 1.8.7 net::http does not respect timeout during the ssl handshake.

#3 Updated by Patrick Carlisle over 1 year ago

Now I’ve realized that change was put into 3.0. Configtimeout doesn’t work for reports on 2.7, as would be expected from reading the config docs. I was looking at 3.0 since the pull request is targeted at the master branch.

#4 Updated by Erik Dalén over 1 year ago

I think it makes more sense to have it as a separate option really. But I can make sure we switch the pull request to target 2.7.x, as I guess this can be considered a bugfix?

#5 Updated by Ramon Alteren over 1 year ago

I think it doesn’t make sense to share the configtimeout option.

We’d want the report timeout to be short so any problem there results in a very quick “report upload fail” We’d want a much more lax configtimeout for the agent, since we run the agent on the master as well, sharing the timeout between the two would mean that the agent ont the master would have the same timeout for http connects as the reporting engine. That doesn’t sound like a good idea at all….

#6 Updated by Ramon Alteren over 1 year ago

  • Branch set to 2.7.x

Rebased change against 2.7.x branch in pull request 956

#7 Updated by Ramon Alteren over 1 year ago

  • Branch deleted (2.7.x)

Erik Dalén wrote:

Code fixing the issue in pull request 923

This pull request still merges with master

#8 Updated by Daniel Pittman over 1 year ago

I think this should be rejected, or at least the code as-is should be: Puppet Labs can’t control the code that receives the report, and this is a public API, imposing this (relatively) short timeout for response puts a new burden on the external system.

If this was only applied to the HTTP open timeout, that wouldn’t be so bad – it is the read timeout that concerns me. (Though it still has some concerns in that case, too.)

It would also be relatively acceptable if the master would not just discard the report when this request failed. Since it will do that, and the entire report will be lost, we should strive to be extremely tolerant of problems with external systems, even at the cost of overall throughput.

#9 Updated by Erik Dalén over 1 year ago

It doesn’t impose a shot timeout by default, the default should still be 60. It just allows the user to have a lower timeout.

The other option to avoid the bug this tries to fix would be to fork this off, but that seems to have some problems under passenger when workers are killed etc.

But I would rather loose a report than have the puppetmaster die.

#10 Updated by Patrick Carlisle over 1 year ago

  • Status changed from Investigating to Needs Decision
  • Assignee set to eric sorenson

#11 Updated by eric sorenson over 1 year ago

  • Status changed from Needs Decision to Closed
  • Keywords set to ntbf

Some comments:

I don’t think we’d add a new config option into 2.7.x at this point.

Even in 3.x, it seems like adding timeout settings for the various http connections is not a good choice either, because it adds a lot of complexity that is really just used to cover up underlying problems. If you have to bump your catalog / file fetch timeout to high values because of slow puppet master response, stacking up connections is probably adding to the death spiral rather than helping it; conversely if you need to use really short timeouts because of an underlying app server problem you should probably be fixing that problem rather than working around it with http settings.

If I understand correctly, Ramon/Erik can use the :configtimeout setting to tune the timeout down today without any code changes upstream, it’s just that this will also affect catalog and file fetches.

Relatedly, as Patrick discovered, the configtimeout option is erroneously named and ought to be corrected and used for this purpose in 3.x, since it affects all http connections. I’ve created #15611 to address this.

#12 Updated by Patrick Carlisle over 1 year ago

That’s not quite right. In 2.7 there is no way to configure the timeout for an http report processor. It is the only http connection that doesn’t use configtimeout.

#13 Updated by Ramon Alteren over 1 year ago

eric sorenson wrote:

Some comments:

I don’t think we’d add a new config option into 2.7.x at this point.

Even in 3.x, it seems like adding timeout settings for the various http connections is not a good choice either, because it adds a lot of complexity that is really just used to cover up underlying problems. If you have to bump your catalog / file fetch timeout to high values because of slow puppet master response, stacking up connections is probably adding to the death spiral rather than helping it; conversely if you need to use really short timeouts because of an underlying app server problem you should probably be fixing that problem rather than working around it with http settings.

Fair enough. The death spiral of stacking up connections was exactly the problem we were trying to solve with this patch. Although I agree that the problem should probably be solved at puppetdashboard level I’m a bit doubtful I’ll be able to avoid all error situations that could result in a (long) timeout. The original cause was a filled mysql disk causing puppet dashboard to timeout, but I can think of a number of network related failures that would have roughly the same effect but can never be solved at the target (reporting) system.

We rely on our puppetmasters to keep functioning regardless of their ability to report status to any secondary services, in general that would be the preferred behavior I think ? It’s really bad for us to have the primary function as config management system fail due it’s reporting facilities. We care less about the reporting and more about a correct functioning config management system.

If I understand correctly, Ramon/Erik can use the :configtimeout setting to tune the timeout down today without any code changes upstream, it’s just that this will also affect catalog and file fetches.

That sadly doesn’t work for 2.7, would it be possible to set the :configtimeout setting for reports only in 3.0?

Relatedly, as Patrick discovered, the configtimeout option is erroneously named and ought to be corrected and used for this purpose in 3.x, since it affects all http connections. I’ve created #15611 to address this.

That would definitely help, thanks for looking into this.

Also available in: Atom PDF