Bug #12623
Long timeout for SRV DNS rsolution
| Status: | Accepted | Start date: | 02/14/2012 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | - | % Done: | 0% | |
| Category: | network | |||
| Target version: | - | |||
| Affected Puppet version: | development | Branch: | ||
| Keywords: | ||||
Description
This issue has come up for me several times when setting up nodes for acceptance tests. It is probably most easily reproducible with a fresh copy of our CentOS6 VM.
In /etc/sysconfig/network there is a line that looks like this:
HOSTNAME=pe-centos6.localdomain
This causes facter to return that same string for “fqdn”. Then, when you attempt to use this VM as an agent node for acceptance tests, it blocks for an extremely long time during the 03_ValidateSignCerts phase. Running the offending commands outside of the acceptance test framework:
puppet master --dns_alt_names="puppet,$(hostname -s),$(hostname -f)" --verbose --no-daemonize --logdest=/var/lib/puppet/log/puppetmaster.log --debug --trace
and
puppet agent --test --debug
Will reproduce the slowness, and you’ll see the following in the agent output:
debug: Searching for SRV records for domain: localdomain debug: Found 0 SRV records for: _x-puppet-report._tcp.localdomain
There may be a delay of 5 minutes or more in between those two lines being printed, however.
The code that this occurs in is in lib/puppet/network/resolver.rb, in the method each_srv_record. However, this code is simply calling into the ruby Resolve::DNS.getresources() method. Reviewing the documentation for this method, I don’t see a way to specify a timeout interval.
I’m also not entirely sure whether reducing the timeout is the correct solution, or if there is some other alternative.
As a workaround, simply fixing the “HOSTNAME” line in /etc/sysconfig/network so that it does not contain the domain name seems to dramatically improve the performance.
Related issues
History
#1
Updated by Chris Price over 1 year ago
Discussed this with Josh. There is a configuration option that you can use to turn this off; when I get a minute, I’m going to re-break my DNS, confirm that I see the slowdown again, and then try running with that option turned off just to confirm.
Pending the outcome of that experiment, we may want to reconsider whether or not this is enabled by default.
One other bit of info: I now have a suspicion that this affects EL flavors of linux out-of-the-box much more readily than Debian flavors. I do not recall having this problem with my ubuntu 10.04 VM, but did have it with Cent5 and Cent6. My theory is that “hostname” returns (for all intents and purposes) the contents of /etc/hostname on debian systems, and debian doesn’t seem to try to put any domain name in there. Meanwhile, on the EL flavors, “hostname” seems to return what’s in /etc/sysconfig/networking… which, for me, on out-of-the-box Cent installs appears to be “hostname.localdomain”. Facter then sees this as the FQDN and things get weird from there… all of this could use further investigation, though.
#2
Updated by Chris Price over 1 year ago
- Status changed from Unreviewed to Investigating
- Assignee set to Chris Price
#3
Updated by Chris Price over 1 year ago
Confirmed that the use_srv_records setting also alleviates this problem.
In my acceptance testing environment, I logged into my CentOS6 node and changed the /etc/sysconfig/network line back to its original value:
HOSTNAME=pe-centos6.localdomain
Rebooted, and ‘hostname’ now returns a pseudo-fqdn again:
[root@pe-centos6 ~]# hostname pe-centos6.localdomain [root@pe-centos6 ~]#
Facter interprets this as the fqdn again:
[root@pe-centos6 ~]# facter fqdn pe-centos6.localdomain [root@pe-centos6 ~]#
And running this command:
time puppet agent --test --debug
regularly takes over three minutes to execute. It very visibly hangs on this line:
debug: Searching for SRV records for domain: localdomain
at least twice during each execution of the agent. If I then edit the /etc/puppet/puppet.conf and adding the line:
use_srv_records=false
and then run the agent, the agent execution time has dropped from over 3 minutes to less than 5 seconds.
#4
Updated by Chris Price over 1 year ago
- Status changed from Investigating to Needs Decision
- Assignee changed from Chris Price to Daniel Pittman
Daniel, Josh suggested that either you or Nigel might want to make a decision as to whether or not this setting should be the default, given the possibility that customers could experience the same issues that we are experiencing internally if their DNS is not set up correctly.
#5
Updated by Chris Price over 1 year ago
Another option to consider might be to disable this option for acceptance testing? i.e., when the framework generates the config file for agents, it could add this setting?
Would lessen the likelihood of developers encountering this problem, but would also introduce a slight variation between our acceptance testing environment and our customers' likely default environments, which might not be desirable.
#6
Updated by Daniel Pittman over 1 year ago
- Status changed from Needs Decision to Accepted
Chris Price wrote:
Daniel, Josh suggested that either you or Nigel might want to make a decision as to whether or not this setting should be the default, given the possibility that customers could experience the same issues that we are experiencing internally if their DNS is not set up correctly.
We should try and limit the timeout strictly, to one or two seconds total, and leave it on by default. If that isn’t possible we can fall back to, eg, disabling this out of the box. If we can’t do that before the next substantial release we should probably disable it in the interim, also, but preferably we should just fix the problem – that we delay longer than is reasonable.
#7
Updated by Chris Price over 1 year ago
FWIW I spent a bit of time looking at the code in question, and the Ruby Resolv::DNS library that we are using did not have any obvious hooks for setting a timeout… so I didn’t see any easy way of dealing with that, short of launching a new thread that we could kill after our timeout window had passed. And it sounds like we’ve not been having much fun with code that spawns new threads recently.
Perhaps I overlooked the timeout setting, or perhaps there is another library that would provide this functionality…
#8
Updated by Daniel Pittman over 1 year ago
Chris Price wrote:
FWIW I spent a bit of time looking at the code in question, and the Ruby Resolv::DNS library that we are using did not have any obvious hooks for setting a timeout… so I didn’t see any easy way of dealing with that, short of launching a new thread that we could kill after our timeout window had passed. And it sounds like we’ve not been having much fun with code that spawns new threads recently.
Perhaps I overlooked the timeout setting, or perhaps there is another library that would provide this functionality…
You probably want the timeout module from the standard library, which should implement that all correctly, I believe.
#9
Updated by Daniel Pittman 13 days ago
- Assignee deleted (
Daniel Pittman)