The Puppet Labs Issue Tracker has Moved: https://tickets.puppetlabs.com

Bug #10291

Puppet manifests fail when UTF-8 non-breaking space (0xc2a0) is used as whitespace

Added by Oliver Hookins almost 3 years ago. Updated 11 months ago.

Status:AcceptedStart date:10/25/2011
Priority:NormalDue date:
Assignee:Jeff McCune% Done:

0%

Category:utf8
Target version:-
Affected Puppet version:2.6.7 Branch:
Keywords: customer

We've Moved!

Ticket tracking is now hosted in JIRA: https://tickets.puppetlabs.com

This ticket may be automatically exported to the PUP project on JIRA using the button below:


Description

err: Could not parse for environment production: Could not match  Yum::Repo at /home/ohookins/svn/redacted/repo.pp:4

The actual code is unremarkable, but the problem is here:

00000020 20 7b 0a 20 c2 a0 59 75 6d 3a 3a 52 65 70 6f 20 | {. ..Yum::Repo | 00000030 7b 0a 20 c2 a0 c2 a0 c2 a0 6d 65 74 61 64 61 74 |{. ……metadat|

Somehow we’ve ended up with a UTF8 “nbsp” in our manifest (the 0xc2a0). Sure, I can just remove these characters but it suggests to me that perhaps the Unicode support in the parser is incomplete, which is a larger problem for internationalisation.

test.pp.gz - Manifest with Unicode characters. (81 Bytes) Oliver Hookins, 11/15/2011 06:56 am


Related issues

Related to Puppet - Bug #11246: Puppet should work with UTF-8 encoded ERB templates Closed 12/06/2011
Related to Puppet - Bug #20522: Improve Puppet's handling of non-ASCII character encodings Accepted

History

#1 Updated by Kelsey Hightower almost 3 years ago

  • Status changed from Unreviewed to Investigating
  • Assignee set to Kelsey Hightower

Thanks for reporting this issue. I will try and reproduce this error on my setup. I wonder if the default ruby encoding has any effect on this bug.

#2 Updated by Kelsey Hightower almost 3 years ago

  • Status changed from Investigating to Needs More Information

Oliver,

To help troubleshoot your issue, can you provide the following bits of information?

  • ruby version
  • operating system details
  • a version of the manifest that can reproduce the error (Scrub important info)
  • the complete error message

#3 Updated by Oliver Hookins almost 3 years ago

# ruby -v
ruby 1.8.5 (2006-08-25) [x86_64-linux]

# cat /etc/redhat-release 
CentOS release 5.6 (Final)

I’ll try to attach the manifest but no guarantees Redmine doesn’t mangle it.

Here is the full output:

# puppet apply test.pp -v --debug --trace
/usr/lib/ruby/site_ruby/1.8/puppet/parser/lexer.rb:19:in `lex_error'
/usr/lib/ruby/site_ruby/1.8/puppet/parser/lexer.rb:452:in `scan'
/usr/lib/ruby/1.8/racc/parser.rb:152:in `_racc_yyparse_c'
/usr/lib/ruby/1.8/racc/parser.rb:152:in `__send__'
/usr/lib/ruby/1.8/racc/parser.rb:152:in `yyparse'
/usr/lib/ruby/site_ruby/1.8/puppet/parser/parser_support.rb:181:in `parse'
/usr/lib/ruby/site_ruby/1.8/puppet/resource/type_collection.rb:168:in `perform_initial_import'
/usr/lib/ruby/site_ruby/1.8/puppet/node/environment.rb:84:in `known_resource_types'
/usr/lib/ruby/1.8/monitor.rb:238:in `synchronize'
/usr/lib/ruby/site_ruby/1.8/puppet/node/environment.rb:81:in `known_resource_types'
/usr/lib/ruby/site_ruby/1.8/puppet/resource/type_collection_helper.rb:5:in `known_resource_types'
/usr/lib/ruby/site_ruby/1.8/puppet/parser/compiler.rb:434:in `initvars'
/usr/lib/ruby/site_ruby/1.8/puppet/parser/compiler.rb:192:in `initialize'
/usr/lib/ruby/site_ruby/1.8/puppet/parser/compiler.rb:18:in `new'
/usr/lib/ruby/site_ruby/1.8/puppet/parser/compiler.rb:18:in `compile'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/catalog/compiler.rb:77:in `compile'
/usr/lib/ruby/site_ruby/1.8/puppet/util.rb:198:in `benchmark'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/catalog/compiler.rb:75:in `compile'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/catalog/compiler.rb:34:in `find'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/indirection.rb:188:in `find'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector.rb:50:in `find'
/usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:115:in `main'
/usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:35:in `run_command'
/usr/lib/ruby/site_ruby/1.8/puppet/application.rb:304:in `run'
/usr/lib/ruby/site_ruby/1.8/puppet/application.rb:410:in `exit_on_fail'
/usr/lib/ruby/site_ruby/1.8/puppet/application.rb:304:in `run'
/usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:59:in `execute'
/usr/bin/puppet:4
/usr/lib/ruby/site_ruby/1.8/puppet/parser/compiler.rb:21:in `compile'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/catalog/compiler.rb:77:in `compile'
/usr/lib/ruby/site_ruby/1.8/puppet/util.rb:198:in `benchmark'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/catalog/compiler.rb:75:in `compile'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/catalog/compiler.rb:34:in `find'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector/indirection.rb:188:in `find'
/usr/lib/ruby/site_ruby/1.8/puppet/indirector.rb:50:in `find'
/usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:115:in `main'
/usr/lib/ruby/site_ruby/1.8/puppet/application/apply.rb:35:in `run_command'
/usr/lib/ruby/site_ruby/1.8/puppet/application.rb:304:in `run'
/usr/lib/ruby/site_ruby/1.8/puppet/application.rb:410:in `exit_on_fail'
/usr/lib/ruby/site_ruby/1.8/puppet/application.rb:304:in `run'
/usr/lib/ruby/site_ruby/1.8/puppet/util/command_line.rb:59:in `execute'
/usr/bin/puppet:4
Could not parse for environment production: Could not match   notice('hello at /root/test.pp:2 on node testnode

#4 Updated by Jeff McCune almost 3 years ago

  • Category set to ruby19
  • Status changed from Needs More Information to Accepted
  • Assignee changed from Kelsey Hightower to Jeff McCune

Working

I’m taking this as part of a related commercial support ticket filed by a customer. My current plan is to implement this fix for Puppet 2.7.x and on.

Oliver, do you have an extremely pressing need to have this in 2.6?

Is it possible for you to work around the process by setting the LANG environment variable to en_US.UTF-8 if you have UTF-8 encoded manifests?

-Jeff

#5 Updated by Jeff McCune almost 3 years ago

Additional Information

This is a more general encoding issue with Strings in Ruby 1.9 and later. We’ll need to try and detect the encoding of each file we load and switch the encoding of the resulting string object on the fly. Related to the paying customer support ticket (535) we specifically need to make this work with templates and the template() function.

A great description of the context and surrounding issues are located at: http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

I suspect early contact with the new m17n engine is going to come to Rubyists in the form of this error message:

invalid multibyte char (US-ASCII)
Ruby 1.8 didn't care what you stuck in a random String literal, but 1.9 is a touch pickier. I think you'll see that the change is for the better, but we do need to spend some time learning to play by Ruby's new rules.

That takes us to the first of Ruby's three default Encodings.

The Source Encoding

In Ruby's new grown up world of all encoded data, each and every String needs an Encoding. That means an Encoding must be selected for a String as soon as it is created. One way that a String can be created is for Ruby to execute some code with a String literal in it, like this:

str = "A new String"
That's a pretty simple String, but what if I use a literal like the following instead?

str = "Résumé"
What Encoding is that in? That fundamental question is probably the main reason we all struggle a bit with character encodings. You can't tell just from looking at that data what Encoding it is in. Now, if I showed you the bytes you may be able to make an educated guess, but the data just isn't wearing an Encoding name tag.

That's true of a frightening lot of data we deal with every day. A plain text file doesn't generally say what Encoding the data inside is in. When you think about that, it's a miracle we can successfully read a lot of things.

When we're talking about program code, the problem gets worse. I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code. How do we make that work seamlessly?

The Ruby 1.8 strategy of one global variable won't survive a test like this, so it was time to switch strategies. Ruby 1.9's answer to this problem is the source Encoding.

All Ruby source code now has some Encoding. When you create a String literal in your code, it is assigned the Encoding of your source. That simple rule solves all the problems I just described pretty nicely. As long my source Encoding is UTF-8 and the Japanese programmer's source Encoding is Shift JIS, my literals will work as I expect and his will work as he expects. Obviously if we share any data, we will need to establish some rules about our shared formats using documentation or code that can adapt to different Encodings, but we should have been doing that all along anyway.

Thus the only question becomes, what's my source Encoding and how do I change it?

#6 Updated by Oliver Hookins almost 3 years ago

It’s not a pressing issue for us and although it is not possible to work around it, we do at least know the warning signs and can test for it in our Lint-style checking stage.

The problem occurs when people share code fragments using Microsoft Communicator/Lync, which insists on converting regular spaces into UTF-8 nbsp characters.

#7 Updated by Jeff McCune almost 3 years ago

  • Assignee changed from Jeff McCune to Nigel Kersten

Target Versions

Nigel, could I please get a decision on what mainline integration branches this fix should be targeted as? I’d like to suggest 2.7.x and master only, but I have no real reason other than development time. =) I can’t yet say if it’s any more expensive to develop the fix for 2.6.x onwards or not.

-Jeff

#8 Updated by Jeff McCune almost 3 years ago

  • Assignee changed from Nigel Kersten to Jeff McCune

I talked with Nigel in person. This will be targeted at 2.7.x and later only since 2.6 is going into a security fix only state soon.

-Jeff

#9 Updated by Jeff McCune almost 3 years ago

Confirmed this is an issue in 1.9.2

I’ve only been able to trigger this when 0xA0 UTF-8 spaces are present in the manifest. Unicode snowmen in parameters and resource titles don’t appear to cause any issues.

% /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/bin/ruby -KU -- `which rspec` -f d -b spec/classes/support535_utf8_oliver_spec.rb 
/Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require': /vagrant/src/puppet/lib/puppet/util/zaml.rb:224: invalid multibyte escape: /([\x80-\xFF])/ (SyntaxError)
invalid multibyte escape: /[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\xFF]/
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /vagrant/src/puppet/lib/puppet/util/monkey_patches.rb:17:in `'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /vagrant/src/puppet/lib/puppet/util.rb:4:in `'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /vagrant/src/puppet/lib/puppet.rb:11:in `'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /Users/jeff/vms/puppet/src/support535/spec/spec_helper.rb:1:in `'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /Users/jeff/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /Users/jeff/vms/puppet/src/support535/spec/classes/support535_utf8_oliver_spec.rb:4:in `'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/configuration.rb:459:in `load'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/configuration.rb:459:in `block in load_spec_files'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/configuration.rb:459:in `map'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/configuration.rb:459:in `load_spec_files'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/command_line.rb:18:in `run'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/runner.rb:80:in `run_in_process'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/runner.rb:69:in `run'
        from /Users/jeff/.rvm/gems/ruby-1.9.2-p290@puppet/gems/rspec-core-2.7.1/lib/rspec/core/runner.rb:10:in `block in autorun'

#10 Updated by Jeff McCune almost 3 years ago

  • Subject changed from UTF8 non-breaking space in a manifest breaks the parser to Puppet manifests should fully support UTF-8 character encoding

Expanding the subject

I’m going to expand the scope on this because it’s not only non-breaking spaces that cause problems. Another support customer reports the following issue:

$ cat test.pp 
class test { 
# sysadmin´s 
} 

$ od -c test.pp 
0000000 c l a s s t e s t { \n 
0000020 # s y s a d m i n 302 264 s \n } 
0000040 \n 
0000041 

$ echo $LANG 
en_US.UTF-8
$ puppet parser validate test.pp 
$ unset LANG 
$ puppet parser validate test.pp 
err: Could not parse for environment production: invalid byte sequence in US-ASCII at /tmp/test.pp:1 
err: Try 'puppet help parser validate' for usage

Our parser does not deal with non breaking spaces and UTF-8 characters in comments at the least.

I’m debating proposing we should switch the entire Puppet source code base to UTF-8 encoding as well which will be a 3rd ticket…

-Jeff

#11 Updated by Jeff McCune almost 3 years ago

  • Subject changed from Puppet manifests should fully support UTF-8 character encoding to Puppet manifests fail when UTF-8 non-breaking space (0xc2a0) is used as whitespace
  • Status changed from Accepted to Needs More Information
  • Assignee changed from Jeff McCune to Oliver Hookins

Tightening the scope

After some investigating it’s clear the non-breaking space issue Oliver reported is distinctly different than the apostrophe in a comment issue Marcin reported. As a result, I’ve filed #11303 to track the comment issue and am tightening the scope on this ticket back to the non-breaking space.

Oliver, sorry but I’m also putting this back into our backlog since it’s a very different problem. I suspect whitespace is whitespace (/\s/) in our parser. Are you looking for a better error message, or do you expect Puppet to treat the 0xc2a0 byte sequence as whitespace as well?

-Jeff

#12 Updated by Jeff McCune almost 3 years ago

  • Category changed from ruby19 to utf8

#13 Updated by Oliver Hookins almost 3 years ago

I would expect either the non-breaking space to be handled as a regular space token, and failing that (if that is deemed as incorrect behaviour and therefore not fixable) a reasonable error message e.g. unknown token.

#14 Updated by Jeff McCune almost 3 years ago

  • Status changed from Needs More Information to Accepted
  • Assignee changed from Oliver Hookins to Jeff McCune

#15 Updated by Anonymous almost 3 years ago

Oliver Hookins wrote:

I would expect either the non-breaking space to be handled as a regular space token, and failing that (if that is deemed as incorrect behaviour and therefore not fixable) a reasonable error message e.g. unknown token.

For what it is worth, treating anything with the Unicode category “Space” seems like a reasonable choice for the composed character in the reader. Generally, preferring to follow the Unicode standard is the best option, I believe.

#16 Updated by Nigel Kersten over 2 years ago

So this sounds like a relatively simple fix? We just expand our parser definition of a space to include the whole unicode category “space” ?

Jeff, is this particular fix that simple? Or are there deeper issues?

#17 Updated by Anonymous over 2 years ago

Nigel Kersten wrote:

So this sounds like a relatively simple fix? We just expand our parser definition of a space to include the whole unicode category “space” ?

I think it is relatively simple, in that sense. I have no idea what problems this will expose in the parser, however.

#18 Updated by Charlie Sharpsteen over 1 year ago

  • Keywords set to customer

#20 Updated by Adrien Thebo 11 months ago

I did some investigating on this and Ruby, Python, and (unsurprisingly) C fail if nonbreaking spaces are used in actual source code. They can be specified in strings but nbsp cannot be used as actual whitespace and a number of common languages fail with parse errors when encountering u00a0. Considering that there are a number of other unicode whitespace characters (http://unicode-table.com/en/search/?q=non+breaking+space) it seems like trying to allow non-breaking spaces as whitespace would be the road to madness.

Also available in: Atom PDF