Simple pattern match failing - Possibly unicode issue

irahul has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Simple pattern match failing - Possibly unicode issue by Corion (Patriarch) on Jun 04, 2010 at 10:55 UTC
Your whitespace is optional:`\s*`. If there always is (at least) one blank, you might want to use `\s+` for one or more or `\s` for exactly one space. Also, while debugging, you might want to print out the whole string you match against.	[reply] [d/l] [select]
Re: Simple pattern match failing - Possibly unicode issue by fod (Friar) on Jun 04, 2010 at 12:06 UTC
I suspect that you've got an extra set of parentheses around the last two capturing groups of the regex in your production code that's usurping $4: `my $regexp = '(\d{4})(\d{2})(\d{2})\s*`(`(\d{2})(\d{2})`)`';` Edit: oops - identical response to cdarke's - you have to be quick around here :)	[reply] [d/l] [select]
Re^2: Simple pattern match failing - Possibly unicode issue by irahul (Initiate) on Jun 04, 2010 at 14:16 UTC
I am new on perlmonks. Good to see a pro-active and helpful community.	[reply]
Re: Simple pattern match failing - Possibly unicode issue by JavaFan (Canon) on Jun 04, 2010 at 11:30 UTC
Are you sure? I mean, regardles of which characters it's going to match, `/\d{2}/` will match two characters exactly. It's never going to match `0748`. Regardless of Unicode issues. Could you provide use with a short snippet of code (including relevant data) that shows this bizarre behaviour? Obviously, the code having the problem isn't the code you are showing; it's hard, if not impossible, to debug code when showing different, working, code.	[reply] [d/l] [select]
Re^2: Simple pattern match failing - Possibly unicode issue by cdarke (Prior) on Jun 04, 2010 at 12:01 UTC
Following up on JavaFan's suggestion, I can reproduce your problem by adding a set of parentheses: `my $regexp = '(\d{4})(\d{2})(\d{2})\s*((\d{2})(\d{2}))';` [download] Are you sure your production code does not have them?	[reply] [d/l]
Re^2: Simple pattern match failing - Possibly unicode issue by irahul (Initiate) on Jun 04, 2010 at 14:12 UTC
/\d{2}/ will always match two characters exactly. But doesn't perl produce polymorphic opcodes for pattern matching which does different things based on the input string encoding? In case of multi-byte encoding, what is /.{2}/ supposed to match? 2 bytes or 2 characters in the given encoding? And yes, the code having the problem isn't the code I posted. But I can assure you the code having the problem is doing the same thing. The actual code reads the value of $datetime from a unicode encoded XML file, reads the pattern to match from a config file and populates the fields accordingly. I am dumping both $datetime and $regex before doing a pattern match and they are exactly what I have shown here. I have anecdotal evidence that perl's unicode implementation have a role to play in this. I removed: `use utf8;` [download] directive and now it works as it's supposed to be.	[reply] [d/l]
Re^3: Simple pattern match failing - Possibly unicode issue by JavaFan (Canon) on Jun 04, 2010 at 15:38 UTC
`/.{2}/` will match two characters. Which may be 4 bytes, or even more. But there's no 2 character UTF-8 byte sequence which, when interpreted as non-UTF-8, will result in a string of 4 ASCII digits. And yes, the code having the problem isn't the code I posted. But I can assure you the code having the problem is doing the same thing. Ehm, if it's doing the same thing, you wouldn't be having the problem you're encountering. You cannot have two pieces of code "doing the same thing" when one has a problem, and the other doesn't. If they do the same thing, they output the same thing. I have anecdotal evidence that perl's unicode implementation have a role to play in this. I removed: `use utf8;` [download] That "directive" tells perl your source code is encoded in UTF-8. Since the code fragment you showed contains ASCII only, the use of this pragma is neither wrong, nor necessary. I'd really like to see the code that you claim to exhibit the behaviour you describe. You have stumbled upon an (unknown) bug.	[reply] [d/l] [select]
Re^3: Simple pattern match failing - Possibly unicode issue by ikegami (Patriarch) on Jun 04, 2010 at 17:29 UTC
In case of multi-byte encoding, what is /.{2}/ supposed to match? 2 bytes or 2 characters in the given encoding? It's quite simple. If you match against bytes (e.g. encoded text), it'll match two bytes. If you match against chars (e.g. decoded text), it'll match two chars. For example, if you match against the four bytes of the UCS-2be encoding of 'AB', you'll get the two bytes 00 and 41. `'\x{00}\x{41}\x{00}\x{42}' =~ /(.{2})/s; # Two bytes 00 and 41.` [download] For example, if you match against the four chars NUL, A, NUL, B, you'll get the two chars NUL and A. `'\x{00}\x{41}\x{00}\x{42}' =~ /(.{2})/s; # Two chars NUL and A` [download]	[reply] [d/l] [select]
Re^3: Simple pattern match failing - Possibly unicode issue by Anonymous Monk on Jun 04, 2010 at 14:29 UTC
And yes, the code having the problem isn't the code I posted. But I can assure you the code having the problem is doing the same thing. Except that it isn't :) The code you posted will never demonstrate your problem. You should post code that does demonstrate your problem.	[reply]
Re^3: Simple pattern match failing - Possibly unicode issue by proceng (Scribe) on Jun 04, 2010 at 23:33 UTC
It is not just anecdotal. See Re: Odd problems with UTF-8, regexps, and newer Perl versions. Unless the source code is expected to be in utf8, the docs say not to use the pragma: Do not use this pragma for anything else than telling Perl that your script is written in UTF-8	[reply]