unicode string comparison (perl 5.26)

md351 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: unicode string comparison (perl 5.26) by choroba (Cardinal) on Nov 01, 2019 at 08:18 UTC
The problem with `=$s` instead of `=~s` has already been mentioned. Note that `\|` is special in regular expressions, so if you want to use it as a separator, you have to escape it. Also, you probably meant `/[^\x00-\x7f]/` [download] not `/[^/x00-\x7f]/` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: unicode string comparison (perl 5.26) by Tux (Canon) on Nov 01, 2019 at 07:10 UTC
Parsing CSV with regular expressions (or split, which is using regular expressions) is brittle and error-prone. Think nested quotes, separators and newlines, but also Unicode. Even if using a real CSV parser, like Text::CSV_XS or Text::CSV, looks like overkill, it will probably safe you hours of pulling hairs later on. The two mentioned modules will deal with Unicode quite well, so you probably will not see any of these issues anymore. `my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, sep_char = +> $separator}); while (my $row = $csv->getline ($fh)) { $row->[-1] =~ m/^\s([0-9]+)\s\z/ or warn "Last field of row ", $csv->record_number, " is not numer +ic: '$row->[-1]'\n"; }` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re: unicode string comparison (perl 5.26) by haj (Vicar) on Nov 01, 2019 at 01:25 UTC
That's a bit short on information... and also probably a bad copypaste in the line `#clean it up`. At which step in the process do you decode your CSV file from "unicode" (which must be one of the Unicode encodings, probably UTF-8)? The decoding, if done with proper error handling, should take care for malformed characters before you do any regex operations. Where exactly do you get that message? Perl regular expressions work on characters, and I doubt that it is possible to feed them "malformed characters" from a Perl scalar - unless there's some XS code involved.	[reply]
Re^2: unicode string comparison (perl 5.26) by md351 (Initiate) on Nov 01, 2019 at 05:03 UTC
What more info do we need? The error does not occur when removing whitespaces (cleanup), but on checking if $last involves a digit or if $last is unicode, whichever comes first (tried either version, i.e.): `I1: if($last=~/[^/x00-\x7f]){...#here the exception occurs} I2: if($last !~/\d/){... # here the exception occurs is I1 is commented out }` [download]	[reply] [d/l]
Re^3: unicode string comparison (perl 5.26) by haj (Vicar) on Nov 01, 2019 at 14:20 UTC
What more info do we need? With the information you've provided so far, I just can't help. I am pretty sure that none of the lines you've shown so far can throw a "Malformed UTF-8 character". Instead, two of the lines are syntax errors. Please take care when providing code samples that they actually demonstrate your point. You also haven't quoted the exact error message, which might contain more information about the offending character, as in the following examples: `Malformed UTF-8 character: \xa4 (unexpected continuation byte 0xa4, wi +th no preceding start byte) at /tmp/a.pl line 3 Malformed UTF-8 character: \xe4\x22\x20 (unexpected non-continuation b +yte 0x22, immediately after start byte 0xe4; need 3 bytes, got 1) at +/tmp/a.pl line 7.` [download] Finally, you haven't answered my question about your decoding routine. Perl complains about malformed UTF-8 characters when you feed it a string which you declare as UTF-8 but it isn't, but I can't see any of this in your code.	[reply] [d/l]
Re^3: unicode string comparison (perl 5.26) by swl (Prior) on Nov 01, 2019 at 06:15 UTC
haj was noting that the line `$last=$s/\s+//g; #clean it up` should be `$last =~ $s/\s+//g; #clean it up` Are you able to provide some example data for others to test with? Also, your code does not compile. The `if ($last !~/\d/) {...} or if($last=~/[^/x00-\x7f]) {...}` block should be `if ($last !~/\d/) {...} elsif ($last=~/[^/x00-\x7f]) {...}`	[reply] [d/l] [select]
Re^4: unicode string comparison (perl 5.26) by AnomalousMonk (Archbishop) on Nov 01, 2019 at 19:24 UTC
Re: unicode string comparison (perl 5.26) by BillKSmith (Monsignor) on Nov 02, 2019 at 14:18 UTC
Your comments suggest that you are trying to determine whether or not the string in $last is a valid number. Neither of your attempts will do this (Not even with the syntactic corrections already posted). It is not clear what you mean by 'number'. If you mean an unsigned decimal integer, your first try is on the right track. A string is almost certainly an integer if it does not contain any non-digits ("\D"). `if($last =~ m/\D/) { ... # Process as a non-integer string else { ... # Process the integer }` [download] For any other definition of 'number', you probably should use a module. Your error message is almost certainly from an unrelated problem. Fix this much and then post the offending code. Bill	[reply] [d/l]
Re^2: unicode string comparison (perl 5.26) by Tux (Canon) on Nov 02, 2019 at 14:31 UTC
I don't know if that will address the OP's problems: $ perl -wE'"6\x{0666}\x{07c6}"=~/\D/ or say "All digits"' All digits $ perl -Mutf8 -wE'"6٦۶߆६৬੬૬୬௬౬೬൬෬๖໖༦၆႖៦᠖᥌᧖᪆᪖᭖᮶᱆᱖꘦꣖꤆꧖꧶꩖꯶６𐒦𑁬𑃶𑄼𑇖𑋶𑑖𑓖𑙖𑛆𑜶𑣦𑱖𑵖𖩦𖭖𝟔𝟞𝟨𝟲𝟼𞥖6"=~m/\D/ or say "All digits!"' All digits! Enjoy, Have FUN! H.Merijn	[reply]
Re^3: unicode string comparison (perl 5.26) by BillKSmith (Monsignor) on Nov 02, 2019 at 19:00 UTC
My English/ASCII only background has certainly left me with tunnel vision concerning what is a 'digit'. My algorithm is correct, but the OP will probably have to change the character class to reflect his requirement. Bill	[reply]
Re^2: unicode string comparison (perl 5.26) by afoken (Chancellor) on Nov 02, 2019 at 17:44 UTC
A string is almost certainly an integer if it does not contain any non-digits ("\D"). Almost: `>perl -e 'for ("1", "22", "abc", "1e4", "0xABCD", "") { /\D/ or print +"\"$_\" is an integer\n" }' "1" is an integer "22" is an integer "" is an integer >` [download] Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l]
Re^3: unicode string comparison (perl 5.26) by BillKSmith (Monsignor) on Nov 02, 2019 at 18:52 UTC
I intentionally ignored unlikely exceptions such as a null string or any 'integer' which cannot be represented exactly in perl's floating point format. Bill	[reply]