md351 has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing a csv file with unicode characters. What I want to say is:

my @tmp=split/$separator/; # e.g. , or | is the separator my $last=pop @tmp; # get the last element $last=$s/\s+//g; #clean it up #normally $last would be a number. If it is not a number but a unicode # string, I want to throw an exception. The problem # is that I get a fatal malformed Utf-8 character error when I try ei +ther if($last !~/\d/){...} or if($last=~/[^/x00-\x7f]){...}

Any wisdom on this?

Replies are listed 'Best First'.
Re: unicode string comparison (perl 5.26)
by choroba (Cardinal) on Nov 01, 2019 at 08:18 UTC
    The problem with =$s instead of =~s has already been mentioned.

    Note that | is special in regular expressions, so if you want to use it as a separator, you have to escape it.

    Also, you probably meant

    /[^\x00-\x7f]/
    not
    /[^/x00-\x7f]/

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: unicode string comparison (perl 5.26)
by Tux (Canon) on Nov 01, 2019 at 07:10 UTC

    Parsing CSV with regular expressions (or split, which is using regular expressions) is brittle and error-prone. Think nested quotes, separators and newlines, but also Unicode.

    Even if using a real CSV parser, like Text::CSV_XS or Text::CSV, looks like overkill, it will probably safe you hours of pulling hairs later on.

    The two mentioned modules will deal with Unicode quite well, so you probably will not see any of these issues anymore.

    my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, sep_char = +> $separator}); while (my $row = $csv->getline ($fh)) { $row->[-1] =~ m/^\s*([0-9]+)\s*\z/ or warn "Last field of row ", $csv->record_number, " is not numer +ic: '$row->[-1]'\n"; }

    Enjoy, Have FUN! H.Merijn
Re: unicode string comparison (perl 5.26)
by haj (Vicar) on Nov 01, 2019 at 01:25 UTC

    That's a bit short on information... and also probably a bad copypaste in the line #clean it up.

    At which step in the process do you decode your CSV file from "unicode" (which must be one of the Unicode encodings, probably UTF-8)? The decoding, if done with proper error handling, should take care for malformed characters before you do any regex operations.

    Where exactly do you get that message? Perl regular expressions work on characters, and I doubt that it is possible to feed them "malformed characters" from a Perl scalar - unless there's some XS code involved.

      What more info do we need? The error does not occur when removing whitespaces (cleanup), but on checking if $last involves a digit or if $last is unicode, whichever comes first (tried either version, i.e.):

      I1: if($last=~/[^/x00-\x7f]){...#here the exception occurs} I2: if($last !~/\d/){... # here the exception occurs is I1 is commented out }
        What more info do we need?

        With the information you've provided so far, I just can't help. I am pretty sure that none of the lines you've shown so far can throw a "Malformed UTF-8 character". Instead, two of the lines are syntax errors. Please take care when providing code samples that they actually demonstrate your point.

        You also haven't quoted the exact error message, which might contain more information about the offending character, as in the following examples:

        Malformed UTF-8 character: \xa4 (unexpected continuation byte 0xa4, wi +th no preceding start byte) at /tmp/a.pl line 3 Malformed UTF-8 character: \xe4\x22\x20 (unexpected non-continuation b +yte 0x22, immediately after start byte 0xe4; need 3 bytes, got 1) at +/tmp/a.pl line 7.

        Finally, you haven't answered my question about your decoding routine. Perl complains about malformed UTF-8 characters when you feed it a string which you declare as UTF-8 but it isn't, but I can't see any of this in your code.

        haj was noting that the line

        $last=$s/\s+//g;  #clean it up

        should be

        $last =~ $s/\s+//g;  #clean it up

        Are you able to provide some example data for others to test with?

        Also, your code does not compile. The if ($last !~/\d/) {...} or if($last=~/[^/x00-\x7f]) {...} block should be if ($last !~/\d/) {...} elsif ($last=~/[^/x00-\x7f]) {...}

Re: unicode string comparison (perl 5.26)
by BillKSmith (Monsignor) on Nov 02, 2019 at 14:18 UTC

    Your comments suggest that you are trying to determine whether or not the string in $last is a valid number. Neither of your attempts will do this (Not even with the syntactic corrections already posted). It is not clear what you mean by 'number'. If you mean an unsigned decimal integer, your first try is on the right track. A string is almost certainly an integer if it does not contain any non-digits ("\D").

    if($last =~ m/\D/) { ... # Process as a non-integer string else { ... # Process the integer }

    For any other definition of 'number', you probably should use a module.

    Your error message is almost certainly from an unrelated problem. Fix this much and then post the offending code.

    Bill

      I don't know if that will address the OP's problems:

      $ perl -wE'"6\x{0666}\x{07c6}"=~/\D/ or say "All digits"'
      All digits
      $ perl -Mutf8 -wE'"6٦۶߆६৬੬૬୬௬౬೬൬෬๖໖༦၆႖៦᠖᥌᧖᪆᪖᭖᮶᱆᱖꘦꣖꤆꧖꧶꩖꯶6𐒦𑁬𑃶𑄼𑇖𑋶𑑖𑓖𑙖𑛆𑜶𑣦𑱖𑵖𖩦𖭖𝟔𝟞𝟨𝟲𝟼𞥖6"=~m/\D/ or say "All digits!"'
      All digits!
      

      Enjoy, Have FUN! H.Merijn
        My English/ASCII only background has certainly left me with tunnel vision concerning what is a 'digit'. My algorithm is correct, but the OP will probably have to change the character class to reflect his requirement.
        Bill
      A string is almost certainly an integer if it does not contain any non-digits ("\D").

      Almost:

      >perl -e 'for ("1", "22", "abc", "1e4", "0xABCD", "") { /\D/ or print +"\"$_\" is an integer\n" }' "1" is an integer "22" is an integer "" is an integer >

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        I intentionally ignored unlikely exceptions such as a null string or any 'integer' which cannot be represented exactly in perl's floating point format.
        Bill