CliffG has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I need some guidance. I am trying to write a little subroutine that takes before and after values and creates a substitution expression, so I can replace non-displayable characters in a file with displayable ones for example. However I cannot make this work with hex values; what I'm attempting looks like this (which replaces hex 'e9' with char '\x65'):
my $pre = '\xe9'; my $post = '\x65'; my $re = qr/s\/$pre\/$post\/g/; my $path = 'C:\Scripts\Working2'; my $fileSep = "\\"; my $file = 'Users_0.xml'; my $tempFile = 'C:\Scripts\Working2\Users2.xml'; if (!open(IF, "<$path$fileSep$file")) { die("Could not open file $path$fileSep$file $!"); } if (!open(OF, ">$tempFile")) { die("Could not open file $tempFile $!"); } while($str = <IF>) { $str =~ $re; print OF $str; } close IF; close OF;
I know the substitution '$str =~ s/\xe9/\x65/g;' works just fine but parameterising it is the problem. Straightforward for char strings but seemingly less so for hex ones ... Can someone please advise?

Replies are listed 'Best First'.
Re: Hex-matching Regex pattern in scalar
by hippo (Archbishop) on May 20, 2016 at 11:02 UTC

    It's not entirely clear to me what you are trying to achieve here, especially with no sample data. However, since it looks to me like you might be trying to transliterate characters perhaps tr is the way to go? Ignoring all the file ops:

    #!/usr/bin/env perl use strict; use warnings; use Test::More; my $instr = "a\xe9b\xe9"; my $outstr = $instr =~ tr/\xe9/e/r; is ($outstr, 'aebe'); done_testing;

    Does this fit your requirements and if not please specify in detail how not?

Re: Hex-matching Regex pattern in scalar
by Corion (Patriarch) on May 20, 2016 at 11:04 UTC

    If you want to change é to e, maybe you want to use Text::Unidecode instead?

      Hello all, and thanks for your thoughts. Perhaps I should explain differently ...

      I am GETting xml docs from an IBM tool using LWP and using LibXML to parse them. This keeps failing due to unparsable characters such as e acute (x'e9') so I need to substitute those characters with parsable ones. My idea was to GET the xml doc then call a subroutine to replace x'e9' with x'65', x'a0' with x'20' and so on before parsing the doc with LibXML.

      The subroutine would write to a temp file then delete the original and rename the temp file. The subroutine would call another whose job it is to replace in a string all instances of one hex value with another.

      So, another way to describe my problem is that I have not been able to write a subroutine that accepts a string, a 'from' hex value and a 'to' hex value and returns a modified string.

      The xml snip I showed as test data is real data snipped from an xml doc retrieved from the tool, and the two unparsable chars I've encountered so far are x'a0' and x'e9' (just e9 in the snip)... there are likely to be others so a generalised 'replacer' seems a good way to go.

      What seemed like a straightforward thing to do has proven otherwise, hence asking the question here - I apologise if what I'm trying to achieve wasn't sufficiently clear. Any hep with what ought to be a simple subroutine will be warmly welcomed.

        It looks as if your input XML data is encoded as Latin-1 (despite the header claiming it to be UTF-8). So why not Encode::decode it from Latin one and save it as UTF-8 and then have LibXML process it?

        I entirely agree with Corion in that it seems to be a problem with encoding. It would be ideal to fix this at source (the IBM tool). If that isn't possible then Corion's approach sounds like the next best plan.

        However, since you said:

        So, another way to describe my problem is that I have not been able to write a subroutine that accepts a string, a 'from' hex value and a 'to' hex value and returns a modified string.

        let me supply this alternative which shows such a subroutine:

        #!/usr/bin/env perl use strict; use warnings; use Test::More; my $instr = "a\xe9b\xe9"; my $outstr = replace ($instr, "\xe9", "\x65"); is ($outstr, 'aebe'); done_testing; sub replace { my ($in, $find, $replace) = @_; $in =~ s/$find/$replace/g; return $in; }
Re: Hex-matching Regex pattern in scalar ( substitution
by Anonymous Monk on May 20, 2016 at 09:51 UTC

    You're not performing any substitution, you're just matching, as qr just returns a regex object

    #!/usr/bin/perl -- use strict; use warnings; use Path::Tiny qw/ path /; my $infile = '...'; my $outfile = '...'; my $find = VerifyHex( '\xe9' ); my $replace = VerifyHex( '\x65' ); my $IF = path( $infile )->openr_raw; my $OF = path( $outfile )->openw_raw; while( my $str = <$IF> ){ $str =~ s{$find}{$replace}g; print $OF $str; } close $IF; close $OF; sub VerifyHex { my( $str ) = @_; if( $str =~ m/(\\[a-zA-Z0-9][a-zA-Z0-9])/ ){ return "$1"; } die "evil input $str"; }

    Yes you could use https://metacpan.org/pod/Path::Tiny#edit_lines-edit_lines_utf8-edit_lines_raw but I didn't want to change code too much

      Thanks but this doesn't work for me. I know that qr returns a regex object, I was only using it because a straight substution doesn't do the trick. I don't have Path::Tiny so I amended your example:
      #!/usr/bin/perl -- use strict; use warnings; my $infile = 'C:\Scripts\Working2\Users_0.xml'; my $outfile = 'C:\Scripts\Working2\Users2.xml'; my $find = VerifyHex( '\xe9' ); my $replace = VerifyHex( '\x65' ); open(IF, "<$infile") or die "Could not open $infile $!"; open(OF, ">$outfile") or die "Could not open $outfile $!"; binmode IF; binmode OF; while( my $str = <IF> ){ $str =~ s{$find}{$replace}g; print OF $str; } close IF; close OF; sub VerifyHex { my( $str ) = @_; if( $str =~ m/(\\[a-zA-Z0-9][a-zA-Z0-9])/ ){ return "$1"; } die "evil input $str"; }
      The input file contains this (edited from an extract from an IBM tool), which says it's UTF-8 but isn't:
      <?xml version="1.0" encoding="UTF-8" ?> <foundation Version="1.0.0"> <contributor> <userId>C12760</userId> <name>Shilpaé Durgale</name> </contributor> </foundation>
      Sadly the output is unchanged from the input, the é is not replaced with e. As you can tell I'm no genius with Perl and I'm sure I'm missing something fundamental. Thoughts?