in reply to Hex-matching Regex pattern in scalar

If you want to change é to e, maybe you want to use Text::Unidecode instead?

Replies are listed 'Best First'.
Re^2: Hex-matching Regex pattern in scalar
by CliffG (Novice) on May 20, 2016 at 12:17 UTC
    Hello all, and thanks for your thoughts. Perhaps I should explain differently ...

    I am GETting xml docs from an IBM tool using LWP and using LibXML to parse them. This keeps failing due to unparsable characters such as e acute (x'e9') so I need to substitute those characters with parsable ones. My idea was to GET the xml doc then call a subroutine to replace x'e9' with x'65', x'a0' with x'20' and so on before parsing the doc with LibXML.

    The subroutine would write to a temp file then delete the original and rename the temp file. The subroutine would call another whose job it is to replace in a string all instances of one hex value with another.

    So, another way to describe my problem is that I have not been able to write a subroutine that accepts a string, a 'from' hex value and a 'to' hex value and returns a modified string.

    The xml snip I showed as test data is real data snipped from an xml doc retrieved from the tool, and the two unparsable chars I've encountered so far are x'a0' and x'e9' (just e9 in the snip)... there are likely to be others so a generalised 'replacer' seems a good way to go.

    What seemed like a straightforward thing to do has proven otherwise, hence asking the question here - I apologise if what I'm trying to achieve wasn't sufficiently clear. Any hep with what ought to be a simple subroutine will be warmly welcomed.

      It looks as if your input XML data is encoded as Latin-1 (despite the header claiming it to be UTF-8). So why not Encode::decode it from Latin one and save it as UTF-8 and then have LibXML process it?

      I entirely agree with Corion in that it seems to be a problem with encoding. It would be ideal to fix this at source (the IBM tool). If that isn't possible then Corion's approach sounds like the next best plan.

      However, since you said:

      So, another way to describe my problem is that I have not been able to write a subroutine that accepts a string, a 'from' hex value and a 'to' hex value and returns a modified string.

      let me supply this alternative which shows such a subroutine:

      #!/usr/bin/env perl use strict; use warnings; use Test::More; my $instr = "a\xe9b\xe9"; my $outstr = replace ($instr, "\xe9", "\x65"); is ($outstr, 'aebe'); done_testing; sub replace { my ($in, $find, $replace) = @_; $in =~ s/$find/$replace/g; return $in; }
        Thanks to all for your help, and the different approach was what I needed. I can't change IBM's product so I took a slightly different angle and simply replaced encoding="UTF-8" with encoding="Latin1" and that allowed LibXML to parse the doc correctly.

        Job done :o)