shamanoff has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise monks, I'm using the following regexp for substituting non-ascii characters with a question mark:

s/[[:^ascii:]]/\?/g;

My question is how to substitute all non-ascii characters with their hex-code using regexp. For example, required output would be: "This is string with non-ascii characters, like e2"

Replies are listed 'Best First'.
Re: hex in regexp
by jwkrahn (Abbot) on May 22, 2012 at 04:55 UTC

    The standard idiom is:

    s/([[:^ascii:]])/ sprintf '%02x', ord $1 /eg;
Re: hex in regexp
by mbethke (Hermit) on May 22, 2012 at 06:35 UTC

    If you can be sure that all non-ASCII characters fit in 8 bits (in whatever codepage that hopefully you and the receiver agree upon) the suggestions so far are fine. Otherwise I'd use this, adding a "%U" before each hex sequence to make it (better, though not perfectly unless you escape existing percent signs too) distinguishable from something like "face" or "decade" that's only valid hex digits:

    perl -Mutf8 -Mstrict -Mwarnings -E ' my $x = "\x966\x959\x959\x946\x945\x961"; # Greek "foobar" $x =~ s/([[:^ascii:]])/sprintf("%%U%x", unpack q{U}, $1)/eg; say $x; '
Re: hex in regexp
by kcott (Archbishop) on May 22, 2012 at 03:47 UTC

    The standard way is to do this is like the following:

    $ perl -Mstrict -Mwarnings -E ' > my $x = q{X-X!X?X}; > $x =~ s/([[:^upper:]])/unpack q{H2}, $1/eg; > say $x; > ' X2dX21X3fX

    Note: I used upper instead of ascii for my example code.

    There may be modules that also supply this functionality. If there are, I'm sure other monks will provide details.

    -- Ken

      Thank you Ken! Your code almost worked. The thing is that for some reason it didn't substitute the symbols with their hex-codes, but just removed them from the string. For example please see the example of the string I'm trying to modify: "çe quil Y a Yå" I am still testing trying to figure out why it happened. Additional question - how exactly can I wrap the hex code in a <> brackets in this example?

        Changing upper to ascii and using your string, I get:

        $ perl -Mstrict -Mwarnings -E ' my $x = q{çe quil Y a Yå}; $x =~ s/([[:^ascii:]])/unpack q{H2}, $1/eg; say $x; ' c383c2a7e quil Y a Yc383c2a5

        Check that you didn't make a typo when entering your code. If you are still having problems, please post your code - as it is, I can't reproduce your problem.

        To wrap the codes in < and >, or any other characters, you can just concatenate the characters at the beginning and end of the hex code:

        $ perl -Mstrict -Mwarnings -E ' my $x = q{çe quil Y a Yå}; $x =~ s/([[:^ascii:]])/q{<} . unpack(q{H2}, $1) . q{>}/eg; say $x; ' <c3><83><c2><a7>e quil Y a Y<c3><83><c2><a5>

        -- Ken

Re: hex in regexp
by tinita (Parson) on May 22, 2012 at 11:09 UTC
    if you want to learn, look at the various regex suggestions and play around with them. if you want to do it right, use a module like URI::Escape (which also can escape utf8 strings, for example)
Re: hex in regexp
by Anonymous Monk on May 22, 2012 at 03:29 UTC

    Like this

    use CGI; my $freeStuff = CGI->escape( $stuff );

    or this

    use JSON; my $freeStuff = JSON->new->ascii(1)->pretty(1)->encode([ $stuff ]);

      Sorry, I did not understand your answer at all. You see, I just started to learn Perl and sometimes may not recognize the value of the example shown. My task is to get rid of the non-Ascii symbols which were mixed with Ascii symbols in text file with multiple rows. I am not sure that solutions mentioned in the first reply do the trick. (what is the meaning of $freeStuff and $stuff anyway...). Could you please provide more vivid example or direction.

        See, I assumed an XY Problem . I assumed what you really wanted to do was use an already existing encoder/serializer, instead of inventing your own, so I showed you two which use hex encoded characters. One which only does strings (CGI->escape) and one which does complex data structures (JSON)

        I guess since this might be homework a practical solution doesn't fit :)