Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Regex with HTML::Entities

by Fletch (Bishop)
on Nov 23, 2021 at 07:33 UTC ( [id://11139042]=note: print w/replies, xml ) Need Help??


in reply to Regex with HTML::Entities

Not sure I'm completely following but it jumps out at me in your second block that the string in $b contains regular expression metacharacters (specifically parens) so that's prossibly the problem. Your "Adjektive (Nominalflexion)~87" is going to be treated as looking for the string "Adjektive" followed by a SPACE followed by the string "Nominalflexion" (which will be captured because of the parens) followed by "~87".

If you use \Q\E escapes to setup as $c = "\Q{$sep$b$sep}\E" that should appropriately escape the metacharacters and let them match literally.

The cake is a lie.
The cake is a lie.
The cake is a lie.

Replies are listed 'Best First'.
Re^2: Regex with HTML::Entities
by Horst.Lohnstein (Initiate) on Nov 23, 2021 at 08:57 UTC
    Hi Fletch, thank you for your advice! I checked it with \Q...\E and also with (?: ...) which omits the capturing of the expressions in the parens, but nothing appears to help. I was wondering whether the tilde ~ produced trouble, but when quoted that should be the case. Best, Horst

      Don't know what to tell you other than try providing a SSCCE that can actually be run. This below works as I expect it so you're doing something strange or (entirely possible) your problem statement's being misread.

      (Also the <code> formatting is doing weird things but I actually have literal ✶ in my source and the output where it's being replaced with the entity below everywhere save the initialization of $wonky_char. Not sure what's the right way to get literal UTF8 chars in sample code using utf8.)

      #!/usr/bin/env perl use 5.034; use HTML::Entities qw( decode_entities ); use utf8; my $input = qq{{&#10038;Adjektive (Nominalflexion)~87&#10038;}}; my $wonky_char = decode_entities( q{&#10038;} ); binmode( STDOUT, q{:utf8} ); say qq{\$input: $input}; say qq{\$wonky_char: $wonky_char}; my $to_match = "Adjektive (Nominalflexion)~87"; my $new_string = $input =~ s{\{$wonky_char(\Q$to_match\E)$wonky_char\}}{<div>I found +'$1'</div>}r; say qq{\$new_string: $new_string}; my $cleaner_regex_sample = $input =~ s{ \{ $wonky_char (\Q$to_match\E) $wonky_char \} }{<div>Al +so found '$1'</div>}rx; say qq{cleaner: $cleaner_regex_sample}; exit 0; __END__ $input: {&#10038;Adjektive (Nominalflexion)~87&#10038;} $wonky_char: &#10038; $new_string: <div>I found 'Adjektive (Nominalflexion)~87'</div> cleaner: <div>Also found 'Adjektive (Nominalflexion)~87'</div>

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

      Please show the output of

      printf "%vX\n", $text;

      I bet your text doesn't actually contain ✶. Did you decode your inputs? You probably have it in its encoded form.


      By the way,

      my $sep = decode_entities('&#10038;');

      is a complicated way of writing

      my $sep = "\N{U+2736}";

      or

      my $sep = "\x{2736}";

      or

      use utf8;
      my $sep = "✶";
      

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11139042]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 17:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found