Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

perlre inverse check for several patterns

by averlon (Sexton)
on Jun 02, 2023 at 13:58 UTC ( [id://11152608]=perlquestion: print w/replies, xml ) Need Help??

averlon has asked for the wisdom of the Perl Monks concerning the following question:

I have the following string as an example

xxx<pre>xxx<www>xxx<strong>xxx

In my case "pre" is allowed and "strong" as well. But "www" is not allowed.

I have tried just for test to finde one exception with:

<(?!strong)>

But did not bring a hit!

At least I need to define in the regex all allowed patterns (there are not many in my case) and it should bring a hit if some not allowed pattern is found

Can someone help me on this?

Regards Kallewirsch

Replies are listed 'Best First'.
Re: perlre inverse check for several patterns
by hv (Prior) on Jun 02, 2023 at 15:36 UTC

    The easy way would be to set up a hash of "good" words, then scan for each substring to check against the hash:

    my %good = map +($_ => 1), qw{ pre strong }; my $string = "xxx<pre>xxx<www>xxx<strong>xxx"; while ($string =~ m{<(\w+)>}g) { warn "bad word '$1'" unless $good{$1}; }

    If you don't care about how invalid strings are invalid, then it is easily done in a single match something like:

    print "ok\n" if $string =~ m{ ^ ( [^<] | < (?: pre | strong ) > )* $ }x;

    (Note that this will also reject a string with an unclosed '<', which the first example will not.)

    If neither of those is what you want, it would be useful if you could say more about precisely what you want to achieve.

Re: perlre inverse check for several patterns
by hippo (Bishop) on Jun 02, 2023 at 14:22 UTC

    That looks like HTML. Don't parse HTML with regex, that way lies madness.

    OK, with that out of the way, your match fails because the lookahead doesn't reset the pos. Include the right angle bracket and you're good to go.

    use strict; use warnings; use Test::More tests => 1; my $str = 'xxx<pre>xxx<www>xxx<strong>xxx'; like $str, qr/<(?!strong>)/, "Valid tag found";

    But again, don't do this. Use an HTML parser. You'll thank me later. :-)


    🦛

      Hi hippo!

      no, it is not HTML. It is some interface using some formatting strings like (!!like!!) HTML. But unfotunately the interface crashes if some "<>" strings are included which do not match the allowed formatting strings.

      The strings I process are lines from logfiles. Unfortunately some of these lines include "<xxx>" strings. This brings the interface I use into trouble. So I need to filter them out.

      I meanwhile found out I get a "true" if I use the following code:

      $av_tmp_STRING = "xxx<pre>xxx<www>xxx<strong>xxx"; if ( $av_tmp_STRING =~ m/<(?!strong>)(?!pre)/ ) { #do something with the string which contains wrong patterns }

      Still testing if it really works

      But anyhow. I will keep the example in mind for other use!

      Thanks

      Regards Kallewirsch
        no, it is not HTML. It is some interface using some formatting strings like (!!like!!) HTML. But unfotunately the interface crashes if some "<>" strings are included which do not match the allowed formatting strings.

        Could you enlighten us as to what exactly this format is and perhaps provide a more representative sample? Also, is it not feasible to fix the crashes in the interface?

        > it is not HTML. It is some interface using some formatting strings like (!!like!!) HTML. But unfotunately the interface crashes if some "<>" strings are included which do not match the allowed formatting strings.

        So what's wrong with tybalt89's approach?

        see Re: perlre inverse check for several patterns

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re: perlre inverse check for several patterns
by tybalt89 (Monsignor) on Jun 02, 2023 at 17:22 UTC

    Perhaps you just want to remove any patterns that are not allowed...

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11152608 use warnings; my @lines = split /^/, <<END; xxx<pre>xxx<www>xxx<strong>xxx xxx<pre>xxxxxx<strong>xxx xxx<pre>xxx<pr>xxx<strong>xxx xxx<pre>xxx<pree>xxx<strong>xxx xxx<pre>xxx<strong>xxx<strong>xxx xxx<pre>xxx<pre>xxx<strong>xxx END my %allowed = map { ( '<'.$_.'>' ) x 2 } qw( pre strong ); for ( @lines ) { my $clean = s[<\w+>][ $allowed{$&} // '' ]ger; print $clean; }

    Outputs:

    xxx<pre>xxxxxx<strong>xxx xxx<pre>xxxxxx<strong>xxx xxx<pre>xxxxxx<strong>xxx xxx<pre>xxxxxx<strong>xxx xxx<pre>xxx<strong>xxx<strong>xxx xxx<pre>xxx<pre>xxx<strong>xxx
Re: perlre inverse check for several patterns
by haukex (Archbishop) on Jun 02, 2023 at 14:35 UTC

    Do not use regular expressions to parse HTML/XML. Assuming your input is indeed HTML, here's a possible solution using Mojo::DOM, based on my code here.

    use warnings; use strict; print html_filter(<<'END_HTML', qw/pre strong i/), "\n"; aaa<pre>bbb</pre>ccc<www><i>ddd</i><strong>eee</strong>fff</www>ggg END_HTML use Mojo::DOM; sub html_filter { my $html = shift; my %allowed = map {$_=>1} @_; my $walk; $walk = sub { my ($in, $out) = @_; for my $n ( @{ $in->child_nodes } ) { if ( $n->type eq 'cdata' || $n->type eq 'text' ) { $out->append_content($n->content) } elsif ( $n->type eq 'tag' ) { if ($allowed{$n->tag}) { my $t = $out->new_tag( $n->tag, %{$n->attr} ) ->child_nodes->first; $walk->($n, $t); $out->append_content($t); } else { $walk->($n, $out) } } # ignore other node types for now } return $out; }; return $walk->(Mojo::DOM->new($html), Mojo::DOM->new)->to_string; } __END__ aaa<pre>bbb</pre>ccc<i>ddd</i><strong>eee</strong>fffggg
Re: perlre inverse check for several patterns
by kcott (Archbishop) on Jun 04, 2023 at 11:41 UTC

    G'day Kallewirsch,

    It would have been better had you provided all information up-front. From your responses to hippo and haukex, I've determined the following.

    You're using WWW::Telegram::BotAPI. This is a front-end to "Telegram Bot API". It's sendMessage method documentation describes "HTML style". You should read that entire section; this extract highlights the main point that applies to you:

    "... All <, > and & symbols ... must be replaced with the corresponding HTML entities ..."

    From that, and based on what you've revealed so far, you need to modify $av_tmp_LINE before combining it with $av_tmp_STRING. Here's an example:

    $ perl -e ' use 5.010; use strict; use warnings; my $av_tmp_LINE = "Jun 3 23:20:05 f42252s5 postfix/pickup[204714] +: E1E63A045C: uid=33 from=<www-data>"; say "BEFORE: $av_tmp_LINE"; $av_tmp_LINE =~ s/([&<>])/char_to_entity($1)/eg; say "AFTER: $av_tmp_LINE"; my $av_tmp_STRING = "Logfile: " . "<strong>" . q{$av_obj_TMP->{inp +ut}} . "</strong>" . " " . $av_tmp_LINE; say "\$av_tmp_STRING[$av_tmp_STRING]"; sub char_to_entity { my ($char) = @_; state $entity_for = {qw{& &amp; < &lt; > &gt;}}; return $entity_for->{$char}; } ' BEFORE: Jun 3 23:20:05 f42252s5 postfix/pickup[204714]: E1E63A045C: u +id=33 from=<www-data> AFTER: Jun 3 23:20:05 f42252s5 postfix/pickup[204714]: E1E63A045C: u +id=33 from=&lt;www-data&gt; $av_tmp_STRING[Logfile: <strong>$av_obj_TMP->{input}</strong> Jun 3 2 +3:20:05 f42252s5 postfix/pickup[204714]: E1E63A045C: uid=33 from=&lt; +www-data&gt;]

    Note: I haven't tried to interpolate $av_obj_TMP->{input} as I've no idea what its value is.

    — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11152608]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-25 05:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found