Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

My code to remove all content before <form tag doesn't seem to work. It needs to be multi line.
$contents =~ s/^[^<form]*(?=<form)//mg;

Replies are listed 'Best First'.
Re: Substitution remove all before
by GrandFather (Saint) on May 18, 2023 at 21:44 UTC

    There are a number of problems with this regex. If you are trying to parse some HTML or similar markup, this is REALLY NOT THE WAY TO DO IT, even if it were doing what you want! But I strongly suspect is doesn't do what you think it does. Consider:

    use strict; my @testStrs = ("all this<form>", "meh!<form>"); for my $str (@testStrs) { my $contents = $str; $contents =~ s/^[^<form]*(?=<form)//smg; print qq(After: "$contents" Before: "$str"\n); }

    Prints:

    After: "<form>" Before: "all this<form>" After: "meh!<form>" Before: "meh!<form>"

    [^<form] matches any one character that is not '<', 'f', 'o', 'r' or 'm'. Guessing at what you may actually want to do, onsider instead:

    use strict; use warnings; use HTML::TreeBuilder; my $htmlFrag = ("<div><form>All good men</form></div>"); my $root = HTML::TreeBuilder->new_from_content($htmlFrag)->elementify( +); my $form = $root->look_down("_tag", "form"); print $form->as_text();

    which prints

    All good men

    See HTML::TreeBuilder and HTML::Element.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Substitution remove all before
by kcott (Archbishop) on May 18, 2023 at 22:26 UTC

    Attempting to parse or manipulate HTML (or XML or similar) with a regular expression is nearly always a bad idea. I'm actually working at the moment and don't have time to pull out references; however, I'm sure others will do so.

    From a purely academic perspective:

    • ^ anchors to the start of $contents — that's fine.
    • [^<form]* is a negated, bracketed character class (zero or more times) — not what you want. See perlrecharclass for details. It would be better to use .+? (one or more of any character, non-greedily); with a /s modifier, . will also match newlines.
    • (?=<form) matches up to but not including "<form" — that's fine.
    • The /m modifier — there's nothing in your regex that makes this useful.
    • The /g modifier — you only want to remove content once; don't use this.
    • See perlre for any of the above that you haven't understood.

    Here's a guess at your original content with a demonstration of your posted regex and my suggested one.

    perl -E ' my $content = q{<!DOCTYPE html> <html> <head> <title>Whatever</title> </head> <body> <h1>Heading</h1> <form>...</form> </body> </html> }; say "Full:"; say $content; my $contents = $content; $contents =~ s/^[^<form]*(?=<form)//mg; say "\nWith your s///:"; say $contents; $content =~ s/^.+?(?=<form)//s; say "\nShortened:"; say $content; ' Full: <!DOCTYPE html> <html> <head> <title>Whatever</title> </head> <body> <h1>Heading</h1> <form>...</form> </body> </html> With your s///: <!DOCTYPE html> <html> <head> <title>Whatever</title> </head> <body> <h1>Heading</h1> <form>...</form> </body> </html> Shortened: <form>...</form> </body> </html>

    For general usage when regexes aren't doing what you expected, I can highly recommend Regexp::Debugger.

    — Ken

        I think this external link is also worth mentioning.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Substitution remove all before
by jwkrahn (Abbot) on May 18, 2023 at 20:57 UTC
    $contents =~ s/^.+?(?=<form)//smg;
    Naked blocks are fun! -- Randal L. Schwartz, Perl hacker
      Works great. Thanks.