Re: Substitution remove all before

Attempting to parse or manipulate HTML (or XML or similar) with a regular expression is nearly always a bad idea. I'm actually working at the moment and don't have time to pull out references; however, I'm sure others will do so.

From a purely academic perspective:

^ anchors to the start of $contents — that's fine.
[^<form]* is a negated, bracketed character class (zero or more times) — not what you want. See perlrecharclass for details. It would be better to use .+? (one or more of any character, non-greedily); with a /s modifier, . will also match newlines.
(?=<form) matches up to but not including "<form" — that's fine.
The /m modifier — there's nothing in your regex that makes this useful.
The /g modifier — you only want to remove content once; don't use this.
See perlre for any of the above that you haven't understood.

Here's a guess at your original content with a demonstration of your posted regex and my suggested one.

perl -E '
    my $content = q{<!DOCTYPE html>
<html>
<head>
    <title>Whatever</title>
</head>
<body>
    <h1>Heading</h1>
    <form>...</form>
</body>
</html>
};

say "Full:";
say $content;

my $contents = $content;
$contents =~ s/^[^<form]*(?=<form)//mg;
say "\nWith your s///:";
say $contents;

$content =~ s/^.+?(?=<form)//s;

say "\nShortened:";
say $content;
'
Full:
<!DOCTYPE html>
<html>
<head>
    <title>Whatever</title>
</head>
<body>
    <h1>Heading</h1>
    <form>...</form>
</body>
</html>


With your s///:
<!DOCTYPE html>
<html>
<head>
    <title>Whatever</title>
</head>
<body>
    <h1>Heading</h1>
<form>...</form>
</body>
</html>


Shortened:
<form>...</form>
</body>
</html>
[download]

For general usage when regexes aren't doing what you expected, I can highly recommend Regexp::Debugger.

— Ken

Comment on Re: Substitution remove all before Select or Download Code

Replies are listed 'Best First'.
Re^2: Substitution remove all before (Parse HTML/XML with Regex References) by eyepopslikeamosquito (Archbishop) on May 18, 2023 at 23:20 UTC
Attempting to parse or manipulate HTML (or XML or similar) with a regular expression is nearly always a bad idea. I'm actually working at the moment and don't have time to pull out references; however, I'm sure others will do so. Surprised I don't have a list of references on this topic. Here's a start: XML (wikipedia) HTML (wikipedia) Why a regex really isn't good enough for HTML and XML, even for "simple" tasks by haukex (2020) Parsing HTML/XML with Regular Expressions by haukex (2017) Re: Creating an abstract (updated) by haukex (2021) - uses Mojo::DOM Re: perlre inverse check for several patterns by haukex (2023) - uses Mojo::DOM References Added Later regex match open tags except XHTML... (SO) Re: pattern matching once by me (Aug 2023) Regexp for HTML by gossamer (Jan 2024) Regular Expression Assistance by g_speran (Jun 2024) Parsing a large html with perl by zesys (Jun 2020) Re^3: Regexp for HTML by marto (Jan 2024) - uses Mojo::DOM Re: Batch remove URLs by marto (2017) - uses Mojo::DOM wrap abbreviations in XML element by LexPl (2025) - question re parsing XML with regex (quick response from haukex) XML::Smart how to prevent encoding <body> tag by zatlas1 (2025) - using XML::Smart (which has not been updated in over 10 years and has issues) ... sadly OP will switch to JSON/Python rather than using a better Perl CPAN module See Also Re: material for a talk about regexes (RegEx References)	[reply]
Re^3: Substitution remove all before by choroba (Cardinal) on May 19, 2023 at 10:55 UTC
I think this external link is also worth mentioning. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]