Attempting to parse or manipulate HTML (or XML or similar) with a regular expression is nearly always a bad idea.
I'm actually working at the moment and don't have time to pull out references; however, I'm sure others will do so.
From a purely academic perspective:
-
^ anchors to the start of $contents — that's fine.
-
[^<form]* is a negated, bracketed character class (zero or more times) — not what you want.
See perlrecharclass for details.
It would be better to use .+? (one or more of any character, non-greedily);
with a /s modifier, . will also match newlines.
-
(?=<form) matches up to but not including "<form" — that's fine.
-
The /m modifier — there's nothing in your regex that makes this useful.
-
The /g modifier — you only want to remove content once; don't use this.
-
See perlre for any of the above that you haven't understood.
Here's a guess at your original content with a demonstration of your posted regex and my suggested one.
perl -E '
my $content = q{<!DOCTYPE html>
<html>
<head>
<title>Whatever</title>
</head>
<body>
<h1>Heading</h1>
<form>...</form>
</body>
</html>
};
say "Full:";
say $content;
my $contents = $content;
$contents =~ s/^[^<form]*(?=<form)//mg;
say "\nWith your s///:";
say $contents;
$content =~ s/^.+?(?=<form)//s;
say "\nShortened:";
say $content;
'
Full:
<!DOCTYPE html>
<html>
<head>
<title>Whatever</title>
</head>
<body>
<h1>Heading</h1>
<form>...</form>
</body>
</html>
With your s///:
<!DOCTYPE html>
<html>
<head>
<title>Whatever</title>
</head>
<body>
<h1>Heading</h1>
<form>...</form>
</body>
</html>
Shortened:
<form>...</form>
</body>
</html>
For general usage when regexes aren't doing what you expected,
I can highly recommend Regexp::Debugger.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.