Substitution remove all before

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Substitution remove all before by GrandFather (Saint) on May 18, 2023 at 21:44 UTC
There are a number of problems with this regex. If you are trying to parse some HTML or similar markup, this is REALLY NOT THE WAY TO DO IT, even if it were doing what you want! But I strongly suspect is doesn't do what you think it does. Consider: `use strict; my @testStrs = ("all this<form>", "meh!<form>"); for my $str (@testStrs) { my $contents = $str; $contents =~ s/^[^<form]*(?=<form)//smg; print qq(After: "$contents" Before: "$str"\n); }` [download] Prints: `After: "<form>" Before: "all this<form>" After: "meh!<form>" Before: "meh!<form>"` [download] `[^<form]` matches any one character that is not '<', 'f', 'o', 'r' or 'm'. Guessing at what you may actually want to do, onsider instead: `use strict; use warnings; use HTML::TreeBuilder; my $htmlFrag = ("<div><form>All good men</form></div>"); my $root = HTML::TreeBuilder->new_from_content($htmlFrag)->elementify( +); my $form = $root->look_down("_tag", "form"); print $form->as_text();` [download] which prints `All good men` [download] See HTML::TreeBuilder and HTML::Element. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re: Substitution remove all before by kcott (Archbishop) on May 18, 2023 at 22:26 UTC
Attempting to parse or manipulate HTML (or XML or similar) with a regular expression is nearly always a bad idea. I'm actually working at the moment and don't have time to pull out references; however, I'm sure others will do so. From a purely academic perspective: `^` anchors to the start of `$contents` — that's fine. `[^<form]` is a negated, bracketed character class (zero or more times) — not what you want. See perlrecharclass for details. It would be better to use `.+?` (one or more of any character, non-greedily); with a `/s` modifier, `.` will also match newlines. `(?=<form)` matches up to but not including "<form" — that's fine. The `/m` modifier — there's nothing in your regex that makes this useful. The `/g` modifier — you only want to remove content once; don't use this. See perlre for any of the above that you haven't understood. Here's a guess at your original content with a demonstration of your posted regex and my suggested one. perl -E ' my $content = q{<!DOCTYPE html> <html> <head> <title>Whatever</title> </head> <body> <h1>Heading</h1> <form>...</form> </body> </html> }; say "Full:"; say $content; my $contents = $content; $contents =~ s/^[^<form](?=<form)//mg; say "\nWith your s///:"; say $contents; $content =~ s/^.+?(?=<form)//s; say "\nShortened:"; say $content; ' Full: <!DOCTYPE html> <html> <head> <title>Whatever</title> </head> <body> <h1>Heading</h1> <form>...</form> </body> </html> With your s///: <!DOCTYPE html> <html> <head> <title>Whatever</title> </head> <body> <h1>Heading</h1> <form>...</form> </body> </html> Shortened: <form>...</form> </body> </html> [download] For general usage when regexes aren't doing what you expected, I can highly recommend Regexp::Debugger. — Ken	[reply] [d/l] [select]
Re^2: Substitution remove all before (Parse HTML/XML with Regex References) by eyepopslikeamosquito (Archbishop) on May 18, 2023 at 23:20 UTC
Attempting to parse or manipulate HTML (or XML or similar) with a regular expression is nearly always a bad idea. I'm actually working at the moment and don't have time to pull out references; however, I'm sure others will do so. Surprised I don't have a list of references on this topic. Here's a start: XML (wikipedia) HTML (wikipedia) Why a regex really isn't good enough for HTML and XML, even for "simple" tasks by haukex (2020) Parsing HTML/XML with Regular Expressions by haukex (2017) Re: Creating an abstract (updated) by haukex (2021) - uses Mojo::DOM Re: perlre inverse check for several patterns by haukex (2023) - uses Mojo::DOM References Added Later regex match open tags except XHTML... (SO) Re: pattern matching once by me (Aug 2023) Regexp for HTML by gossamer (Jan 2024) Regular Expression Assistance by g_speran (Jun 2024) Parsing a large html with perl by zesys (Jun 2020) Re^3: Regexp for HTML by marto (Jan 2024) - uses Mojo::DOM Re: Batch remove URLs by marto (2017) - uses Mojo::DOM wrap abbreviations in XML element by LexPl (2025) - question re parsing XML with regex (quick response from haukex) XML::Smart how to prevent encoding <body> tag by zatlas1 (2025) - using XML::Smart (which has not been updated in over 10 years and has issues) ... sadly OP will switch to JSON/Python rather than using a better Perl CPAN module See Also Re: material for a talk about regexes (RegEx References)	[reply]
Re^3: Substitution remove all before by choroba (Cardinal) on May 19, 2023 at 10:55 UTC
I think this external link is also worth mentioning. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re: Substitution remove all before by jwkrahn (Abbot) on May 18, 2023 at 20:57 UTC
`$contents =~ s/^.+?(?=<form)//smg;` [download] Naked blocks are fun! -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re^2: Substitution remove all before by Anonymous Monk on May 18, 2023 at 21:21 UTC
Works great. Thanks.	[reply]