How to extract a pattern in Perl regex?

SergioQ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to extract a pattern in Perl regex? by Fletch (Bishop) on Apr 30, 2020 at 02:47 UTC
Before you go too far down this route be forewarned that parsing (arbitrary) HTML with regular expressions is going to be a world of pain. You'll be better served using a real parser (Mojo::DOM or the like). That being said there's no way that `s/^(alt img)//` is going to match your sample text since it's ignoring the initial < character on the alt tag. The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l]
Re: How to extract a pattern in Perl regex? by kcott (Archbishop) on Apr 30, 2020 at 08:06 UTC
G'day SergioQ, If the data you're dealing with is HTML, then '`<alt img="....">`' is invalid. I suspect this is meant to be the img element which may look like: `<img alt="..." src="..."> <img src="..." alt="..."> <img src="...">` [download] or any number of other variations including a variety of other attributes (`id="..."`, `class="..."`, and so on) which could appear anywhere between '`<img`' and '`>`'; it may have '`/>`', instead of '`>`', at the end. Even if it's not HTML — perhaps it's XML — you'll likely have the same problem with an expected order. This is why you've been advised against using regular expressions for this type of work. I strongly recommend you take a look at "Parsing HTML/XML with Regular Expressions". This expands on the issues and provides many alternatives: you'd do well to choose one of these. — Ken	[reply] [d/l] [select]
Re: How to extract a pattern in Perl regex? by AnomalousMonk (Archbishop) on Apr 30, 2020 at 06:08 UTC
I second Fletch's advice to avoid parsing HTML with regex. However, another problem you may have is that you are not, strictly speaking, matching, but substituting (with the empty string). E.g.: `c:\@Work\Perl\monks>perl -wMstrict -le "my $urlresult = 'an Alt Img and another ALT IMG here'; print qq{string has 'alt img': '$urlresult'}; ;; if ($urlresult =~ s/(alt img)//igm) { print qq{string had 'alt img', but no more: '$urlresult'}; } " string has 'alt img': 'an Alt Img and another ALT IMG here' string had 'alt img', but no more: 'an and another here'` [download] (This example only works because I've removed the `^` start-of-string anchor.) Please see perlre, perlretut and perlrequick. If you have more questions, please feel free to ask. Please see How do I post a question effectively?, Short, Self-Contained, Correct Example and How to ask better questions using Test::More and sample data. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: How to extract a pattern in Perl regex? by marto (Cardinal) on Apr 30, 2020 at 07:46 UTC
Mojo::DOM is for sure the way to go here, it makes this kind of work trivial. See Re^2: running an example script with WWW::Mechanize* module, super search will find more. Post a URL or some HTML you are working with and I'll make some suggestions. Update: fixed link, too early for me...	[reply]
Re: How to extract a pattern in Perl regex? by hippo (Archbishop) on Apr 30, 2020 at 08:39 UTC
You've had some great advice so far - please take time to read through it all. You will reap the benefits. My own small addition to it is to point out that there is an FAQ which covers precisely this topic. The fact that it echoes what you've already heard just helps to reinforce the point. As a final gift, I will also point out that extracting the contents of the `<title>` element from an HTML doc is one of the provided examples in the HTML::Parser documentation.	[reply] [d/l]
Re^2: How to extract a pattern in Perl regex? by SergioQ (Scribe) on May 01, 2020 at 03:08 UTC
Yes, I'm looking at the recommended methods, and that "^" was a typo. However part of my question was how do I extract in one statement what's in between the "title tags". The way I worked around it was: `$result = =~ /(<title>.*<\/title>)/mgi; my $newresult = $1; $newresult =~ s/<title>//i; $newresult =~ s/<\/title>//i;` [download] Surely there's a simpler way?	[reply] [d/l]
Re^3: How to extract a pattern in Perl regex? by marto (Cardinal) on May 01, 2020 at 09:27 UTC
Using Mojo::DOM (pulling live data use Mojo::UserAgent): `#!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::Util 'trim'; use Mojo::UserAgent; # get perlmonks my $ua = Mojo::UserAgent->new; my $dom = $ua->get('https://perlmonks.org')->res->dom; say 'Title: ' . trim( $dom->at('title')->text ); say 'Image src: ' . trim( $dom->at('img')->attr->{'src'} ); say 'Image alt: ' . trim( $dom->at('img')->attr->{'alt'} );` [download] Output: `Title: PerlMonks - The Monastery Gates Image src: //promote.pair.com/i/pair-banner-current.gif Image alt: Beefy Boxes and Bandwidth Generously Provided by pair Netwo +rks` [download] Mojo::DOM makes parsing fun and simple.	[reply] [d/l] [select]
Re^4: How to extract a pattern in Perl regex? by haukex (Archbishop) on May 02, 2020 at 09:23 UTC
Re^5: How to extract a pattern in Perl regex? by marto (Cardinal) on May 02, 2020 at 09:26 UTC
Some notes below your chosen depth have not been shown here
Re^3: How to extract a pattern in Perl regex? by hippo (Archbishop) on May 01, 2020 at 09:10 UTC
Surely there's a simpler way? Just capture what you want. Let's change the task to remove the elephant in the room of parsing HTML with regex which you now know you shouldn't do. Instead suppose you want to extract everything between 'foo' and 'bar' and ignore all the rest. Here's the simple approach: `use strict; use warnings; use Test::More tests => 1; my $in = 'abcfooHellobarxyz'; my $want = 'Hello'; my ($have) = ($in =~ /foo(.*)bar/); is $have, $want, "Extracted $want";` [download] The only real caveat to this is to remember to use the `/s` modifier if the text you are extracting might contain `\n`.	[reply] [d/l] [select]
Re^4: How to extract a pattern in Perl regex? by SergioQ (Scribe) on May 01, 2020 at 22:24 UTC
Re^4: How to extract a pattern in Perl regex? by AnomalousMonk (Archbishop) on May 01, 2020 at 10:33 UTC
Re^5: How to extract a pattern in Perl regex? by hippo (Archbishop) on May 01, 2020 at 10:50 UTC
Some notes below your chosen depth have not been shown here
Re^3: How to extract a pattern in Perl regex? (updated) by AnomalousMonk (Archbishop) on May 01, 2020 at 03:58 UTC
`c:\@Work\Perl\monks>perl -wMstrict -le "my $result = '<title>The Rain in Spain</tItLe>'; my ($newresult) = $result =~ m{ <title> (.?) </title> }xmsi; print qq{'$newresult'}; " 'The Rain in Spain'` [download] Update:* Or, going a step further: `c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $result = 'yada <title>The Rain in Spain</tItLe> blah <TITLE>How N +ow Brown Cow</TitlE> foo'; my @titles = $result =~ m{ (?i) <title> (.*?) </title> }xmsg; dd \@titles; " ["The Rain in Spain", "How Now Brown Cow"]` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]