Re: How to extract a pattern in Perl regex?
by Fletch (Bishop) on Apr 30, 2020 at 02:47 UTC
|
Before you go too far down this route be forewarned that parsing (arbitrary) HTML with regular expressions is going to be a world of pain. You'll be better served using a real parser (Mojo::DOM or the like).
That being said there's no way that s/^(alt img)// is going to match your sample text since it's ignoring the initial < character on the alt tag.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
Re: How to extract a pattern in Perl regex?
by kcott (Archbishop) on Apr 30, 2020 at 08:06 UTC
|
G'day SergioQ,
If the data you're dealing with is HTML, then '<alt img="....">' is invalid.
I suspect this is meant to be the img
element which may look like:
<img alt="..." src="...">
<img src="..." alt="...">
<img src="...">
or any number of other variations including a variety of other attributes (id="...", class="...", and so on)
which could appear anywhere between '<img' and '>';
it may have '/>', instead of '>', at the end.
Even if it's not HTML — perhaps it's XML — you'll likely have the same problem with an expected order.
This is why you've been advised against using regular expressions for this type of work.
I strongly recommend you take a look at "Parsing HTML/XML with Regular Expressions".
This expands on the issues and provides many alternatives:
you'd do well to choose one of these.
| [reply] [d/l] [select] |
Re: How to extract a pattern in Perl regex?
by AnomalousMonk (Archbishop) on Apr 30, 2020 at 06:08 UTC
|
I second Fletch's advice to avoid parsing HTML with regex.
However, another problem you may have is that you are not, strictly speaking, matching, but substituting (with the empty string). E.g.:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $urlresult = 'an Alt Img and another ALT IMG here';
print qq{string has 'alt img': '$urlresult'};
;;
if ($urlresult =~ s/(alt img)//igm) {
print qq{string had 'alt img', but no more: '$urlresult'};
}
"
string has 'alt img': 'an Alt Img and another ALT IMG here'
string had 'alt img', but no more: 'an and another here'
(This example only works because I've removed the ^ start-of-string anchor.) Please see perlre, perlretut and perlrequick.
If you have more questions, please feel free to ask. Please see How do I post a question effectively?, Short, Self-Contained, Correct Example and How to ask better questions using Test::More and sample data.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: How to extract a pattern in Perl regex?
by marto (Cardinal) on Apr 30, 2020 at 07:46 UTC
|
| [reply] |
Re: How to extract a pattern in Perl regex?
by hippo (Archbishop) on Apr 30, 2020 at 08:39 UTC
|
You've had some great advice so far - please take time to read through it all. You will reap the benefits.
My own small addition to it is to point out that there is an FAQ which covers precisely this topic. The fact that it echoes what you've already heard just helps to reinforce the point.
As a final gift, I will also point out that extracting the contents of the <title> element from an HTML doc is one of the provided examples in the HTML::Parser documentation.
| [reply] [d/l] |
|
|
Yes, I'm looking at the recommended methods, and that "^" was a typo.
However part of my question was how do I extract in one statement what's in between the "title tags".
The way I worked around it was:
$result = =~ /(<title>.*<\/title>)/mgi;
my $newresult = $1;
$newresult =~ s/<title>//i;
$newresult =~ s/<\/title>//i;
Surely there's a simpler way?
| [reply] [d/l] |
|
|
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Mojo::Util 'trim';
use Mojo::UserAgent;
# get perlmonks
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get('https://perlmonks.org')->res->dom;
say 'Title: ' . trim( $dom->at('title')->text );
say 'Image src: ' . trim( $dom->at('img')->attr->{'src'} );
say 'Image alt: ' . trim( $dom->at('img')->attr->{'alt'} );
Output:
Title: PerlMonks - The Monastery Gates
Image src: //promote.pair.com/i/pair-banner-current.gif
Image alt: Beefy Boxes and Bandwidth Generously Provided by pair Netwo
+rks
Mojo::DOM makes parsing fun and simple. | [reply] [d/l] [select] |
|
|
|
|
|
|
|
use strict;
use warnings;
use Test::More tests => 1;
my $in = 'abcfooHellobarxyz';
my $want = 'Hello';
my ($have) = ($in =~ /foo(.*)bar/);
is $have, $want, "Extracted $want";
The only real caveat to this is to remember to use the /s modifier if the text you are extracting might contain \n. | [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
c:\@Work\Perl\monks>perl -wMstrict -le
"my $result = '<title>The Rain in Spain</tItLe>';
my ($newresult) = $result =~ m{ <title> (.*?) </title> }xmsi;
print qq{'$newresult'};
"
'The Rain in Spain'
Update: Or, going a step further:
c:\@Work\Perl\monks>perl -wMstrict -le
"use Data::Dump qw(dd);
;;
my $result = 'yada <title>The Rain in Spain</tItLe> blah <TITLE>How N
+ow Brown Cow</TitlE> foo';
my @titles = $result =~ m{ (?i) <title> (.*?) </title> }xmsg;
dd \@titles;
"
["The Rain in Spain", "How Now Brown Cow"]
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |