SergioQ has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I need to find patterns in a large Perl variable, which I assume is multi-lined. There are different patterns I am looking for.

<alt img="....">

<title>....</title>

the …. represents any character

But I can't figure out the regex code for just the basics. As I have tried :

if($urlresult =~ s/^(alt img)//igm

But I get nowhere. I realize that's not the full extract I want, but if I can't even get a cleaner version then I can't proceed to the next steps (multiple wild character, etc.)

Any help would be appreciated.

Replies are listed 'Best First'.
Re: How to extract a pattern in Perl regex?
by Fletch (Bishop) on Apr 30, 2020 at 02:47 UTC

    Before you go too far down this route be forewarned that parsing (arbitrary) HTML with regular expressions is going to be a world of pain. You'll be better served using a real parser (Mojo::DOM or the like).

    That being said there's no way that s/^(alt img)// is going to match your sample text since it's ignoring the initial < character on the alt tag.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: How to extract a pattern in Perl regex?
by kcott (Archbishop) on Apr 30, 2020 at 08:06 UTC

    G'day SergioQ,

    If the data you're dealing with is HTML, then '<alt img="....">' is invalid. I suspect this is meant to be the img element which may look like:

    <img alt="..." src="..."> <img src="..." alt="..."> <img src="...">

    or any number of other variations including a variety of other attributes (id="...", class="...", and so on) which could appear anywhere between '<img' and '>'; it may have '/>', instead of '>', at the end.

    Even if it's not HTML — perhaps it's XML — you'll likely have the same problem with an expected order. This is why you've been advised against using regular expressions for this type of work.

    I strongly recommend you take a look at "Parsing HTML/XML with Regular Expressions". This expands on the issues and provides many alternatives: you'd do well to choose one of these.

    — Ken

Re: How to extract a pattern in Perl regex?
by AnomalousMonk (Archbishop) on Apr 30, 2020 at 06:08 UTC

    I second Fletch's advice to avoid parsing HTML with regex.

    However, another problem you may have is that you are not, strictly speaking, matching, but substituting (with the empty string). E.g.:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $urlresult = 'an Alt Img and another ALT IMG here'; print qq{string has 'alt img': '$urlresult'}; ;; if ($urlresult =~ s/(alt img)//igm) { print qq{string had 'alt img', but no more: '$urlresult'}; } " string has 'alt img': 'an Alt Img and another ALT IMG here' string had 'alt img', but no more: 'an and another here'
    (This example only works because I've removed the  ^ start-of-string anchor.) Please see perlre, perlretut and perlrequick.

    If you have more questions, please feel free to ask. Please see How do I post a question effectively?, Short, Self-Contained, Correct Example and How to ask better questions using Test::More and sample data.


    Give a man a fish:  <%-{-{-{-<

Re: How to extract a pattern in Perl regex?
by marto (Cardinal) on Apr 30, 2020 at 07:46 UTC
Re: How to extract a pattern in Perl regex?
by hippo (Archbishop) on Apr 30, 2020 at 08:39 UTC

    You've had some great advice so far - please take time to read through it all. You will reap the benefits.

    My own small addition to it is to point out that there is an FAQ which covers precisely this topic. The fact that it echoes what you've already heard just helps to reinforce the point.

    As a final gift, I will also point out that extracting the contents of the <title> element from an HTML doc is one of the provided examples in the HTML::Parser documentation.

      Yes, I'm looking at the recommended methods, and that "^" was a typo.

      However part of my question was how do I extract in one statement what's in between the "title tags".

      The way I worked around it was:

      $result = =~ /(<title>.*<\/title>)/mgi; my $newresult = $1; $newresult =~ s/<title>//i; $newresult =~ s/<\/title>//i;

      Surely there's a simpler way?

        Using Mojo::DOM (pulling live data use Mojo::UserAgent):

        #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::Util 'trim'; use Mojo::UserAgent; # get perlmonks my $ua = Mojo::UserAgent->new; my $dom = $ua->get('https://perlmonks.org')->res->dom; say 'Title: ' . trim( $dom->at('title')->text ); say 'Image src: ' . trim( $dom->at('img')->attr->{'src'} ); say 'Image alt: ' . trim( $dom->at('img')->attr->{'alt'} );

        Output:

        Title: PerlMonks - The Monastery Gates Image src: //promote.pair.com/i/pair-banner-current.gif Image alt: Beefy Boxes and Bandwidth Generously Provided by pair Netwo +rks

        Mojo::DOM makes parsing fun and simple.

        Surely there's a simpler way?

        Just capture what you want. Let's change the task to remove the elephant in the room of parsing HTML with regex which you now know you shouldn't do. Instead suppose you want to extract everything between 'foo' and 'bar' and ignore all the rest. Here's the simple approach:

        use strict; use warnings; use Test::More tests => 1; my $in = 'abcfooHellobarxyz'; my $want = 'Hello'; my ($have) = ($in =~ /foo(.*)bar/); is $have, $want, "Extracted $want";

        The only real caveat to this is to remember to use the /s modifier if the text you are extracting might contain \n.

        c:\@Work\Perl\monks>perl -wMstrict -le "my $result = '<title>The Rain in Spain</tItLe>'; my ($newresult) = $result =~ m{ <title> (.*?) </title> }xmsi; print qq{'$newresult'}; " 'The Rain in Spain'

        Update: Or, going a step further:

        c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $result = 'yada <title>The Rain in Spain</tItLe> blah <TITLE>How N +ow Brown Cow</TitlE> foo'; my @titles = $result =~ m{ (?i) <title> (.*?) </title> }xmsg; dd \@titles; " ["The Rain in Spain", "How Now Brown Cow"]


        Give a man a fish:  <%-{-{-{-<