in reply to How to extract a pattern in Perl regex?

You've had some great advice so far - please take time to read through it all. You will reap the benefits.

My own small addition to it is to point out that there is an FAQ which covers precisely this topic. The fact that it echoes what you've already heard just helps to reinforce the point.

As a final gift, I will also point out that extracting the contents of the <title> element from an HTML doc is one of the provided examples in the HTML::Parser documentation.

Replies are listed 'Best First'.
Re^2: How to extract a pattern in Perl regex?
by SergioQ (Scribe) on May 01, 2020 at 03:08 UTC

    Yes, I'm looking at the recommended methods, and that "^" was a typo.

    However part of my question was how do I extract in one statement what's in between the "title tags".

    The way I worked around it was:

    $result = =~ /(<title>.*<\/title>)/mgi; my $newresult = $1; $newresult =~ s/<title>//i; $newresult =~ s/<\/title>//i;

    Surely there's a simpler way?

      Using Mojo::DOM (pulling live data use Mojo::UserAgent):

      #!/usr/bin/perl use strict; use warnings; use feature 'say'; use Mojo::Util 'trim'; use Mojo::UserAgent; # get perlmonks my $ua = Mojo::UserAgent->new; my $dom = $ua->get('https://perlmonks.org')->res->dom; say 'Title: ' . trim( $dom->at('title')->text ); say 'Image src: ' . trim( $dom->at('img')->attr->{'src'} ); say 'Image alt: ' . trim( $dom->at('img')->attr->{'alt'} );

      Output:

      Title: PerlMonks - The Monastery Gates Image src: //promote.pair.com/i/pair-banner-current.gif Image alt: Beefy Boxes and Bandwidth Generously Provided by pair Netwo +rks

      Mojo::DOM makes parsing fun and simple.

        Mojo::DOM makes parsing fun and simple.

        Agreed, and ojo makes it even more fun ;-)

        $ perl -Mojo -e 'say g("https://perlmonks.org")->dom->at("title")->all +_text=~s/^\s+|\s+$//gr' PerlMonks - The Monastery Gates
      Surely there's a simpler way?

      Just capture what you want. Let's change the task to remove the elephant in the room of parsing HTML with regex which you now know you shouldn't do. Instead suppose you want to extract everything between 'foo' and 'bar' and ignore all the rest. Here's the simple approach:

      use strict; use warnings; use Test::More tests => 1; my $in = 'abcfooHellobarxyz'; my $want = 'Hello'; my ($have) = ($in =~ /foo(.*)bar/); is $have, $want, "Extracted $want";

      The only real caveat to this is to remember to use the /s modifier if the text you are extracting might contain \n.

        Thank you! Yes, this was the main part of what I was looking for. I remember going through a rather large Perl handbook, and it ended the Regex chapter (or started it) by saying that "there is so much to Regex that whole books are written on it." I really see why now.
        ... caveat ... is to remember to use the /s modifier if the text you are extracting might contain \n.

        Simpler still is to always use  /s (along with  /x and  /m in a consistent  /xms modifier tail) on every  qr// m// s/// you write. Then the rule is simply "Dot matches all." Period.


        Give a man a fish:  <%-{-{-{-<

      c:\@Work\Perl\monks>perl -wMstrict -le "my $result = '<title>The Rain in Spain</tItLe>'; my ($newresult) = $result =~ m{ <title> (.*?) </title> }xmsi; print qq{'$newresult'}; " 'The Rain in Spain'

      Update: Or, going a step further:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $result = 'yada <title>The Rain in Spain</tItLe> blah <TITLE>How N +ow Brown Cow</TitlE> foo'; my @titles = $result =~ m{ (?i) <title> (.*?) </title> }xmsg; dd \@titles; " ["The Rain in Spain", "How Now Brown Cow"]


      Give a man a fish:  <%-{-{-{-<