Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to plot out a way to do this for most of the afternoon and can't quite figure it out.

I have to match multiple things in a regex and on top of that I need to match things in pairs. To keep things in insertion order I was thinking of creating an array and just combining match1::match2 and ripping through them later on. But since I need the URL and name for each of the urls, that's a little much for an array, isn't it?

Code I need to match is as follows

<TR class=g><TD class=i><A href="/page1.html">Page name here</A></TD>< +TD class=j><A href="/page2.html">Other page name here</A></TD></TR> <TR class=h><TD class=i><A href="/blahblah.html">lah blah</A></TD><TD +class=j><A href="/ribbit.html">Ribbit</A></TD></TR>
There are hundreds of these matches per page and I NEED to pair up the first and second link of each match. Also, the TR CLASS alternates between "g" and "h" each time.

Any advice on the regex to use for this to mach the URL and the title of each set AND how to group this together for easy access later (preferrably in the order it's found) would be great.

Replies are listed 'Best First'.
Re: multiple regex matching
by tlm (Prior) on Apr 13, 2005 at 22:10 UTC

    Why don't you use HTML::Table::Extract to extract the rows and columns of that table? After the matching problem becomes trivial.

    the lowliest monk

Re: multiple regex matching
by gam3 (Curate) on Apr 14, 2005 at 00:52 UTC
    It is not clear to me, but is this an answer?
    while (<DATA>) { if (my ($a, $b, $c, $d) = (m/TD class=i><A href="([^"]*)"\s*>([^<] +*)<.*TD class=j><A href="([^"]*)"\s*>([^<]*)</)) { print "($a, $b, $c, $d)\n"; } } __END__ <TR class=g><TD class=i><A href="/page1.html">Page name here</A></TD>< +TD class=j><A href="/page2.html">Other page name here</A></TD></TR> <TR class=h><TD class=i><A href="/blahblah.html">lah blah</A></TD><TD +class=j><A href="/ribbit.html">Ribbit</A></TD></TR>
    -- gam3
    A picture is worth a thousand words, but takes 200K.
      yes!!! That is exactly it! That's what I need to do.

      I can get your demo working but do I need to save the html into an array in order to do this while loop?

      Thanks.

        My solution saves the data into an array of hashes.


        ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
        =~y~b-v~a-z~s; print
Re: multiple regex matching
by Cody Pendant (Prior) on Apr 13, 2005 at 23:58 UTC
    There are hundreds of these matches per page and I NEED to pair up the first and second link of each match. Also, the TR CLASS alternates between "g" and "h" each time.

    I don't understand what you want. There are a number of things in each page which match. Within the match there are more than two URLs, but you only want the first two?

    Give us an example like "here's a page, (some code). In this page I need to extract the links (list of links you need to extract)".



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
      Let me retry to explain what I need to do.

      I am scraping a web page and I have $source which is the html dump. From the entire source code there is a table that has a ton of rows but always 2 columns (as you can see from the two samples above.

      There is 1 URL in each table cell which means 2 URLS per table row. I need to collect the URL and the text of the URL for EACH table cell (which means 2 urls per row).

      Those 2 samples displayed in the OP is what the table looks like. There's dozens upon dozens of <TR>s and I need to extract each url and the text for it and at the same time keep each row of links connected. This is to say I need to keep the two URLS per row together somehow because they are relative to each other. This code is the beginning of the table and 3 rows. You see it's just the same type of code I pasted earlier. I need the urls and the text but keep each row of data together magically.

      <TABLE width=468 style="border-collapse:collapse"><TR><TD><TABLE width +=468 style="border-collapse:collapse" cellpadding=2> <TR><TD class=he +ad>Artist</TD><TD class=head>Song</TD></TR> <TR class=g><TD class=i><A href="/my-bloody-valentine-rxzg3.html">My B +loody Valentine</A></TD><TD class=j><A href="/my-bloody-valentine-lov +ely-sweet-darlene-9lgrnkb.html">Lovely Sweet Darlene *</A></TD></TR> <TR class=h><TD class=i><A href="/jennifer-love-hewitt-d193f.html">Jen +nifer Love Hewitt</A></TD><TD class=j><A href="/jennifer-love-hewitt- +i-want-a-love-i-can-see-jv2mj67.html">I Want A Love I Can See *</A></ +TD></TR> <TR class=g><TD class=i><A href="/jennifer-love-hewitt-d193f.html">Jen +nifer Love Hewitt</A></TD><TD class=j><A href="/jennifer-love-hewitt- +no-ordinary-love-4rftntr.html">No Ordinary Love *</A></TD></TR>
      From that code I want to get /my-bloody-valentine-rxzg3.html and My Bloody Valentine from the first cell from the first table and ="/my-bloody-valentine-rxzg3.html and Lovely Sweet Darlene . I need this data collected for each row.

      Does this make any more sense?

      Edited by Chady -- escaped <TR>

        Well, this does it, but it's not the optimal solution. There's lots of repetition and one side is clearly the artist and the other a song, so it looks kind of off to me. You know better than me what you want though...

        I never thought I'd see MBF and JLH mentioned in the same breath...

        #!/usr/bin/perl use strict; use warnings; use diagnostics; use Data::Dumper; my $html = '<TABLE width=468 style="border-collapse:collapse"> <TR><TD><TABLE width=468 style="border-collapse:collapse" cellpadding= +2> <TR><TD class=head>Artist</TD><TD class=head>Song</TD></TR> <TR class=g><TD class=i><A href="/my-bloody-valentine-rxzg3.html">My B +loody Valentine</A></TD><TD class=j><A href="/my-bloody-valentine-lov +ely-sweet-darlene-9lgrnkb.html">Lovely Sweet Darlene *</A></TD></TR> <TR class=h><TD class=i><A href="/jennifer-love-hewitt-d193f.html">Jen +nifer Love Hewitt</A></TD><TD class=j><A href="/jennifer-love-hewitt- +i-want-a-love-i-can-see-jv2mj67.html">I Want A Love I Can See *</A></ +TD></TR> <TR class=g><TD class=i><A href="/jennifer-love-hewitt-d193f.html">Jen +nifer Love Hewitt</A></TD><TD class=j><A href="/jennifer-love-hewitt- +no-ordinary-love-4rftntr.html">No Ordinary Love *</A></TD></TR>'; my @data = (); while ( $html =~ m|<TD class=\w><A href="([^"]+)">([^<]+)</A></TD>|g ) { push( @data, { link => $1, name => $2 } ); } print Dumper(\@data); #### result #### # $VAR1 = [ # { # 'link' => '/my-bloody-valentine-rxzg3.html', # 'name' => 'My Bloody Valentine' # }, # { # 'link' => '/my-bloody-valentine-lovely-sweet-darlene-9lgrnkb.ht +ml', # 'name' => 'Lovely Sweet Darlene *' # }, # { # 'link' => '/jennifer-love-hewitt-d193f.html', # 'name' => 'Jennifer Love Hewitt' # }, # { # 'link' => '/jennifer-love-hewitt-i-want-a-love-i-can-see-jv2mj6 +7.html', # 'name' => 'I Want A Love I Can See *' # }, # { # 'link' => '/jennifer-love-hewitt-d193f.html', # 'name' => 'Jennifer Love Hewitt' # }, # { # 'link' => '/jennifer-love-hewitt-no-ordinary-love-4rftntr.html' +, # 'name' => 'No Ordinary Love *' # } # ]; #


        ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
        =~y~b-v~a-z~s; print
Re: multiple regex matching
by Popcorn Dave (Abbot) on Apr 14, 2005 at 19:37 UTC
    Since you're scraping the page, you might also take a look at HTML::TokeParser. That makes the work of parsing HTML absolutely trivial. From there I would think it would be easy to match what you needed.

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.