jcpunk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to get the data out of the following string
<tr align=right><td>/export/home3</td><td>308218</td><td>307200</td><t +d>308224</td><td>7.0 days</td><td>0</td><td>0</td><td>0</td><td>-</td +></tr>
I was brave enough to code up
/.*?<tr align=right><td>(?:\/\w*?)+<\/td><td>(\d*?)<\/td><td>(\d*?)<\/ +td><td>(\d*?)<\/td><td>+?<\/td><td>(\d*?)<\/td><td>(\d*?)<\/td><td>(\ +d*?)<\/td>/
in an attempt to match it,but it was for nothing as it does not. any hints towards fixing this, or alternate methods of getting the values out (including "-")would be excellent.

jcpunk
all code is tested, and doesn't work so there :p (varient on common PM sig for my own ammusment)

Replies are listed 'Best First'.
break it down ( un-uglify )
by PodMaster (Abbot) on Mar 30, 2004 at 07:50 UTC
    Break it down. You're using (?:\/\w*?)+ to match /extport/home, and you know \w doesn't match that extra /. If i were you i'd start from scratch, and build the re from pieces, and use alternate delimiters and use the x option.

    You should also think in more general terms. Think about parsing the html. What you're looking for is stuff in between > and <.

    # C:\dev\loose\html.treebuilder3.pl use strict; use warnings; use HTML::TreeBuilder; my $html = q~ <tr align=right><td>/export/home3</td><td>308218</td><td>307200</td><t +d>308224</td><td>7.0 days</td><td>0</td><td>0</td><td>0</td><td>-</td +></tr> ~; my $t = HTML::TreeBuilder->new(); $t->parse($html); $t->eof; for my $row ( $t->find_by_tag_name('tr') ){ print join ' | ', map { ref $_ ? $_->as_text : $_ } @{ $row->content() },$/; } warn $_ for $html =~ m{> ( [^>]+ ) </}gx; __END__ /export/home3 | 308218 | 307200 | 308224 | 7.0 days | 0 | 0 | 0 | - | /export/home3 at html.treebuilder3.pl line 23. 308218 at html.treebuilder3.pl line 23. 307200 at html.treebuilder3.pl line 23. 308224 at html.treebuilder3.pl line 23. 7.0 days at html.treebuilder3.pl line 23. 0 at html.treebuilder3.pl line 23. 0 at html.treebuilder3.pl line 23. 0 at html.treebuilder3.pl line 23. - at html.treebuilder3.pl line 23.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: ugly regex question
by matija (Priest) on Mar 30, 2004 at 09:07 UTC
    That string looks like it's part of a larger table.

    I think you can save yourself a lot of pain by using HTML::TableExtract.

Re: ugly regex question
by jweed (Chaplain) on Mar 30, 2004 at 07:40 UTC
Re: ugly regex question
by Tomte (Priest) on Mar 30, 2004 at 07:54 UTC

    you I think you'd be better off using somthing like HTML::Parser, but you can just remove all the tags, can't you?

    my $text = '<tr align=right><td>/export/home3</td><td>308218</td><td>3 +07200</td><td>308224</td><td>7.­0 days</td><td>0</td><td>0</td><td>0< +/td><td>-</td></tr>'; $text =~ s!<[^>]+>!^!g; # replacing tags with caret my @data = split(/\^/, $text); # splitting at caret my @data = grep {$_} @data; # remove undef elements print join(" : ", @data); __END__ /export/home3 : 308218 : 307200 : 308224 : 7.­0 days : -
    Edit:: removed part of my shell-prompt (system-time [0932]) after the trailing -.

    hth, regards,
    tomte


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.

Re: ugly regex question
by kiat (Vicar) on Mar 30, 2004 at 08:17 UTC
    my $string = q~<tr align=right><td>/export/home3</td><td>308218</td><t +d>307200</td><td>308224</td><td>7.0 days</td><td>0</td><td>0</td><td> +0</td><td>-</td></tr>~; $string =~ s/<.*?><.*?>/\n/g and print $string; # output /export/home3 308218 307200 308224 7.0 days 0 0 0 -
    You probably want to use the solutions by the others above. I was just playing with it...