ugly regex question

jcpunk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to get the data out of the following string

<tr align=right><td>/export/home3</td><td>308218</td><td>307200</td><t
+d>308224</td><td>7.0 days</td><td>0</td><td>0</td><td>0</td><td>-</td
+></tr>
[download]

I was brave enough to code up

/.*?<tr align=right><td>(?:\/\w*?)+<\/td><td>(\d*?)<\/td><td>(\d*?)<\/
+td><td>(\d*?)<\/td><td>+?<\/td><td>(\d*?)<\/td><td>(\d*?)<\/td><td>(\
+d*?)<\/td>/
[download]

in an attempt to match it,but it was for nothing as it does not. any hints towards fixing this, or alternate methods of getting the values out (including "-")would be excellent.

jcpunk

all code is tested, and doesn't work so there :p (varient on common PM sig for my own ammusment)

Comment on ugly regex question Select or Download Code

Replies are listed 'Best First'.

break it down ( un-uglify )
by PodMaster (Abbot) on Mar 30, 2004 at 07:50 UTC

(?:\/\w*?)+

/extport/home

You should also think in more general terms. Think about parsing the html. What you're looking for is stuff in between > and <.

# C:\dev\loose\html.treebuilder3.pl
use strict;
use warnings;

use HTML::TreeBuilder;

my $html = q~
<tr align=right><td>/export/home3</td><td>308218</td><td>307200</td><t
+d>308224</td><td>7.0 days</td><td>0</td><td>0</td><td>0</td><td>-</td
+></tr>
~;

my $t = HTML::TreeBuilder->new();
$t->parse($html);
$t->eof;

for my $row ( $t->find_by_tag_name('tr') ){
    print join ' | ', map {
                        ref $_
                        ? $_->as_text
                        : $_
                    } @{ $row->content() },$/;
}

warn $_ for $html =~ m{> ( [^>]+ ) </}gx;


__END__
/export/home3 | 308218 | 307200 | 308224 | 7.0 days | 0 | 0 | 0 | - |
/export/home3 at html.treebuilder3.pl line 23.
308218 at html.treebuilder3.pl line 23.
307200 at html.treebuilder3.pl line 23.
308224 at html.treebuilder3.pl line 23.
7.0 days at html.treebuilder3.pl line 23.
0 at html.treebuilder3.pl line 23.
0 at html.treebuilder3.pl line 23.
0 at html.treebuilder3.pl line 23.
- at html.treebuilder3.pl line 23.
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]
[select]

Re: ugly regex question
by matija (Priest) on Mar 30, 2004 at 09:07 UTC

I think you can save yourself a lot of pain by using HTML::TableExtract.

[reply]

Re: ugly regex question
by jweed (Chaplain) on Mar 30, 2004 at 07:40 UTC

use HTML::Strip

Code is (almost) always untested.
http://www.justicepoetic.net/

[reply]
[d/l]

Re: ugly regex question
by Tomte (Priest) on Mar 30, 2004 at 07:54 UTC

~~you~~ I think you'd be better off using somthing like HTML::Parser, but you can just remove all the tags, can't you?

my $text = '<tr align=right><td>/export/home3</td><td>308218</td><td>3
+07200</td><td>308224</td><td>7.0 days</td><td>0</td><td>0</td><td>0<
+/td><td>-</td></tr>';
$text =~ s!<[^>]+>!^!g;        # replacing tags with caret
my @data = split(/\^/, $text); # splitting at caret
my @data = grep {$_} @data;    # remove undef elements
print join(" : ", @data);
__END__
/export/home3 : 308218 : 307200 : 308224 : 7.0 days : -
[download]

Edit:

[0932]

-

hth, regards,
tomte

Hlade's Law:

If you have a difficult task, give it to a lazy person --
they will find an easier way to do it.

[reply]
[d/l]

Re: ugly regex question
by kiat (Vicar) on Mar 30, 2004 at 08:17 UTC

my $string = q~<tr align=right><td>/export/home3</td><td>308218</td><t
+d>307200</td><td>308224</td><td>7.0 days</td><td>0</td><td>0</td><td>
+0</td><td>-</td></tr>~;

$string =~ s/<.*?><.*?>/\n/g and print $string;

# output

/export/home3
308218
307200
308224
7.0 days
0
0
0
-
[download]

[reply]
[d/l]