html tag matching confusion

moonlord has asked for the wisdom of the Perl Monks concerning the following question:

hi there I have some data in the format (a credit card statement) , and I've been trying to adapt a google-posted perl script to parse the data into a qif file for import into microsoft money.

<tr><td bgcolor='#DCDCDC' align='left' width='100'><font face='arial,h
+elvetica' size='-2'>&nbsp;28 Jun 2001</font></td><td bgcolor='#DCDCDC
+'
+ align='left' width='300'><font face='arial,helvetica' size='-2'>&nbs
+p
+;HMV UK LTD              NOTTINGHAM    GB</font></td><td bgcolor='#DC
+D
+CDC' align='right' width='75'><font face='arial,helvetica' size='-2'>
+&
+pound;10.99 &nbsp;</font></td></tr><tr><td bgcolor='#DCDCDC' align='l
+e
+ft' width='100'><font face='arial,helvetica' size='-2'>&nbsp;28 Jun 2
+0
+01</font></td><td bgcolor='#DCDCDC' align='left' width='300'><font fa
+c
+e='arial,helvetica' size='-2'>&nbsp;MARKS   SPENCER         NOTTINGHA
+M
+ 06 GB</font></td><td bgcolor='#DCDCDC' align='right' width='75'><fon
+t
+ face='arial,helvetica' size
[download]

and I want to find certain tags using this code snippet, and output a qif file. The problem I'm having is in matching the start and end tags.

$start="<td bgcolor=\'#DCDCDC\' align=\'left\' width=\'300\'><font fac
+e='arial,helvetica' size='-2'>&nbsp;";
$end="</font></td>";
while (<>)
  {
    if (/$start(.*?)$end/g)
      {
    print "\n\n\nDOODAH:".$1."\n";
      }
  }
[download]

It never seems to match the start tag, and if I change the start tag to something simpler like

$start="<td bgcolor";
[download]

then the perl never seems to stop when it hits something that matches the $end var. I've been going round and round this and I just can't figure it out, so any advice would be greatly appreciated. Cheers moonlord

Comment on html tag matching confusion Select or Download Code

Replies are listed 'Best First'.
Re: html tag matching confusion by rob_au (Abbot) on Nov 25, 2001 at 05:47 UTC
Have you considered looking at HTML::TokeParser for this parsing of HTML? There is even an excellent tutorial for it here on this site by crazyinsomniac. Ooohhh, Rob no beer function well without!	[reply]
Re: html tag matching confusion by jarich (Curate) on Nov 25, 2001 at 08:33 UTC
I'm not sure how your block of html up there should be broken up. Is it all one line or is it over many lines? On the assumption that it _might_ sooner or later be over several lines I offer you the following code: `my $start = q{<font\s+face='arial,helvetica'\s+size='-2'> }; my $end = q{</font></td>}; my @list; { local $\ = ""; # file slurping mode my $filecontents = <>; # take in the whole file @list = ($filecontents =~ /$start(.*?)$end/sg); } print "\n\n\nDOODAH: @list\n";` [download] You'll notice that we don't really need to look out for the td tag here, as what we want is between font tags. You might need to watch for this. We use \s+ instead of a literal space, because this will catch newlines. Using the /s modifier on our regular expression allows . to match newlines as well. Calling the regular expression in a list context, and using the /g modifier will ensure that all possible matches are stored in our array @list. Likewise using q{} to quote our variables means that we don't need to worry about their contents. This code works for me on the snippet of html that you provided. Good luck.	[reply] [d/l]
Re: html tag matching confusion by demerphq (Chancellor) on Nov 25, 2001 at 16:54 UTC
Looking at your data my guess is that you are pulling records from an HTML table. Presumably this table has headers for each column. In which case the easiest way to extract your data is to use HTML::TableExtract. Here is an excerpt from the pod: `use HTML::TableExtract; $te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] ); $te->parse($html_string); # Examine all matching tables foreach $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join(',', @$row), "\n"; } }` [download] Its a very easy to use and powerful tool. Check it out. (Err, and its useful even if there arent column headers.) Yves / DeMerphq -- Have you registered your Name Space?	[reply] [d/l]
Re: html tag matching confusion by Rich36 (Chaplain) on Nov 25, 2001 at 05:47 UTC
Maybe it's tripping up on the ";" in ` ` in your match. Try "\"ing it. Does it compile ok? Have you tried running "perl -c"? Also, I'm sure that there are modules to parse HTML tags out there - I'm not familiar with them, not having used them - but I'm sure someone could suggest something. Rich36 There's more than one way to screw it up...	[reply] [d/l]
Re: html tag matching confusion by mattr (Curate) on Nov 25, 2001 at 12:36 UTC
1) You could try using B::DeParse on the command line to see what Perl makes of the script you are trying to use. 2) It is possible there could be a bug if you have an old Perl. At any rate to stay sane, why not make up your own tag names like "\nSTART\t" and do a global replace on the data first. Then maybe you could read it yourself and have less trouble debugging. 3) Also you just don't want to use dot-star. Really. ".?" is dangerous especially for finding things with quotes embedded in them, as that link (Ovid's) will show. Ovid suggests a negated character class. You could also use an available HTML parser, or in the beginning just strip out all the bad stuff first (you need to know you are not stripping good data by accident). You could also inch through the data using pos to parse a bit at a time. Move SIG!*	[reply]
Re: Re: html tag matching confusion by jarich (Curate) on Nov 25, 2001 at 14:26 UTC
I think that in this particular instance, .? ought to be fine, as embedded font tags are not legal (whereas in Ovid's example, embedded "s are fine). In this case we're looking for stuff between <font ...> and </font> so .? works a charm, although, if you had code that had: `<font ..> text <font ..> more text </font> and some </font>` [download] you'd get `text <font ..> more text` [download] out. This would be awkward, but a negated character class won't save us. If it is possible that you're getting insane html, then you have to expect bugginess on any regexp we come up with.	[reply] [d/l] [select]