in reply to Extracting the few useful lines from HTML garble

If would be useful if you could show a cut down/simplified example of the HTML and what "the few useful lines" contain that you need.

If you're after, say, subject, date and from etc. these may not be all on the same line and you may need to consider a parser.

Something like HTML::TokeParser::Simple or HTML::TreeBuilder may be more suitable than a regex especially if it's garbled. :-)

  • Comment on Re: Extracting the few useful lines from HTML garble

Replies are listed 'Best First'.
Re^2: Extracting the few useful lines from HTML garble
by handheld-penguin (Initiate) on Aug 11, 2008 at 17:26 UTC
      Here's my go with HTML::TableExtract.
      #!/usr/local/bin/perl use strict; use warnings; use HTML::TableExtract; my @headers = qw{From Subject Received Size}; my $te = HTML::TableExtract->new(headers => \@headers); $te->parse_file(q{html/monk.html}) or die qq{parse failed\n}; my $ts = $te->first_table_found(); foreach my $row ($ts->rows) { for my $i (0..$#{$row}){ print qq{$headers[$i]: $row->[$i]\n}; } }
      From: usernme Subject: Personal Statement - 08/09/08 Received: Sat 09/08/2008 04:25 PM Size: 124Â KB
      There's a non breaking space after the size.

      Ok. If you want the whole line, then just change my code above to (provided that "/Inbox/email.EML" is on every line. Since I can see only one line I have no idea which are the static and which the variable parts):

      while (<F>) { $theline=$_ if ( m{/Inbox/email\.EML}xms ); }

      If this is not the answer you want, you have to be more specific

      By the way, I just notice that you wrote EMl in your first post, which looks exactly like EMI in my web browser. So to get my first code snippet above to work, you have to change EMI to EML there.