Re: Extracting the few useful lines from HTML garble

If would be useful if you could show a cut down/simplified example of the HTML and what "the few useful lines" contain that you need.

If you're after, say, subject, date and from etc. these may not be all on the same line and you may need to consider a parser.

Something like HTML::TokeParser::Simple or HTML::TreeBuilder may be more suitable than a regex especially if it's garbled. :-)

Comment on Re: Extracting the few useful lines from HTML garble

Replies are listed 'Best First'.
Re^2: Extracting the few useful lines from HTML garble by handheld-penguin (Initiate) on Aug 11, 2008 at 17:26 UTC
Here is the html in question...everything i want is one big line at 75.. http://pastebin.com/mf0291b0	[reply]
Re^3: Extracting the few useful lines from HTML garble by wfsp (Abbot) on Aug 12, 2008 at 05:22 UTC
Here's my go with HTML::TableExtract. `#!/usr/local/bin/perl use strict; use warnings; use HTML::TableExtract; my @headers = qw{From Subject Received Size}; my $te = HTML::TableExtract->new(headers => \@headers); $te->parse_file(q{html/monk.html}) or die qq{parse failed\n}; my $ts = $te->first_table_found(); foreach my $row ($ts->rows) { for my $i (0..$#{$row}){ print qq{$headers[$i]: $row->[$i]\n}; } }` [download] `From: usernme Subject: Personal Statement - 08/09/08 Received: Sat 09/08/2008 04:25 PM Size: 124Ā KB` [download] There's a non breaking space after the size.	[reply] [d/l] [select]
Re^3: Extracting the few useful lines from HTML garble by jethro (Monsignor) on Aug 11, 2008 at 23:25 UTC
Ok. If you want the whole line, then just change my code above to (provided that "/Inbox/email.EML" is on every line. Since I can see only one line I have no idea which are the static and which the variable parts): `while (<F>) { $theline=$_ if ( m{/Inbox/email\.EML}xms ); }` [download] If this is not the answer you want, you have to be more specific By the way, I just notice that you wrote EMl in your first post, which looks exactly like EMI in my web browser. So to get my first code snippet above to work, you have to change EMI to EML there.	[reply] [d/l]