Improved HTML2TXT regexp parser!!!

so i'm here again. i tried improving my skills in using regexp to parse HTML, though adviced to desist ;-) and here's what i did: better handling of multi-line tags: it works even if < and > are REALLY far away each other! better use of regexp themselves, "stripping" many similar actions into only one; use of a sub (I learned how to use them, finally) and this makes me proud! and general optimization so that lenght of code is reduced by many bytes, and working better!!!!! the main intent was that of keeping code under 1K and so it is. I still believe that accomplishing strange tasks in non-canonical ways is a good mean to improve one's skill. This text doesn't (obviously) mean that people should drive blindfolded to improve their driving skill!!! programming snippets is fun, driving needs responsability. so, enjoy my code, or dislike it, or even hate it. but if you find it interesting (for instance as a mean to demonstrate the INFINITE SLOWNESS of Winword HTML->TXT Filter) feel free to send me an email at baginov@hotmail.com. and if you feel really well, please do something good for other people. many people need help, just take a look around. enjoy SiG ------------------------------------------------------- this is third edit.... i got chady's advices and applied them! thanks again -------------------------------------------------------

#!/usr/bin/perl
$fn = $ARGV[0] unless ! $ARGV[0];
if (!$ARGV[0]){
    print "Input File:\n";
    chop($fn = <STDIN>);
    }
open (INF,"< $fn");
$fn=~ s/\.htm?./\.txt/;
open (OUF,"> $fn");
sub par
    {
    $par = shift;
    $par =~ s/<.*script.*>/----- Script -----\n/gsi;
    $par =~ s/<img.+>/\n---------\n\| Image \|\n---------\n/gsi;
    $par =~ s/<br>/\n/gs;
    $par =~ s/<.*?>//gs;
    $par =~ s/\&nbsp;//g;
    $par =~ s/(\&)(\w)(grave;|acute;)/$2\'/g;
    $par =~ s/\&lt;/</g;
    $par =~ s/\&gt;/>/g;
    $par =~ s/\&quot;/\"/g;
    }    
while ($nl=<INF>)
{
$cl .= $nl;
    if ($cl =~ /.*>[^<]*\n/)
    {
    par($cl);
    print OUF $par;
    undef $cl;
    }
}
par($cl);
print OUF $par;
close (INF);
close (OUF);
[download]

Comment on Improved HTML2TXT regexp parser!!! Download Code

Replies are listed 'Best First'.
Re: Improved HTML2TXT regexp parser!!! by Chady (Priest) on Aug 01, 2001 at 13:15 UTC
Just a few comments: I assume that this: `if (!$ARGV[0]){ print "Input File:\n"; $fn = ; chop($fn); }` [download] was meant to be: `if (!$ARGV[0]){ print "Input File:\n"; chomp($fn = <STDIN>); }` [download] I think that `$par = $_[0];` is beter off to be `$par = shift;` `<.script.>` is greedy what's that? `$par =~ s/\"/\"/g;` There's a lot more with the regexes but I won't talk about that anyway... `while ($nl=)` is missing the `<INF>` `$cl = $cl.$nl` should be substituted with `$cl .= $nl` `$cl = "";` is better as `undef $cl` a lot more in the coding, now I know TIMTOWTDI but some ways are more efficient, and after all you are using perl, so you should use it. Good to try new things, but try to improve your coding skills by looking at other codes, and analysing them. Improved, but not by much He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life. Chady \| http://chady.net/	[reply] [d/l] [select]
Re: Re: Improved HTML2TXT regexp parser!!! by Sigmund (Pilgrim) on Aug 01, 2001 at 14:20 UTC
sorry chady.... i had problems when cutting & pasting....from X!!!!!! strange now i edited my snippet and it should be ok. please take a look at second edit. bye	[reply]