comment on

so i'm here again. i tried improving my skills in using regexp to parse HTML, though adviced to desist ;-) and here's what i did: better handling of multi-line tags: it works even if < and > are REALLY far away each other! better use of regexp themselves, "stripping" many similar actions into only one; use of a sub (I learned how to use them, finally) and this makes me proud! and general optimization so that lenght of code is reduced by many bytes, and working better!!!!! the main intent was that of keeping code under 1K and so it is. I still believe that accomplishing strange tasks in non-canonical ways is a good mean to improve one's skill. This text doesn't (obviously) mean that people should drive blindfolded to improve their driving skill!!! programming snippets is fun, driving needs responsability. so, enjoy my code, or dislike it, or even hate it. but if you find it interesting (for instance as a mean to demonstrate the INFINITE SLOWNESS of Winword HTML->TXT Filter) feel free to send me an email at baginov@hotmail.com. and if you feel really well, please do something good for other people. many people need help, just take a look around. enjoy SiG ------------------------------------------------------- this is third edit.... i got chady's advices and applied them! thanks again -------------------------------------------------------

#!/usr/bin/perl
$fn = $ARGV[0] unless ! $ARGV[0];
if (!$ARGV[0]){
    print "Input File:\n";
    chop($fn = <STDIN>);
    }
open (INF,"< $fn");
$fn=~ s/\.htm?./\.txt/;
open (OUF,"> $fn");
sub par
    {
    $par = shift;
    $par =~ s/<.*script.*>/----- Script -----\n/gsi;
    $par =~ s/<img.+>/\n---------\n\| Image \|\n---------\n/gsi;
    $par =~ s/<br>/\n/gs;
    $par =~ s/<.*?>//gs;
    $par =~ s/\&nbsp;//g;
    $par =~ s/(\&)(\w)(grave;|acute;)/$2\'/g;
    $par =~ s/\&lt;/</g;
    $par =~ s/\&gt;/>/g;
    $par =~ s/\&quot;/\"/g;
    }    
while ($nl=<INF>)
{
$cl .= $nl;
    if ($cl =~ /.*>[^<]*\n/)
    {
    par($cl);
    print OUF $par;
    undef $cl;
    }
}
par($cl);
print OUF $par;
close (INF);
close (OUF);
[download]

In reply to Improved HTML2TXT regexp parser!!! by Sigmund

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.