comment on

if ($line =~ /<h3>(.*?)(\d.*)<\/h3>/) {
        print "Generic name is: $1 \n\t Dosage is:$2\n";
    }
[download]

Sure, may work for the input shown in parsing a line with $1, $2, $3. But it is very fragile:

Imagine someone at the data source suddenly decides that upper case tags are "more pretty" (despite all recommendations to use lower case tags). It is still valid HTML, but you have to update your RE.

Imagine that "someone" also decides that the headings should be formatted by inline CSS (despite all recommendations not to use inline CSS). <H3> will be changed to <H3 STYLE="color:red;font-weight:bold;">. It is still valid HTML, but you have to update your RE.

Imagine that "someone" is told that semantic markup is the new trend. So he adds a class attribute, but still keeps the inline style. Unfortunately, the order of the attributes vary from line to line, because the HTML is manually edited using ed on a VT52. The header now starts either with <H3 CLASS="generica" STYLE="color:red;font-weight:bold;"> or with <H3 STYLE="color:red;font-weight:bold;" CLASS="generica">. Except for some cases where the class attribute was accidentally omitted. It is still valid HTML, but you have to update your RE (unless it allows arbitary attributes).

And to make things worse, our imaginary ed-user randomly inserted TABs and CRs into the tags while running low on caffeine. It is still valid HTML, but you have to update your RE.

You could try to keep up with the problem by making the RE complexer with every minor change in the input file. Or you could simply use an HTML parser that can handle all of those cases, and nasty things like SGML minimization. It also takes care of decoding escaped characters, unlike your RE.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^3: parsing a line with $1, $2, $3 by afoken
in thread parsing a line with $1, $2, $3 by kevyt

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.