Re^2: parsing a line with $1, $2, $3

The tricky part is separating the generic name from dosage, since the HTML markup won't help with that. If you want everything from the first numeric character to the closing </h3>, this will do it:

if ($line =~ /<h3>(.*?)(\d.*)<\/h3>/) {
        print "Generic name is: $1 \n\t Dosage is:$2\n";
    }
[download]

But like ww says above, be *really* sure that the data are consistent. There will probably be some special cases that don't start with a number or have something else wrong with them, just to give you a headache

Edit: for the op: I used "*?" (minimal match quantifier) instead of "*" to match the name-- "*" by itself will match all the way up to the closing tag, but with the "?" modifier it only matches as much as it needs to to satisfy the match.

Another edit: ww reminded me of something I thought of when I was typing a reply that I accidentally closed before posting-- you might want to get a separate list of generic names without the additional information, so you can isolate those from the records you're dismantling. This may be what you're trying to generate though...

Comment on Re^2: parsing a line with $1, $2, $3 Select or Download Code

Replies are listed 'Best First'.
Re^3: parsing a line with $1, $2, $3 by afoken (Chancellor) on Mar 10, 2012 at 21:06 UTC
`if ($line =~ /<h3>(.?)(\d.)<\/h3>/) { print "Generic name is: $1 \n\t Dosage is:$2\n"; }` [download] Sure, may work for the input shown in parsing a line with $1, $2, $3. But it is very fragile: Imagine someone at the data source suddenly decides that upper case tags are "more pretty" (despite all recommendations to use lower case tags). It is still valid HTML, but you have to update your RE. Imagine that "someone" also decides that the headings should be formatted by inline CSS (despite all recommendations not to use inline CSS). `<H3>` will be changed to `<H3 STYLE="color:red;font-weight:bold;">`. It is still valid HTML, but you have to update your RE. Imagine that "someone" is told that semantic markup is the new trend. So he adds a `class` attribute, but still keeps the inline style. Unfortunately, the order of the attributes vary from line to line, because the HTML is manually edited using ed on a VT52. The header now starts either with `<H3 CLASS="generica" STYLE="color:red;font-weight:bold;">` or with `<H3 STYLE="color:red;font-weight:bold;" CLASS="generica">`. Except for some cases where the class attribute was accidentally omitted. It is still valid HTML, but you have to update your RE (unless it allows arbitary attributes). And to make things worse, our imaginary ed-user randomly inserted TABs and CRs into the tags while running low on caffeine. It is still valid HTML, but you have to update your RE. You could try to keep up with the problem by making the RE complexer with every minor change in the input file. Or you could simply use an HTML parser that can handle all of those cases, and nasty things like SGML minimization. It also takes care of decoding escaped characters, unlike your RE. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^4: parsing a line with $1, $2, $3 by bitingduck (Deacon) on Mar 11, 2012 at 00:36 UTC
I don't disagree with any of that-- I'd use an HTML parser to get the content out of the various tags, which is the easy part. The harder part (and even more fragile) is separating the generic names, doses, and forms, because there's no markup indicators at all, and there's probably just enough inconsistency in the source data to make you crazy. I don't use Perl a lot, but it's what I use when I need to deal with HTML, XML, or structured stuff that I can coerce into looking like HTML or XML and then use something out of CPAN that will be better behaved in less time than what I can do myself.	[reply]