in reply to parsing a line with $1, $2, $3
You'll do well to study the references Kcott provided above. The problems reflect, in some large measure, your need to clarify your understanding of what constitutes a matchable pattern and also, in part, your confusion about escaping characters -- that's only necessary when the characters are special to Perl's regex engine.
<aside> RichardK's point is well taken... but you've asked how to do this with a regex. Revising this to utilize $1, $2, $3 is left as an exercise (and not necessarily worth the time: it's often less error-prone to write multiple regexen rather than one which is monolithic.)</aside>
So, assuming that you've worked out how to get your data into the form you've shown, the following is executable (which is strongly recommended in the Monastery's guidance on asking a question) -- NOT so that you can cargo-cult what follows, but rather, to address what seem to be some coding issues.
#!/usr/bin/perl use 5.014; # automatically invokes 'strict' and 'warnings' # 958840 my @array = <DATA>; for my $line(@array) { if ( $line =~ /<h3>(.*)<\/h3>/ ) { # Note 1 say "\n\t NAME is: $1"; } if ( $line =~ /<p class='cd1'>(.*)(?=<\/p><p>mfg)/ ) { # Note 2 say "\t Instructions 2 are: $1"; } if ( $line =~ /\d+<\/p><p>(.*)(?=<\/p>$)/ ) { # Note 3 say "\t GENERIC NAME IS: $1.\n\t" . "-" x15; # Note 4 } } __DATA__ <h3>FENTANYL 25MCG/HR PATCH, TRANSDERMAL 72 HOURS</h3> <p class='cd1'> +Restricted to NDC labeler code 50458 (Janssen) and to a maximum of te +n (10) transdermal patches per dispensing and a maximum of three (3) +dispensings of any strength in a 75-day period only. </p><p>mfg codes +:50458</p><p>DURAGESIC</p> <h3>ACETAMINOPHEN 80MG/0.8ML SUSPENSION, DROPS(FINAL DOSAGE FORM)(ML)< +/h3> <p class='cd1'>Restricted to individuals younger than 21 years o +f age for the liquid and drops only. </p><p>mfg codes:68016, 63868, 6 +3162, 49348, 46122, 36800, 00904, 00536, 00472, 00113, 00067</p><p>TR +IAMINIC FEVER REDUCER | PAIN & FEVER | NORTEMP | MAPAP INFANT | INFAN +TS' NON-ASPIRIN | INFANT'S PAIN RELIEVER | INFANT'S PAIN RELIEF | INF +ANT'S NON-ASPIRIN | ACETAMINOPHEN</p>
Note 1: The "m" and the escapes of '>' are unnecessary but when you use a (greedy) .* you need to tell the match operator when it can stop - namely, at "</h3> (where the slash in the HTML tag has to be escaped as you did at your line 10)
Note 2: The (?=... is a look-ahead (qv in the regex docs) and, again, you need to specify what data marks the end of what you're trying to capture -- eg, when it sees the </p><p>mfg code ....
Note 3: The $ tells the regex that the closing para tag its looking for should be the last set of chars before a newline.
Note 4: The tabs, newlines and repeated "-" simply help make the output more readable... for me, that is and YMMV
Note a potential hazard in the fact that the contents of $1, once captured, are retained until replaced.
Thus, it's better practice, unless you're Really, REALLY SURE that your data is absotively, posilutely regular and well-formed, it's better to test that you have a fresh capture before blindly using whatever happens to be there. That discussion is well beyond the scope of this thread.
PS -- you'd do well to include a sample of desired output... for those of us who may not know which part of your data contains the proprietary name, and which is the "GENERIC_NAME." or, in the alternative, an explanation in your narrative. A question that leaves the Monks having to wonder about the meanings of terms used by a Seeker of Perl Wisdom is apt to get answers that run wide of the mark you intended.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: parsing a line with $1, $2, $3
by bitingduck (Deacon) on Mar 10, 2012 at 16:49 UTC | |
by afoken (Chancellor) on Mar 10, 2012 at 21:06 UTC | |
by bitingduck (Deacon) on Mar 11, 2012 at 00:36 UTC | |
|
Re^2: parsing a line with $1, $2, $3
by kevyt (Scribe) on Mar 12, 2012 at 19:51 UTC | |
|
Re^2: parsing a line with $1, $2, $3
by kevyt (Scribe) on Mar 12, 2012 at 22:23 UTC |