Re: parsing a line with $1, $2, $3
by ww (Archbishop) on Mar 10, 2012 at 13:57 UTC
|
You'll do well to study the references Kcott provided above. The problems reflect, in some large measure, your need to clarify your understanding of what constitutes a matchable pattern and also, in part, your confusion about escaping characters -- that's only necessary when the characters are special to Perl's regex engine.
<aside> RichardK's point is well taken... but you've asked how to do this with a regex. Revising this to utilize $1, $2, $3 is left as an exercise (and not necessarily worth the time: it's often less error-prone to write multiple regexen rather than one which is monolithic.)</aside>
So, assuming that you've worked out how to get your data into the form you've shown, the following is executable (which is strongly recommended in the Monastery's guidance on asking a question) -- NOT so that you can cargo-cult what follows, but rather, to address what seem to be some coding issues.
#!/usr/bin/perl
use 5.014; # automatically invokes 'strict' and 'warnings'
# 958840
my @array = <DATA>;
for my $line(@array) {
if ( $line =~ /<h3>(.*)<\/h3>/ ) { # Note 1
say "\n\t NAME is: $1";
}
if ( $line =~ /<p class='cd1'>(.*)(?=<\/p><p>mfg)/ ) { # Note 2
say "\t Instructions 2 are: $1";
}
if ( $line =~ /\d+<\/p><p>(.*)(?=<\/p>$)/ ) { # Note 3
say "\t GENERIC NAME IS: $1.\n\t" . "-" x15; # Note 4
}
}
__DATA__
<h3>FENTANYL 25MCG/HR PATCH, TRANSDERMAL 72 HOURS</h3> <p class='cd1'>
+Restricted to NDC labeler code 50458 (Janssen) and to a maximum of te
+n (10) transdermal patches per dispensing and a maximum of three (3)
+dispensings of any strength in a 75-day period only. </p><p>mfg codes
+:50458</p><p>DURAGESIC</p>
<h3>ACETAMINOPHEN 80MG/0.8ML SUSPENSION, DROPS(FINAL DOSAGE FORM)(ML)<
+/h3> <p class='cd1'>Restricted to individuals younger than 21 years o
+f age for the liquid and drops only. </p><p>mfg codes:68016, 63868, 6
+3162, 49348, 46122, 36800, 00904, 00536, 00472, 00113, 00067</p><p>TR
+IAMINIC FEVER REDUCER | PAIN & FEVER | NORTEMP | MAPAP INFANT | INFAN
+TS' NON-ASPIRIN | INFANT'S PAIN RELIEVER | INFANT'S PAIN RELIEF | INF
+ANT'S NON-ASPIRIN | ACETAMINOPHEN</p>
Note 1: The "m" and the escapes of '>' are unnecessary but when you use a (greedy) .* you need to tell the match operator when it can stop - namely, at "</h3> (where the slash in the HTML tag has to be escaped as you did at your line 10)
Note 2: The (?=... is a look-ahead (qv in the regex docs) and, again, you need to specify what data marks the end of what you're trying to capture -- eg, when it sees the </p><p>mfg code ....
Note 3: The $ tells the regex that the closing para tag its looking for should be the last set of chars before a newline.
Note 4: The tabs, newlines and repeated "-" simply help make the output more readable... for me, that is and YMMV
Note a potential hazard in the fact that the contents of $1, once captured, are retained until replaced.
Thus, it's better practice, unless you're Really, REALLY SURE that your data is absotively, posilutely regular and well-formed, it's better to test that you have a fresh capture before blindly using whatever happens to be there. That discussion is well beyond the scope of this thread.
PS -- you'd do well to include a sample of desired output... for those of us who may not know which part of your data contains the proprietary name, and which is the "GENERIC_NAME." or, in the alternative, an explanation in your narrative. A question that leaves the Monks having to wonder about the meanings of terms used by a Seeker of Perl Wisdom is apt to get answers that run wide of the mark you intended. | [reply] [d/l] [select] |
|
|
if ($line =~ /<h3>(.*?)(\d.*)<\/h3>/) {
print "Generic name is: $1 \n\t Dosage is:$2\n";
}
But like ww says above, be *really* sure that the data are consistent. There will probably be some special cases that don't start with a number or have something else wrong with them, just to give you a headache
Edit: for the op: I used "*?" (minimal match quantifier) instead of "*" to match the name-- "*" by itself will match all the way up to the closing tag, but with the "?" modifier it only matches as much as it needs to to satisfy the match.
Another edit: ww reminded me of something I thought of when I was typing a reply that I accidentally closed before posting-- you might want to get a separate list of generic names without the additional information, so you can isolate those from the records you're dismantling. This may be what you're trying to generate though... | [reply] [d/l] [select] |
|
|
if ($line =~ /<h3>(.*?)(\d.*)<\/h3>/) {
print "Generic name is: $1 \n\t Dosage is:$2\n";
}
Sure, may work for the input shown in parsing a line with $1, $2, $3. But it is very fragile:
Imagine someone at the data source suddenly decides that upper case tags are "more pretty" (despite all recommendations to use lower case tags). It is still valid HTML, but you have to update your RE.
Imagine that "someone" also decides that the headings should be formatted by inline CSS (despite all recommendations not to use inline CSS). <H3> will be changed to <H3 STYLE="color:red;font-weight:bold;">. It is still valid HTML, but you have to update your RE.
Imagine that "someone" is told that semantic markup is the new trend. So he adds a class attribute, but still keeps the inline style. Unfortunately, the order of the attributes vary from line to line, because the HTML is manually edited using ed on a VT52. The header now starts either with <H3 CLASS="generica" STYLE="color:red;font-weight:bold;"> or with <H3 STYLE="color:red;font-weight:bold;" CLASS="generica">. Except for some cases where the class attribute was accidentally omitted. It is still valid HTML, but you have to update your RE (unless it allows arbitary attributes).
And to make things worse, our imaginary ed-user randomly inserted TABs and CRs into the tags while running low on caffeine. It is still valid HTML, but you have to update your RE.
You could try to keep up with the problem by making the RE complexer with every minor change in the input file. Or you could simply use an HTML parser that can handle all of those cases, and nasty things like SGML minimization. It also takes care of decoding escaped characters, unlike your RE.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
|
|
|
|
Thank you.
The input & desired output are:
Input:
id:"200207",
label : "MACROLIDES",
type:"category",
close:["200476"],
drug:[
"<h3>AZITHROMYCIN 250 MG TABLET</h3> <p class='cd1'>Restricted to a ma
+ximum quantity per dispensing of eight (8) tablets ... </p><p>mfg cod
+es:68084, 64679, 60505</p><p>ZITHROMAX | AZITHROMYCIN</p> ",
"<h3>AZITHROMYCIN 1 G PACKET (EA)</h3><p>mfg codes:59762, 00069</p><p>
+ZITHROMAX | AZITHROMYCIN</p> ",
"<h3>AZITHROMYCIN 200 MG/5ML SUSPENSION, RECONSTITUTED, ORAL (ML)</h3>
+ <p class='cd1'>Restricted to use for individuals less than eight ...
+ </p><p>mfg codes:59762, 501119</p><p>ZITHROMAX | AZITHROMYCIN</p> ",
,NCdrug:[
"AZITHROMYCIN ZMAX ZITHROMAX AZITHROMYCIN",
"AZITHROMYCIN HYDROGEN CITRATE AZITHROMYCIN",
"CLARITHROMYCIN CLARITHROMYCIN ER CLARITHROMYCIN BIAXIN",
"DIRITHROMYCIN DYNABAC",
"ERY E-SUCC/SULFISOXAZOLE PEDIAZOLE ERYTHROMYCIN-SULFISOXAZOLE",
"ERYTHROMYCIN BASE ERYTHROMYCIN BASE ERYTHROMYCIN ERYC ERY-TAB",
"ERYTHROMYCIN ESTOLATE ERYTHROMYCIN ESTOLATE",
"ERYTHROMYCIN ETHYLSUCCINATE ERYTHROMYCIN ETHYLSUCCINATE ERYPED 400 E.
+E.S. 400",
"ERYTHROMYCIN LACTOBIONATE ERYTHROCIN LACTOBIONATE",
"ERYTHROMYCIN STEARATE MY-E FILM ERYTHROMYCIN STEARATE",
"FIDAXOMICIN DIFICID"]
},
id:"939383",
label : "FIBROMYALGIA AGENTS,SEROTONIN-NOREPINEPH RU INHIB",
type:"category",
drug:[
"<h3>MILNACIPRAN HCL 12.5 MG TABLET</h3><p>mfg codes:00456</p><p>SAVEL
+LA</p> ",
"<h3>MILNACIPRAN HCL 25 MG TABLET</h3><p>mfg codes:00456</p><p>SAVELLA
+</p> ",
"<h3>MILNACIPRAN HCL 50 MG TABLET</h3><p>mfg codes:00456</p><p>SAVELLA
+</p> ",
"<h3>MILNACIPRAN HCL 100 MG TABLET</h3><p>mfg codes:00456</p><p>SAVELL
+A</p> ",
"<h3>MILNACIPRAN HCL 12.5-25-50 TABLET, DOSE PACK</h3><p>mfg codes:004
+56</p><p>SAVELLA</p> "]
}]}
Desired Output
# Spaces added between fields to make it more readable.
ID| CATEGORY| GENERIC_PRODUCT_NAME| GENERIC_NAME| STRENGTH_DESCRIPTION
+| INSTRUCTIONS| MFG_CODE| TRADE_NAME
# ROW ONE
### Different MFG_Code
200207 | MACROLIDES | AZITHROMYCIN 250 MG TABLET| AZITHROMYCIN| 250 MG
+ TABLET| Restricted to a maximum quantity per dispensing of eight (8)
+ tablets ...| 68084| ZITHROMAX
200207 | MACROLIDES | AZITHROMYCIN 250 MG TABLET| AZITHROMYCIN| 250 MG
+ TABLET| Restricted to a maximum quantity per dispensing of eight (8)
+ tablets ...| 64679| ZITHROMAX
200207 | MACROLIDES | AZITHROMYCIN 250 MG TABLET| AZITHROMYCIN| 250 MG
+ TABLET| Restricted to a maximum quantity per dispensing of eight (8)
+ tablets ... |60505| ZITHROMAX
#### Different trade names than above (last column)
200207| MACROLIDES| AZITHROMYCIN 250 MG TABLET| AZITHROMYCIN| 250 MG T
+ABLET| Restricted to a maximum quantity per dispensing of eight ...|
+68084| AZITHROMYCIN
200207| MACROLIDES| AZITHROMYCIN 250 MG TABLET| AZITHROMYCIN| 250 MG T
+ABLET| Restricted to a maximum quantity per dispensing of eight ...|
+64679| AZITHROMYCIN
200207| MACROLIDES| AZITHROMYCIN 250 MG TABLET| AZITHROMYCIN| 250 MG T
+ABLET| Restricted to a maximum quantity per dispensing of eight ...|6
+0505| AZITHROMYCIN
# ROW 2
200207| MACROLIDES| AZITHROMYCIN 1 G PACKET (EA)| AZITHROMYCIN| 1 G P
+ACKET| NULL| 59762| ZITHROMAX
200207| MACROLIDES| AZITHROMYCIN 1 G PACKET (EA)| AZITHROMYCIN| 1 G P
+ACKET| NULL| 00069| ZITHROMAX
200207| MACROLIDES| AZITHROMYCIN 1 G PACKET (EA)| AZITHROMYCIN| 1 G P
+ACKET| NULL| 59762| AZITHROMYCIN
200207| MACROLIDES| AZITHROMYCIN 1 G PACKET (EA)| AZITHROMYCIN| 1 G P
+ACKET| NULL| 00069| AZITHROMYCIN
# ROW 3
200207| MACROLIDES| AZITHROMYCIN 200 MG/5ML SUSPENSION, RECONSTITUTED,
+ ORAL (ML)| Restricted to use for individuals less than eight ...| 59
+762| ZITHROMAX
200207| MACROLIDES| AZITHROMYCIN 200 MG/5ML SUSPENSION, RECONSTITUTED,
+ ORAL (ML)| Restricted to use for individuals less than eight ...| 50
+1119| ZITHROMAX
200207| MACROLIDES| AZITHROMYCIN 200 MG/5ML SUSPENSION, RECONSTITUTED,
+ ORAL (ML)| Restricted to use for individuals less than eight ...| 59
+762| AZITHROMYCIN
200207| MACROLIDES| AZITHROMYCIN 200 MG/5ML SUSPENSION, RECONSTITUTED,
+ ORAL (ML)| Restricted to use for individuals less than eight ...| 50
+1119| AZITHROMYCIN
...
| [reply] [d/l] [select] |
|
|
Thanks ww. The notes are very helpful.
Kevin
| [reply] |
Re: parsing a line with $1, $2, $3
by kcott (Archbishop) on Mar 10, 2012 at 10:18 UTC
|
| [reply] |
|
|
Thank you. I'm reading this now.
| [reply] |
Re: parsing a line with $1, $2, $3
by RichardK (Parson) on Mar 10, 2012 at 11:20 UTC
|
Your data looks like it's HTML, in which case you might get better results using an HTML parser.
I'd use HTML::TreeBuilder but there are other options ;)
| [reply] |
Re: parsing a line with $1, $2, $3
by CountZero (Bishop) on Mar 10, 2012 at 21:34 UTC
|
I am not sure I understood fully your requirement, but here is my best try. It is not in one regex, but a combination of a split, a regex and a substitution. use Modern::Perl;
use Data::Dump qw /dump/;
while (<DATA>) {
chomp;
my ($namedosageform, $instructions, $code) = grep {/\S/} split /<.
+*?>/;
my ($name, $dosage, $form) = $namedosageform =~ /^([^\d]+)(\d+[^,]
++),\s*(.*)$/;
$code =~ s/mfg codes:(.*)/$1/;
say "Name: $name\nDosage: $dosage\nForm: $form\nInstructions: $ins
+tructions\nMFG codes: $code\n";
}
__DATA__
<h3>FENTANYL 25MCG/HR PATCH, TRANSDERMAL 72 HOURS</h3> <p class='cd1'>
+Restricted to NDC labeler code 50458 (Janssen) and to a maximum of te
+n (10) transdermal patches per dispensing and a maximum of three (3)
+dispensings of any strength in a 75-day period only. </p><p>mfg codes
+:50458</p><p>DURAGESIC</p>
<h3>ACETAMINOPHEN 80MG/0.8ML SUSPENSION, DROPS(FINAL DOSAGE FORM)(ML)<
+/h3> <p class='cd1'>Restricted to individuals younger than 21 years o
+f age for the liquid and drops only. </p><p>mfg codes:68016, 63868, 6
+3162, 49348, 46122, 36800, 00904, 00536, 00472, 00113, 00067</p><p>TR
+IAMINIC FEVER REDUCER | PAIN & FEVER | NORTEMP | MAPAP INFANT | INFAN
+TS' NON-ASPIRIN | INFANT'S PAIN RELIEVER | INFANT'S PAIN RELIEF | INF
+ANT'S NON-ASPIRIN | ACETAMINOPHEN</p>
Output:Name: FENTANYL
Dosage: 25MCG/HR PATCH
Form: TRANSDERMAL 72 HOURS
Instructions: Restricted to NDC labeler code 50458 (Janssen) and to a maximum of ten (10) transdermal patches per dispensing and a maximum of three (3) dispensings of any strength in a 75-day period only.
MFG codes: 50458
Name: ACETAMINOPHEN
Dosage: 80MG/0.8ML SUSPENSION
Form: DROPS(FINAL DOSAGE FORM)(ML)
Instructions: Restricted to individuals younger than 21 years of age for the liquid and drops only.
MFG codes: 68016, 63868, 63162, 49348, 46122, 36800, 00904, 00536, 00472, 00113, 00067
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] [select] |