Parsing Quark file with RegEx

rael9 has asked for the wisdom of the Perl Monks concerning the following question:

I have written a script to parse a text file into Quark .xtg format for my newspaper. Everything is working except one regex snippet, which seems to be correct to me. The perl code in question is:

grep s/(<Bz8>)([^ ]*)/$1$2<\$>/, @FILE;
[download]

An example of the text to be parsed is:

<*ra0*p(9,-9,0,9,0,0,g,"U.S. English")><Bz8>ATTN: Out of school youth 
+16-21. Get help to finish high school and/or find job training, and g
+et a good job. Call "Options": ask for Kathy M. at EASTCONN 111-1111.
[download]

If I'm not mistaken, my code should return:

<*ra0*p(9,-9,0,9,0,0,g,"U.S. English")><Bz8>ATTN:<$> Out of school you
+th 16-21. Get help to finish high school and/or find job training, an
+d get a good job. Call "Options": ask for Kathy M. at EASTCONN 111-11
+11.
[download]

But instead it returns:

<*ra0*p(9,-9,0,9,0,0,g,"U.S. English")><Bz8><$>ATTN: Out of school you
+th 16-21. Get help to finish high school and/or find job training, an
+d get a good job. Call "Options": ask for Kathy M. at EASTCONN 111-11
+11.
[download]

Am I doing it wrong? I am a relative newb, so I may have just misread the documentation. Any help in this regard is much appreciated.

20040904 Edit by castaway: Changed title from 'Regular expression hell'

Comment on Parsing Quark file with RegEx Select or Download Code

Replies are listed 'Best First'.
Re: Parsing Quark file with RegEx by ikegami (Patriarch) on Sep 03, 2004 at 20:33 UTC
I think you were confusing grep (used to filter) with map (used to transform): `@FILE2 = map { local $_=$_; s/(<Bz8>)([^ ])/$1$2<\$>/; $_ } @FILE;` [download] You can also do the change in place: `s/(<Bz8>)([^ ])/$1$2<\$>/ foreach (@FILE);` [download] or `foreach (@FILE) { s/(<Bz8>)([^ ]*)/$1$2<\$>/; }` [download]	[reply] [d/l] [select]
Re^2: Parsing Quark file with RegEx by ikegami (Patriarch) on Sep 03, 2004 at 21:09 UTC
Maybe you need `s/.../.../g` instead of `s/.../.../` to replace all instances in the line, instead of just the first? Tip: In the expression in the second half of s///, you only need to escape $, @, \ and / (or whatever your seperator is). For example: `s/[0]{4}\n/<v2\.05><e1>\n\@Normal\=<Ps100t0h100z12k0b0cKf\"ArialMT\">\ +n\@Normal\=\[S\"\",\"Normal\",\"Normal\"\]<\L\h\"Standard\"\kn0\k +t0\ra0\rb0\d0\p$0,0,0,0,0,0,g,\"U\.S\. English\"$>\n\@\$\:<\J\ +p$9,\-9,0,9,0,0,g,\"U\.S\. English\"$><z8f\"Helvetica\">\n/;` [download] and `s/[0]{4}\n/<v2.05><e1>\n\@Normal=<Ps100t0h100z12k0b0cKf"ArialMT">\n\@N +ormal=[S"","Normal","Normal"]<Lh"Standard"kn0kt0ra0rb0d0p(0,0 +,0,0,0,0,g,"U.S. English")>\n\@\$:<Jp(9,-9,0,9,0,0,g,"U.S. English" +)><z8f"Helvetica">\n/;` [download] are equivalent.	[reply] [d/l] [select]
Re^3: Parsing Quark file with RegEx by rael9 (Novice) on Sep 03, 2004 at 21:15 UTC
I tried it with the 'g' as well and still no go. I just don't get it. Thanks for the tip, by the way. I found that out while reading more about perl regex today, but that part was working, so I didn't want to mess around with it. I have a headache as it is ;)	[reply]
Re^4: Parsing Quark file with RegEx by ikegami (Patriarch) on Sep 03, 2004 at 21:23 UTC
Re^2: Parsing Quark file with RegEx by rael9 (Novice) on Sep 03, 2004 at 20:56 UTC
removed	[reply]
Re: Parsing Quark file with RegEx by Eimi Metamorphoumai (Deacon) on Sep 03, 2004 at 20:25 UTC
It looks to me like your problem is elsewhere. One question I have is why you're using `grep` there. But the regexp does what it's supposed to. `#shortened for clarity, but works with the full thing $_ = '<ra0p(9,-9,0,9,0,0,g,"U.S. English")><Bz8>ATTN: Out of school +youth'; s/(<Bz8>)([^ ])/$1$2<\$>/; print;` [download] prints `<ra0*p(9,-9,0,9,0,0,g,"U.S. English")><Bz8>ATTN:<$> Out of school you +th` [download] on my system.	[reply] [d/l] [select]
Re: Parsing Quark file with RegEx by lidden (Curate) on Sep 03, 2004 at 20:35 UTC
This seems to be doing what you want. `foreach (@FILE){ s/<Bz8>(\S*)/<Bz8>$1<\$>/; }` [download]	[reply] [d/l]
Re: Parsing Quark file with RegEx by TheEnigma (Pilgrim) on Sep 03, 2004 at 20:42 UTC
I worked when I tried it (on my windows box). Maybe you could try: `for(@FILE){ s/(<Bz8>)([^ ])/$1$2<\$>/; }` [download] Update: oops!*Don't need the g at the end of the regex. TheEnigma	[reply] [d/l]
Re: Parsing Quark file with RegEx by rael9 (Novice) on Sep 03, 2004 at 21:10 UTC
OK. The foreach statement structure makes more sense, but I'm still at an impass. It still gives the same result. There are actually several passes made in the regex to get all the formatting done, like so: foreach (@FILE) { s/[0]{4}\n/<v2\.05><e1>\n\@Normal\=<Ps100t0h100z12k0b0cKf\"ArialMT\ +">\n\@Normal\=\[S\"\",\"Normal\",\"Normal\"\]<\L\h\"Standard\"\kn0 +\kt0\ra0\rb0\d0\p$0,0,0,0,0,0,g,\"U\.S\. English\"$>\n\@\$\:<\ +J\p$9,\-9,0,9,0,0,g,\"U\.S\. English\"$><z8f\"Helvetica\">\n/; s/^[0-9]{4}\n/<\ra$1,0,K,100,\-9,0,0\%$\p$9,\-9,0,7,0,0,g,\"U\ +.S\. English\"$><z2>\n<\ra0\p$9,\-9,0,9,0,0,g,\"U\.S\. English\"\ +)><Bz8>/; s/^[0-9]{4}D\n/<\ra\(1,0,K,100,\-9,0,0\%$\p$9,\-9,0,7,0,0,g,\"U +\.S\. English\"$><z2>\n<\ra0\p$9,\-9,0,9,0,0,g,\"U\.S\. English\" +$><Bz8>\n\\\\\\DISPLAY\\\\\\\n/; s/(<Bz8>)([^ ])/$1$2<\$>/; } [download] Could one of the other passes be screing something up? Could there be an invisible character between <Bz8> and the next word that is screwing it up? It does it right when it hits: `<ra0p(9,-9,0,9,0,0,g,"U.S. English")><Bz8> ***DISPLAY**** DRIVER OPENINGS` [download] but: `<ra0p(9,-9,0,9,0,0,g,"U.S. English")><Bz8>KETTLE WORKERS F/T & P/T, +Nov.-Dec., $6.90/hr. Must be clean, honest & neat. Apply at the Salva +tion Army, 316 Pleasant St., Wmtc., wkdys., 9-6 p.m.` [download] still screws it up.	[reply] [d/l] [select]
Re^2: Parsing Quark file with RegEx by TimToady (Parson) on Sep 04, 2004 at 02:15 UTC
Could you by any chance have a tab character in your `[^ ]` where you think you have a space? It is right about at the 16th column, so it could be fooling you visually.	[reply] [d/l]
Re^2: Parsing Quark file with RegEx by ikegami (Patriarch) on Sep 03, 2004 at 21:12 UTC
Replied up here	[reply]
Re: Parsing Quark file with RegEx by rael9 (Novice) on Sep 07, 2004 at 19:08 UTC
UPDATE! I fixed it finally. What happened was that after the first three Regex search and replaces, the <Bz8> was technically right next to the text following it, but as it was in the @FILE array still, and they were in different nodes in the array, they were being treated separately. I fixed it by printing @FILE out to a "working" file, reading it back in to @FILE so that the line breaks were now updated in the array, and then doing the final Regex search and replace. There's probably an easier way, but being relatively new to Perl and having been out of the programming loop for quite some time now, I was just happy to get it working. Thanks for all the help, folks!	[reply]
Re^2: Parsing Quark file with RegEx by Eimi Metamorphoumai (Deacon) on Sep 07, 2004 at 20:08 UTC
The easier way to do it, if I'm reading what you want correctly, would be `@FILE = split(/\n/, join('', @FILE));` [download] which concatentates all the items into one big string, then splits it on newlines.	[reply] [d/l]