Improved regexp sought

myomancer has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
•Re: Improved regexp sought by merlyn (Sage) on Oct 27, 2004 at 14:12 UTC
Can anybody suggest a change to the second line, and take me further along the path? Sure, change the second line to: `my @fields = qw(0010 2 O'Reilly);` [download] But if you want more help, you'll need to help us by describing what you want in words, not just have us second-guess your regex. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re^2: Improved regexp sought by myomancer (Novice) on Oct 27, 2004 at 14:32 UTC
Point taken merlyn. I have a file with multiple lines in. Each line consists of a variable number of variable-length fields separated by a + character. Each line is terminated by a ' character. Sometimes, a field might have a ' character in it - if so, the ' is preceeded by a question-mark. Here are some example lines: `0010+2+O'Reilly' 023++++234+35+White+++17+' g?'day mate+++'` [download] I want to break each line up into its constituent fields. I can do it with brute force, but would prefer elegance. Thanks Myomancer	[reply] [d/l]
Re^3: Improved regexp sought by duff (Parson) on Oct 27, 2004 at 14:39 UTC
Mayhap you want to take a multi-step approach. `$string =~ s/'$//; $string =~ s/\?'/'/g; @fields = split /\+/, $string;` [download] I want to break each line up into its constituent fields. I can do it with brute force, but would prefer elegance. I usually choose "working" over "not working" :-) duff	[reply] [d/l]
Re^3: Improved regexp sought by diotalevi (Canon) on Oct 27, 2004 at 14:54 UTC
use Text::CSV_XS. I guessed that your lines were terminated with apostrophe + newline. Alter the code to fit. `use Text::CSV_XS; my $parser = Text::CSV_XS->new( { eol => "'\n", escape_char => "'", sep_char => "+" } ); while ( my $line = <$fh> ) { $parser->parse( $line ); print join( ", ", $parser->fields ) . "\n"; }` [download]	[reply] [d/l]
Re^3: Improved regexp sought by Limbic~Region (Chancellor) on Oct 27, 2004 at 14:43 UTC
myomancer, I can do it with brute force, but would prefer elegance I hope you aren't confusing conciseness with elegance. There are not always related. See the following: `my $str = "0010+2+O?'Reilly'"; my @field = map {s/\?'/'/g; $_ } split /\+/ , substr($str,0, (length $ +str) - 1); print "[$_]$field[$_]\n" for 0 .. $#field;` [download] IMO, the code would be more elegant broken out into multiple lines - perhaps with comments. Cheers - L~R	[reply] [d/l]
Re: Improved regexp sought by Roy Johnson (Monsignor) on Oct 27, 2004 at 16:29 UTC
Is this elegant? `my @fields = map {s/\?'/'/g; $_} split /\+\|(?<!\?)'/, $line;` [download] The split is on plus or apostrophes not preceded by a question mark. The map turns "?'" into just "'". Update: It's probably clearer (and thus better) to say `my @fields = split /\+\|(?<!\?)'/, $line; s/\?'/'/g for @fields;` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re: Improved regexp sought by Random_Walk (Prior) on Oct 27, 2004 at 14:45 UTC
`#!/usr/local/bin/perl -w use strict; $/="\'\n"; #set the IFS while (<DATA>) { chomp; # chomp strips the IFS s/\?\'/\'/g; # fix the "quoted" ' marks my @fields=split /\+/; print "[$_] $fields[$_]\n" for 0..$#fields; } __DATA__ 0010+2+O'Reilly' 023++++234+35+White+++17+' g?'day mate+++' # output [0] 0010 [1] 2 [2] O'Reilly [0] 023 [1] [2] [3] [4] 234 [5] 35 [6] White [7] [8] [9] 17 [0] g'day mate` [download] update possibly a bit more elegant for certain values of elegance `$/="\'\n"; while (<DATA>) { chomp; s/\?\'/\'/g; my $i; print "[",$i++,"] $_\n" for split /\+/; }` [download] I keep getting voted down for this node so I think I had better explain myself. I am not doing it all in regex as the OP wanted to do but the OP says the data is in a file one record per line terminated with a '. As he has to read the line in anyway, and we can guess from the example string given in the lead post that he is also chomping the line he reads why not use the IFS to solve the terminal ' issue efficiently. Once you have reached this point the fix "?'" and split is surely more efficient and maintainable than some confusing regex. Cheers, R.	[reply] [d/l] [select]
Re: Improved regexp sought by TedPride (Priest) on Oct 27, 2004 at 15:57 UTC
I tested, and map is significantly faster than for or foreach at large numbers of iterations (about 13/8 as fast). I couldn't figure out how to put everything into one statement given the ?' format, but the following works fine: `use strict; use warnings; my $i; while (<DATA>) { chomp; chop; s/\?'/'/g; $i = 0; map { print '['.$i++."]$_\n" } split(/\+/); } __DATA__ 0010+2+O?'Reilly' 012+90+Penguin'` [download]	[reply] [d/l]
Re^2: Improved regexp sought by revdiablo (Prior) on Oct 27, 2004 at 16:52 UTC
I tested, and map is significantly faster than for or foreach at large numbers of iterations I'm sorry, I know it's off-topic, but I'd really like to see the Benchmarks to back this statement up. It's not that I don't believe you, just that it would be really surprising to me if it were true. Update: well, I am indeed surprised. Here's a benchmark I cooked up: Read more... (2 kB) Which shows that map is indeed faster for something like this. Of course, the rates for both are still quite high, and actually choosing between map and for based on speed seems silly, but I'm still surprised.	[reply] [d/l] [select]
Re^3: Improved regexp sought by ihb (Deacon) on Oct 28, 2004 at 08:47 UTC
For me, `for` is faster on one Perl, and slower on another. The lesson learned is that optimizations like this is version dependant and not something you should bother about when trying to speed up your program. `Benchmark: running tp_for, tp_map for at least 2 CPU seconds... tp_for: 2 wallclock secs ( 1.81 usr + 0.24 sys = 2.05 CPU) @ 11 +155.46/s (n=22891) tp_map: 3 wallclock secs ( 1.96 usr + 0.23 sys = 2.19 CPU) @ 10 +615.60/s (n=23280) Rate tp_map tp_for tp_map 10616/s -- -5% tp_for 11155/s 5% -- This is perl, v5.8.0 built for MSWin32-x86-multi-thread` [download] `Benchmark: running tp_for, tp_map for at least 2 CPU seconds... tp_for: 2 wallclock secs ( 2.02 usr + 0.28 sys = 2.30 CPU) @ 35 +695.65/s (n=82100) tp_map: 2 wallclock secs ( 1.80 usr + 0.30 sys = 2.10 CPU) @ 39 +710.48/s (n=83392) Rate tp_for tp_map tp_for 35696/s -- -10% tp_map 39710/s 11% -- This is perl, v5.8.4 built for i486-linux` [download] `ihb` See perltoc if you don't know which perldoc to read! Read argumentation in its context!	[reply] [d/l] [select]

update