•Re: Improved regexp sought
by merlyn (Sage) on Oct 27, 2004 at 14:12 UTC
|
Can anybody suggest a change to the second line, and take me further along the path?
Sure, change the second line to:
my @fields = qw(0010 2 O'Reilly);
But if you want more help, you'll need to help us by describing what you want in words, not just have us second-guess your regex.
| [reply] [d/l] |
|
|
Point taken merlyn.
I have a file with multiple lines in. Each line consists of a variable number of variable-length fields separated by a + character. Each line is terminated by a ' character. Sometimes, a field might have a ' character in it - if so, the ' is preceeded by a question-mark.
Here are some example lines:
0010+2+O'Reilly'
023++++234+35+White+++17+'
g?'day mate+++'
I want to break each line up into its constituent fields. I can do it with brute force, but would prefer elegance.
Thanks
Myomancer | [reply] [d/l] |
|
|
$string =~ s/'$//;
$string =~ s/\?'/'/g;
@fields = split /\+/, $string;
I want to break each line up into its constituent fields. I can do it with brute force, but would prefer elegance. I usually choose "working" over "not working" :-)
| [reply] [d/l] |
|
|
use Text::CSV_XS;
my $parser = Text::CSV_XS->new( {
eol => "'\n",
escape_char => "'",
sep_char => "+" } );
while ( my $line = <$fh> ) {
$parser->parse( $line );
print join( ", ", $parser->fields ) . "\n";
}
| [reply] [d/l] |
|
|
my $str = "0010+2+O?'Reilly'";
my @field = map {s/\?'/'/g; $_ } split /\+/ , substr($str,0, (length $
+str) - 1);
print "[$_]$field[$_]\n" for 0 .. $#field;
IMO, the code would be more elegant broken out into multiple lines - perhaps with comments.
| [reply] [d/l] |
Re: Improved regexp sought
by Roy Johnson (Monsignor) on Oct 27, 2004 at 16:29 UTC
|
my @fields = map {s/\?'/'/g; $_} split /\+|(?<!\?)'/, $line;
The split is on plus or apostrophes not preceded by a question mark. The map turns "?'" into just "'".
Update: It's probably clearer (and thus better) to say
my @fields = split /\+|(?<!\?)'/, $line;
s/\?'/'/g for @fields;
Caution: Contents may have been coded under pressure.
| [reply] [d/l] [select] |
Re: Improved regexp sought
by Random_Walk (Prior) on Oct 27, 2004 at 14:45 UTC
|
#!/usr/local/bin/perl -w
use strict;
$/="\'\n"; #set the IFS
while (<DATA>) {
chomp; # chomp strips the IFS
s/\?\'/\'/g; # fix the "quoted" ' marks
my @fields=split /\+/;
print "[$_] $fields[$_]\n" for 0..$#fields;
}
__DATA__
0010+2+O'Reilly'
023++++234+35+White+++17+'
g?'day mate+++'
# output
[0] 0010
[1] 2
[2] O'Reilly
[0] 023
[1]
[2]
[3]
[4] 234
[5] 35
[6] White
[7]
[8]
[9] 17
[0] g'day mate
update
possibly a bit more elegant for certain values of elegance
$/="\'\n";
while (<DATA>) {
chomp;
s/\?\'/\'/g;
my $i;
print "[",$i++,"] $_\n" for split /\+/;
}
I keep getting voted down for this node so I think I had better explain myself. I am not doing it all in regex as the OP wanted to do but the OP says the data is in a file one record per line terminated with a '. As he has to read the line in anyway, and we can guess from the example string given in the lead post that he is also chomping the line he reads why not use the IFS to solve the terminal ' issue efficiently. Once you have reached this point the fix "?'" and split is surely more efficient and maintainable than some confusing regex.
Cheers, R. | [reply] [d/l] [select] |
Re: Improved regexp sought
by TedPride (Priest) on Oct 27, 2004 at 15:57 UTC
|
I tested, and map is significantly faster than for or foreach at large numbers of iterations (about 13/8 as fast). I couldn't figure out how to put everything into one statement given the ?' format, but the following works fine:
use strict;
use warnings;
my $i;
while (<DATA>) {
chomp; chop; s/\?'/'/g;
$i = 0; map { print '['.$i++."]$_\n" } split(/\+/);
}
__DATA__
0010+2+O?'Reilly'
012+90+Penguin'
| [reply] [d/l] |
|
|
I tested, and map is significantly faster than for or foreach at large numbers of iterations
I'm sorry, I know it's off-topic, but I'd really like to see the Benchmarks to back this statement up. It's not that I don't believe you, just that it would be really surprising to me if it were true.
Update: well, I am indeed surprised. Here's a benchmark I cooked up:
Which shows that map is indeed faster for something like this. Of course, the rates for both are still quite high, and actually choosing between map and for based on speed seems silly, but I'm still surprised.
| [reply] [d/l] [select] |
|
|
For me, for is faster on one Perl, and slower on another. The lesson learned is that optimizations like this is version dependant and not something you should bother about when trying to speed up your program.
Benchmark: running tp_for, tp_map for at least 2 CPU seconds...
tp_for: 2 wallclock secs ( 1.81 usr + 0.24 sys = 2.05 CPU) @ 11
+155.46/s (n=22891)
tp_map: 3 wallclock secs ( 1.96 usr + 0.23 sys = 2.19 CPU) @ 10
+615.60/s (n=23280)
Rate tp_map tp_for
tp_map 10616/s -- -5%
tp_for 11155/s 5% --
This is perl, v5.8.0 built for MSWin32-x86-multi-thread
Benchmark: running tp_for, tp_map for at least 2 CPU seconds...
tp_for: 2 wallclock secs ( 2.02 usr + 0.28 sys = 2.30 CPU) @ 35
+695.65/s (n=82100)
tp_map: 2 wallclock secs ( 1.80 usr + 0.30 sys = 2.10 CPU) @ 39
+710.48/s (n=83392)
Rate tp_for tp_map
tp_for 35696/s -- -10%
tp_map 39710/s 11% --
This is perl, v5.8.4 built for i486-linux
ihb
See perltoc if you don't know which perldoc to read!
Read argumentation in its context!
| [reply] [d/l] [select] |