Re: 30 Spaces- 1 question
by extremely (Priest) on Oct 09, 2000 at 06:55 UTC
|
my @w = split /\s+/,$text,31;
pop @w;
my $words30= join " ", @w;
dchetlin's regex I like too: (I modified it a little)
$text =~ m/^((?:\S+\s+){1,30})/;
my $words30 = $1;
Tho with that all the usual warnings about $1 being
set to a prior match may need dealing with.
If ABSOLUTELY sure there are no double spaces or other
nastiness, you could use index() in a for loop too.
But it's ugly so someone else can do that...
--
$you = new YOU;
honk() if $you->love(perl) | [reply] [d/l] [select] |
|
|
[~] $ perl -wnle'/^((?:\S+\s+){1,30})/;print $1'
one two three four five six seven
one two three four five six
versus mine:
[~] $ perl -wnle'/^((?:\S+\s*){1,30})/;print $1'
one two three four five six seven
one two three four five six seven
Also, your split solution has a minor problem:
[~] $ perl -wnle'@w=split/\s+/,$_,31;pop@w;print join " ",@w'
one two three four five
one two three four
(Granted, the problem specs weren't that great, but it seems reasonable to assume that if we have a line with less than 30 words, we don't want to throw away the last.)
-dlc | [reply] [d/l] [select] |
|
|
Actually, I had rather taken it as gospel that the
string was more than 30 chunks. =) good spot on that.
I still hate throwing the regex engine at this problem
tho. Maybe I'll Benchmark em all and post that. Make me
feel better for being a goof.
--
$you = new YOU;
honk() if $you->love(perl)
| [reply] |
|
|
Re: 30 Spaces- 1 question
by extremely (Priest) on Oct 09, 2000 at 13:18 UTC
|
Benchmarks, first the code: (GAK! See post from dchetlin)
use strict;
use Benchmark qw(cmpthese);
my @x;
$x[0]= "mark " x 35;
$x[1]= "asdfasdfasdfasdfasdf " x 35;
$x[2]= "as " x 31;
$x[3]= "asdfasdfasdfasdfasdfasdf " x 100;
$x[4]= "asdfasdfasdfasdfasdfasdf " x 10;
# http://www.cpan.org/doc/manual/html/pod/perlfunc/index.html
#
# Well, that was no help... fricking http conventions
#
cmpthese (1000000, {
'regexp' => ' foreach (@x) {
/^((?:\S+\s*){1,30})/;
print $1;
} ',
'split_join' => ' foreach (@x) {
print join ( " ", (split " ", $_,31)[0..29] );
} ',
'for_index_substr' => ' foreach (@x) {
my $ind = index ($_, " ");
for (0..29) {
last if $ind == -1;
$ind = index $_, " ", $ind;
}
print substr($_,0,$ind);
} ',
} );
and now the results (basically, a huge fricking waste of
time, we are talking a MILLION reps on p166 here.)
Rate split_join for_index_substr regexp
split_join 69061/s -- -7% -7%
for_index_substr 73910/s 7% -- -1%
regexp 74627/s 8% 1% --
### second run, no changes... hmmm....
Rate for_index_substr split_join regexp
for_index_substr 68027/s -- -9% -9%
split_join 74349/s 9% -- -0%
regexp 74516/s 10% 0% --
At those rates, any of those would likely be waiting on
the harddrive to feed them data. My dataset is about 3.5kB
so that means 232MB/s feedrate on the WORST run there and
255MB/s on the best. =) Do it anyway you want. Not going
to matter, not one iota, in the long run.
--
$you = new YOU;
honk() if $you->love(perl) | [reply] [d/l] |
|
|
Hmm. Sorry to be the parade-rainer, but the benchmark isn't actually measuring what it should be. @x is a lexical here, and when Benchmark takes those strings and evaluates them, @x is out of scope. That's why they're all going so fast; there's nothing to loop over.
Also, and more minorly, your for_index_substr routine isn't working -- $_ is being re-aliased on the second for loop and the actual target string is being lost.
I took the liberty of making a couple of changes and re-running:
use strict;
use Benchmark qw(cmpthese);
my @x;
my @result;
$x[0]= "mark " x 35;
$x[1]= "asdfasdfasdfasdfasdf " x 35;
$x[2]= "as " x 31;
$x[3]= "asdfasdfasdfasdfasdfasdf " x 100;
$x[4]= "asdfasdfasdfasdfasdfasdf " x 10;
cmpthese (-5, {
'regexp' =>
sub {
my $i;
foreach (@x) {
/^((?:\S+\s*){1,30})/;
$result[$i++]{REx} = $1;
}
},
'split_join' =>
sub {
my $i;
foreach (@x) {
$result[$i++]{split} =
join " ", (split " ", $_,31)[0..29];
}
},
'for_index_substr' =>
sub {
my $i;
foreach (@x) {
my $ind = index ($_, " ");
for my $foo (0..28) {
last if $ind == -1;
$ind = index $_, " ", $ind + 1;
}
$result[$i++]{index} = substr($_,0,$ind);
}
},
} );
for (@result) {
print "bad!" unless ($_{REx} eq $_{split} and
$_{split} eq $_{index});
}
This produces:
Rate split_join for_index_substr regexp
split_join 4294/s -- -7% -27%
for_index_substr 4618/s 8% -- -21%
regexp 5849/s 36% 27% --
-dlc | [reply] [d/l] [select] |
|
|
*sigh* That will teach me, eh? I should have run it
with it not printing to STDERR and that piped to /dev/null
=)
Also, in my defense, it was after my bedtime =) Don't
apologize for being sharp...
What is worse it that I tested a working version of the
for_index_substr but "cleaned it up" for the final run.
I normally use vars ... to avoid the scope
issue. =( *woe is me*
OTOH, I don't get your numbers when running your code.
Number one, your results check should be:
foreach (@result) {
print "bad!" unless ($_->{REx} eq $_->{split} and
$_->{split} eq $_->{index});
}
Those arrows are important. Second, the comparison will
always fail because your regex includes the terminal space
and the split version doesn't. Also, if there were multiple
spaces or other types of whitespace they would fail
but we both agreed to ignore that. =)
Third, I still get equivalent results, (split_join_2 has
the non-magical /\s+/ regex, just for fun. I get
this with cmpthese 20,000. (using cmpthese -10 gives the
equivalent results.)
Rate split_join_2 split_join regexp for_index_substr
split_join_2 3205/s -- -0% -4% -6%
split_join 3205/s 0% -- -4% -6%
regexp 3344/s 4% 4% -- -2%
for_index_substr 3413/s 6% 6% 2% --
Worse, the array slice hack on the end of (split...)[0..29]
throws warnings on -w if there are too few items in the slice
to join. Join doesn't like undef it appears =)
All in all, I'm now less enlightened now. Did I cut-n-paste
wrong? I print debugged to verify that bits were working
all thru it. *sigh*
--
$you = new YOU;
honk() if $you->love(perl) | [reply] [d/l] [select] |
|
|
|
|
|
|
Whoa! Thanks, that's more than i asked for!
| [reply] |
Re: 30 Spaces- 1 question
by dchetlin (Friar) on Oct 09, 2000 at 06:54 UTC
|
tr is not the right tool here.
if ($desc =~ m#((?:\S+\s*){1,30})#) {
$first_30 = $1;
} else {
# Didn't match here
}
-dlc | [reply] [d/l] |
RE: 30 Spaces- 1 question
by Zarathustra (Beadle) on Oct 09, 2000 at 07:00 UTC
|
for ( 1..90) {
$line .= "word ";
}
( $thirty = $line ) =~ s/^((\w+\s){1,30}).*$/$1/;
print $thirty;
That's my shoddy solution at any rate.
| [reply] [d/l] |
Re: 30 Spaces- 1 question
by Trimbach (Curate) on Oct 09, 2000 at 09:18 UTC
|
@first_thirty[0..29] = split " ", $string;
or, if you want the first 30 words as a single try you can:
$first_thirty = join " ", (split " ", $string)[0..29];
...which doesn't avoid the split/join thing, but by specifying an array slice you might avoid the memory hit of splitting the entire string when all you want is the first 30 words.
Gary Blackburn
Trained Killer | [reply] [d/l] [select] |
|
|
split has a form for avoiding overspliting, you can
give it a count of how many parts you want, see my first post.
@array = split / /, $string, 31;
OTOH, your array slice will fix my dropping the last item
with pop foolishness...
--
$you = new YOU;
honk() if $you->love(perl)
| [reply] [d/l] |
Re: 30 Spaces- 1 question
by merlyn (Sage) on Oct 09, 2000 at 07:16 UTC
|
| [reply] |