30 Spaces- 1 question

AgentM has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: 30 Spaces- 1 question by extremely (Priest) on Oct 09, 2000 at 06:55 UTC
My recommendation: `my @w = split /\s+/,$text,31; pop @w; my $words30= join " ", @w;` [download] dchetlin's regex I like too: (I modified it a little) `$text =~ m/^((?:\S+\s+){1,30})/; my $words30 = $1;` [download] Tho with that all the usual warnings about $1 being set to a prior match may need dealing with. If ABSOLUTELY sure there are no double spaces or other nastiness, you could use index() in a for loop too. But it's ugly so someone else can do that... -- $you = new YOU; honk() if $you->love(perl)	[reply] [d/l] [select]
RE: Re: 30 Spaces- 1 question by dchetlin (Friar) on Oct 09, 2000 at 07:01 UTC
I appreciate your liking my REx, but be careful; that `\s` is a `' for a reason. Try yours out: `[~] $ perl -wnle'/^((?:\S+\s+){1,30})/;print $1' one two three four five six seven one two three four five six` [download] versus mine: `[~] $ perl -wnle'/^((?:\S+\s*){1,30})/;print $1' one two three four five six seven one two three four five six seven` [download] Also, your split solution has a minor problem: `[~] $ perl -wnle'@w=split/\s+/,$_,31;pop@w;print join " ",@w' one two three four five one two three four` [download] (Granted, the problem specs weren't that great, but it seems reasonable to assume that if we have a line with less than 30 words, we don't want to throw away the last.) -dlc	[reply] [d/l] [select]
RE: RE: Re: 30 Spaces- 1 question by extremely (Priest) on Oct 09, 2000 at 10:07 UTC
Actually, I had rather taken it as gospel that the string was more than 30 chunks. =) good spot on that. I still hate throwing the regex engine at this problem tho. Maybe I'll Benchmark em all and post that. Make me feel better for being a goof. -- $you = new YOU; honk() if $you->love(perl)	[reply]
(dchetlin) Re(4): 30 Spaces- 1 question by dchetlin (Friar) on Oct 09, 2000 at 11:07 UTC
Re: 30 Spaces- 1 question by extremely (Priest) on Oct 09, 2000 at 13:18 UTC
Benchmarks, first the code: (GAK! See post from dchetlin) use strict; use Benchmark qw(cmpthese); my @x; $x[0]= "mark " x 35; $x[1]= "asdfasdfasdfasdfasdf " x 35; $x[2]= "as " x 31; $x[3]= "asdfasdfasdfasdfasdfasdf " x 100; $x[4]= "asdfasdfasdfasdfasdfasdf " x 10; # http://www.cpan.org/doc/manual/html/pod/perlfunc/index.html # # Well, that was no help... fricking http conventions # cmpthese (1000000, { 'regexp' => ' foreach (@x) { /^((?:\S+\s*){1,30})/; print $1; } ', 'split_join' => ' foreach (@x) { print join ( " ", (split " ", $_,31)[0..29] ); } ', 'for_index_substr' => ' foreach (@x) { my $ind = index ($_, " "); for (0..29) { last if $ind == -1; $ind = index $_, " ", $ind; } print substr($_,0,$ind); } ', } ); [download] and now the results (basically, a huge fricking waste of time, we are talking a MILLION reps on p166 here.) Rate split_join for_index_substr regexp split_join 69061/s -- -7% -7% for_index_substr 73910/s 7% -- -1% regexp 74627/s 8% 1% -- ### second run, no changes... hmmm.... Rate for_index_substr split_join regexp for_index_substr 68027/s -- -9% -9% split_join 74349/s 9% -- -0% regexp 74516/s 10% 0% -- At those rates, any of those would likely be waiting on the harddrive to feed them data. My dataset is about 3.5kB so that means 232MB/s feedrate on the WORST run there and 255MB/s on the best. =) Do it anyway you want. Not going to matter, not one iota, in the long run. -- $you = new YOU; honk() if $you->love(perl)	[reply] [d/l]
(dchetlin: Benchmark fixes) 30 Spaces- 1 question by dchetlin (Friar) on Oct 10, 2000 at 06:40 UTC
Hmm. Sorry to be the parade-rainer, but the benchmark isn't actually measuring what it should be. `@x` is a lexical here, and when Benchmark takes those strings and evaluates them, `@x` is out of scope. That's why they're all going so fast; there's nothing to loop over. Also, and more minorly, your for_index_substr routine isn't working -- `$_` is being re-aliased on the second for loop and the actual target string is being lost. I took the liberty of making a couple of changes and re-running: use strict; use Benchmark qw(cmpthese); my @x; my @result; $x[0]= "mark " x 35; $x[1]= "asdfasdfasdfasdfasdf " x 35; $x[2]= "as " x 31; $x[3]= "asdfasdfasdfasdfasdfasdf " x 100; $x[4]= "asdfasdfasdfasdfasdfasdf " x 10; cmpthese (-5, { 'regexp' => sub { my $i; foreach (@x) { /^((?:\S+\s*){1,30})/; $result[$i++]{REx} = $1; } }, 'split_join' => sub { my $i; foreach (@x) { $result[$i++]{split} = join " ", (split " ", $_,31)[0..29]; } }, 'for_index_substr' => sub { my $i; foreach (@x) { my $ind = index ($_, " "); for my $foo (0..28) { last if $ind == -1; $ind = index $_, " ", $ind + 1; } $result[$i++]{index} = substr($_,0,$ind); } }, } ); for (@result) { print "bad!" unless ($_{REx} eq $_{split} and $_{split} eq $_{index}); } [download] This produces: Rate split_join for_index_substr regexp split_join 4294/s -- -7% -27% for_index_substr 4618/s 8% -- -21% regexp 5849/s 36% 27% -- -dlc	[reply] [d/l] [select]
RE: (dchetlin: Benchmark fixes) 30 Spaces- 1 question by extremely (Priest) on Oct 10, 2000 at 08:29 UTC
sigh That will teach me, eh? I should have run it with it not printing to STDERR and that piped to /dev/null =) Also, in my defense, it was after my bedtime =) Don't apologize for being sharp... What is worse it that I tested a working version of the for_index_substr but "cleaned it up" for the final run. I normally `use vars ...` to avoid the scope issue. =( woe is me OTOH, I don't get your numbers when running your code. Number one, your results check should be: `foreach (@result) { print "bad!" unless ($_->{REx} eq $_->{split} and $_->{split} eq $_->{index}); }` [download] Those arrows are important. Second, the comparison will always fail because your regex includes the terminal space and the split version doesn't. Also, if there were multiple spaces or other types of whitespace they would fail but we both agreed to ignore that. =) Third, I still get equivalent results, (split_join_2 has the non-magical `/\s+/` regex, just for fun. I get this with cmpthese 20,000. (using cmpthese -10 gives the equivalent results.) Rate split_join_2 split_join regexp for_index_substr split_join_2 3205/s -- -0% -4% -6% split_join 3205/s 0% -- -4% -6% regexp 3344/s 4% 4% -- -2% for_index_substr 3413/s 6% 6% 2% -- Worse, the array slice hack on the end of `(split...)[0..29]` throws warnings on -w if there are too few items in the slice to join. Join doesn't like undef it appears =) All in all, I'm now less enlightened now. Did I cut-n-paste wrong? I print debugged to verify that bits were working all thru it. sigh -- $you = new YOU; honk() if $you->love(perl)	[reply] [d/l] [select]
RE: RE: (dchetlin: Benchmark fixes) 30 Spaces- 1 question by dchetlin (Friar) on Oct 10, 2000 at 08:34 UTC
RE: RE: RE: (dchetlin: Benchmark fixes) 30 Spaces- 1 question by extremely (Priest) on Oct 10, 2000 at 08:39 UTC
RE: Re: 30 Spaces- 1 question by AgentM (Curate) on Oct 09, 2000 at 18:59 UTC
Whoa! Thanks, that's more than i asked for! AgentM Systems or Nasca Enterprises is not responsible for the comments made by AgentM- anywhere.	[reply]
Re: 30 Spaces- 1 question by dchetlin (Friar) on Oct 09, 2000 at 06:54 UTC
tr is not the right tool here. `if ($desc =~ m#((?:\S+\s*){1,30})#) { $first_30 = $1; } else { # Didn't match here }` [download] -dlc	[reply] [d/l]
RE: 30 Spaces- 1 question by Zarathustra (Beadle) on Oct 09, 2000 at 07:00 UTC
`for ( 1..90) { $line .= "word "; } ( $thirty = $line ) =~ s/^((\w+\s){1,30}).*$/$1/; print $thirty;` [download] That's my shoddy solution at any rate.	[reply] [d/l]
Re: 30 Spaces- 1 question by Trimbach (Curate) on Oct 09, 2000 at 09:18 UTC
How about this? `@first_thirty[0..29] = split " ", $string;` [download] or, if you want the first 30 words as a single try you can: `$first_thirty = join " ", (split " ", $string)[0..29];` [download] ...which doesn't avoid the split/join thing, but by specifying an array slice you might avoid the memory hit of splitting the entire string when all you want is the first 30 words. Gary Blackburn Trained Killer	[reply] [d/l] [select]
RE: Re: 30 Spaces- 1 question by extremely (Priest) on Oct 09, 2000 at 09:58 UTC
split has a form for avoiding overspliting, you can give it a count of how many parts you want, see my first post. `@array = split / /, $string, 31;` OTOH, your array slice will fix my dropping the last item with pop foolishness... -- $you = new YOU; honk() if $you->love(perl)	[reply] [d/l]
Re: 30 Spaces- 1 question by merlyn (Sage) on Oct 09, 2000 at 07:16 UTC
I can't tell, but would Text::Wrap or Text::Autoformat be apropos here? -- Randal L. Schwartz, Perl hacker	[reply]

AgentM Systems or Nasca Enterprises is not responsible for the comments made by AgentM- anywhere.

AgentM Systems or Nasca Enterprises is not responsible for the comments made by AgentM- anywhere.