Fast Way to Split String in to Chunk of Equal Length

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fast Way to Split String in to Chunk of Equal Length by ikegami (Patriarch) on Nov 25, 2011 at 07:22 UTC
`my @parts = $seq =~ /.{3}/sg;` [download] `my @parts = unpack('(a3)*', $seq);` [download] You didn't specify how you want to handle strings whose length isn't a multiple of three, so I presume it can't happen.	[reply] [d/l] [select]
Re: Fast Way to Split String in to Chunk of Equal Length by davido (Cardinal) on Nov 25, 2011 at 08:11 UTC
Are you sure that chunking the strings is where your performance bottleneck is? Have you profiled? Could IO be a more significant constraint to execution time? I only ask because it doesn't seem like the performance of substr for 10,000,000 strings is all that bad for the problem domain. If this section of code is really significant, here's a comparison of the valid solutions provided up to this point in the thread. Naturally unpack wins. It's only a few seconds slower in 10,000,000 iterations than the "control" case (which isn't a solution, but just a check to see what the framework for each solution costs). Read more... (1176 Bytes) Here's the output. `Benchmark: timing 10000000 iterations of control, match, substr, unpac +k... control: 5 wallclock secs ( 5.65 usr + 0.00 sys = 5.65 CPU) @ 17 +69911.50/s (n=10000000) match: 22 wallclock secs (21.12 usr + 0.00 sys = 21.12 CPU) @ 47 +3484.85/s (n=10000000) substr: 12 wallclock secs (11.88 usr + 0.00 sys = 11.88 CPU) @ 84 +1750.84/s (n=10000000) unpack: 8 wallclock secs ( 9.04 usr + 0.00 sys = 9.04 CPU) @ 11 +06194.69/s (n=10000000)` [download] By the way: This question was crossposted to StackOverflow here. Dave	[reply] [d/l] [select]
Re: Fast Way to Split String in to Chunk of Equal Length by CountZero (Bishop) on Nov 25, 2011 at 07:16 UTC
I am not sure it is fast but it works. `use Modern::Perl; use Data::Dump qw/dump/; my $seq = "CTTCGAATT"; # in this case length of 9 my @strings = $seq =~ /(.{3})/g ; say dump(@strings);` [download] Output: `("CTT", "CGA", "ATT")` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: Fast Way to Split String in to Chunk of Equal Length by bart (Canon) on Nov 25, 2011 at 11:51 UTC
~~I think your code is just a tad too example specific. I doubt that in his real data the string will always be 9 characters. But it will always be a multiple of 3.~~ ~~If it is always 9 characters, then your code is as good as an alternatives.~~ update Wait a minute, did you just change your code? If not, I must have replied to the wrong node.	[reply]
Re^3: Fast Way to Split String in to Chunk of Equal Length by CountZero (Bishop) on Nov 25, 2011 at 16:41 UTC
Nope, did not change my code at all. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: Fast Way to Split String in to Chunk of Equal Length by BrowserUk (Patriarch) on Nov 25, 2011 at 10:31 UTC
unpack is the way to go, but one tip is that `unpack 'a3' x $n, $str` runs a few percent quicker than `unpack '(a3)*', $str` if you know in advance how many fields there are. Also, if your problem allows you to unpack the strings just in time at the point where you need them, rather than en masse, avoiding the allocation, creation and destruction of 10 million concurrent small arrays -- even if you have ample memory for them -- can save a significant amounts of time. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Fast Way to Split String in to Chunk of Equal Length by ikegami (Patriarch) on Nov 25, 2011 at 17:09 UTC
unpack is the way to go, but one tip is that unpack 'a3' x $n, $str runs a few percent quicker than unpack '(a3)', $str if you know in advance how many fields there are.* How can you not know in advance how many fields there are (`int(length($seq)/3)`)? avoiding the allocation, creation and destruction of 10 million concurrent small arrays -- even if you have ample memory for them -- can save a significant amounts of time Any implementation in mind? Something like the following? `while ($seq =~ /(.{3})/sg) { # Substring in $1 ... }` [download]	[reply] [d/l] [select]
Re^3: Fast Way to Split String in to Chunk of Equal Length by BrowserUk (Patriarch) on Nov 25, 2011 at 17:20 UTC
Assuming your question is in regard to: Also, if your problem allows you to unpack the strings just in time at the point where you need them, rather than en masse, avoiding the allocation, creation and destruction of 10 million concurrent small arrays -- even if you have ample memory for them -- can save a significant amounts of time. I meant -- unpack the strings just in time at the point where you need them -- something like: `for my $seq ( @seqs ) { my @bits = unpack 'a3a3a3', $seq; ## $bits[ ... ]; }` [download] rather than -- en masse --: `@seqs = map[ unpack '(a3)*', $_ ], @seqs; for my $ref ( @seqs ) { ## $ref->[ ... ] }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: Fast Way to Split String in to Chunk of Equal Length by BrowserUk (Patriarch) on Nov 25, 2011 at 20:23 UTC
At it again. Unannounced updates. The above post originally consisted entirely of just: Any implementation in mind? Something like the following? `while ($seq =~ /(.{3})/sg) { # Substring in $1 ... }` [download] To answer the dumb question that wasn't there originally, How can you not know in advance how many fields there are `(int(length($seq)/3))?` If you have to calculate the number, you throw away the few percent gained. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Fast Way to Split String in to Chunk of Equal Length by pvaldes (Chaplain) on Nov 25, 2011 at 12:35 UTC
Similar, but better like this, less error prone (and a little faster probably) `my $seq = qw(CCCTCTCTCCCAGGGGCA); my @parts = $seq =~ /[CGTAU]{3}/sg; foreach my $codon(@parts){print $codon," "}` [download]	[reply] [d/l]
Re: Fast Way to Split String in to Chunk of Equal Length by TJPride (Pilgrim) on Nov 25, 2011 at 11:58 UTC
I would think that substr would be fast enough to do what you need. The main bottleneck, as BrowserUK said, is probably not substr, but rather disk read (if you're reading one record at a time) or memory allocation. On the former, you could try using read() and a buffer variable. Be simple enough to read in some large multiple of 10 (9 + separator character) and then generate directly from that: `use strict; use warnings; my ($handle, $buffer, $size, @parsed, $i); open($handle, 'my-data.txt'); while ($size = read($handle, $buffer, 30)) { for ($i = 0; $i < $size; $i += 10) { push @parsed, [ substr($buffer, $i, 3), substr($buffer, $i+3, 3), substr($buffer, $i+6, 3) ]; } }` [download]	[reply] [d/l]
Re^2: Fast Way to Split String in to Chunk of Equal Length by TJPride (Pilgrim) on Nov 25, 2011 at 16:48 UTC
Note that I was testing with a read of 30 - should have modified it back to 10240 before posting.	[reply]
Re: Fast Way to Split String in to Chunk of Equal Length by pvaldes (Chaplain) on Nov 25, 2011 at 12:37 UTC
note also that in this code the last non-triplet is silently discarded. `my $seq = qw(CCCTCTCTCCCAGGGGCAAA); prints: CCC TCT CTC CCA GGG GCA AA discarded` [download] This could be, or not, what you want, so maybe you'll need to fix this	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.


"be consistent"
	PerlMonks