in reply to Performance problems on splitting long strings

Why does the long string have to be split into individual containers? Can you work with the large string, just in segments at offset multiples of five?

unpack is already lightning fast, but even less work is being done if you can avoid making a copies. Of course this is Perl, and sometimes the more idiomatic approaches (unpack, for example) can be faster than algorithmic approaches (walking through the existing string without making copies), but if unpack is still too slow, and if what you are using those little substrings for could be done in-place, it might be worth attempting a solution that works on the original string.


Dave

  • Comment on Re: Performance problems on splitting long strings

Replies are listed 'Best First'.
Re^2: Performance problems on splitting long strings
by AnomalousMonk (Archbishop) on Jan 31, 2014 at 22:24 UTC

    An example of davido's in-place string modification combined with boftx's dispatch table handler:

    >perl -wMstrict -le "use constant LEN => 5; ;; my $s = '1234598765555553456733333'; print qq{'$s'}; ;; my %dispatch = ( '55555' => sub { return 'x' x length $_[0]; }, ); ;; for (my $offset = 0; $offset < length $s; $offset += LEN) { for (substr $s, $offset, LEN) { $_ = exists $dispatch{$_} ? $dispatch{$_}->($_) : $_ + 2; } } print qq{'$s'}; " '1234598765555553456733333' '1234798767xxxxx3456933335'

    Update: Changed example code to also exemplify topicalization of sub-string segment via for-structure (given no longer being quite kosher).

Re^2: Performance problems on splitting long strings
by Laurent_R (Canon) on Jan 31, 2014 at 01:10 UTC
    Hi Dave, thank you for your input, the truth of the matter is that I am the victim of earlier poor design. The long string that I am talking about is a list of billing services for a client. All billing service codes are 5 character-long. What I need to do with this list (with a bit of simplification) is to filter in those services that are of use for our purpose, and to sort them out. The ultimate goal is to compare the two large files after having pre-processed them.

      Sounds like there is going to be a dispatch table in there somewhere to handle what to do with the different billing codes that are of interest. :)

      It helps to remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
        Well, yes, I am using a data pipeline afterwards, something like this:
        my $field16 = join '|', sort grep {exists $hash{$_}} @subfiedls;
        but I wanted to keep the splitting separate to start with because I suspected that I might have a performance problem with it and therefore wanted to have it as a separate instruction to enable finer benchmarking.