Berislav has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, Could you tell me what is the maximum string length that PERL can handle without the need for doing something extra to it. I'm counting certain elements in relatively long strings (which can go up to 100MB), and i have a Windows executable file written by a friend of mine (i think he did the programming in c++), and a perl script, written by me. The thing is that for relatively short strings these two programs give the same results, but for longer strings the results differ, although the programs should work the same. Moreover, the string lengths reported by both program are the same, only the number of certain elements in the same strings differ depending on the program used. So either I did something wrong, or my friend did something wrong, and i believe it could be me because my friend is a proffesional programer...:-) Btw, I'm using perl under Fedora Core distribution. Thank You all very much in advance. Berislav Lisnic.

Replies are listed 'Best First'.
Re: Maximum string length
by borisz (Canon) on Feb 22, 2006 at 11:35 UTC
    try:
    my $l = do{ use bytes; length $my_string }; print $l;
    There is a difference, if your string is a UTF8 string and you use perl 5.8.x
    Boris
Re: Maximum string length
by holli (Abbot) on Feb 22, 2006 at 13:06 UTC
    Theoretically, the maximum length of a string in Perl only depends on how much memory is available. Practically, such big strings are unhandable. Try running the following script to get an impression:
    $i=256; while (1) { print ++$i, "\n"; system ("perl -e \"\$_ = 'x' x ($i*1024*1024)\""); }


    holli, /regexed monk/
      why shell out?
      my $i=1; my $x; while (1) { print $i, "\n"; $x = 'x' x ($i*1024*1024); $i += 10; }
      happily eats up all my memory
Re: Maximum string length
by marto (Cardinal) on Feb 22, 2006 at 11:38 UTC
    Hi Berislav,

    I notice that this is your fisrt post, welcome to the Monastery! Please read the PerlMonks FAQ if you have not already done so. I think it would be best if you post your Perl code and a short example of the input data. That way, people here can point out any problems or provide any enhancements to your code.

    Hope this helps.

    Martin
      Thank You everyone for your replies....:-) In the meantime I've found out where the problem was, but i will post the code, and an example of input data, nevertheless. Please note that the purpose of this code is to test an algorithm, and it's not a final version of the program i have in mind.
      system "clear"; print "Palindrome - gamma version\n"; print "--------------------------\n\n"; print "Please enter DNA filename: "; $filename=<STDIN>; chomp $filename; unless (-e $filename) { print "No such file...exiting\n\n"; exit; } unless (open(DNASEQ, $filename)) { print "Cannot open file...exiting\n\n"; exit; } @dna=<DNASEQ>; $dna=join('', @dna); $dna=~ s/\s//g; $count_of_2=0; for ($lb=0; $lb<length $dna; ++$lb) { $lba=substr ($dna, $lb, 1); $rba=substr ($dna, $lb+1, 1); $rba=~ tr/atgc/tacg/; if ($lba eq $rba) { ++$count_of_2; } else { } } print "Number of 2bp palindromes: ", $count_of_2, "\n"; exit;
      Input data are files that contain DNA sequences arranged in the following format:
      CGACAGCTACGATCGTAC CAGTATCATCACTACGTA CACGAGAGTACGATCGAC ......etc.........

      The program should work with both lower- and uppercase sequences, but i forgot to add something like
      $dna=~ tr/ATGC/atgc/;

      so when i loaded uppercase DNA sequence it just didn' do the job right. Now the two programs give the same results, i've tested them with sequences containing up to 29160000 characters.
        Out of curiosity: How long is the C sourcecode your friend wrote? And ... Have you ever looked at bioperl?


        holli, /regexed monk/
Re: Maximum string length
by turo (Friar) on Feb 22, 2006 at 19:39 UTC

    hi Berislav,
    i think, you have 2 problems on your code:

    1. for ($lb=0; $lb<length $dna; ++$lb), the lenght of the array must be length($dna) -1 (the last elemet cannot be compared with anything ... perl accepts this, though)
    2. the 'tr' only tries to convert between lowercase letters. The example you gave us have uppercase letters ... the result isn't the same ...

    I've retouched your script, so the problem of the large string will not affect you
    #!/usr/bin/perl -w use strict; system "clear"; print "Palindrome - gamma version\n"; print "--------------------------\n\n"; print "Please enter DNA filename: "; my $filename=<STDIN>; chomp $filename; die "No such file...exiting\n\n" unless (-e $filename); open(DNASEQ, $filename) or die "Cannot open file...exiting\n\n"; my $last_protein; my $count_of_2=0; while (<DNASEQ>) { chomp; my ($lba,$rba)= ($last_protein, undef); for (my $lb = 0; $lb < (length) - 1; ++$lb) { $lba = substr ($_, $lb, 1); $rba = substr ($_, $lb+1, 1); $rba =~ tr/atgcATGC/tacgTACG/; ++$count_of_2 if ($lba eq $rba); } $last_protein = $rba; } print "Number of 2bp palindromes: ", $count_of_2, "\n";

    hope that helps :-)

    perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'
Re: Maximum string length
by Anonymous Monk on Feb 22, 2006 at 19:58 UTC
    A big problem I see in your program is that you are reading the whole file into an array first and then are assigning that to a scalar so you have TWO copies of the whole file in memory. It would be more memory efficient just to read the file directly into a scalar.
    use warnings; use strict; system 'clear'; print "Palindrome - gamma version\n"; print "--------------------------\n\n"; print 'Please enter DNA filename: '; chomp( my $filename = <STDIN> ); open DNASEQ, $filename or die "Cannot open $filename: $!"; ( my $dna = do { local $/; <DNASEQ> } ) =~ tr/atcgATCG//cd; my $count_of_2 = () = $dna =~ /a(?=t)|t(?=a)|c(?=g)|g(?=c)/ig; print "Number of 2bp palindromes: ", $count_of_2, "\n"; __END__
    See if that produces the same result as your friend's C++ code.