Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow Monks!
I have two lines, like the following:
$str1='---DAAAGLRG--G--G-P-LT-I--A--PG----A-----T----LG---G-YG-------- +--------------------------------SVT---------------------------------- +---------------------------------------------------G-------NV-T------ +NN---G----TI----SVANALPSLASSLPGDFRIF--------------------------------- +--------------------------GTLTNAGVVELRGRVVGN--G-LA-V-S------------G-- +------N---Y---VGQN----------------------GAVN-------------MN-TT------- +--L--AG--D----------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +----------------------------------------------------G---------------- +-----A-------PS-------D-TL-LI---------------GGVPA-VATAS---------G---- +K--------T----T---------L-------------------------------------------- +--------------N-----VTNVGG---------------AGAL------------------------ +------------------------------------------TK-SDGI---------RL-VY------ +----------AVNFA-N---------T-------------------G---N-A--F--TLAG----GTV +S--AG---------------------------------------------------------------- +---------AYSYY--------------LV--KGGV-T------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------A-----------------LTG---------EDWYLR-S----------- +---------------------------------------------------TVPPR-P-DQ---P---- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +-T-QQ--PPF----------------------------------------------------------- +---------------------S--V-A---DG-TP-ES--I-----------V---------------- +--E--AV-K---N----------------A--AP-DA-------------------------------- +--------------------------------------------------------K-------PEP-- +------------------------------------------------------------V-------- +----------YR--------------------------------------------------------- +----------PEV--PL-YS-----------EVP----------------------------------- +--------------------------------------------------------------------- +---------------------------------------------A--VARQ----------------- +-----LG---L-L------------Q--------IDT-F-H----------------DRQ-------G- +EQG--LL-----AEN-G-S-------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +------VP----VSWSRVW-----------GGY---SN------IKQ-NG------------------- +-------DVTPSY--DGTVW-----G--MQVGQ---DLY-----ADNRP-------SGHRNHYGFF--- +-LGF------SR--AIGDVNGFA--------------------------------------LAQPDL-- +------GVGSLQVN-A-Y-N----L--G--G-YWT-----------------------------H---- +IGPG--------------GWYTDA--------------------------VV--MGS-V--LT---V-- +RTHSN-------------------------------N------NVSGS--T-D--GNA--VTGS-V--E +AGV--P--I------------SL------G-YG----------L--------------T----L----- +----E-PQA-QLLW-QWLS-LA--RFND------G-------V-------------------------- +--------SDV----T--W-----NN-GNTFLGR----IG-ARL--------QY-----AFDAN----- +-GVSWK--------------------PYLRVNVLR--S--FG-S--DD----------RTT-----FG- +----GS----TT------------------------IG-TQ-VG-------Q--T--AGQIGA-GL-VA +-Q--LT-KR----GSVYA--T--V--S---Y---------LT-NL-----GG----E----H----QR- +---T---I--T---GNAGVRW--'; $str2='XXXXXXXXXXX..X..X.X.XX.X..X..XX....X.....X....XX...X.XX........ +................................XXX.................................. +...................................................X.......XX.X...... +XX...X....XX....XXXXX................................................ +.........................................XXX..X.XX.X.X............X.. +......X...X...XXXX......................XXXX.............XX.XX....... +..X..XX..X........................................................... +..................................................................... +..................................................................... +....................................................X................ +.....X.......XX.......X.XX.XX...............XXX......XX.........X.... +X........X....X.........X............................................ +..............X.....XXXXXX................XXX........................ +..........................................XX.XXXX.........XX.X....... +...........XXXX.X..X....X.X...................X...X.X..X..XXX......XX +X..XX................................................................ +.........XXXXX..............XX..XXXX.X............................... +..................................................................... +..................................................................... +..................................................................... +..................................................................... +..................................................................... +....................X..............XXXXXX.........XXXXXX.X........... +...................................................XXXXX.X.XX...X.... +..................................................................... +..................................................................... +..................................................................... +..................................................................... +..................................................................... +..................................................................... +..................................................................... +.X.XX..XXX.........XX..X...X......................................... +.....................X..X.X...XX.XX.XX..X...........X................ +..X..XX.X...X................X..XX.XX................................ +........................................................X.......XXX.. +............................................................X........ +...X..X.X.XX......................................................... +..........XXX..XX.XX...........XXX................................... +..................................................................... +.............................................X..XXXX................. +.....XX...X.XX.........XXX........XXX.X.X................XXX.......X. +XXX..XX......XX.X.X.................................................. +..................................................................... +..................................................................... +......XX....XIIIIII...........III...II.......XXXX.................... +........XXXXX..XXXXX.....X..XXXXX...XXX.....XXX............XXXXXXX... +.XXX......XX..XXXXXX.............................................X... +........XXXXXX.X.X.X....X..X..X.XXX.............................X.... +XXXX...............XXXXX..........................XX..XXX.X..XX...X.. +XXXXXX.XX..XX......................XX......XXXXX..X.X..XXX..XXXX.X..X +XXX..X..X............XX......X..X..........X..............X....X..... +....X.XXX.XXXX.XXXX.XX..XXXX......X.......XX....X.................... +X.X.....XXX....X..X.....XX.XXXXXXX....XX.XXX........XX.....XXXXX..... +.XXXXX....................XXXXXXXXX..X..XX.X..XX....XX...XXXX.....XX. +....XX....XX............X....XX....XXX.XX.XX.......X..X..XXXXXX.XX.XX +.X..XX.XX....XXXXX..X..X..X...X.........X...X......X....X....X....XX. +...X...X..X...XXXXXXXXX';

The goal is, for each of the positions in $str1 that are -, erase the respective positions in $str2. The desired output should then be:
DAAAGLRGGGPLTIAPGATLGGYGSVTGNVTNNGTISVANALPSLASSLPGDFRIFGTLTNAGVVELRGR +VVGNGLAVSGNYVGQNGAVNMNTTLAGDGAPSDTLLIGGVPAVATASGKTTLNVTNVGGAGALTKSDGI +RLVYAVNFANTGNAFTLAGGTVSAGAYSYYLVKGGVTALTGEDWYLRSTVPPRPDQPTQQPPFSVADGT +PESIVEAVKNAAPDAKPEPVYRPEVPLYSEVPAVARQLGLLQIDTFHDRQGEQGLLAENGSVPVSWSRV +WGGYSNIKQNGDVTPSYDGTVWGMQVGQDLYADNRPSGHRNHYGFFLGFSRAIGDVNGFALAQPDLGVG +SLQVNAYNLGGYWTHIGPGGWYTDAVVMGSVLTVRTHSN------NNVSGSTDGNAVTGSVEAGVPISL +GYGLTLEPQAQLLWQWLSLARFNDGV----SDVTWNNGNTFLGRIGARLQYAFDANGVSWKPYLRVNVL +RSFGSDDRTTFGGSTTIGTQVGQTAGQIGAGLVAQLTKRGSVYATVSYLTNLGGEHQRTITGNAGVRW XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX............................. +.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.....XXXXXXXXXXXXXX.XXXXXXXXX +XXX..XXXXXXXXXXXXX..XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..XXXXIIIII +IIIIIIXXXX..XXXXXXXXXXXXXXXXXXXXXX.....XXXXXXXXXXXXXXXXXX.......X...X +XXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX +X.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..X.X.XXXXXXXXXXXXX

My approach would be to split $str1 and $str2 and then, foreach of the positions in $str1 that are - I would erase the corresponding positions in $str2.
The problem is that I have a very large file of such cases and split would be rather slow I reckon.
Any faster way maybe?

Replies are listed 'Best First'.
Re: How to make this substitutions without splitting the strings?
by AnomalousMonk (Archbishop) on Jul 28, 2014 at 10:42 UTC

    (NB: It's not really necessary to post a reeeeealy looooong string to convince us you're dealing with long strings. We're prepared to take your word. Shorter example strings would seem to have been sufficient.)

    Insofar as I understand your problem, this is the classic, idiomatic Perlish solution. (I don't know if you really need your  $str1 string transformed also; I do it essentially to support the example.) This approach assumes that  $s1 and  $s2 are always the same length, and also that the null character  \x00 is never a valid part of the  $s1 or  $s2 string.

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s1 = '---DAAAGLRG--G--G-P-LT-IGGY'; my $s2 = '...XXXXXXXX..X..X.X.XX.X..X'; print qq{'$s1'}; print qq{'$s2'}; ;; (my $mask = $s1) =~ tr{-\x00-,.-\xff}{\x00\xff}; my $t1 = $s1 & $mask; my $t2 = $s2 & $mask; $t1 =~ tr{\x00}''d; $t2 =~ tr{\x00}''d; ;; print qq{'$t1'}; print qq{'$t2'}; " '---DAAAGLRG--G--G-P-LT-IGGY' '...XXXXXXXX..X..X.X.XX.X..X' 'DAAAGLRGGGPLTIGGY' 'XXXXXXXXXXXXXX..X'

    Update: Replaced  s/// with  tr/// in masked character removal step:  $t1 =~ tr{\x00}''d; instead of  $t1 =~ s{ \x00+ }''xmsg;

      Thank you very much, it works like a charm...
      Can you please tell me what does this line:
      (my $mask = $s1) =~ tr{-\x00-,.-\xff}{\x00\xff};
      do? Also, which is the null character?
      Thanks again!
        ... what does this line:

        (my $mask = $s1) =~ tr{-\x00-,.-\xff}{\x00\xff};

        It copies the string  $s1 to the new lexical  $mask while simultaneously translating the characters of  $s1 with the  tr/// function. If you have Perl version 5.14+, you can simplify this statement somewhat by using the  /r modifier:
            my $mask = $s1 =~ tr{-\x00-,.-\xff}{\x00\xff}r;
        See  tr/// in the Quote-Like Operators section in perlop.

        ... which is the null character?

        It is the  \0 character (or byte), hex 0x00, octal 000 (also decimal 0). (Update: Along with its cousin 0xff, it's very useful for creating bit-masks for strings.)

Re: How to make this substitutions without splitting the strings? (compute not reckon)
by Anonymous Monk on Jul 28, 2014 at 09:37 UTC
Re: How to make this substitutions without splitting the strings?
by Anonymous Monk on Jul 28, 2014 at 09:47 UTC

    You don't need to split into an array to access positions in a string by index, you can use substr (or index to search for the indices of "-", or even a regex and pos). Also, you can output a new file with only the characters you want instead of manipulating a large string.

    Show us some of your code and we can help you with that...

    (I dimly remember seeing a similar question on PerlMonks recently... have you Googled your question?)

      Remember to process the strings backwards so the 'erase' does not change the position of characters yet to be processed.
      Bill

        Note this only applies if you don't follow the parent's recommendation of generating new strings.

Re: How to make this substitutions without splitting the strings?
by Anonymous Monk on Jul 28, 2014 at 11:53 UTC
    Here's a reasonably fast one.
    # include <stdio.h> # include <stdlib.h> # include <errno.h> int main(int argc, char **argv) { if (argc != 3) { fprintf(stderr, "Usage: %s <input_file> <comparison_file>\n", +argv[0]); exit(1); } FILE *in_file = fopen(argv[1], "r"); if (in_file == NULL) { perror(argv[1]); exit(1); } FILE *cmp_file = fopen(argv[2], "r"); if (cmp_file == NULL) { perror(argv[2]); exit(1); } for (;;) { char cmp = fgetc(cmp_file); char inp = fgetc(in_file); if (cmp == '-') { continue; } if (cmp == EOF || inp == EOF) { break; } putchar(inp); } fclose(in_file); fclose(cmp_file); exit(0); }
    Compile:
    gcc -O3 tr_big_string.c -o tr_big_string
    Usage:
    ./tr_big_string ./big_string_file ./comparison_file

      For the serendipitous explorer: please do observe in above example code the oft-perpetrated C pitfall of forcing the fgetc() result to a char type. Return value of fgetc is int; this allows for end-of-file or error condition to be signaled via out-of-band value (EOF).

Re: How to make this substitutions without splitting the strings?
by Anonymous Monk on Jul 28, 2014 at 23:21 UTC

    AnomalousMonk pretty much nailed it. The string-based approach appears about 8 times faster than unpack/pack.

              Rate unpack string
    unpack  8128/s     --   -87%
    string 61949/s   662%     --
    

      (my $p = shift) =~ tr/-\0-\377/\0\377/;

      In the  tr/-\0-\377/\0\377/ expression, the '-' (hyphen) character appears twice in the search list: initially, and also within the  \0-\377 range. In tests I did with some Win32 Perls in the range 5.8 to 5.14, the test code

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'XXXooX'; (my $t = $s) =~ tr/XoX/ab/; print qq{'$t'}; " 'aaabba'
      (and identically for  tr/X\x00-\xff/ab/) always produced the same result: the leftmost occurrence of a character in the search list is selected for matching to and replacement by the corresponding character in the replacement list.

      I considered using the much neater  tr/-\0-\377/\0\377/ version, but I couldn't find anything in the docs to guarantee the behavior shown in my tests must always prevail. Despite the tests, I didn't feel comfortable using an "undocumented feature". Do you know of any documentation of this "leftmost match" feature in the  tr/// built-in? In a regex, the rule would be "leftmost longest match", but  tr/// isn't really a regex, it's a transliterator — isn't it?

        Do you know of any documentation of this "leftmost match" feature in the tr/// built-in?
        From perlop:

        "If multiple transliterations are given for a character, only the first one is used:

        tr/AAA/XYZ/

        will transliterate any A to X."