Vkhaw has asked for the wisdom of the Perl Monks concerning the following question:

I am new to perl and I written a script performing split in file 1, then using the particular pattern to search and paste in file 2. However, there is series of number show up in my file 2 which auto added that not belong to file 1/2. Could you please help me point out where is the bug? As it is seem a series of 1-9 adding to the search pattern. This is critical as it is urgent to perform data analysis.

#!/usr/bin/perl #Copy Die2 as Output file use File::Copy; copy ("Die2.txt","CombineDie1Die2.txt") or die "copy failed: $!"; #Open Die1 Input file open (Label, "Die1.txt") or die "can't open Die1: $!"; #Search and replace using 1 liner command while (<Label>) { $replace = $_; chomp ($replace); @temp = split (/,/, $replace); $search = $temp[0].",".$temp[1]; $command_line = "perl -pi\.bak -e s\/".$search."\/".$replace."\/g\ +; CombineDie1Die2.txt"; system ($command_line); }
input file 1 WL,BL,Die1 WL0,BL0,1708 WL0,BL1,1708 WL0,BL2,1708 WL0,BL3,1931 WL0,BL4,1931 WL0,BL5,1931 WL0,BL6,1931 WL0,BL7,1931 WL0,BL8,1708 WL0,BL9,1931 WL0,BL10,1708 WL0,BL11,1708 WL0,BL12,1708 WL0,BL13,1931 WL0,BL14,1931 WL0,BL15,1931 WL0,BL16,1931 WL0,BL17,2153 WL0,BL18,2153 input file 2 WL,BL,Die 2 WL0,BL0,1708 WL0,BL1,1708 WL0,BL2,1708 WL0,BL3,1931 WL0,BL4,1931 WL0,BL5,1931 WL0,BL6,1931 WL0,BL7,1931 WL0,BL8,1708 WL0,BL9,1931 WL0,BL10,1708 WL0,BL11,1708 WL0,BL12,1708 WL0,BL13,1931 WL0,BL14,1931 WL0,BL15,1931 WL0,BL16,1931 WL0,BL17,2153 WL0,BL18,2153 Error output : (example: at line 12, "WL0,BL1,17080,1708" the 3rd value had addition +"0" added in it. Thus, subsequently there was increment of the number + in this area." WL0,BL1,17081,1708 <-- 1708 instead of 17081 WL0,BL1,17082,1708 <-- 1708 instead of 17082 WL0,BL1,17083,1931 <-- 1708 instead of 17083 ========================== Exact Error output file WL,BL,Die1,Die 2 WL0,BL0,1708,1708 WL0,BL1,1708,1708 WL0,BL2,1708,1708 WL0,BL3,1931,1931 WL0,BL4,1931,1931 WL0,BL5,1931,1931 WL0,BL6,1931,1931 WL0,BL7,1931,1931 WL0,BL8,1708,1708 WL0,BL9,1931,1931 WL0,BL1,17080,1708 WL0,BL1,17081,1708 WL0,BL1,17082,1708 WL0,BL1,17083,1931 WL0,BL1,17084,1931 WL0,BL1,17085,1931 WL0,BL1,17086,1931 WL0,BL1,17087,2153 WL0,BL1,17088,2153

Replies are listed 'Best First'.
Re: Unknown numbers show up
by Athanasius (Archbishop) on Apr 12, 2015 at 14:48 UTC

    Hello Vkhaw, and welcome to the Monastery!

    For each line in the output file, you are running a series of substitutions. For example, for this line:

    WL0,BL10,1708

    you run:

    command line = 'perl -pi.bak -e s/WL0,BL1/WL0,BL1,1708/g; CombineDie1D +ie2.txt'

    and later

    command line = 'perl -pi.bak -e s/WL0,BL10/WL0,BL10,1708/g; CombineDie +1Die2.txt'

    and you expect the second substitution to be applied to the line. But the first substitution finds a match, and so replaces it:

    WL0,BL10,1708 ******* s/WL0,BL1/WL0,BL1,1708/g

    (The match is marked by asterisks.) This results in:

    WL0,BL1,17080,1708

    and the second match (the one you want) is never applied.

    One way to fix this problem would be to re-order the matches so that the longer matches occur first. Another way would be to add a look-ahead assertion to match a comma:

    s/WL0,BL1(?=,)/WL0,BL1,1708/g;

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Dear Athanasius,

      I would like to thanks you for the detail explaining the logic behind why the extra number show up.

      However, I still not really know how to solve the issue. As I am dealing with 2 files with each 8 million of line. I have to use your second proposal on the assertion to match the comma.

      As this is my first write up, I am not really sure how to edit this particular code:-  $command_line = "perl -pi\.bak -e s\/".$search."\/".$replace."\/g\; CombineDie1Die2.txt";

      to what you had mentioned :-  s/WL0,BL1(?=,)/WL0,BL1,1708/g;

      I meant how to insert or where to include this (?=) specifically.

      Regards, Vkhaw

        Like this:

        my $command_line = 'perl -pi.bak -e s/' . $search . '(?=,)/' . $replac +e . '/g; CombineDie1Die2.txt'; # Add this ^^^^^

        (See “Look-Around Assertions” in perlre#Extended-Patterns.)

        By the way, there is another problem with your code: the command switch -pi.bak makes a backup of the target file each time a substitution is applied. This means that CombineDie1Die2.txt.bak ends up almost the same as CombineDie1Die2.txt (only the final substitution is omitted). This is almost certainly not what you want. It provides yet another reason to follow flexvault’s advice and remove the nested calls to Perl one-liners from within your script. Much better to re-cast the logic and use Perl’s substitution facilities, etc., directly, rather than invoking a new Perl interpreter millions of times!

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Unknow number show up
by flexvault (Monsignor) on Apr 12, 2015 at 14:40 UTC

    Welcome Vkhaw,

    I downloaded your code and added structures and executed the script. I didn't get your results, so I commented out the 'system' call and added:

    print "$command_line\n";
    This would help you know how your re-calling Perl.

    Note: you don't need to call Perl in the loop! This adds a lot of memory usage and would be very slow in production. Do the search and save the results to a 3rd file.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      Dear Ed,

      When I replace "system" with "print" in my code as you mentioned, it just now working in the sense that nothing happen to my output file. While in command prompt window it give me this type of display

      perl -pi.bak -e s/WL0,BL195(?=,)/WL0,BL195,1708/g; CombineDie1Die2.txt perl -pi.bak -e s/WL0,BL196(?=,)/WL0,BL196,1708/g; CombineDie1Die2.txt perl -pi.bak -e s/WL0,BL197(?=,)/WL0,BL197,1931/g; CombineDie1Die2.txt perl -pi.bak -e s/WL0,BL198(?=,)/WL0,BL198,1931/g; CombineDie1Die2.txt

      I am facing another problem now as my file 1 is about 8 millions line and same as file 2. It took around 3~4 hours only process thousand line. I need a solution for this. Appreciate if you can help.

        Dear Vkhaw,

        The 'print' was to help you see what your 'system' call was doing, not to fix your problem. You have some excellent suggestions from other posters, integrate their suggestions and then post additional questions.

        My current suggestion is to take some time to better understand your file inputs. ie:

        • Are the files sorted in any way? Could they be sorted for performance?
        • Can file2 be loading into an array in memory? That way you read the file once before the loop and then inside the loop, you read file1 a line at a time...Much faster!
        • Do your 's///' directly without re-calling Perl! Much, much faster!
        And please keep asking questions, that's how we all learned!

        Regards...Ed

        "Well done is better than well said." - Benjamin Franklin

Re: Unknow number show up
by ww (Archbishop) on Apr 12, 2015 at 15:53 UTC

    Extending the replies above:

    re Re: Unknown numbers show up: An alternate to Athanasius' look-ahead assertion would be to insert an anchor in $replace (which would also require you to use a "quote-like" operator, qr/ ... /;)
              ...thus:  $search = qr/$temp[0],$temp[1]$/;

    You might also want to read perldoc perlretut and/or perldoc perlre for a better understanding of the impact your quotes, dots, etc have on Ln 18,

       $command_line = "perl -pi\.bak -e s\/".$search."\/".$replace."\/g\; CombineDie1Die2.txt";

    And, re flexvault's note: definitely, don't call another instance of Perl to execute your  s/// and especially, don't call it in a loop. Do rewrite it so the substitution is executed without system..... That's left as an exercise for the Seeker.
    I'm not sure that writing to a third file is necessary, though... but that may well be merely my blindspot.

    #!/usr/bin/perl use 5.010; # 1123214 #Copy Die2 as Output file use File::Copy; copy ("Die2.txt","CombineDie1Die2.txt") or die "copy failed: $!"; #Open Die1 Input file open (Label, "Die1.txt") or die "can't open Die1: $!"; #Search and replace using 1 liner command while (<Label>) { $replace = $_; chomp ($replace); @temp = split (/,/, $replace); $search = qr/$temp[0],$temp[1]$/; # using an anchor print "\t search is $search"; $command_line = "perl -pi\.bak -e s\/$search/$replace/g\; CombineD +ie1Die2.txt"; # NB the removal of numerous dots, quotes and backslashes system ($command_line); }