Vkhaw has asked for the wisdom of the Perl Monks concerning the following question:

Hi I written a perl to post process 2 files data and then put it as output file. However, each of the file containt over 8millions of line. My script seem effective at all as after several hours, it only reach thousand line. I am here seeking help that anyone could fix my script to a better ways.

The logic basically just take file 1 first 2 column data match with file 2. If it is match, then insert all the 3 colume from file 1 to the file 2. Also, the nature of file 1 and file 2 already sorted correctly (that meant line 1 in file1 is equal to line1 in file2. it no need to perform search actually)

File 1 example:- WL,BL,Die1 WL0,BL0,1708 WL0,BL1,1708 WL0,BL2,1708 WL0,BL3,1931 WL0,BL4,1931 File 2 example:- WL,BL,Die 2 WL0,BL0,1708 WL0,BL1,1931 WL0,BL2,1708 WL0,BL3,1931 WL0,BL3,1708 Output after script:- WL,BL,Die1, Die2 WL0,BL0,1708,1708 WL0,BL1,1708,1931 WL0,BL2,1708,1708 WL0,BL3,1931,1931 WL0,BL4,1931,1708 My script:- #!/usr/bin/perl #Copy Die2 as Output file use File::Copy; copy ("Die2_10k.txt","CombineDie1Die2.txt") or die "copy failed: $!"; #Open Die1 Input file open (Label, "Die1_10k.txt") or die "can't open Die1: $!"; #Search and replace using 1 liner command while (<Label>) { $replace = $_; chomp ($replace); @temp = split (/,/, $replace); $search = $temp[0].",".$temp[1]; $command_line = "perl -pi\.bak -e s\/".$search."(?=,)\/".$replace. +"\/g\; CombineDie1Die2.txt"; system ($command_line); }

It will be greatful if someone could help to provide a effective way of process the 2 8millions of file with some script for the above purpose. Thanks you.

Replies are listed 'Best First'.
Re: Solve the large file size issue
by GrandFather (Saint) on Apr 14, 2015 at 01:22 UTC

    Spawning another copy of Perl to handle a search and replace feels completely wrong. Since your files are "in sync" you can simply read them in parallel and generate the output as you go. The following code uses a trick described in I know what I mean. Why don't you? to make a self contained example. You should replace the open statements for production code.

    #!/usr/bin/perl use strict; use warnings; my $File1 = <<FILE; WL,BL,Die1 WL0,BL0,1708 WL0,BL1,1708 WL0,BL2,1708 WL0,BL3,1931 WL0,BL4,1931 FILE my $File2 = <<FILE; WL,BL,Die 2 WL0,BL0,1708 WL0,BL1,1931 WL0,BL2,1708 WL0,BL3,1931 WL0,BL3,1708 FILE open my $in1, '<', \$File1; open my $in2, '<', \$File2; while (my $line1 = <$in1>) { chomp $line1; defined (my $line2 = <$in2>) or last; print $line1, ',', (split ',', $line2)[-1]; }

    Prints:

    WL,BL,Die1,Die 2 WL0,BL0,1708,1708 WL0,BL1,1708,1931 WL0,BL2,1708,1708 WL0,BL3,1931,1931 WL0,BL4,1931,1708
    Perl is the programming world's equivalent of English
      Thanks you for the brief guide. As I handle with 2 input files, and thus want to saved the output to another file. Could you guide me how to modify your existing script for my requirement?

      At the same time, could you explain what is this part of code meant? "(my $line2 = <$in2>) or last" ??

        You should really be able to figure out the I/O changes for yourself, but to ensure you follow best practice change:

        open my $in1, '<', \$File1;

        to:

        open my $in1, '<', $file1 or die "Can't open '$file1': $!\n";

        where $file1 contains the path to input file 1. Do the same for $file2. Add:

        open my $out, '>', $fileOut or die "Can't create '$fileOut': $!\n";

        and change the print statement to:

        print $out $line1, ',', (split ',', $line2)[-1];

        (my $line2 = <$in2>) reads a line from $in2 and assigns it to the variable $line2. or last; exits the loop if $in2 contains fewer lines than $in1.

        Perl is the programming world's equivalent of English
Re: Solve the large file size issue
by ww (Archbishop) on Apr 14, 2015 at 11:20 UTC
    1. Please use a separate code block to post each datum or script.
    2. See answers in #1123214 for prior advice (like GrandFather's) AGAINST spawning another copy of Perl to do the substitution.
    3. If you don't know how to convert what appears to be a cargo-culted one-liner (s///) into line(s) of inline code, read perldoc perlretut and perldoc perlre at your console.
    4. What precisely happens after your code "only reach thousand line" (sic)? How do you know it doesn't|won't continue? What messages appear? Where (did you check your error logs?)

    Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
    1. code
    2. verbatim error and/or warning messages
    3. a coherent explanation of what "doesn't work actually means.
Re: Solve the large file size issue
by Corion (Patriarch) on Apr 14, 2015 at 11:23 UTC

    When converting oneliners to Perl code, there are two ways. The first is to convert the onliner to a full Perl program:

    perl -MO=Deparse -pi.bak -e s/$search(?=,)\/$replace/g

    The alternative is to use File::Inplace, which implements the -p-loop as a Perl module.

Re: Solve the large file size issue
by soonix (Chancellor) on Apr 15, 2015 at 11:44 UTC
    probably a bit late, but to throw a database in (DBD::CSV):
    #!/usr/bin/env perl use strict; use warnings; use DBI; my $dbh = DBI->connect('dbi:CSV:', undef, undef, {f_dir => '/tmp', f_e +xt => '.txt'}) or die $DBI::errstr; my $select = $dbh->do('CREATE TABLE CombineDie1Die2 AS SELECT WL, BL, +Die1, Die_2 as Die2 from Die1_10k a, Die2_10k b Where a.WL = b.WL and + a.BL = b.BL') or die "Failed to execute - " . $dbh->errstr;
    • The part "Die_2 as Die2" is because of the space in your File 2's field name
    • This is a "inner join", thus WL,BL combinations are skipped if they exist in only one of the files
    • for syntax see → SQL::Statement::Syntax (as http://sqlmonks.org seems to be dormant since its inception) :-)
    Post Scriptum: OK, this example is convoluted, but the general advantage of SQL is: you can concentrate more on what you want to do with the data instead of how.
      Hi Soonix, I am not understood most of the syntax stated. Is that meant that you still in Perl but you are calling SQL type of structure? Also, I tried to copy your code and run in command prompt. However, I only able to get one line of output as below.  wl,bl,die1,die_2 Could you guide me on the coding side if I wish to modify this script to able open 2 input files name "Die1.txt" and "Die2.txt". Then let it extract all the 8 millions of line from both file and output to the "CombineDie1Die2.txt". Appreciate if you could explain the logic too.

        Well, we stay in perl plus use an additional language (SQL), the disadvantage being, obviously, that we now have two languages. However, I find this (for some applications, not for everything) better suited.

        The parameters to the connect have the information to interface your environment (plus perl, of course) with the "database".
        The parameters in my example say that the tables are .txt files, so the OP's file "Die1_10k.txt" is a table "Die1_10k" (my example looks for it in the /tmp folder/directory). Different mappings of file names to table names are possible, this is described in → DBD::CSV.

        Possible difficulties you may encounter:

        • SQL is partially case insensitive, your file system maybe not (most Linux/Unix file systems). I prefer all lowercase names because this makes the least problems.
        • spaces and other characters in file names - what makes a valid file name in your file system may not make a valid table name in SQL
        • similiar considerations for field/column names (DBD::CSV tries to get the field names from the first row, but you can instruct it otherwise)
        I should have written the SQL a bit tidier, e.g.

        CREATE TABLE CombineDie1Die2 AS SELECT WL, BL, Die1, Die_2 as Die2 FROM Die1_10k a, Die2_10k b WHERE a.WL = b.WL AND a.BL = b.BL
        so that it is more readable.

        The OP's data contains column headings, with one column named "Die 2" (with space in between), DBD::CSV sees this as Die_2 (with underscore), but we want it to be simply "Die2". If your column names are different (or your data doesn't have a header line), you have to decide where to adjust: the first line of your data, the info in the connect statement, or the names in the SQL statement itself, like some "renaming" as I did with that field or the tables (a and b).

        What the program printed for you, was the heading only, meaning that no data was SELECTed, either because your data does not match (what I wrote about "inner join" at the end of my previous posting), or your columns are named differently (with the effect, that they don't match) ...

        Update: to find out more, you could try out a simple SQL statement such as:
        use strict; use warnings; use Data::Dumper; use DBI; my $dbh = DBI->connect('dbi:CSV:', undef, undef, { f_dir => '/tmp', # where your files live f_ext => '.txt', }) or die $DBI::errstr; my $select = $dbh->prepare( 'SELECT * FROM Die1_10k' ) or die "Failed to prepare - " . $dbh->errstr; $select->execute or die "Failed to execute - " . $select->errstr; print Dumper $select->{NAME}, $select->fetchrow_hashref();
        This prints out your column names, and the first row of data (or undef if it doesn't find data).

        Update 2: I don't have real big CSV files, so I don't know how well this will work out, but if it is 2 files x 8 Million lines x 12 Bytes (let's assume some Overhead, so 128 Bytes) → 2 Gig, so, depending on how "modern" your Computer is, it might fail.

        Anyway, you did not yet tell us, wether
        - your data is sorted
        - your data is "in sync" (as assumed in GrandFather's example)
        The database solution would work for unsorted data also, but if it is, this would be an advantage for the other solutions.

Re: Solve the large file size issue
by locked_user sundialsvc4 (Abbot) on Apr 14, 2015 at 22:05 UTC

    It superficially appears to me that what you are setting out to do here is simply, “a merge.” If you know that you have two files which are sorted by an identical key, you can write very efficient logic to process the two files.   Or, quite likely, you can find an already-existing CPAN module that does this.   (Sort::Merge and File::MergeSort both look interesting.)

    If you want to “code your own” solution, here’s how I presented a solution to my COBOL classes, all those years ago.   (Utterly ignoring the textbook’s complicated examples.)  Use a state-machine approach:   first, figure out what state you’re in, then do the right thing.   There are the following states:   (in a “two files” scenario)

    1. Initial state:   nothing has been read from either file yet.
    2. Final state:   STOP RUN.
    3. You have reached the end of both files.   (Therefore, switch to final-state.)
    4. You have reached the end of file #1 but not file #2.
    5. You have reached the end of file #2 but not file #1.
    6. You have records from both files, and the keys are identical.
    7. ... and the key from file #1 is smaller.
    8. ... and the key from file #2 is larger.

    (koff, koff ...)   Interesting stuff for a late-night community college class, yes, and nice because it can easily be extended to deal with any number of input files.   But otherwise, this is “a thing already done.”   This sort of data-processing has (literally ...) been done, and done very well, since the days of Herman Hollerith.   Grab an existing, off-the-shelf CPAN module and use it.

    The computer should positively scream through a “mere” 8 million lines, since it only has to make one sequential pass through the file(s) to produce the right answer.   You should have your solution in, say, “worst case, a second or so...”