in reply to Merge and sort large data

Hi barbar,

I can't make any sense of this post, what code now works? Perhaps you should take a look at the PerlMonks FAQ and How do I post a question effectively?.

Update: I checked CB60 shortly after posting, and found no chat relating to this post.

Martin

Replies are listed 'Best First'.
Re^2: Merge and sort large data
by planetscape (Chancellor) on Jul 23, 2007 at 11:12 UTC

    barbar's post looked like this when I approved it:


    I have been struggling with sorting and merging data for some time. I managed to find a script by roboticus in Perl monks: node_id=596095 Which said:

    mergefile.1
    15 20 foo 22 30 bar 30 33 baz 14 22 fubar
    mergefile.2
    alpha baz 17.30 gamma foobar 22.35 gamma bar 19.01 delta fromish 33.03 sigma bear 14.56
    mergefile.out bar 22 30 gamma 19.01 baz 30 33 alpha 17.30 bear null null sigma 14.56 foo 15 20 null null foobar null null gamma 22.35 fromish null null delta 33.03 fubar 14 22 null null

    The code used to merge mergefile.1 and mergefile.2 to create mergefile.out is below

    #!/usr/bin/perl -w use strict; use warnings;
    open F1, 'sort -k3 mergefile.1|' or die "opening file 1"; open F2, 'sort -k2 mergefile.2|' or die "opening file 2";
    open OUF, '>', 'mergefile.out' or die "opening output file";
    my @in1; my @in2;
    sub getrec1 { @in1 = (); if (!eof(F1)) { (@in1) = split /\t/, <F1>; chomp $in1[2]; } }
    sub getrec2 { @in2 = (); if (!eof(F2)) { (@in2) = split /\t/, <F2>; chomp $in2[2]; } }
    sub write1 { print OUF "$in1[2]\t$in1[0]\t$in1[1]\tnull\tnull\n"; getrec1; }
    sub write2 { print OUF "$in2[1]\tnull\tnull\t$in2[0]\t$in2[2]\n"; getrec2; }
    sub writeboth { print OUF "$in1[2]\t$in1[0]\t$in1[1]\t$in2[0]\t$in2[2]\n"; getrec1; getrec2; }
    # Prime the pump getrec1; getrec2;
    while (1) { last if $#in1<0 and $#in2<0;
    if ($#in1<0 or $#in2<0) { # Only one file is left... write2 if $#in1<0; write1 if $#in2<0; } elsif ($in1[2] eq $in2[1]) { # Matching records, merge & write 'em writeboth; } elsif ($in1[2] lt $in2[1]) { # unmatched item in file 1, write it & get next rec write1; } else { # unmatched item in file 2, write it & get next rec write2; } }

    My question is - how can I get this to work.??

    I have these files saved in a diractory and when I try to run this from Unix command it errors: "Input file specified two times."

    If I run this in Korn Shell it comes up with the warning:" sort: last character not record delimiter" and the output is:

    bar null null gamma 19.01 bar 22 30 null null baz null null alpha 17.30 baz 30 33 null null bear null null sigma 14.56 foo 15 20 null null foobar null null gamma 22.35 fromish null null delta 33.03 fubar 14 22 null null

    Which shows that the code is not working.

    I altered the line "chomp $in2[2];" to read "chomp $in2[1];"

    I changed the delimiters in the files to be commas, and changed the script from \t to ,

    I really think this script will be able to solve my poblem of sorting and merging large files - if only I could get it to work and understand why it was not working in the first place.

    Please can anyone help me by giving me any pointers?????


    (I happened to have left the browser tab open and that is how I produced this copy.)

    Note to barbar: Please don't go in and completely alter a node once posted. It messes with people's heads. And mine is definitely messed up enough already. Thank you.

    HTH,

    planetscape
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^2: Merge and sort large data
by roboticus (Chancellor) on Jul 24, 2007 at 13:12 UTC

    marto:

    He was referring to an old post of mine. I found a /msg from him on CB when I logged in. Lucky for me, the issue was resolved by the time I saw the question. 8^)

    ...roboticus

Re^2: Merge and sort large data
by barbar (Initiate) on Jul 23, 2007 at 10:58 UTC
    Sorry I have been unclear, the foillowing code now works:
    #!/usr/bin/perl -w use strict; use warnings; open F1, 'sort -k3 mergefile1|' or die "opening file 1"; open F2, 'sort -k2 mergefile2|' or die "opening file 2"; open OUF, '>', 'mergefile.out' or die "opening output file"; my @in1; my @in2; sub getrec1 { @in1 = (); if (!eof(F1)) { (@in1) = split /\t/, <F1>; chomp $in1[2]; } } sub getrec2 { @in2 = (); if (!eof(F2)) { (@in2) = split /\t/, <F2>; chomp $in2[1]; } } sub write1 { print OUF "write1 $in1[2]\t$in1[0]\t$in1[1]\tnull\tnull\n"; getrec1; } sub write2 { print OUF "write2 $in2[1]\tnull\tnull\t$in2[0]\t$in2[2]\n"; getrec2; } sub writeboth { print OUF "writeboth $in1[2]\t$in1[0]\t$in1[1]\t$in2[0]\t$in2[ +2]\n"; getrec1; getrec2; } # Prime the pump getrec1; getrec2; while (1) { last if $#in1<0 and $#in2<0; if ($#in1<0 or $#in2<0) { # Only one file is left... write2 if $#in1<0; write1 if $#in2<0; } elsif ($in1[2] eq $in2[1]) { # Matching records, merge & write 'em writeboth; } elsif ($in1[2] lt $in2[1]) { # unmatched item in file 1, write it & get next rec write1; } else { # unmatched item in file 2, write it & get next rec write2; } }
    using the file mergefile1 below
    15 20 foo 22 30 bar 30 33 baz 14 22 fubar
    And the file mergefile2 below
    alpha baz 17.30 gamma foobar 22.35 gamma bar 19.01 delta fromish 33.03 sigma bear 14.56
    Can anyone tell me why this only works on Korn shell and not in command? Thank you

      I'm not sure why this "only works on Korn shell and not in command, but there are two possibilities:

      Traditionally, Unix shells perform argument wildcard expansion (globbing) on the command line, while command.com and cmd.exe, the shells on Windows do not. If you want to perform globbing yourself, use the glob version, best as follows:

      # Near the top of your program use File::Glob qw(bsd_glob); # sane whitespace handling my @files = glob('../INPUT/*');

      The other possibility is that you have two programs called sort.exe and $ENV{PATH} is set differently between your ksh and your cmd.exe. The sort.exe program that comes with Windows is incompatible with the sort.exe program that behaves like the Unixish programs do. The easiest way to make sure that the "right" sort.exe program is invoked is to give the full path, maybe C:\\programs\\cygwin\\usr\\bin\\sort.exe, instead of calling it implicitly.