tradersjoe0 has asked for the wisdom of the Perl Monks concerning the following question:

I have a large file which has blocked data (Block 1 to block 5 represented by first character of each record being 1,2,3,4,5 distinguishing the block number, Block 1 being start of a block and Block 5 being end of each block) I can have something around 20,000 to 21,000 blocks in each file.

1nkndnfd 2nsnskdnsdn 3ddjsjd 4fksjsdj 2jjdjdidjsijs 3ndskcnsdkndskdns 5kdsjdskjdskj 1ksdjdjsk 2dsjskj 3djkdjsljs 4fdkkjdskjsk 3saddnhsdhsh 4dsnhdshshsshk 5sadjjdjdodjs
Like wise I can have thousands of blocks. I need to split this large file into smaller 4 files by doing a round robin of each block ending with block 5. End of each block is represented with a block 5 Any help is much appreciated

Replies are listed 'Best First'.
Re: Splitting a Blocked file in Round Robin into smaller files
by Corion (Patriarch) on Dec 14, 2015 at 15:41 UTC

    You use very confusing terminology, mixing "record", "line" and "block". Assuming that you consider "unit of work" boundary to be between a "block" 5 and a "block" 1, then why not simply use seek to seek to a position roughly ($file_size / $number_of_files) * $this_file and read forward until you've encountered one "block" 5? After that, the current set of "unit of work" starts.

      I am using the below code, but its not working exactly how I would like it
      #!/usr/bin/env perl use strict; use warnings; my $num_files_to_write = 4; use Data::Dumper; my @filehandles; for my $id ( 1..$num_files_to_write ) { open ( my $fh, '>', "file_$id.txt" ) or die $!; push @filehandles, $fh; } local $/ = '5'; while ( <> ) { select $filehandles[$. % $num_files_to_write]; print; } foreach my $fh ( @filehandles ) { close ( $fh ); }

        So, how does your program fail for you?

Re: Splitting a Blocked file in Round Robin into smaller files
by BrowserUk (Patriarch) on Dec 14, 2015 at 15:35 UTC

    Are you saying that you want all the records starting with '1' in one file. All those starting with '2' in a second file. And all those starting with '3' in a third. And so on?

    If not, you'll need to clarify your explanation because it is very confused.

    Eg. What does "I need to split this large file into smaller 4 files by doing a round robin of each block(block 1 to 5)" mean?

    Did you typo? Should that be "5 smaller files"?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I am sorry for the confusion. Lets say the actual file has 5 blocks as shown below(each end of block is distinguished by a record having its first character as '5'). Now, I will be dividing this actual file into 4 smaller files

      Step 1: Go through the file, find the first record having the first character as '5' and copy until then to first file.

      Step 2:Copy the next block until you get first character as '5' into second file and so on until the entire file is divided into 4 smaller files in round robin fashion

      Actual File:
      1this is block 1 2this is block 1 3this is block 1 4this is block 1 2this is block 1 3this is block 1 5this is block 1 1this is block 2 2this is block 2 3this is block 2 2this is block 2 3this is block 2 5this is block 2 1this is block 3 2this is block 3 5this is block 3 1this is block 4 2this is block 4 5this is block 4 1this is block 5 3this is block 5 5this is block 5
      File1:
      1this is block 1 2this is block 1 3this is block 1 4this is block 1 2this is block 1 3this is block 1 5this is block 1 1this is block 5 3this is block 5 5this is block 5
      File 2:
      1this is block 2 2this is block 2 3this is block 2 2this is block 2 3this is block 2 5this is block 2
      File 3:
      1this is block 3 2this is block 3 5this is block 3
      File 4:
      1this is block 4 2this is block 4 5this is block 4

        Much better explanation, thank you. Try this:

        #! perl -sw use strict; my $file = $ARGV[0]; open I, '<', $file or die $!; my @outs; open $outs[ $_ ], '>', "$file.$_" or die $! for 1 .. 4; my $out = 1; while( <I> ) { print { $outs[ $out ] } $_; if( /^5/ ) { ++$out; $out = 1 if $out > 4; } }

        Call it as scriptname filename. The 4 output files will be named filename.1 filename.2 filename.3 filename.4


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Splitting a Blocked file in Round Robin into smaller files
by james28909 (Deacon) on Dec 14, 2015 at 16:12 UTC
    Im not quite sure what you are wanting to end up with but try this:
    use strict; use warnings; my @array = qw( 1nkndnfd 2nsnskdnsdn 3ddjsjd 4fksjsdj 5kdsjdskjdskj 1ksdjdjsk hg 2dsjskj 3djkdjsljs 4fdkkjdskjsk 5sadjjdjdodjs 6sadjjdjdodjs ); foreach (@array) { my ( $num, $data ) = /(\d)(.*)/; next if ( !length $num || !length $data || $num !~ /[1-5]/ ); open my $file, '+>>', "round_robin_$num" . ".txt"; print $file "$num - $data\n"; }
    EDIT: Just noticed other replies, looks this is NOT what he/she was after.

      Thank you so much for taking time to look into my question

        Hey no problem man, I do try to help when I can. I atleast give ideas if anything else haha :)
Re: Splitting a Blocked file in Round Robin into smaller files
by KurtSchwind (Chaplain) on Dec 14, 2015 at 16:08 UTC

    I think he considers each 1,2,3,4,5 as a unit and wants 4 files with round-robin units in each.

    #!/usr/bin/perl use POSIX; my @outfile = ('file1.txt', 'file2.txt', 'file3.txt' , 'file4.txt'); my $infile = 'infile.txt'; my $lineno = 0; open (my $ifh, '<', $infile); while (<$ifh>) { my $out = $outfile[floor($lineno / 5)]; open (my $fh, '>>', $outfile[floor($lineno / 5)%4]) or die "Un +able to open outfile"; print $fh $_; close $fh; $lineno++; } close ($ifh);
    --
    “For the Present is the point at which time touches eternity.” - CS Lewis

      Why open & close the output files for every write?

      Do you know about the variable $.?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I'm in the habit of forcing a flush.

        Would the flush happen regardless? If so, I've picked up something new.

        --
        “For the Present is the point at which time touches eternity.” - CS Lewis