mmittiga17 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I am looking for a way to split a large file in to smaller chunks based on the unique value of field one. Lets say field 1 eq A5273KRU9, there code be 8 lines starting with that and that is considered 1 record. Id like to break the file up into individual file made up of 100 records each. So file1 would have all lines starting with A5273KRU9,A9294WLR3,C63278AX0 and so on.
data A5273KRU9 CA000120060919A5273KRU9 C 00000000 A + A5273KRU9 CA300120080518 211558 USEDT + A5273KRU9 CB0001000000380000 N2 201310201200305282003102020121020FX + A5273KRU9 CB100100038359ANDN 00000001000EUREUREUR 00000 +0000000 A5273KRU9 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 A5273KRU9 CB300100000001000000000001000000000001000000000000000 + A5273KRU9 CC4001F S 0000000000000000 0000000000000000 00000000 +00000000 A5273KRU9 CDA001REPUBLIK OESTERREICH + A5273KRU9 CD0001REBUBLIK OESTERREICH EURO$ MTN 144A13 + A5273KRU9 CD1001OUTFBGZ 000000 0 1 4 000010000000EUR00Y + ATAT G Y A5273KRU9 CD20012003052801058231700000 00000099679000EURF0 N + A5273KRU9 CD3001 00105823200000 320060901 + A5273KRU9 CD5001C AP15 2003052800000000 PRIV + N A5273KRU9 CI000100000000000000000000000000000000 00000000000000 +0000380000 A5273KRU9 CI1001A 000 00000000000000 BN 000000000000000000 + A5273KRU9 CR1001 00000000 00000000 00000000 + A5273KRU9 CX0001A5273KRU9 016947857 7742543AT00 +00385992 A5273KRU9 CX4001 + A9294WLR3 CA000120050415A9294WLR3 C 00000000 A + A9294WLR3 CA300120090305 142038 USEST + A9294WLR3 CB0001000000000000 F3 201107013200307012003110100000000 + A9294WLR3 CB100100000000 NUN 00000001000 USD 00000 +0000000 A9294WLR3 CB2001E AF N C N 0000000000000000 0000000000000000 000 +00000 A9294WLR3 CB300100000000000 00000000000 00000001000000000000000 + A9294WLR3 CDA001VORARLBERGER LANDES UND HYPOTHEKENBANK AG VORMALS HY +POTHEKENBA A9294WLR3 CD0001VORARLBERGER LANDES UND HYPOTH RANGE ACRL VAR11 + A9294WLR3 CD1001OUTFBPN 000000 0 3 4 000010000000USD00Y + AUUS A N A9294WLR3 CD20010000000000000000000000 00000000000000 S4 + A9294WLR3 CD3001 00000000000000 00000000 + A9294WLR3 CD5001 0000000000000000 + N A9294WLR3 CI000100000000000000000000000000000000 00000000000000 +0000000000 A9294WLR3 CI1001C 000 00000000000000 DN 000000000000000000 + A9294WLR3 CR1001 00000000 00000000 00000000 + A9294WLR3 CX0001A9294WLR3 + A9294WLR3 CX4001 + C63278AX0 CA000120050425C63278AX0 C 00000000 A + C63278AX0 CA300120090707 100627 USEDT + C63278AX0 CB0001000000512500 N5 201012291200302192003122920091229AFG + C63278AX0 CB100100049261AN N 00000001000CADCADCAD 00000 +0000000 C63278AX0 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 C63278AX0 CB300100000001000000000001000000000001000000000000000 + C63278AX0 CC00012003021910000000000000000000000000000000000000002003 +0219 0030A C63278AX0 CC2001 2003021910000000 0000000000000000 0000000000 +000000 C63278AX0 CC4001A M 0000000000000000 0000000000000000 00000000 +00000000 C63278AX0 CDA001PROVINCE OF NEWFOUNDLAND AND LABRADOR + C63278AX0 CD0001PROVINCE OF NEWFOUNDLAND AND NT 5.125%10 + C63278AX0 CD1001CANFB N 000000 0 1 4 000010000000CAD00Y + CACA M N C63278AX0 CD20012003021700020000000000 00000101255000CADF0 N + C63278AX0 CD3001 00020000000000 320050401 + C63278AX0 CD5001C AP15 2003021920030219 B RBBR + N C63278AX0 CI00012009070720091229200912140000000072800000000051250000 +0000512500 C63278AX0 CI1001A 000 00000000000000 BN 000000000000000000 + C63278AX0 CR1001 00000000 00000000 00000000 + C63278AX0 CX0001C63278AX0 016207268 7546198XS01 +62072689 C63278AX0 CX4001 + C7631MBG4 CA000120041129C7631MBG4 CI 20090105 A + C7631MBG4 CA300120090210 081019 USEST + C7631MBG4 CB0001000000512500 N5 200901041199808181999010420080104 + C7631MBG4 CB100100000000 N N 00000001000 EUR 00000 +0000000 C7631MBG4 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 C7631MBG4 CB300100000000000 00000000000 00000001000000000000000 + C7631MBG4 CDA001PROVINCE OF QUEBEC + C7631MBG4 CD0001PROVINCE OF QUEBEC FR 5.125%010409 + C7631MBG4 CD1001CANFB X 000000 0 1 4 000010000000EUR00Y + CACA M N C7631MBG4 CD20010000000000063911500000 00000000000000 S0 E + C7631MBG4 CD3001 00063911500000 320041101 + C7631MBG4 CD5001 0000000000000000 + N C7631MBG4 CI000100000000000000000000000000000000 00000000000000 +0000512500 C7631MBG4 CI1001A 000 00000000000000 BN 000000000000000000 + C7631MBG4 CR1001 00000000 00000000 00000000 + C7631MBG4 CX0001C7631MBG4 + C7631MBG4 CX4001 + D20658C46 CA000120050707D20658C46 C 00000000 A + D20658C46 CA300120090625 180129 USEDT + D20658C46 CB0001000000325000 N2 201004091200504012006040920090409FX + D20658C46 CB100100031362AN N 00000001000EUREUREUR 00000 +0000000 D20658C46 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 D20658C46 CB300100000001000000000001000000000001000000000000000 + D20658C46 CC4001F S 0000000000000000 0000000000000000 00000000 +00000000 D20658C46 CDA001FEDERAL REPUBLIC OF GERMANY + D20658C46 CD0001FEDERAL REPUBLIC OF GERMANY NT-146 3.25%10 + D20658C46 CD1001OUTFBGN 000000 0 1 4 000010000000EUR00Y + DEUS G N D20658C46 CD20012005033000700000000000 00000100520000EURF0 N + D20658C46 CD3001 01700000000000 320090601 + D20658C46 CD5001C AP15 2005040100000000 + N D20658C46 CI000100000000000000000000000000000000 00000000000000 +0000325000 D20658C46 CI1001A 000 00000000000000 BN 000000000000000000 + D20658C46 CR1001 00000000 00000000 00000000 + D20658C46 CX0001D20658C46 021609714 B06Y4T2DE00 +01141463 D20658C46 CX4001 + D20658NG7 CA000120051107D20658NG7 C 00000000 A + D20658NG7 CA300120090625 180129 USEDT + D20658NG7 CB0001000000537500 N2 201001041199910222001010420090104FX + D20658NG7 CB100100054227ANDN 00000001000EUREUREUR 00000 +0000000 D20658NG7 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 D20658NG7 CB300100000001000000000001000000000001000000000000000 + D20658NG7 CC4001F S 0000000000000000 0000000000000000 00000000 +00000000 D20658NG7 CDA001FEDERAL REPUBLIC OF GERMANY + D20658NG7 CD0001FEDERAL REPUBLIC OF GERMANY EURO BD 5.375%10 + D20658NG7 CD1001OUTFBGD 000000 0 1 4 000010000000EUR00Y + DEDE G N D20658NG7 CD20011999102202000000000000 00000099610000EURF0 N + D20658NG7 CD3001 00202500000000 320051101 + D20658NG7 CD5001C AP15 1999102200000000 + N D20658NG7 CI000100000000000000000000000000000000 00000000000000 +0000537500 D20658NG7 CI1001A 000 00000000000000 BN 000000000000000000 + D20658NG7 CR1001 00000000 00000000 00000000 + D20658NG7 CX0001D20658NG7 010339677 5809343DE00 +01135135 D20658NG7 CX4001 + D20658SY3 CA000120041118D20658SY3 C 00000000 A + D20658SY3 CA300120060601 120347 USEDT + D20658SY3 CB0001000000500000 N2 201201041200201042003010420110104 + D20658SY3 CB100100000000 N N 00000001000 EUR 00000 +0000000 D20658SY3 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 D20658SY3 CB300100000000000 00000000000 00000001000000000000000 + D20658SY3 CDA001FEDERAL REPUBLIC OF GERMANY + D20658SY3 CD0001FEDERAL REPUBLIC OF GERMANY EURO BD 5%12 + D20658SY3 CD1001OUTFBGD 000000 0 1 4 000010000000EUR00Y + DEDE G N D20658SY3 CD20010000000000000000000000 00000000000000 S0 + D20658SY3 CD3001 00000000000000 00000000 + D20658SY3 CD5001 0000000000000000 + N D20658SY3 CI000100000000000000000000000000000000 00000000000000 +0000500000 D20658SY3 CI1001A 000 00000000000000 BN 000000000000000000 + D20658SY3 CR1001 00000000 00000000 00000000 + D20658SY3 CX0001D20658SY3 014105646 7281466DE00 +01135192 D20658SY3 CX4001 + D20658YB6 CA000120060725D20658YB6 C 00000000 A
I get lost trying to figure out how to count unique fields#1 and print + out in chunks This is as far as I get: #!/usr/bin/perl use Data::Dumper; $inFile = shift(@ARGV); open(DATA, $inFile); my %buf; while( my $line = <DATA> ){ my $rec = substr($line, 0,9 ); $buf{$rec} = $line if $rec; print "$rec => $buf{$rec}"; }
Any help will be appreciated.

Replies are listed 'Best First'.
Re: Help spliting file into chunks
by ikegami (Patriarch) on Jul 28, 2009 at 16:18 UTC

    The solutions you are getting don't all do the same thing because there are ambiguities in your question.

    • Do you want 100 id per file or at most 100 lines per file?
    • If the latter, are you ok with splitting ids across multiple files?
    • If not, what should the code do if there are more than 100 lines for one of the ids?
      Thanks for all of your help. If there would never be more than 17-18 lines per Id.
        I suppose that answers the third question, but what about the two more important questions?
Re: Help spliting file into chunks
by Utilitarian (Vicar) on Jul 28, 2009 at 15:17 UTC
    If your file is sorted as shown, you just have to read through it one line @ a time and check for a change of the first field in the record...warning untested but something like the following
    use strict; use warnings; my ($FH, $OUT,$past,$record,$count); open ($FH,"<","$ARGV[0]"); $record=""; $count=0; while ($line=<$FH>){ my($key ,@record)=split(/\s+/, $line); if $past ne $key){ $count++; if ($count == 100){ open ($OUT, ">" $past); print $OUT $record; close $OUT; $record=""; $count=0; } $past=$key; } $record.= $line; } open ($OUT, ">" $past); print $OUT $record; close $OUT;
    Update: Added counter
Re: Help spliting file into chunks
by jwkrahn (Abbot) on Jul 28, 2009 at 15:44 UTC

    Something like this may work (UNTESTED):

    #!/usr/bin/perl use warnings; use strict; $inFile = shift or die "usage: $0 filename\n"; open DATA, '<', $inFile or die "Cannot open '$inFile' $!"; my ( $file_count, %records ); while ( my $line = <DATA> ) { if ( 1 == $. || 100 == keys %records ) { ++$file_count; open OUT, '>', "file$file_count" or die "Cannot open 'file$fil +e_count' $!"; %records = (); } my $key = substr $line, 0, 9; $records{ $key }++; print OUT; }
Re: Help spliting file into chunks
by moritz (Cardinal) on Jul 28, 2009 at 15:22 UTC
    The logic you need is, in pseudo code
    while there's a next line extract the ID check if ID matches the previous ID if not: set previous ID to current ID increment a counter if counter > 100 close current output file open a new output file reset counter write the line to the current output file.

    Try to write that in perl you can almost translate it directly. If you have troubles with a specific step, shows us what you've tried and where your problem is.

      That can create empty files. Fix:

      while there's a next line extract the ID check if ID matches the previous ID if not: set previous ID to current ID increment a counter if counter > 100 close current output file reset counter if output file isn't open open a new output file write the line to the current output file.

      It can also create files with more than 100 records. Fix:

      my $last_id; my @group; my $fh; my $line_counter = 0; my $file_counter = 0; sub output { if ($line_counter + @group > 100) { $fh = undef; $line_counter = 0; } if (!defined($fh)) { my $fn = sprintf('file%04d', $file_counter++); open($fh, '>', $fn) or die("Error create file $fn: $!\n"); } $line_counter += @group; print($fh splice(@group)); } while (<>) { my ($id) = /^(\S+)/; $last_id = $id if !defined($last_id); if ($id eq $last_id) { push @group, $_; } else { output(); } } output() if @group;

      If there's more than 100 record for one id, it'll put them in the same file despite the limit.

      Note that both my code and the parent's pseudocode assume that the records are grouped by id in the input file.

Re: Help spliting file into chunks
by ErwinWinter (Novice) on Jul 28, 2009 at 20:57 UTC
    Hi, this code splits the records into several (here 3) files. Hope you meant this... Erwin
    #!/usr/bin/perl -w use strict; use warnings; # my $inFile = shift(@ARGV); # open(DATA, $inFile); # for easier testing use data from end of file my $recordNameOld = ''; my @buffer = (); my $recordCount = 0; my $MaxRecords = 3; # only 3 for testing ! my $recordFile = 'RecFile000'; while (<DATA>) { # read next available line into $_ /^(\w+)/ or die "-ERROR- line $. no leftbound recordName found !"; my $recordName = $1; if ($recordNameOld ne $recordName) { $recordNameOld = $recordName; if (@buffer) { print join('',@buffer); } @buffer = (); # reset buffer unless ($recordCount % $MaxRecords) { close REC if ($recordCount); $recordFile++; my $file = $recordFile . '.txt'; # add file-type (& pat +h) open(REC,'>',$file) or die "-ERROR- failed to create $file + $!"; select(REC); # now all print are directed to REC $recordCount = 0; } $recordCount++; } push(@buffer,$_); # store current line } print join('',@buffer) if (@buffer); close REC; #------------------- data for testing ----- __DATA__ A5273KRU9 CA000120060919A5273KRU9 C 00000000 A + A5273KRU9 CA300120080518 211558 USEDT + A5273KRU9 CB0001000000380000 N2 201310201200305282003102020121020FX + A5273KRU9 CB100100038359ANDN 00000001000EUREUREUR 00000 +0000000 A5273KRU9 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 A5273KRU9 CB300100000001000000000001000000000001000000000000000 + A5273KRU9 CC4001F S 0000000000000000 0000000000000000 00000000 +00000000 A5273KRU9 CDA001REPUBLIK OESTERREICH + A5273KRU9 CD0001REBUBLIK OESTERREICH EURO$ MTN 144A13 + A5273KRU9 CD1001OUTFBGZ 000000 0 1 4 000010000000EUR00Y + ATAT G Y A5273KRU9 CD20012003052801058231700000 00000099679000EURF0 N + A5273KRU9 CD3001 00105823200000 320060901 + A5273KRU9 CD5001C AP15 2003052800000000 PRIV + N A5273KRU9 CI000100000000000000000000000000000000 00000000000000 +0000380000 A5273KRU9 CI1001A 000 00000000000000 BN 000000000000000000 + A5273KRU9 CR1001 00000000 00000000 00000000 + A5273KRU9 CX0001A5273KRU9 016947857 7742543AT00 +00385992 A5273KRU9 CX4001 + A9294WLR3 CA000120050415A9294WLR3 C 00000000 A + A9294WLR3 CA300120090305 142038 USEST + A9294WLR3 CB0001000000000000 F3 201107013200307012003110100000000 + A9294WLR3 CB100100000000 NUN 00000001000 USD 00000 +0000000 A9294WLR3 CB2001E AF N C N 0000000000000000 0000000000000000 000 +00000 A9294WLR3 CB300100000000000 00000000000 00000001000000000000000 + A9294WLR3 CDA001VORARLBERGER LANDES UND HYPOTHEKENBANK AG VORMALS HY +POTHEKENBA A9294WLR3 CD0001VORARLBERGER LANDES UND HYPOTH RANGE ACRL VAR11 + A9294WLR3 CD1001OUTFBPN 000000 0 3 4 000010000000USD00Y + AUUS A N A9294WLR3 CD20010000000000000000000000 00000000000000 S4 + A9294WLR3 CD3001 00000000000000 00000000 + A9294WLR3 CD5001 0000000000000000 + N A9294WLR3 CI000100000000000000000000000000000000 00000000000000 +0000000000 A9294WLR3 CI1001C 000 00000000000000 DN 000000000000000000 + A9294WLR3 CR1001 00000000 00000000 00000000 + A9294WLR3 CX0001A9294WLR3 + A9294WLR3 CX4001 + C63278AX0 CA000120050425C63278AX0 C 00000000 A + C63278AX0 CA300120090707 100627 USEDT + C63278AX0 CB0001000000512500 N5 201012291200302192003122920091229AFG + C63278AX0 CB100100049261AN N 00000001000CADCADCAD 00000 +0000000 C63278AX0 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 C63278AX0 CB300100000001000000000001000000000001000000000000000 + C63278AX0 CC00012003021910000000000000000000000000000000000000002003 +0219 0030A C63278AX0 CC2001 2003021910000000 0000000000000000 0000000000 +000000 C63278AX0 CC4001A M 0000000000000000 0000000000000000 00000000 +00000000 C63278AX0 CDA001PROVINCE OF NEWFOUNDLAND AND LABRADOR + C63278AX0 CD0001PROVINCE OF NEWFOUNDLAND AND NT 5.125%10 + C63278AX0 CD1001CANFB N 000000 0 1 4 000010000000CAD00Y + CACA M N C63278AX0 CD20012003021700020000000000 00000101255000CADF0 N + C63278AX0 CD3001 00020000000000 320050401 + C63278AX0 CD5001C AP15 2003021920030219 B RBBR + N C63278AX0 CI00012009070720091229200912140000000072800000000051250000 +0000512500 C63278AX0 CI1001A 000 00000000000000 BN 000000000000000000 + C63278AX0 CR1001 00000000 00000000 00000000 + C63278AX0 CX0001C63278AX0 016207268 7546198XS01 +62072689 C63278AX0 CX4001 + C7631MBG4 CA000120041129C7631MBG4 CI 20090105 A + C7631MBG4 CA300120090210 081019 USEST + C7631MBG4 CB0001000000512500 N5 200901041199808181999010420080104 + C7631MBG4 CB100100000000 N N 00000001000 EUR 00000 +0000000 C7631MBG4 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 C7631MBG4 CB300100000000000 00000000000 00000001000000000000000 + C7631MBG4 CDA001PROVINCE OF QUEBEC + C7631MBG4 CD0001PROVINCE OF QUEBEC FR 5.125%010409 + C7631MBG4 CD1001CANFB X 000000 0 1 4 000010000000EUR00Y + CACA M N C7631MBG4 CD20010000000000063911500000 00000000000000 S0 E + C7631MBG4 CD3001 00063911500000 320041101 + C7631MBG4 CD5001 0000000000000000 + N C7631MBG4 CI000100000000000000000000000000000000 00000000000000 +0000512500 C7631MBG4 CI1001A 000 00000000000000 BN 000000000000000000 + C7631MBG4 CR1001 00000000 00000000 00000000 + C7631MBG4 CX0001C7631MBG4 + C7631MBG4 CX4001 + D20658C46 CA000120050707D20658C46 C 00000000 A + D20658C46 CA300120090625 180129 USEDT + D20658C46 CB0001000000325000 N2 201004091200504012006040920090409FX + D20658C46 CB100100031362AN N 00000001000EUREUREUR 00000 +0000000 D20658C46 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 D20658C46 CB300100000001000000000001000000000001000000000000000 + D20658C46 CC4001F S 0000000000000000 0000000000000000 00000000 +00000000 D20658C46 CDA001FEDERAL REPUBLIC OF GERMANY + D20658C46 CD0001FEDERAL REPUBLIC OF GERMANY NT-146 3.25%10 + D20658C46 CD1001OUTFBGN 000000 0 1 4 000010000000EUR00Y + DEUS G N D20658C46 CD20012005033000700000000000 00000100520000EURF0 N + D20658C46 CD3001 01700000000000 320090601 + D20658C46 CD5001C AP15 2005040100000000 + N D20658C46 CI000100000000000000000000000000000000 00000000000000 +0000325000 D20658C46 CI1001A 000 00000000000000 BN 000000000000000000 + D20658C46 CR1001 00000000 00000000 00000000 + D20658C46 CX0001D20658C46 021609714 B06Y4T2DE00 +01141463 D20658C46 CX4001 + D20658NG7 CA000120051107D20658NG7 C 00000000 A + D20658NG7 CA300120090625 180129 USEDT + D20658NG7 CB0001000000537500 N2 201001041199910222001010420090104FX + D20658NG7 CB100100054227ANDN 00000001000EUREUREUR 00000 +0000000 D20658NG7 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 D20658NG7 CB300100000001000000000001000000000001000000000000000 + D20658NG7 CC4001F S 0000000000000000 0000000000000000 00000000 +00000000 D20658NG7 CDA001FEDERAL REPUBLIC OF GERMANY + D20658NG7 CD0001FEDERAL REPUBLIC OF GERMANY EURO BD 5.375%10 + D20658NG7 CD1001OUTFBGD 000000 0 1 4 000010000000EUR00Y + DEDE G N D20658NG7 CD20011999102202000000000000 00000099610000EURF0 N + D20658NG7 CD3001 00202500000000 320051101 + D20658NG7 CD5001C AP15 1999102200000000 + N D20658NG7 CI000100000000000000000000000000000000 00000000000000 +0000537500 D20658NG7 CI1001A 000 00000000000000 BN 000000000000000000 + D20658NG7 CR1001 00000000 00000000 00000000 + D20658NG7 CX0001D20658NG7 010339677 5809343DE00 +01135135 D20658NG7 CX4001 + D20658SY3 CA000120041118D20658SY3 C 00000000 A + D20658SY3 CA300120060601 120347 USEDT + D20658SY3 CB0001000000500000 N2 201201041200201042003010420110104 + D20658SY3 CB100100000000 N N 00000001000 EUR 00000 +0000000 D20658SY3 CB2001A AF N C N 0000000000000000 0000000000000000 000 +00000 D20658SY3 CB300100000000000 00000000000 00000001000000000000000 + D20658SY3 CDA001FEDERAL REPUBLIC OF GERMANY + D20658SY3 CD0001FEDERAL REPUBLIC OF GERMANY EURO BD 5%12 + D20658SY3 CD1001OUTFBGD 000000 0 1 4 000010000000EUR00Y + DEDE G N D20658SY3 CD20010000000000000000000000 00000000000000 S0 + D20658SY3 CD3001 00000000000000 00000000 + D20658SY3 CD5001 0000000000000000 + N D20658SY3 CI000100000000000000000000000000000000 00000000000000 +0000500000 D20658SY3 CI1001A 000 00000000000000 BN 000000000000000000 + D20658SY3 CR1001 00000000 00000000 00000000 + D20658SY3 CX0001D20658SY3 014105646 7281466DE00 +01135192 D20658SY3 CX4001 + D20658YB6 CA000120060725D20658YB6 C 00000000 A
      Thanks everyone for your help. I appreciate it as well as learned something.