in reply to Re: Grouping unique lines into a file.
in thread Grouping unique lines into a file.

What is meant to be treated as unique lines are the numbers in front of each line.
Trying to read a data file and group into separate files if the numbers in front are the same:

One file name with all the "01s".
One file name with all the "02s".
One file name with all the "03s" and so on.

Sorry for the lack of explanation.
  • Comment on Re^2: Grouping unique lines into a file.

Replies are listed 'Best First'.
Re^3: Grouping unique lines into a file.
by Not_a_Number (Prior) on Apr 21, 2014 at 17:34 UTC

    So does this do what you want?

    use strict; use warnings; my %accounts; while ( <DATA> ) { push @{ $accounts{$1} }, $2 if /^(\d+)\s+(.+)/; } for my $k ( keys %accounts ) { open my $fh, '>', $k . 'txt' or die "Can't open '$k.txt': $!\n"; print $fh join "\n", @{ $accounts{$k} }; }

    Update: Modified code slightly.

Re^3: Grouping unique lines into a file.
by bigj (Monk) on Apr 21, 2014 at 18:20 UTC
    Here's a solution that caches the filehandles (so no need to often make unnecessary open, close actions) and works in place (means, you don't have to keep all lines hanging around in memory)* and avoids doubled lines:
    #!/usr/bin/perl -w use strict; use warnings; use autodie; # I'm too lazy to write open ... or die stuff right here # cache for file handles; my %fh = (); my %seen = (); while (<DATA>) { next unless /^(\d{2})/; # ignore lines starting with anything el +se than 2 digits next if $seen{$_}++; # ignore if a line comes again unless ($fh{$1}) { warn "'$1.txt' already exists" if -e "$1.txt"; open my $FH, '>>', "$1.txt"; $fh{$1} = $FH; } print {$fh{$1}} $_; } foreach my $FH (values %fh) {close $FH}; __DATA__ 01 The quick red fox and dog as test. 02 Time flies like an arrow, fruit flies like a banana. 02 Time flies like an arrow, fruit flies like a banana. 03 Now is the time for all good men to come to the aid of their party. 01 The quick red fox jumped over the lazy brown dog. 01 The quick red fox jumped over the lazy brown dog. 02 Time flies like an arrow. 03 Now is the time for all good men to come to the aid of their party +and not going. 03 Now is the time for all.

    Greetings,
    Janek Schleicher

    *PS: O.K., that's not the hole truth as the keys of %seen are the lines :-). If it gets a memory problem, we can replace them with a hash function, e.g. with SHA1 like:
    ... use Digest::SHA1 qw/sha1/; # cache for file handles; my %fh = (); my %seen = (); while (<DATA>) { next unless /^(\d{2})/; # ignore lines starting with anything el +se than 2 digits next if $seen{sha1($_)}++; # ignore if a line comes again .... ...