Grouping unique lines into a file.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Grouping unique lines into a file. by Limbic~Region (Chancellor) on Apr 21, 2014 at 15:31 UTC
Anonymous Monk, If I have understood what you are trying to do, it boils down to these requirements: A single file contains multiple lines. Each line is comprised of an account and then data The desired end state is for each account to be in its own file Any duplicate data for an account should be ignored Assuming the above is correct, here is how I would do it: `#!/usr/bin/perl use strict; use warnings; my %seen; while (<DATA>) { chomp; my ($acct, $data) = $_ =~ m{^(\d\d)(.)}; next if $seen{$acct}{$data}++; append_data($acct, $_); } # If you know you are not going to exceed the open filehandle limit # You can improve performance by caching filehandles sub append_data { my ($acct, $line) = @_; open(my $fh, '>>', "$acct.txt") or die "Unable to open '$acct.txt' + for appending: $!\n"; print $fh $line; }` [download] Here is how it would look if you know you will be safe caching file handles. `#!/usr/bin/perl use strict; use warnings; my (%seen, %fh); while (<DATA>) { chomp; my ($acct, $data) = $_ =~ m{^(\d\d)(.)}; next if $seen{$acct}{$data}++; append_data($acct, $_, \%fh); } # Improved performance by caching filehandles sub append_data { my ($acct, $line, $fh) = @_; if (! $fh->{$acct}) { open($fh->{$acct}, '>>', "$acct.txt") or die "Unable to open ' +$acct.txt' for appending: $!\n"; } print { $fh->{$acct} } $line; }` [download] Cheers - L~R	[reply] [d/l] [select]
Re: Grouping unique lines into a file. by ww (Archbishop) on Apr 21, 2014 at 15:53 UTC
There are no "unique lines" in your sample data... at least, not within the normal (computer usage included) meaning of "unique." Perhaps Limbic~Region's interpretation is correct; IMO that's likely and entirely plausible... but what do you want to do if an account includes multiple, unduplicated lines... as, for example, does `03`. If not, please explain what you really want to do, using more precise language, since another (semi-)plausible interpretation of your description is that you want each unique data element, per account number, included in a single file which includes all account numbers. *Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:* code verbatim error and/or warning messages *a coherent explanation of what "doesn't work* actually means.** check Ln42!	[reply] [d/l]
Re^2: Grouping unique lines into a file. by Anonymous Monk on Apr 21, 2014 at 17:27 UTC
What is meant to be treated as unique lines are the numbers in front of each line. Trying to read a data file and group into separate files if the numbers in front are the same: One file name with all the "01s". One file name with all the "02s". One file name with all the "03s" and so on. Sorry for the lack of explanation.	[reply]
Re^3: Grouping unique lines into a file. by Not_a_Number (Prior) on Apr 21, 2014 at 17:34 UTC
So does this do what you want? `use strict; use warnings; my %accounts; while ( <DATA> ) { push @{ $accounts{$1} }, $2 if /^(\d+)\s+(.+)/; } for my $k ( keys %accounts ) { open my $fh, '>', $k . 'txt' or die "Can't open '$k.txt': $!\n"; print $fh join "\n", @{ $accounts{$k} }; }` [download] Update: Modified code slightly.	[reply] [d/l]
Re^3: Grouping unique lines into a file. by bigj (Monk) on Apr 21, 2014 at 18:20 UTC
Here's a solution that caches the filehandles (so no need to often make unnecessary open, close actions) and works in place (means, you don't have to keep all lines hanging around in memory)* and avoids doubled lines: #!/usr/bin/perl -w use strict; use warnings; use autodie; # I'm too lazy to write open ... or die stuff right here # cache for file handles; my %fh = (); my %seen = (); while (<DATA>) { next unless /^(\d{2})/; # ignore lines starting with anything el +se than 2 digits next if $seen{$_}++; # ignore if a line comes again unless ($fh{$1}) { warn "'$1.txt' already exists" if -e "$1.txt"; open my $FH, '>>', "$1.txt"; $fh{$1} = $FH; } print {$fh{$1}} $_; } foreach my $FH (values %fh) {close $FH}; __DATA__ 01 The quick red fox and dog as test. 02 Time flies like an arrow, fruit flies like a banana. 02 Time flies like an arrow, fruit flies like a banana. 03 Now is the time for all good men to come to the aid of their party. 01 The quick red fox jumped over the lazy brown dog. 01 The quick red fox jumped over the lazy brown dog. 02 Time flies like an arrow. 03 Now is the time for all good men to come to the aid of their party +and not going. 03 Now is the time for all. [download] Greetings, Janek Schleicher *PS: O.K., that's not the hole truth as the keys of %seen are the lines :-). If it gets a memory problem, we can replace them with a hash function, e.g. with SHA1 like: `... use Digest::SHA1 qw/sha1/; # cache for file handles; my %fh = (); my %seen = (); while (<DATA>) { next unless /^(\d{2})/; # ignore lines starting with anything el +se than 2 digits next if $seen{sha1($_)}++; # ignore if a line comes again .... ...` [download]	[reply] [d/l] [select]
Re: Grouping unique lines into a file. by Laurent_R (Canon) on Apr 21, 2014 at 16:20 UTC
In addition to what has been said by other monks about the lack of clarity of your requirement, I certainly don't understand your code. Especially, you are testing: `if($account=~/$group{$line}/g)` [download] and, whether this conditional returns true or false, you are just doing exactly the same thing. This is unlikely to be your real intent. The way you are populating the `%group` hash also appears to be inconsistent with the way you are using it in the above conditional, unless I am missing something.	[reply] [d/l] [select]