Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Get unique fields from file

by Marshall (Canon)
on Jan 06, 2022 at 23:51 UTC ( #11140228=note: print w/replies, xml ) Need Help??


in reply to Get unique fields from file

I looked at some of the other solutions and some appear to be complex for a beginner. Added: I looked back at this and even it seems pretty complicated! The size of the potential data involved is a complicating wrinkle....So I tried yet another approach. You have a potentially huge file and you very likely will run out of memory if you try to do all the processing in one pass.

So, this code focuses on one column at a time. It reads the whole file and counts the number of times things in the "current column of interest" are "seen". When that is finished, the "unique" values are the ones which were only seen once.

Then the file pointer is moved back to the beginning of data, and the next column is processed.

If you have a large number of columns, this will be slow. But speed may not matter much. You don't say.

You are probably fine splitting on /\|/ (regex for |). I work with database files all the time which are | delimited and are not CSV files. Meaning amongst other things, that | is not a legal data value. If this file really is a CSV file, then you should use Text::CSV to parse it for you (the rules are incredibly tricky). But if its not a CSV file, then a simple split is fine. If you see an " marks in the file, it probably is a CSV file.

use strict; use warnings; use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); my $data =<< "END"; head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3 END open my $fh , "<", \$data or die "can't open input file $!"; my $header_line = <$fh>; # get first line chomp $header_line; my @headers = split(/\|/,$header_line); my $num_columns = @headers; # scalar value of an array is number of el +ements my $begin_of_data = tell($fh); #save byte position of start of data ro +ws my %seen; foreach my $column_nr (0..$num_columns-1) { # read whole file looking only at one column while (<$fh>) { chomp; my $field = (split(/\|/,$_))[$column_nr]; $seen{$field}++; # counts num times "seen" } # unique values are those which were only "seen" once print "UNIQUE VALUES for ",shift @headers,":\n"; foreach my $field_value (sort keys %seen) { print "$field_value\n" if $seen{$field_value} == 1; } # reset file pointer to beginning and do everything again for # the next column seek ($fh,$begin_of_data,SEEK_SET); %seen=(); # clear the hash for calculating next column's data print "\n"; } __END__ UNIQUE VALUES for head1: val2 val3 val6 UNIQUE VALUES for head2: val2 UNIQUE VALUES for head3:
Update: I did re-consider how to do this in a single pass with minimal memory. A 1/2 GB file is "big" in terms of memory when you consider that the internal Perl representation of that thing as perhaps multiple hash tables (one per column) could be quite large and cause an "out of memory" error. In an attempt to reduce hash table footprint, I came up with a hash of array implementation where the keys are a global collection of all values in all columns and the value is an array of the columns that value was seen in. Depending upon the data, this table could fit into RAM memory.

I can see an important specification issue here is: "what is meant by unique". I interpreted "unique" in the normal English sense, "one of a kind". Some folks have interpreted that as "non repeating values", "don't repeat yourself". Those are very different interpretations. I am not sure what the OP wants

Here is what that code looks like:

use strict; use warnings; use Data::Dump qw (dump dd); my $data =<< "END"; head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3 END open my $fh , "<", \$data or die "can't open input file $!"; my $header_line = <$fh>; # get first line chomp $header_line; my @headers = split(/\|/,$header_line); my %seen; # value => [array of the positive column numbers it's unique + in] # like "somevalue" => [1,2] # a negative column number means it was seen at least twice # and therefore that value is not unique, "one of a kind" # in that column # like "somevalue" =>[1,2,-3] while (defined (my $line=<$fh>)) { chomp $line; next if $line =~ /^\s*$/; my @fields = split(/\|/,$line); my $col=1; # cols are 1...n # can't have a negative sero! foreach my $field (@fields) { if (my ($col_seen_before) = grep{$col == abs($_)}@{$seen{$field +}}) { if ($col_seen_before <0) { # nothing to do...#this is nth time this value seen in +this col } else { # flip the sign of column number to negative to indicat +e 2nd # this value has been seen in this column. @{$seen{$field}} = map{$_ = -$_ if $_==$col_seen_before +; $_}@{$seen{$field}}; } } else { push @{$seen{$field}}, $col; #first time seen in this colu +mn } $col++; #dd \%seen; } } # unique values are those which have a positive value for that col num +ber foreach my $value_seen (sort keys %seen) { if (my @unique_cols = sort{$a<=>$b} grep{$_>0}@{$seen{$value_seen} +}) { print "$value_seen is unique in cols: @unique_cols\n"; } } __END__ Previous program output: UNIQUE VALUES for head1: val2 val3 val6 UNIQUE VALUES for head2: val2 UNIQUE VALUES for head3: (NONE) This program's output: There could be many formats to present this information... val2 is unique in cols: 1 2 val3 is unique in cols: 1 val6 is unique in cols: 1 This means nothing was unique in column 3. column 1 has 3 unique values. column 2 has one unique value.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11140228]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2022-08-18 08:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?