Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I looked at some of the other solutions and some appear to be complex for a beginner. Added: I looked back at this and even it seems pretty complicated! The size of the potential data involved is a complicating wrinkle....So I tried yet another approach. You have a potentially huge file and you very likely will run out of memory if you try to do all the processing in one pass.

So, this code focuses on one column at a time. It reads the whole file and counts the number of times things in the "current column of interest" are "seen". When that is finished, the "unique" values are the ones which were only seen once.

Then the file pointer is moved back to the beginning of data, and the next column is processed.

If you have a large number of columns, this will be slow. But speed may not matter much. You don't say.

You are probably fine splitting on /\|/ (regex for |). I work with database files all the time which are | delimited and are not CSV files. Meaning amongst other things, that | is not a legal data value. If this file really is a CSV file, then you should use Text::CSV to parse it for you (the rules are incredibly tricky). But if its not a CSV file, then a simple split is fine. If you see an " marks in the file, it probably is a CSV file.

use strict; use warnings; use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); my $data =<< "END"; head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3 END open my $fh , "<", \$data or die "can't open input file $!"; my $header_line = <$fh>; # get first line chomp $header_line; my @headers = split(/\|/,$header_line); my $num_columns = @headers; # scalar value of an array is number of el +ements my $begin_of_data = tell($fh); #save byte position of start of data ro +ws my %seen; foreach my $column_nr (0..$num_columns-1) { # read whole file looking only at one column while (<$fh>) { chomp; my $field = (split(/\|/,$_))[$column_nr]; $seen{$field}++; # counts num times "seen" } # unique values are those which were only "seen" once print "UNIQUE VALUES for ",shift @headers,":\n"; foreach my $field_value (sort keys %seen) { print "$field_value\n" if $seen{$field_value} == 1; } # reset file pointer to beginning and do everything again for # the next column seek ($fh,$begin_of_data,SEEK_SET); %seen=(); # clear the hash for calculating next column's data print "\n"; } __END__ UNIQUE VALUES for head1: val2 val3 val6 UNIQUE VALUES for head2: val2 UNIQUE VALUES for head3:
Update: I did re-consider how to do this in a single pass with minimal memory. A 1/2 GB file is "big" in terms of memory when you consider that the internal Perl representation of that thing as perhaps multiple hash tables (one per column) could be quite large and cause an "out of memory" error. In an attempt to reduce hash table footprint, I came up with a hash of array implementation where the keys are a global collection of all values in all columns and the value is an array of the columns that value was seen in. Depending upon the data, this table could fit into RAM memory.

I can see an important specification issue here is: "what is meant by unique". I interpreted "unique" in the normal English sense, "one of a kind". Some folks have interpreted that as "non repeating values", "don't repeat yourself". Those are very different interpretations. I am not sure what the OP wants

Here is what that code looks like:

use strict; use warnings; use Data::Dump qw (dump dd); my $data =<< "END"; head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3 END open my $fh , "<", \$data or die "can't open input file $!"; my $header_line = <$fh>; # get first line chomp $header_line; my @headers = split(/\|/,$header_line); my %seen; # value => [array of the positive column numbers it's unique + in] # like "somevalue" => [1,2] # a negative column number means it was seen at least twice # and therefore that value is not unique, "one of a kind" # in that column # like "somevalue" =>[1,2,-3] while (defined (my $line=<$fh>)) { chomp $line; next if $line =~ /^\s*$/; my @fields = split(/\|/,$line); my $col=1; # cols are 1...n # can't have a negative sero! foreach my $field (@fields) { if (my ($col_seen_before) = grep{$col == abs($_)}@{$seen{$field +}}) { if ($col_seen_before <0) { # nothing to do...#this is nth time this value seen in +this col } else { # flip the sign of column number to negative to indicat +e 2nd # this value has been seen in this column. @{$seen{$field}} = map{$_ = -$_ if $_==$col_seen_before +; $_}@{$seen{$field}}; } } else { push @{$seen{$field}}, $col; #first time seen in this colu +mn } $col++; #dd \%seen; } } # unique values are those which have a positive value for that col num +ber foreach my $value_seen (sort keys %seen) { if (my @unique_cols = sort{$a<=>$b} grep{$_>0}@{$seen{$value_seen} +}) { print "$value_seen is unique in cols: @unique_cols\n"; } } __END__ Previous program output: UNIQUE VALUES for head1: val2 val3 val6 UNIQUE VALUES for head2: val2 UNIQUE VALUES for head3: (NONE) This program's output: There could be many formats to present this information... val2 is unique in cols: 1 2 val3 is unique in cols: 1 val6 is unique in cols: 1 This means nothing was unique in column 3. column 1 has 3 unique values. column 2 has one unique value.

In reply to Re: Get unique fields from file by Marshall
in thread Get unique fields from file by sroux

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (9)
As of 2024-03-28 08:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found