Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Get unique fields from file

by sroux (Sexton)
on Jan 06, 2022 at 09:22 UTC ( [id://11140211] : perlquestion . print w/replies, xml ) Need Help??

sroux has asked for the wisdom of the Perl Monks concerning the following question:


As many I use Perl as a swiss army knife, I may understand the code, write some but also copy and glue it together to make scripts used on a desktop (as utility) or server (file parsing, mapping etc.).

I would like to write an utility that reads a file (can be max 500mb) and output a result with unique values >> for each delimited field << (what I mean is not an doing an unique on the whole file content but field by field).

Looks like that hashes are a nice solution for this task but I never used hashes.

I found some scripts around:

#Create hash based on header

my %combined; open(my $input, "<", "flatfile.dat") or die ("Unable to open file"); my $line = <$input>; chomp ( $line ); my @headers = split("|", $line); while (<$input>) { chomp; my @row = split("|"); for my $header (@headers) { push( @{ $combined{$header} }, shift(@row) ); } }

Now how to get the unique value for each field and produce an output file. I had this piece of code used somewhere but I hardly can understand it:

#Remove duplicates my %seen = (); my @uniqueOutput = grep { ! $seen{ $_ }++ } @output; print $fh3 @uniqueOutput;

Thank you for any guidance you may provide.

Replies are listed 'Best First'.
Re: Get unique fields from file
by choroba (Cardinal) on Jan 06, 2022 at 12:00 UTC
    The first argument of split is a regex, not a string. "|" is an alternative of two empty strings.
    my @fields = split "|", 'abc|def'; print join ', ', @fields; # a, b, c, |, d, e, f
    If you want to split on the literal vertical bar, you need to backslash it.
    my @fields = split /\|/, 'abc|def'; print join ', ', @fields; # abc, def
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Get unique fields from file
by johngg (Canon) on Jan 06, 2022 at 11:43 UTC
    I had this piece of code used somewhere but I hardly can understand it:

    Breaking that down into stages:-

    • Create an empty hash %seen that will track occurrences of elements in @output

    • Pass each element of @output one at a time into the grep as $_ for filtering

    • The first time a particular value occurs in $_ the hash value $seen{ $_ } will be empty, hence "false" and ! seen{ $_ } i.e. "not false" is "true" so that the $_ value passes out of the grep into @uniqueOutput

    • Note also the ++ post-increment operator in ! $seen{ $_ }++ that increments the value of $seen{ $_ } after the test ! seen{ $_ } has been done, which means that after the first occurrence of a particular value the hash for it will no longer be blank, i.e. "false", but 1, 2 ... etc. depending on how many times it occurs, which evaluates to "true" and therefore "not true" is "false" so that second and subsequent occurrences will not pass from the grep into @uniqueOutput

    My preference is to limit the scope of %seen to a do block so that it isn't left lying around.

    johngg@abouriou:~/perl/Monks$ perl -Mstrict -Mwarnings -E ' say q{}; my @arr = qw{ a b c d c e f a b b g d }; my @uniq = do { my %seen; grep { not $seen{ $_ } ++ } @arr; }; say qq{@uniq};' a b c d e f g

    I hope this makes the process clearer for you but ask further if something is still not clear.



Re: Get unique fields from file
by davido (Cardinal) on Jan 06, 2022 at 16:53 UTC

    I didn't see specifically what you are having trouble with. It didn't seem like there was a specific question.

    It is correct that using a hash is a good approach. I think that the uniq function is probably not where I would immediately go because of the fact that you want unique values per field, and in order to use uniq for that, you would have to hold the whole file in memory at once (even if there's a high rate of duplication within fields). Rather, I would do the unique filtering early on, line by line. That way, if there are a lot of collisions within any given field, you're only holding onto one instance, which could be a lot more memory friendly.

    You're dealing with a style of CSV. It's '|' separated csv, so |sv, but I prefer using a CSV parser for that so that I don't have to deal with the intricacies of embedded / escaped separators. The Text::CSV module can grab the headers for you and can already pre-organize the data into header => value pairs for you. Here's an example of how that could look:

    #!/usr/bin/env perl use strict; use warnings; use Text::CSV; my $unique_within_field = csv_to_unique_within_field(\*DATA, '|'); print "$_: ", join(', ', @{$unique_within_field->{$_}}), "\n" foreach sort keys %{$unique_within_field}; sub csv_to_unique_within_field { my($data_fh, $sep) = @_; my $csv = Text::CSV->new({}); $csv->header($data_fh, {sep_set => [$sep // ',']}); my %found; while (my $row = $csv->getline_hr($data_fh)) { $found{$_}{$row->{$_}} = undef for keys %$row; } return { map { $_ => [sort keys %{$found{$_}}] } keys %found }; } __DATA__ head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3

    The meat here is within the csv_to_unique_within_field function. Pass into the function a filehandle, and a separator. If no separator is provided, assume comma.

    The function does this:

    1. Grab the CSV header. In my example, the header identifies fields named head1, head2, and head3.
    2. For each remaining row of CSV data, populate our %found hash with hash keys for each field value within each header. I'm giving the keys 'undef' as the value to which they point. It's not important what they contain. We're using the keys as a uniqueness filter.
    3. Return a new, transformed hash where each header key points a reference to an array containing the unique values.

    After this I just print the headers and all the fields they contained. Since we filtered-in only unique per header, it's just a straighforward datastructure print.


Re: Get unique fields from file
by hippo (Bishop) on Jan 06, 2022 at 09:49 UTC
    Now how to get the unique value for each field and produce an output file. I had this piece of code used somewhere but I hardly can understand it

    It depends largely on what you mean by "get the unique value". If your set is (x, x, y) do you want the result to be (y) as that is the only value which is unique or do you want it to be (x, y) as that is all the values but just one occurrence of each? If the latter then then snippet which you have will suffice. Removing duplicates is explained in the FAQ How can I remove duplicate elements from a list or array?.

    BTW, since what you have is really an xSV file with pipe delimiters I would strongly suggest you have a look at Text::CSV_XS before rolling your own parser.


Re: Get unique fields from file
by kcott (Archbishop) on Jan 07, 2022 at 03:11 UTC

    G'day sroux,

    Given the size of your data, processing speed may be a factor. The following may be faster than other proposed solutions; but do Benchmark with realistic data.

    I've used the same input data as others have done.

    $ cat pm_11140211_flatfile.dat head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3

    My code takes advantage of the fact that when duplicate keys are used in a hash assignment, only the last duplicate takes effect. A short piece of code to demonstrate:

    $ perl -e ' use Data::Dump; my %x = (a => 1, b => 3, c => 4); my %y = (b => 2, c => 3, d => 4); my %z = (%x, %y); dd \%z; ' { a => 1, b => 2, c => 3, d => 4 }

    So there's no need for %seen, uniq(), or any similar mechanism, to handle duplicates.

    Also note that I've used bind_columns(). See the benchmark in "Text::CSV - getline_hr()".

    The code:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Text::CSV; my $infile = 'pm_11140211_flatfile.dat'; my $csv = Text::CSV::->new({sep_char => '|'}); open my $in_fh, '<', $infile; my $row = {}; my @cols = @{$csv->getline($in_fh)}; $csv->bind_columns(\@{$row}{@cols}); my %data = map +($_, {}), @cols; while ($csv->getline($in_fh)) { $data{$_} = { %{$data{$_}}, $row->{$_}, 1 } for @cols; } print "$_: ", join(', ', sort keys %{$data{$_}}), "\n" for sort @cols;

    The output:

    head1: val1, val2, val3, val6 head2: val2, val4, val7 head3: val3, val5

    — Ken

Re: Get unique fields from file
by tybalt89 (Monsignor) on Jan 07, 2022 at 02:29 UTC

    At least this does not fail any of the test cases you have provided :)

    #!/usr/bin/perl use strict; # use warnings; my $data =<< "END"; head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3 END open my $fh , "<", \$data or die "can't open input file $!"; my @headers = split /[|\n]/, <$fh>; my %seen; while( <$fh> ) { my @row = split /[|\n]/; $seen{$_}{shift @row}++ for @headers; } print join "\n", "UNIQUE VALUES for $_:", (sort keys %{ $seen{$_} }), +"\n" for @headers;


    UNIQUE VALUES for head1: val1 val2 val3 val6 UNIQUE VALUES for head2: val2 val4 val7 UNIQUE VALUES for head3: val3 val5
      I do like this general approach, however the OP is talking about a significant sized file of 500 MB. Depending upon the data of course, your HoH (hash of hash) structure could consume quite a bit more memory than the actual file size in MB.

      { head1 => { val1 => 2, val2 => 1, val3 => 1, val6 => 1 }, head2 => { val2 => 1, val4 => 2, val7 => 2 }, head3 => { val3 => 2, val5 => 3 }, }
      I came up with a representation (at this post) where the column values only occur once as hash keys and the value of each hash key is an array describing whether a value: appears or doesn't appear at all in column, whether a value only appears once in a column, whether a value occurs more than once in a column.

      We both interpreted "unique" to mean different things.
      I see you think that means: "don't repeat yourself after having said something once".
      I thought it meant: "don't say anything at all if you would repeat yourself".

      My data structure:

      { val1 => [-1], # val1 occurs more than once in col 1 val2 => [2, 1], # val2 occurs once in col 1 and once in col 2 val3 => [-3, 1], #val3 occurs more than once in col 3 #but only one time in col1 val4 => [-2], val5 => [-3], val6 => [1], val7 => [-2], #val7 is mentioned at least twice in col2 }
      Of course I could generate your same output from my data structure because I know the columns where the term appeared more than once.

        Yep. I couldn't tell if what was wanted was "unique in row", "unique in column" or "only singletons in column", so I just thought I'd toss something out there.

        I do have a solution that takes practically no memory (uses external sort -u), but I'll wait to see responses to the proposed answers first.

[OT] Re: Get unique fields from file
by erix (Prior) on Jan 08, 2022 at 11:42 UTC

    I should probably say: don't do this at home, kids.

    I'm not really sure what sroux is after. I have assumed the 'intention' from the solutions that others baked.

    For fun, I open up a postgres SQL interface to the data in this file. I'll admit this postgres spelunking is not for the faint-hearted or for DB beginners. The advantage is getting access to a textfile via the power of SQL.

    # bash file=/tmp/flatfile.dat t=flattable sep='|' echo "head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3" > $file # postgres extension 'file_fdw' must be installed (but haven't we all) +: psql -qXa -c "create server if not exists fs foreign data wrapper file +_fdw;" psql -qXa << SQL_PERL drop foreign table if exists $t; create foreign table $t ($( < $file perl -ne 'chomp; my $t = "'$t'"; print join("\n , ", map{$_ .= " text"} split(/['${sep}']/)); exit;' )) server fs options(delimiter'$sep',format'csv',header'TRUE',filename'$f +ile'); SQL_PERL

    running the above, on-screen the SQL that was run, echoed from psql:

    create server if not exists fs foreign data wrapper file_fdw; drop foreign table if exists flattable; create foreign table flattable (head1 text , head2 text , head3 text ) server fs options(delimiter'|',format'csv',header'TRUE',filename'/tmp/ +flatfile.dat');

    Now the file is accessible as a foreign table named 'flattable'.

    Once a postgres table, you can run SQL against it, for instance this SELECT:

    (select 'head1' "heads", array_agg(distinct head1 order by head1) "val +ues" from flattable) union all (select 'head2' "heads", array_agg(distinct head2 order by head2) "val +ues" from flattable) union all (select 'head3' "heads", array_agg(distinct head3 order by head3) "val +ues" from flattable) -- output: -- -- heads | values -- -------+----------------------- -- head1 | {val1,val2,val3,val6} -- head2 | {val2,val4,val7} -- head3 | {val3,val5} -- (3 rows)

    In case of many columns, say hundreds (it happens!), generate the needed SQL. (see psql's \gexec -- left as an exercise for the reader)

      Using an SQL DB for this sort of thing is actually a pretty good idea.
      I attach my SQLite multi-platform SQLite code below.

      There are utilities to import CSV files, but I have never bothered to install the SQLite3 command line on my machine. The Perl DBI comes with all the functions that Perl needs. I use a browser plug-in, SQLite Manager for admin functions. Firefox removed the I/F this plug-in needs, so I installed Waterfox for the single purpose of running this plug-in.

      The Perl syntax for some of the SQL command strings got a little hairy, so I included a debug statement for those. If you define a constant, like DEBUG_SQL, if you set that constant equal to 0, Perl will not even compile the "if" statement into the code. So there is no production runtime penalty for using this technique. I left those statements in the code as a demo of technique.

      I didn't bother with a CSV parser because there is no evidence at all that this is a CSV file. Include that if you like.

      SQL cannot make a table name a variable, but it can make a column name a variable. Unfortunately my SQL knowledge is inadequate at the moment. So I just made a "template" and ran a "prepare" on the adjusted template once per column. Impact on performance is zero.

      There has been some discussion in this thread about whether the OP actually intended unique values or distinct values. Most posts rejected the OP's plain English request for unique values in favor of an "distinct" interpretation. I suppose a hold over from UNIX uniq? I calculated both ways below.

      There is an easy way to vary SQLite's memory footprint. But it has been maybe 8 years since I needed to do that. I had a heavily indexed DB and found that creating the tables first and then adding the indices later was much faster than having the DB figure that stuff out "as it went" during DB creation. It was possible to vastly increase the footprint of SQLite for this indexing operation and then reduce it later. Something like that could have a huge performance impact on this SQL code.

      Update: Oh, darn I see that val3 should be unique in head1. Not sure where the bug is.
      OK, see **ADDED**, I wasn't ignoring blank lines.

      use strict; use warnings; use DBI; use constant {DEBUG_SQL => 1}; my $scratchDBname = "ScratchDB.sqlite"; my %attr = ( RaiseError => 1); # auto die with error printout my $dbh = DBI->connect("dbi:SQLite:dbname=$scratchDBname","","",\%attr +) or die "Couldn't connect to database $scratchDBname: " . DBI->errs +tr; my $headerLine = <DATA>; chomp $headerLine; my @headers = split (/\|/,$headerLine); # create main Table # $dbh->do("DROP TABLE IF EXISTS main"); my $createTableSQL = "CREATE TABLE main (". join (",",map{"\"$_\" TEXT NOT NULL"}@headers). ")"; print "SQL debug Statement: $createTableSQL\n" if DEBUG_SQL; $dbh->do($createTableSQL); # prepare insert row statement # my $valuesField = "?," x @headers; chop $valuesField; # get rid of last comma my $insertRowSQL = "INSERT INTO main (". join(",",@headers).")". "VALUES (".$valuesField.")"; print "SQL debug statement: $insertRowSQL\n" if DEBUG_SQL; + my $insert = $dbh->prepare($insertRowSQL); # populate main table by reading the 1/2 GB file # $dbh->begin_work; while (defined (my $line =<DATA>)) { chomp $line; next if $line =~ /^\s*$/; #ignore blank lines ***ADDED*** my @fields = split (/\|/,$line); $insert->execute(@fields); } $dbh->commit; # prepare SQL prototypes # # A Column can be a variable but requires more SQL-foo than # I have at the moment... # But making a "prototype" and adjusting that and preparing # on a per column basis appears to be no big deal my $distinct_proto = "SELECT __COLUMN from main GROUP BY __COLUMN ORDER BY __COLUMN"; my $unique_proto = "SELECT __COLUMN FROM ( SELECT __COLUMN, count(*) as histo FROM main GROUP BY __COLUMN ORDER BY __COLUMN) WHERE histo = 1"; foreach my $header (@headers) { my $distinct_select = $distinct_proto; $distinct_select =~ s/__COLUMN/$header/g; my $distinct = $dbh->prepare($distinct_select); my $unique_select = $unique_proto; $unique_select =~ s/__COLUMN/$header/g; my $unique = $dbh->prepare($unique_select); print "\n*****\n"; print "Distinct values for $header:\n"; $distinct->execute(); my $distinct_array_ref = $distinct -> fetchall_arrayref(); print "@$_\n" foreach @$distinct_array_ref; print "\nUnique values for $header:\n"; $unique->execute(); my $unique_array_ref = $unique -> fetchall_arrayref(); print "@$_\n" foreach @$unique_array_ref; } =Prints: SQL debug Statement: CREATE TABLE main ("head1" TEXT NOT NULL,"head2" +TEXT NOT NULL,"head3" TEXT NOT NULL) SQL debug statement: INSERT INTO main (head1,head2,head3)VALUES (?,?,? +) ***** Distinct values for head1: val1 val2 val3 val6 Unique values for head1: val2 val3 val6 ***** Distinct values for head2: val2 val4 val7 Unique values for head2: val2 ***** Distinct values for head3: val3 val5 Unique values for head3: =cut __DATA__ head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3

        That's nice, too, but it is essentially different from my example: your code actually /imports/ data, whereas mine only configures a table interface on top of an underlying text file, so that you can run any SQL against that underlying text table data (one could also read a zipped text file in this way).

        Let's just say SQL too has more than one way to do it ;)

Re: Get unique fields from file
by Anonymous Monk on Jan 06, 2022 at 15:50 UTC

    Assuming your Perl is at least semi-modern, you can do

    use List::Util 'uniq';
    my @uniqueOutput = uniq( @output );

    List::Util has been in Perl since Perl version 5.7.3, and uniq() has been in List::Util since List::Util 1.45. There are also variants (uniqnum(), etc) should they be more appropriate to your use.

Re: Get unique fields from file
by Marshall (Canon) on Jan 06, 2022 at 23:51 UTC
    I looked at some of the other solutions and some appear to be complex for a beginner. Added: I looked back at this and even it seems pretty complicated! The size of the potential data involved is a complicating wrinkle....So I tried yet another approach. You have a potentially huge file and you very likely will run out of memory if you try to do all the processing in one pass.

    So, this code focuses on one column at a time. It reads the whole file and counts the number of times things in the "current column of interest" are "seen". When that is finished, the "unique" values are the ones which were only seen once.

    Then the file pointer is moved back to the beginning of data, and the next column is processed.

    If you have a large number of columns, this will be slow. But speed may not matter much. You don't say.

    You are probably fine splitting on /\|/ (regex for |). I work with database files all the time which are | delimited and are not CSV files. Meaning amongst other things, that | is not a legal data value. If this file really is a CSV file, then you should use Text::CSV to parse it for you (the rules are incredibly tricky). But if its not a CSV file, then a simple split is fine. If you see an " marks in the file, it probably is a CSV file.

    use strict; use warnings; use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END); my $data =<< "END"; head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3 END open my $fh , "<", \$data or die "can't open input file $!"; my $header_line = <$fh>; # get first line chomp $header_line; my @headers = split(/\|/,$header_line); my $num_columns = @headers; # scalar value of an array is number of el +ements my $begin_of_data = tell($fh); #save byte position of start of data ro +ws my %seen; foreach my $column_nr (0..$num_columns-1) { # read whole file looking only at one column while (<$fh>) { chomp; my $field = (split(/\|/,$_))[$column_nr]; $seen{$field}++; # counts num times "seen" } # unique values are those which were only "seen" once print "UNIQUE VALUES for ",shift @headers,":\n"; foreach my $field_value (sort keys %seen) { print "$field_value\n" if $seen{$field_value} == 1; } # reset file pointer to beginning and do everything again for # the next column seek ($fh,$begin_of_data,SEEK_SET); %seen=(); # clear the hash for calculating next column's data print "\n"; } __END__ UNIQUE VALUES for head1: val2 val3 val6 UNIQUE VALUES for head2: val2 UNIQUE VALUES for head3:
    Update: I did re-consider how to do this in a single pass with minimal memory. A 1/2 GB file is "big" in terms of memory when you consider that the internal Perl representation of that thing as perhaps multiple hash tables (one per column) could be quite large and cause an "out of memory" error. In an attempt to reduce hash table footprint, I came up with a hash of array implementation where the keys are a global collection of all values in all columns and the value is an array of the columns that value was seen in. Depending upon the data, this table could fit into RAM memory.

    I can see an important specification issue here is: "what is meant by unique". I interpreted "unique" in the normal English sense, "one of a kind". Some folks have interpreted that as "non repeating values", "don't repeat yourself". Those are very different interpretations. I am not sure what the OP wants

    Here is what that code looks like:

    use strict; use warnings; use Data::Dump qw (dump dd); my $data =<< "END"; head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3 END open my $fh , "<", \$data or die "can't open input file $!"; my $header_line = <$fh>; # get first line chomp $header_line; my @headers = split(/\|/,$header_line); my %seen; # value => [array of the positive column numbers it's unique + in] # like "somevalue" => [1,2] # a negative column number means it was seen at least twice # and therefore that value is not unique, "one of a kind" # in that column # like "somevalue" =>[1,2,-3] while (defined (my $line=<$fh>)) { chomp $line; next if $line =~ /^\s*$/; my @fields = split(/\|/,$line); my $col=1; # cols are 1...n # can't have a negative sero! foreach my $field (@fields) { if (my ($col_seen_before) = grep{$col == abs($_)}@{$seen{$field +}}) { if ($col_seen_before <0) { # nothing to do...#this is nth time this value seen in +this col } else { # flip the sign of column number to negative to indicat +e 2nd # this value has been seen in this column. @{$seen{$field}} = map{$_ = -$_ if $_==$col_seen_before +; $_}@{$seen{$field}}; } } else { push @{$seen{$field}}, $col; #first time seen in this colu +mn } $col++; #dd \%seen; } } # unique values are those which have a positive value for that col num +ber foreach my $value_seen (sort keys %seen) { if (my @unique_cols = sort{$a<=>$b} grep{$_>0}@{$seen{$value_seen} +}) { print "$value_seen is unique in cols: @unique_cols\n"; } } __END__ Previous program output: UNIQUE VALUES for head1: val2 val3 val6 UNIQUE VALUES for head2: val2 UNIQUE VALUES for head3: (NONE) This program's output: There could be many formats to present this information... val2 is unique in cols: 1 2 val3 is unique in cols: 1 val6 is unique in cols: 1 This means nothing was unique in column 3. column 1 has 3 unique values. column 2 has one unique value.