Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I didn't see specifically what you are having trouble with. It didn't seem like there was a specific question.

It is correct that using a hash is a good approach. I think that the uniq function is probably not where I would immediately go because of the fact that you want unique values per field, and in order to use uniq for that, you would have to hold the whole file in memory at once (even if there's a high rate of duplication within fields). Rather, I would do the unique filtering early on, line by line. That way, if there are a lot of collisions within any given field, you're only holding onto one instance, which could be a lot more memory friendly.

You're dealing with a style of CSV. It's '|' separated csv, so |sv, but I prefer using a CSV parser for that so that I don't have to deal with the intricacies of embedded / escaped separators. The Text::CSV module can grab the headers for you and can already pre-organize the data into header => value pairs for you. Here's an example of how that could look:

#!/usr/bin/env perl use strict; use warnings; use Text::CSV; my $unique_within_field = csv_to_unique_within_field(\*DATA, '|'); print "$_: ", join(', ', @{$unique_within_field->{$_}}), "\n" foreach sort keys %{$unique_within_field}; sub csv_to_unique_within_field { my($data_fh, $sep) = @_; my $csv = Text::CSV->new({}); $csv->header($data_fh, {sep_set => [$sep // ',']}); my %found; while (my $row = $csv->getline_hr($data_fh)) { $found{$_}{$row->{$_}} = undef for keys %$row; } return { map { $_ => [sort keys %{$found{$_}}] } keys %found }; } __DATA__ head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3

The meat here is within the csv_to_unique_within_field function. Pass into the function a filehandle, and a separator. If no separator is provided, assume comma.

The function does this:

  1. Grab the CSV header. In my example, the header identifies fields named head1, head2, and head3.
  2. For each remaining row of CSV data, populate our %found hash with hash keys for each field value within each header. I'm giving the keys 'undef' as the value to which they point. It's not important what they contain. We're using the keys as a uniqueness filter.
  3. Return a new, transformed hash where each header key points a reference to an array containing the unique values.

After this I just print the headers and all the fields they contained. Since we filtered-in only unique per header, it's just a straighforward datastructure print.


Dave


In reply to Re: Get unique fields from file by davido
in thread Get unique fields from file by sroux

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-03-28 21:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found