sroux has asked for the wisdom of the Perl Monks concerning the following question:
Dears,
As many I use Perl as a swiss army knife, I may understand the code, write some but also copy and glue it together to make scripts used on a desktop (as utility) or server (file parsing, mapping etc.).
I would like to write an utility that reads a file (can be max 500mb) and output a result with unique values >> for each delimited field << (what I mean is not an doing an unique on the whole file content but field by field).
Looks like that hashes are a nice solution for this task but I never used hashes.
I found some scripts around:
#Create hash based on header
my %combined; open(my $input, "<", "flatfile.dat") or die ("Unable to open file"); my $line = <$input>; chomp ( $line ); my @headers = split("|", $line); while (<$input>) { chomp; my @row = split("|"); for my $header (@headers) { push( @{ $combined{$header} }, shift(@row) ); } }
Now how to get the unique value for each field and produce an output file. I had this piece of code used somewhere but I hardly can understand it:
#Remove duplicates my %seen = (); my @uniqueOutput = grep { ! $seen{ $_ }++ } @output; print $fh3 @uniqueOutput;
Thank you for any guidance you may provide.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Get unique fields from file
by choroba (Cardinal) on Jan 06, 2022 at 12:00 UTC | |
If you want to split on the literal vertical bar, you need to backslash it.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Get unique fields from file
by johngg (Canon) on Jan 06, 2022 at 11:43 UTC | |
I had this piece of code used somewhere but I hardly can understand it: Breaking that down into stages:- My preference is to limit the scope of %seen to a do block so that it isn't left lying around. I hope this makes the process clearer for you but ask further if something is still not clear. Cheers, JohnGG | [reply] [d/l] [select] |
Re: Get unique fields from file
by davido (Cardinal) on Jan 06, 2022 at 16:53 UTC | |
I didn't see specifically what you are having trouble with. It didn't seem like there was a specific question. It is correct that using a hash is a good approach. I think that the uniq function is probably not where I would immediately go because of the fact that you want unique values per field, and in order to use uniq for that, you would have to hold the whole file in memory at once (even if there's a high rate of duplication within fields). Rather, I would do the unique filtering early on, line by line. That way, if there are a lot of collisions within any given field, you're only holding onto one instance, which could be a lot more memory friendly. You're dealing with a style of CSV. It's '|' separated csv, so |sv, but I prefer using a CSV parser for that so that I don't have to deal with the intricacies of embedded / escaped separators. The Text::CSV module can grab the headers for you and can already pre-organize the data into header => value pairs for you. Here's an example of how that could look:
The meat here is within the csv_to_unique_within_field function. Pass into the function a filehandle, and a separator. If no separator is provided, assume comma. The function does this:
After this I just print the headers and all the fields they contained. Since we filtered-in only unique per header, it's just a straighforward datastructure print. Dave | [reply] [d/l] |
Re: Get unique fields from file
by hippo (Bishop) on Jan 06, 2022 at 09:49 UTC | |
Now how to get the unique value for each field and produce an output file. I had this piece of code used somewhere but I hardly can understand it It depends largely on what you mean by "get the unique value". If your set is (x, x, y) do you want the result to be (y) as that is the only value which is unique or do you want it to be (x, y) as that is all the values but just one occurrence of each? If the latter then then snippet which you have will suffice. Removing duplicates is explained in the FAQ How can I remove duplicate elements from a list or array?. BTW, since what you have is really an xSV file with pipe delimiters I would strongly suggest you have a look at Text::CSV_XS before rolling your own parser. 🦛 | [reply] |
Re: Get unique fields from file
by kcott (Archbishop) on Jan 07, 2022 at 03:11 UTC | |
G'day sroux, Given the size of your data, processing speed may be a factor. The following may be faster than other proposed solutions; but do Benchmark with realistic data. I've used the same input data as others have done.
My code takes advantage of the fact that when duplicate keys are used in a hash assignment, only the last duplicate takes effect. A short piece of code to demonstrate:
So there's no need for %seen, uniq(), or any similar mechanism, to handle duplicates. Also note that I've used bind_columns(). See the benchmark in "Text::CSV - getline_hr()". The code:
The output:
— Ken | [reply] [d/l] [select] |
[OT] Re: Get unique fields from file
by erix (Prior) on Jan 08, 2022 at 11:42 UTC | |
I should probably say: don't do this at home, kids. I'm not really sure what sroux is after. I have assumed the 'intention' from the solutions that others baked. For fun, I open up a postgres SQL interface to the data in this file. I'll admit this postgres spelunking is not for the faint-hearted or for DB beginners. The advantage is getting access to a textfile via the power of SQL.
running the above, on-screen the SQL that was run, echoed from psql:
Now the file is accessible as a foreign table named 'flattable'. Once a postgres table, you can run SQL against it, for instance this SELECT:
In case of many columns, say hundreds (it happens!), generate the needed SQL. (see psql's \gexec -- left as an exercise for the reader) | [reply] [d/l] [select] |
by Marshall (Canon) on Jan 11, 2022 at 12:46 UTC | |
I attach my SQLite multi-platform SQLite code below. There are utilities to import CSV files, but I have never bothered to install the SQLite3 command line on my machine. The Perl DBI comes with all the functions that Perl needs. I use a browser plug-in, SQLite Manager for admin functions. Firefox removed the I/F this plug-in needs, so I installed Waterfox for the single purpose of running this plug-in. The Perl syntax for some of the SQL command strings got a little hairy, so I included a debug statement for those. If you define a constant, like DEBUG_SQL, if you set that constant equal to 0, Perl will not even compile the "if" statement into the code. So there is no production runtime penalty for using this technique. I left those statements in the code as a demo of technique. I didn't bother with a CSV parser because there is no evidence at all that this is a CSV file. Include that if you like. SQL cannot make a table name a variable, but it can make a column name a variable. Unfortunately my SQL knowledge is inadequate at the moment. So I just made a "template" and ran a "prepare" on the adjusted template once per column. Impact on performance is zero. There has been some discussion in this thread about whether the OP actually intended unique values or distinct values. Most posts rejected the OP's plain English request for unique values in favor of an "distinct" interpretation. I suppose a hold over from UNIX uniq? I calculated both ways below. There is an easy way to vary SQLite's memory footprint. But it has been maybe 8 years since I needed to do that. I had a heavily indexed DB and found that creating the tables first and then adding the indices later was much faster than having the DB figure that stuff out "as it went" during DB creation. It was possible to vastly increase the footprint of SQLite for this indexing operation and then reduce it later. Something like that could have a huge performance impact on this SQL code.
Update: Oh, darn I see that val3 should be unique in head1. Not sure where the bug is.
| [reply] [d/l] |
by erix (Prior) on Jan 11, 2022 at 15:58 UTC | |
That's nice, too, but it is essentially different from my example: your code actually /imports/ data, whereas mine only configures a table interface on top of an underlying text file, so that you can run any SQL against that underlying text table data (one could also read a zipped text file in this way). Let's just say SQL too has more than one way to do it ;) | [reply] |
by Marshall (Canon) on Jan 11, 2022 at 17:07 UTC | |
by erix (Prior) on Jan 11, 2022 at 17:38 UTC | |
| |
Re: Get unique fields from file
by tybalt89 (Monsignor) on Jan 07, 2022 at 02:29 UTC | |
At least this does not fail any of the test cases you have provided :)
Outputs:
| [reply] [d/l] [select] |
by Marshall (Canon) on Jan 07, 2022 at 03:49 UTC | |
I came up with a representation (at this post) where the column values only occur once as hash keys and the value of each hash key is an array describing whether a value: appears or doesn't appear at all in column, whether a value only appears once in a column, whether a value occurs more than once in a column.
We both interpreted "unique" to mean different things. My data structure: Of course I could generate your same output from my data structure because I know the columns where the term appeared more than once. | [reply] [d/l] [select] |
by LanX (Saint) on Jan 08, 2022 at 15:27 UTC | |
This shouldn't be a problem if you a apply a sliding window technique° plus splitting the hashes into easily swappable chunks². The trick is to balance time, space and disk access, by minimizing the the number of swaps. This will scale well, until the limit given by disk-space.
Cheers Rolf °) see ²) see | [reply] |
by tybalt89 (Monsignor) on Jan 07, 2022 at 04:49 UTC | |
Yep. I couldn't tell if what was wanted was "unique in row", "unique in column" or "only singletons in column",
so I just thought I'd toss something out there.
| [reply] |
by Marshall (Canon) on Jan 07, 2022 at 10:33 UTC | |
Re: Get unique fields from file
by Anonymous Monk on Jan 06, 2022 at 15:50 UTC | |
Assuming your Perl is at least semi-modern, you can do use List::Util 'uniq'; my @uniqueOutput = uniq( @output ); List::Util has been in Perl since Perl version 5.7.3, and uniq() has been in List::Util since List::Util 1.45. There are also variants (uniqnum(), etc) should they be more appropriate to your use. | [reply] [d/l] [select] |
Re: Get unique fields from file
by Marshall (Canon) on Jan 06, 2022 at 23:51 UTC | |
So, this code focuses on one column at a time. It reads the whole file and counts the number of times things in the "current column of interest" are "seen". When that is finished, the "unique" values are the ones which were only seen once. Then the file pointer is moved back to the beginning of data, and the next column is processed. If you have a large number of columns, this will be slow. But speed may not matter much. You don't say. You are probably fine splitting on /\|/ (regex for |). I work with database files all the time which are | delimited and are not CSV files. Meaning amongst other things, that | is not a legal data value. If this file really is a CSV file, then you should use Text::CSV to parse it for you (the rules are incredibly tricky). But if its not a CSV file, then a simple split is fine. If you see an " marks in the file, it probably is a CSV file.
Update: I did re-consider how to do this in a single pass with minimal memory. A 1/2 GB file is "big" in terms of memory when you consider that the internal Perl representation of that thing as perhaps multiple hash tables (one per column) could be quite large and cause an "out of memory" error. In an attempt to reduce hash table footprint, I came up with a hash of array implementation where the keys are a global collection of all values in all columns and the value is an array of the columns that value was seen in. Depending upon the data, this table could fit into RAM memory. I can see an important specification issue here is: "what is meant by unique". I interpreted "unique" in the normal English sense, "one of a kind". Some folks have interpreted that as "non repeating values", "don't repeat yourself". Those are very different interpretations. I am not sure what the OP wants
Here is what that code looks like:
| [reply] [d/l] [select] |