2 files

Deicide has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: 2 files by haukex (Archbishop) on Dec 05, 2018 at 22:42 UTC
Welcome to Perl and the Monastery, Deicide! Please have a look at How do I post a question effectively? and Short, Self-Contained, Correct Example: at the very least, please provide some sample input that demonstrates what you are describing, the expected output for that input, and if you have it, any code you've tried, plus the output you're getting from that code, including exact error messages. Each of these should be placed in `<code>` tags for readability. In general, for handling CSV data, it's best to use a module like Text::CSV (also install Text::CSV_XS for speed). It sounds like you'll want to parse the first file, and as you're doing so, parse the cells in question using a regex - if the string you want to match on is always in parentheses, like in `"first last(G123456)"`, the perhaps you could use Regexp::Common::balanced to help with that (but we'd need to see some more samples of input data to be sure). Then, assuming you want exact matches, perhaps you could build a hash of the values you've seen (with the keys being the values, for easy lookup), and as you're parsing the second file, look up matching strings in the hash. But this is just a rough idea based on the description you've provided so far, as you give more details, better methods might emerge.	[reply] [d/l] [select]
Re: 2 files by 1nickt (Canon) on Dec 06, 2018 at 00:27 UTC
Hi, use a regular expression to extract the meaningful part of the value in the given cells use the CSV-parsing library Text::CSV_XS to read in the first file and create a hash containing a key for each value that should not be duplicated use the library again to read in the second file use the `filter` option (as recently shown by choroba) to exclude rows with seen values use map to reduce the wanted rows to just the values (Note I am using Inline::Files here to provide this demo script with CSV files. In real life, replace "`\FILE1`" and "`\FILE2`" with the real file names.) use v5.014; use Text::CSV_XS 'csv'; my $re = qr/ \A first (?:\s\|\.) last $ (G[0-9]+) $ \Z /x; my %hash = map { ($_->[3] =~ s/$re/$1/r) => 1 } @{ csv( in => \FILE1, headers => 'skip' ), }; my @result = map { $_->[0] } @{ csv( in => \FILE2, headers => 'skip', filter => { 1 => sub { $_ =~ s/$re/$1/ and not exists $hash{$_} }, }), }; say "found $_ " for @result; use Inline::Files; __FILE1__ "A","B","C","D","E" "blah","blah","blah","first last(G123456)","blah" "blah","blah","blah","first last(G123457)","blah" "blah","blah","blah","first last(G123459)","blah" __FILE2__ "A","B","C","D" "first.last(G123456)","blah","blah","blah" "first.last(G123457)","blah","blah","blah" "first.last(G123458)","blah","blah","blah" "first.last(G123459)","blah","blah","blah" [download] Output: `$ perl 1226797.pl found G123458` [download] Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re: 2 files by Laurent_R (Canon) on Dec 05, 2018 at 23:50 UTC
Hi Deicide, it would be most useful to see your code, as well as some samples of your input and desired output. As a general rule, when you're looking for records (or parts of records) that are in file 2 and not in file 1, the best method is to read file 1 and to store the data of interest of file 1 in a hash; and then to scan file 2 and keep the records not matching the hash. BTW, a hash will remove the duplicates. Not knowing anything about your input files (e.g., what is the record separator?), I can't suggest any real code. But the basic idea in pseudo-code is: `Open file 1 and read file 1 record by record, and: for (each record) { get column D from the record; get the interesting part from column D; store the interesting part in a hash: some thing like: $hash{inter +esting part} = 1; } close file 1 open file 2 and read file 2 record by record, and: for (each record) { get column D from the record; get the interesting part from column D; if (exists $hash{interesting part}) { print $hash{interesting part}; } } close file 2;` [download] I (and other monks here) will be able to provide actual code if you provide the necessary information (especially sample input and desires output).	[reply] [d/l]