Deicide has asked for the wisdom of the Perl Monks concerning the following question:

Ok I’m a beginner and need help with code and direction on how to handle this problem. I have 2 csv files. File1, the data that I need is in column D but it’s not the only data in column D. The data I need in each cell is in the format “first last(G123456)” the data I need from file2 , in Column A, is in the format “first.last(G786342)”. Ultimately the data I need is just the cells that contain a (G783212) from file2 that are not in file1. I can’t depend on a whole cell comparison because the data infront of the id number might be different. Each one of the columns might have lots of duplicates, reducing each before the comparison might save time, there’s a few thousand records, 3-4 maybe. I’ve tried reduce each column to unique entries, then combine them and reduce them again to rid all duplicates including originals leaving only entries that do not have duplicates but was hoping there was an easier way.

Replies are listed 'Best First'.
Re: 2 files
by haukex (Archbishop) on Dec 05, 2018 at 22:42 UTC

    Welcome to Perl and the Monastery, Deicide!

    Please have a look at How do I post a question effectively? and Short, Self-Contained, Correct Example: at the very least, please provide some sample input that demonstrates what you are describing, the expected output for that input, and if you have it, any code you've tried, plus the output you're getting from that code, including exact error messages. Each of these should be placed in <code> tags for readability.

    In general, for handling CSV data, it's best to use a module like Text::CSV (also install Text::CSV_XS for speed). It sounds like you'll want to parse the first file, and as you're doing so, parse the cells in question using a regex - if the string you want to match on is always in parentheses, like in "first last(G123456)", the perhaps you could use Regexp::Common::balanced to help with that (but we'd need to see some more samples of input data to be sure). Then, assuming you want exact matches, perhaps you could build a hash of the values you've seen (with the keys being the values, for easy lookup), and as you're parsing the second file, look up matching strings in the hash. But this is just a rough idea based on the description you've provided so far, as you give more details, better methods might emerge.

Re: 2 files
by 1nickt (Canon) on Dec 06, 2018 at 00:27 UTC

    Hi,

    • use a regular expression to extract the meaningful part of the value in the given cells
    • use the CSV-parsing library Text::CSV_XS to read in the first file and create a hash containing a key for each value that should not be duplicated
    • use the library again to read in the second file
    • use the filter option (as recently shown by choroba) to exclude rows with seen values
    • use map to reduce the wanted rows to just the values
    (Note I am using Inline::Files here to provide this demo script with CSV files. In real life, replace "\*FILE1" and "\*FILE2" with the real file names.)
    use v5.014; use Text::CSV_XS 'csv'; my $re = qr/ \A first (?:\s|\.) last \( (G[0-9]+) \) \Z /x; my %hash = map { ($_->[3] =~ s/$re/$1/r) => 1 } @{ csv( in => \*FILE1, headers => 'skip' ), }; my @result = map { $_->[0] } @{ csv( in => \*FILE2, headers => 'skip', filter => { 1 => sub { $_ =~ s/$re/$1/ and not exists $hash{$_} }, }), }; say "found $_ " for @result; use Inline::Files; __FILE1__ "A","B","C","D","E" "blah","blah","blah","first last(G123456)","blah" "blah","blah","blah","first last(G123457)","blah" "blah","blah","blah","first last(G123459)","blah" __FILE2__ "A","B","C","D" "first.last(G123456)","blah","blah","blah" "first.last(G123457)","blah","blah","blah" "first.last(G123458)","blah","blah","blah" "first.last(G123459)","blah","blah","blah"
    Output:
    $ perl 1226797.pl found G123458

    Hope this helps!


    The way forward always starts with a minimal test.
Re: 2 files
by Laurent_R (Canon) on Dec 05, 2018 at 23:50 UTC
    Hi Deicide,

    it would be most useful to see your code, as well as some samples of your input and desired output.

    As a general rule, when you're looking for records (or parts of records) that are in file 2 and not in file 1, the best method is to read file 1 and to store the data of interest of file 1 in a hash; and then to scan file 2 and keep the records not matching the hash. BTW, a hash will remove the duplicates.

    Not knowing anything about your input files (e.g., what is the record separator?), I can't suggest any real code. But the basic idea in pseudo-code is:

    Open file 1 and read file 1 record by record, and: for (each record) { get column D from the record; get the interesting part from column D; store the interesting part in a hash: some thing like: $hash{inter +esting part} = 1; } close file 1 open file 2 and read file 2 record by record, and: for (each record) { get column D from the record; get the interesting part from column D; if (exists $hash{interesting part}) { print $hash{interesting part}; } } close file 2;

    I (and other monks here) will be able to provide actual code if you provide the necessary information (especially sample input and desires output).