Alessandro has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, here is my question. I have 2 files, one containing a list of IDs that looks like this:

GSAD1234 GSAD2345 GSAD4567
And another one that looks like this, it is a csv with tab as field delimiter (which I symbolize here with " \t " as somehow I can't add a real tab here, also please note that I did not forget a tab in "no match", there are really fields that do contain white spaces):
GSAD1234 \t 123 \t 45 \t no match \t fungus \t protein_x GSAD5678 \t 123 \t 51 \t plant \t fungus \t protein_y \t transporter
It is worth mentioning this second file contains more than 50 000 lines.

I would like to extract from the second file the lines corresponding to the IDs from the first file. So here the desired output would be:

GSAD1234 \t 123 \t 45 \t no match \t fungus \t protein_x
How do I do that? I had thought of reading the second file into a hash with the IDs as key and the rest of the fields as values but I can't find a way to do it due to the multiple fields per line. So far I have read the 2 files into arrays and tried to match the lines but it doesn't work and again, I am not sure it is the right strategy. Here is the code that seems to simply output the whole csv file:
#!/usr/bin/perl use warnings; use strict; use Text::CSV; use File::Slurp; my $csv = Text::CSV->new({ sep_char => '\t' }); #end of preparation #read data my $file = $ARGV[0] or die "Need to get CSV file on the command line\n +"; open(my $data,'<',$file) or die "Could not open file \n"; chomp (my @strings = <$data>); close $data; # read ID list my $id = 'id.txt'; my @ids = read_file("$id", chomp =>1); foreach(@ids) { my @matches = grep(/^($_)/,@strings); print join ",",@matches; }
I would be grateful for any help.

Replies are listed 'Best First'.
Re: Extracting lines starting with a pattern from an array
by choroba (Cardinal) on Dec 16, 2015 at 17:35 UTC
    I had thought of reading the second file into a hash with the IDs as key and the rest of the fields as values but I can't find a way to do it due to the multiple fields per line.
    It might be possible, by creating a hash of arrays. But it seems easier to do it the other way round, to store the first file in a hash. Then iterate over the second file and check whether the given id exists in the hash. If you need the output sorted, you might store the line number ($.) from the first file as the value in the first hash, and sort by that at the end.

    Update: Solution #2:

    #!/usr/bin/perl use warnings; use strict; use Text::CSV; open my $LST, '<', 'ids.lst' or die $!; my %id; while (<$LST>) { chomp; $id{$_} = $.; } my @out; my $csv = 'Text::CSV'->new({ sep_char => "\t", eol => "\n", }); open my $CSV, '<', 'file.csv' or die $!; while (my $row = $csv->getline($CSV)) { push @out, [ $id{ $row->[0] }, $row ] if exists $id{ $row->[0] }; } $csv->print(*STDOUT, $_->[1]) for sort { $a->[0] <=> $b->[0] } @out;

    Update #2: Solution #1:

    #!/usr/bin/perl use warnings; use strict; use Text::CSV; my %record; my $csv = 'Text::CSV'->new({ sep_char => "\t", eol => "\n", }); open my $CSV, '<', 'file.csv' or die $!; while (my $row = $csv->getline($CSV)) { $record{ $row->[0] } = $row; } open my $LST, '<', 'ids.lst' or die $!; my %id; while (<$LST>) { chomp; print $record{$_} if exists $record{$_}; }
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Extracting lines starting with a pattern from an array
by CountZero (Bishop) on Dec 16, 2015 at 21:27 UTC
    Combining the IDs into one regex and then just reading line by line and matching the combined ID regex against the start of each string.

    use Modern::Perl qw/2015/; my @regex = <DATA>; chomp @regex; my $regex = join '|', @regex; $regex = qr/$regex/; open( my $FH, '<', 'data.txt' ) or die "Could not open file: $!"; while ( my $line = <$FH> ) { print "Matched $1 at $line" if $line =~ m/^($regex)/; } __DATA__ GSAD1234 GSAD2345 GSAD4567

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      Thanks for the code, thanks all. But something really weird is happening... The script is returning only a single match (and I know for sure there should be more than one). I have tried a few other codes and they give me the same unique match as well.

      However I have written a dummy data set and the code works on it. So, knowing that my data have been given to me by someone and they are derived from an excel file, am I right to suspect that some kind of invisible characters are causing a problem?

        If you suspect that, you might try looking at the raw input data. For example, on *nix start with this

         $ od -cx input-data.txt > raw-input-data.txt

        and look at the output file.

Re: Extracting lines starting with a pattern from an array
by GotToBTru (Prior) on Dec 16, 2015 at 17:44 UTC
    use Text::CSV; my $csv = Text::CSV->new({ sep_char => chr(9) }); my $file = shift; chomp($file); open my $fh, '<', $file; while ( my $row = $csv->getline( $fh ) ) { $index{$row->[0]} = $row }
    Dum Spiro Spero
Re: Extracting lines starting with a pattern from an array
by Laurent_R (Canon) on Dec 16, 2015 at 22:13 UTC
    Perhaps something as simple as this:
    my $id_file = 'id.txt'; my %id; open my $ID, "<", $id_file or die "Error opening $id_file $!"; while (<$ID>) { chomp; $id{$_} = 1; } close $ID; while (<>) { # assumes file 2 is passed to the script - to be adjusted + to real conditions my $key = $1 if /^(\w+)/; print if exists $id{$key}; }