Extracting lines starting with a pattern from an array

Alessandro has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, here is my question. I have 2 files, one containing a list of IDs that looks like this:

GSAD1234
GSAD2345
GSAD4567
[download]

And another one that looks like this, it is a csv with tab as field delimiter (which I symbolize here with " \t " as somehow I can't add a real tab here, also please note that I did not forget a tab in "no match", there are really fields that do contain white spaces):

GSAD1234 \t 123 \t 45 \t no match \t fungus \t protein_x
GSAD5678 \t 123 \t 51 \t plant \t fungus \t protein_y \t transporter
[download]

It is worth mentioning this second file contains more than 50 000 lines.

I would like to extract from the second file the lines corresponding to the IDs from the first file. So here the desired output would be:

GSAD1234 \t 123 \t 45 \t no match \t fungus \t protein_x
[download]

How do I do that? I had thought of reading the second file into a hash with the IDs as key and the rest of the fields as values but I can't find a way to do it due to the multiple fields per line. So far I have read the 2 files into arrays and tried to match the lines but it doesn't work and again, I am not sure it is the right strategy. Here is the code that seems to simply output the whole csv file:

#!/usr/bin/perl

use warnings;
use strict;
use Text::CSV;
use File::Slurp;


my $csv = Text::CSV->new({ sep_char => '\t' });
#end of preparation


#read data
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n
+";

open(my $data,'<',$file) or die "Could not open file \n";
chomp (my @strings = <$data>);
close $data;

# read ID list

my $id = 'id.txt';
my @ids = read_file("$id", chomp =>1);
foreach(@ids) {
    my @matches = grep(/^($_)/,@strings);
    print join ",",@matches;
    }
[download]

I would be grateful for any help.

Comment on Extracting lines starting with a pattern from an array Select or Download Code

Replies are listed 'Best First'.
Re: Extracting lines starting with a pattern from an array by choroba (Cardinal) on Dec 16, 2015 at 17:35 UTC
I had thought of reading the second file into a hash with the IDs as key and the rest of the fields as values but I can't find a way to do it due to the multiple fields per line. It might be possible, by creating a hash of arrays. But it seems easier to do it the other way round, to store the first file in a hash. Then iterate over the second file and check whether the given id exists in the hash. If you need the output sorted, you might store the line number (`$.`) from the first file as the value in the first hash, and sort by that at the end. Update: Solution #2: `#!/usr/bin/perl use warnings; use strict; use Text::CSV; open my $LST, '<', 'ids.lst' or die $!; my %id; while (<$LST>) { chomp; $id{$_} = $.; } my @out; my $csv = 'Text::CSV'->new({ sep_char => "\t", eol => "\n", }); open my $CSV, '<', 'file.csv' or die $!; while (my $row = $csv->getline($CSV)) { push @out, [ $id{ $row->[0] }, $row ] if exists $id{ $row->[0] }; } $csv->print(STDOUT, $_->[1]) for sort { $a->[0] <=> $b->[0] } @out;` [download] Update #2:* Solution #1: `#!/usr/bin/perl use warnings; use strict; use Text::CSV; my %record; my $csv = 'Text::CSV'->new({ sep_char => "\t", eol => "\n", }); open my $CSV, '<', 'file.csv' or die $!; while (my $row = $csv->getline($CSV)) { $record{ $row->[0] } = $row; } open my $LST, '<', 'ids.lst' or die $!; my %id; while (<$LST>) { chomp; print $record{$_} if exists $record{$_}; }` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Extracting lines starting with a pattern from an array by CountZero (Bishop) on Dec 16, 2015 at 21:27 UTC
Combining the IDs into one regex and then just reading line by line and matching the combined ID regex against the start of each string. `use Modern::Perl qw/2015/; my @regex = <DATA>; chomp @regex; my $regex = join '\|', @regex; $regex = qr/$regex/; open( my $FH, '<', 'data.txt' ) or die "Could not open file: $!"; while ( my $line = <$FH> ) { print "Matched $1 at $line" if $line =~ m/^($regex)/; } __DATA__ GSAD1234 GSAD2345 GSAD4567` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l]
Re^2: Extracting lines starting with a pattern from an array by Alessandro (Acolyte) on Dec 17, 2015 at 16:15 UTC
Thanks for the code, thanks all. But something really weird is happening... The script is returning only a single match (and I know for sure there should be more than one). I have tried a few other codes and they give me the same unique match as well. However I have written a dummy data set and the code works on it. So, knowing that my data have been given to me by someone and they are derived from an excel file, am I right to suspect that some kind of invisible characters are causing a problem?	[reply]
Re^3: Extracting lines starting with a pattern from an array by u65 (Chaplain) on Dec 18, 2015 at 11:37 UTC
If you suspect that, you might try looking at the raw input data. For example, on *nix start with this `$ od -cx input-data.txt > raw-input-data.txt` and look at the output file.	[reply] [d/l]
Re^4: Extracting lines starting with a pattern from an array by hippo (Archbishop) on Dec 18, 2015 at 12:19 UTC
Re: Extracting lines starting with a pattern from an array by GotToBTru (Prior) on Dec 16, 2015 at 17:44 UTC
`use Text::CSV; my $csv = Text::CSV->new({ sep_char => chr(9) }); my $file = shift; chomp($file); open my $fh, '<', $file; while ( my $row = $csv->getline( $fh ) ) { $index{$row->[0]} = $row }` [download] Dum Spiro Spero	[reply] [d/l]
Re: Extracting lines starting with a pattern from an array by Laurent_R (Canon) on Dec 16, 2015 at 22:13 UTC
Perhaps something as simple as this: `my $id_file = 'id.txt'; my %id; open my $ID, "<", $id_file or die "Error opening $id_file $!"; while (<$ID>) { chomp; $id{$_} = 1; } close $ID; while (<>) { # assumes file 2 is passed to the script - to be adjusted + to real conditions my $key = $1 if /^(\w+)/; print if exists $id{$key}; }` [download]	[reply] [d/l]