AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, Basically I have a list of gene_ids that I do not like (%empty_geneids). And I want to remove all lines in a second file ($ptt_file) that contain these gene_ids. I have tried many variations of this code, all in perl, then 'regressing' to trying to use sed. I know this last attempt using sed is VERY wrong. but I think it explains what I would like to do. I am relatively new so could use some suggestions. thank you

use strict; use warnings; my $list_file="/g/Viruses/prophage_data/emptySeqList_aa.txt"; my %empty_geneids; my @header_columns; my $gene_id; my @ptt_columns; open (IN,$list_file); ## list of 83 to remove from ptt file while(<IN>) { chomp; my ($header,$descript)=split(/\s+/,$_); $header=~ s/^>//g; @header_columns=split(/_/,$header); $gene_id=$header_columns[0]."_".$header_columns[1]; $empty_geneids{$gene_id}=1; print "empty gene_id from list file\t$gene_id\n"; } close (IN); my $ptt_file="/g/Viruses/prophage_data/prophage_region.ptt1"; my @ptt_lines; my @non_empty_lines; my $ptt_file2; open (IN,$ptt_file); @ptt_lines=<IN>; chomp(@ptt_lines); close (IN); open (OUT,">$ptt_file.noEmpties"); foreach my $k1 (keys %empty_geneids){ `sed '/$k1/d' $ptt_file > $ptt_file2`; `mv $ptt_file2 $ptt_file`; }

Replies are listed 'Best First'.
Re: compare lists and delete unwanted from file
by jwkrahn (Abbot) on Mar 13, 2012 at 15:45 UTC

    Perhaps you want something like this (UNTESTED):

    use strict; use warnings; my $list_file = '/g/Viruses/prophage_data/emptySeqList_aa.txt'; my $ptt_file = '/g/Viruses/prophage_data/prophage_region.ptt1'; ## list of 83 to remove from ptt file open my $IN, '<', $list_file or die "Cannot open '$list_file' because: + $!"; my %empty_geneids; while ( <$IN> ) { my ( $header ) = split; if ( /^>([^_]+_[^_]+)/ ) { $empty_geneids{ $1 } = 1; print "empty gene_id from list file\t$1\n"; } } close $IN; my $empty_geneids_pattern = join '|', keys %empty_geneids; open my $PTT, '<', $ptt_file or die "Cannot open '$ptt_file' because: +$!"; open my $OUT, '>', "$ptt_file.noEmpties" or die "Cannot open '$ptt_fil +e.noEmpties' because: $!"; while ( <$PTT> ) { next if /$empty_geneids_pattern/; print $OUT $_; } close $OUT; close $PTT;
Re: compare lists and delete unwanted from file
by kcott (Archbishop) on Mar 13, 2012 at 18:22 UTC

    You can use Tie::File for the deletion of the unwanted records.

    use Tie::File; ... tie my @ptt_records, q{Tie::File}, $ptt_file or die $!; foreach my $k1 (keys %empty_geneids){ @ptt_records = grep { ! /$k1/ } @ptt_records; } untie @ptt_records;

    Note that you don't need an intermediate file ($ptt_file2) with this technique.

    Also, Tie::File uses a very small memory footprint which may be a bonus given you're working with biological data.

    -- Ken

Re: compare lists and delete unwanted from file
by muppetjones (Novice) on Mar 13, 2012 at 18:17 UTC

    Without knowing how the data is formated, it's hard to give an exact solution, but here it goes.

    (Note: as I'm sure you know, everything in Perl can be done a million different ways. I prefer to use hash and array references and do everything in one or two regular expressions when possible.)

    First, read in your file and store the unwanted ids:
    ## open the file and read in data my $list_file = '/g/Viruses/prophage_data/emptySeqList_aa.txt'; ## try to use single quotes when ## you don't need string interpolation, ## e.g., no variables or "\n" open (my $fh, '<', $list_file); ## it is often preferable to use a ## variable to store a filehandle my @lines = <$fh>; # reads entire file in one go ## This is technically bad form, ## but assuming your file isn't too big, it's fine close ($fh); my $text = join ('', @lines); # combines all lines into one string ## Here's where your file format will change the code ## I'm assuming nothing is in the file but gene ids, ## and that each id consists of letters, numbers, and underscores. ## This regex will identify all geneids (using \w+) ## and store them as hash keys. my $geneids_to_remove = {}; # create a hash reference $text =~ s/(\w+) (?{ # in regex code $geneids_to_remove->{$1} = 1; # store geneids in a hash }) //gx;

    Now, we read in your other file -- there are two options here:
    1) do it per line or
    2) do it all at once.

    #### Per line #### my $ptt_file="/g/Viruses/prophage_data/prophage_region.ptt1"; open ($fh, '<', $ppt_file); ## precompile a regex to capture the geneid on each line ## I assume the gene id is the first thing on each line my $gene_id; my $rx_find_geneid = qr/^(\w+) (?{ $gene_id = $1; })/x; ## I prefer to avoid $_ for clarity my $saved_lines = ''; while (my $line = <$fh>) { ## run precompiled regex $line =~ /$rx_find_geneid/; ## check to see if it exists in the hash ## if not, save it if (! exists $geneids_to_remove->{$gene_id}) { $saved_lines .= $line; } } close ($fh);
    or (my preference)
    #### One big regex #### ## don't do this and the previous ## read in file my $ptt_file="/g/Viruses/prophage_data/prophage_region.ptt1"; open ($fh, '<', $ppt_file); @lines = <$fh>; close ($fh); $text = join ('', @lines); ## you don't need to precompile this -- it's for clarity ## and in case you ever want to remove these from multiple ## files, i.e., put it in a loop ## Again, I assume the geneid is at the front of the line. my $saved_lines = ''; my $rx_rm_lines = qr/ (^(\w+).+$ [\r\n]) (?{ if (! exists $geneids_to_remove->{$2}) { $saved_lines .= $1; } }) /xm; # the 'm' modifier enables multiline regex ## run the regex (you can use s/.../$1/g if you ## don't want to destroy the string as you search $text =~ s/$rx_rm_lines//g;
    Now write it out (regardless of which method you used)
    ## write out saved data open ($fh, '>', $outfile); print $fh $saved_lines; close ($fh);

      Thankyou for your help. I have been trying to get a handle on hash references so this will be good. However, I don't understand the code snippet below. Specifically, what is the "(?" and the //gw.

      my $geneids_to_remove = {}; # create a hash reference $text =~ s/(\w+) (?{ # in regex code $geneids_to_remove->{$1} = 1; # store geneids in a hash }) //gx;

      Can you also explain the part below in more detail. Specifically, what does the /x do?

      my $rx_find_geneid = qr/^(\w+) (?{ $gene_id = $1; })/x;

        Of course!

        The (?{ ... }) is how you insert code into a regex. Whenever you see (? ... ), it generally means you are doing something with whatever is found inside, but you don't want to capture it. For instance (?: ... ) works just like a capture group but without actually capturing.

        Whenever you insert code into a regex, it's always a good idea to add the /x modifier, which tells the regex compiler to ignore all whitespace. This means you'll either need to escape whitespace (i.e., '\ '), or (even better) just use \s.

        The /g modifier means to match globally. This does different things depending on whether you're using s/// or m//. The former does it all at once, but the latter simply continues matching wherever the previous match left off, meaning you need a loop to match everything.