Re: compare lists and delete unwanted from file

Without knowing how the data is formated, it's hard to give an exact solution, but here it goes.

(Note: as I'm sure you know, everything in Perl can be done a million different ways. I prefer to use hash and array references and do everything in one or two regular expressions when possible.)

First, read in your file and store the unwanted ids:

## open the file and read in data
my $list_file = '/g/Viruses/prophage_data/emptySeqList_aa.txt'; 
  ## try to use single quotes when 
  ## you don't need string interpolation,
  ## e.g., no variables or "\n"
open (my $fh, '<', $list_file); 
  ## it is often preferable to use a
  ## variable to store a filehandle
my @lines = <$fh>; # reads entire file in one go
  ## This is technically bad form, 
  ## but assuming your file isn't too big, it's fine
close ($fh);
my $text = join ('', @lines); # combines all lines into one string

## Here's where your file format will change the code
## I'm assuming nothing is in the file but gene ids,
## and that each id consists of letters, numbers, and underscores.
## This regex will identify all geneids (using \w+)
## and store them as hash keys.
my $geneids_to_remove = {}; # create a hash reference
$text =~ s/(\w+)
  (?{ # in regex code
    $geneids_to_remove->{$1} = 1; # store geneids in a hash
  })
//gx;
[download]

Now, we read in your other file -- there are two options here:
1) do it per line or
2) do it all at once.

#### Per line ####
my $ptt_file="/g/Viruses/prophage_data/prophage_region.ptt1";
open ($fh, '<', $ppt_file);

## precompile a regex to capture the geneid on each line
## I assume the gene id is the first thing on each line
my $gene_id;
my $rx_find_geneid = qr/^(\w+) (?{ $gene_id = $1; })/x;

## I prefer to avoid $_ for clarity
my $saved_lines = '';
while (my $line = <$fh>)
{
  ## run precompiled regex
  $line =~ /$rx_find_geneid/;
  
  ## check to see if it exists in the hash
  ## if not, save it
  if (! exists $geneids_to_remove->{$gene_id})
  {
    $saved_lines .= $line;
  }    
}
close ($fh);
[download]

or (my preference)

#### One big regex ####
## don't do this and the previous

## read in file
my $ptt_file="/g/Viruses/prophage_data/prophage_region.ptt1";
open ($fh, '<', $ppt_file);
@lines = <$fh>;
close ($fh);
$text = join ('', @lines);

## you don't need to precompile this -- it's for clarity
## and in case you ever want to remove these from multiple
## files, i.e., put it in a loop
## Again, I assume the geneid is at the front of the line.
my $saved_lines = '';
my $rx_rm_lines = qr/
  (^(\w+).+$ [\r\n])
  (?{
    if (! exists $geneids_to_remove->{$2})
    {
      $saved_lines .= $1;
    }
  })
/xm; # the 'm' modifier enables multiline regex

## run the regex (you can use s/.../$1/g if you
## don't want to destroy the string as you search
$text =~ s/$rx_rm_lines//g;
[download]

Now write it out (regardless of which method you used)

## write out saved data
open ($fh, '>', $outfile);
print $fh $saved_lines;
close ($fh);
[download]

Comment on Re: compare lists and delete unwanted from file Select or Download Code

Replies are listed 'Best First'.
Re^2: compare lists and delete unwanted from file by Anonymous Monk on Mar 13, 2012 at 21:30 UTC
Thankyou for your help. I have been trying to get a handle on hash references so this will be good. However, I don't understand the code snippet below. Specifically, what is the "(?" and the //gw. `my $geneids_to_remove = {}; # create a hash reference $text =~ s/(\w+) (?{ # in regex code $geneids_to_remove->{$1} = 1; # store geneids in a hash }) //gx;` [download] Can you also explain the part below in more detail. Specifically, what does the /x do? `my $rx_find_geneid = qr/^(\w+) (?{ $gene_id = $1; })/x;` [download]	[reply] [d/l] [select]
Re^3: compare lists and delete unwanted from file by muppetjones (Novice) on Mar 14, 2012 at 17:28 UTC
Of course! The (?{ ... }) is how you insert code into a regex. Whenever you see (? ... ), it generally means you are doing something with whatever is found inside, but you don't want to capture it. For instance (?: ... ) works just like a capture group but without actually capturing. Whenever you insert code into a regex, it's always a good idea to add the /x modifier, which tells the regex compiler to ignore all whitespace. This means you'll either need to escape whitespace (i.e., '\ '), or (even better) just use \s. The /g modifier means to match globally. This does different things depending on whether you're using s/// or m//. The former does it all at once, but the latter simply continues matching wherever the previous match left off, meaning you need a loop to match everything.	[reply]