in reply to Re: Extacting lines where one column matches a name from a list of names
in thread Extacting lines where one column matches a name from a list of names

I understand. I believe that using hash tables would speed things up significantly. I don't need sort of matches, because my data has been designed such that my list of names exactly match the corresponding entries within the table (i.e. regex).

The list of names looks like this:

1000567/1
1000567/2
1000574/1
1000574/2

And the total data (from which I want to extract the matching columns) looks like this:

I have removed the tabs and replaced them with commas, so that each entry is in one string
1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0
122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0
641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0


Separately, I have the first column as a matching "hash file"
1083978/2
122854/1
641613/1

So now, I will use the hash file and the comma-separated data to create hash and key pairs. I will put the hashes in a table. And finally, I will compared my hashes to my list of names using regex matching. For the hashes that match, I will fetch the keys and replace commas with tabs, thus giving me my original data. Does this seem sensible?
  • Comment on Re^2: Extacting lines where one column matches a name from a list of names

Replies are listed 'Best First'.
Re^3: Extacting lines where one column matches a name from a list of names
by Marshall (Canon) on Sep 13, 2019 at 04:54 UTC
    Does this seem sensible? No, it does not.

    I think you want to select particular lines of interest from the input for further processing? This simple code does that.

    #!/usr/bin/perl use strict; use warnings; use Inline::Files; # Just for this Demo # so that this is a single runnable file # Monk node_id=11106111 my %nameValid; # initialize Name Hash... # Name the hash by what the Value Means, # not by what the hash key means while (my $name = <LISTOFNAMES>) { chomp $name; # shorter idioms exist $nameValid{$name}=1; # but this is just fine } while (my $line = <DATA>) { chomp $line; my ($name,$data) = split ",",$line,2; if ( exists $nameValid{$name} ) { # name is in "list" do something # I have no idea what to do? # so I just print that line print "$name => $data\n"; } } # prints the one line of interest: # 122854/1 => 284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0 __LISTOFNAMES__ 1000567/1 1000567/2 122854/1 1000574/1 1000574/2 __DATA__ 1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0 122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0 641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0
    Please see Wiki Article on regex (Regular Expression).
    You may be able to use the -f option on grep to do what you want?
Re^3: Extacting lines where one column matches a name from a list of names (updated)
by AnomalousMonk (Archbishop) on Sep 13, 2019 at 06:22 UTC

    One feature of your example data here is that the field of interest (the "name" field) is always the first field in the record, i.e., always at the start of a string read from a file. (Update: This approach assumes that the  $rx_sep field separator pattern cannot possibly appear in a "name" field!) This anchor can be very useful. If you build a regex to match all the names "of interest" (see haukex's article Building Regex Alternations Dynamically), it's a one-pass process to read all records in a file and match and extract only those records of interest.

    c:\@Work\Perl\monks>perl use strict; use warnings; my @names = qw(1000567/1 1000567/2 122854/1 1000574/2); my $rx_sep = qr{ , }xms; # adjust to match real field separator my ($rx_interesting) = map qr{ \A (?: $_) (?= $rx_sep) }xms, join q{ | }, map quotemeta, # proper! reverse sort # order! @names ; print "$rx_interesting \n"; # for debug my @data = ( '1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0', '122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0', '641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0', ); while (my $datum = shift @data) { print "interesting: >$datum< \n" if $datum =~ $rx_interesting; } __END__ (?msx-i: \A (?: 122854\/1 | 1000574\/2 | 1000567\/2 | 1000567\/1) (?= +(?msx-i: , )) ) interesting: >122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0, +1,53,0<
    Note that  $rx_sep has to be adjusted to match whatever field separator your data records actually use.

    Update 1: I didn't notice that haukex already suggested this approach here. Oh well... At least you have a worked example :)

    Update 2: I've noticed a stupid mistake in my code as originally posted. A part of the sequence of operations to build  $rx_interesting was incorrectly given as
        reverse sort
        map quotemeta,
    The code has been corrected. The error (quotemeta-ing before sort-ing) should have made no difference in this particular application, but there are corner cases (update: in other potential applications) in which it would (although I'm unable to think of a good example of such a case ATM).


    Give a man a fish:  <%-{-{-{-<