in reply to Extacting lines where one column matches a name from a list of names

The first issue is that Perl Monks is a forum to help folks learn about Perl, thereby enabling those folks to get better at doing it themselves. This is not a general code writing service. However, if you were able to make some demonstrated attempt at this yourself, you probably will get lots of help. So one issue is how to get started? Do you have any programming experience at all? I am unsure of the very best current books on Beginning Perl, but I remember an O'Reilly book by that name. I hope other Monks can make additional suggestions?

Your Plan is a bit confusing to me owing to your use of the term "hash keys" which I associate with something very specific and which might not be what you meant.

In general a data search problem of this type is done by keeping your "list of names" in memory in an efficiently searchable form. A hash table would often be used for this. Is name1 in my list of names? can be answered very quickly.

Things become more complex if some amount of "sort of matches" is allowed. The question: Does name1 "look like" something in my "list of names" can be complex or computationally expensive.I'd have to have some example data to make a concrete recommendations.

So, if an exact match to one of words in your list is required, then a simple hash table of your names would suffice. Read a line of data, decide if the name matches and if so, do "something" with it, otherwise skip that line (do nothing). Read next line, rinse repeat.

Please give some more detail. Then we can discuss "What to do" in more detail (the processing algorithm). Along the way, you will need to quite a bit of learning on your own about "How to do it". A good and perhaps stepwise plan should be of interest to you along with some books and other material to read in order for you to get started.

Update: I guess one starting point would be to try to translate your awk code into Perl. The enormous execution time suggests to me that you have a very inefficient algorithm for determining if a name is relevant or not? How you are currently making that decision is one main point of mine above.

  • Comment on Re: Extacting lines where one column matches a name from a list of names

Replies are listed 'Best First'.
Re^2: Extacting lines where one column matches a name from a list of names
by mr_clean (Initiate) on Sep 13, 2019 at 03:50 UTC
    I understand. I believe that using hash tables would speed things up significantly. I don't need sort of matches, because my data has been designed such that my list of names exactly match the corresponding entries within the table (i.e. regex).

    The list of names looks like this:

    1000567/1
    1000567/2
    1000574/1
    1000574/2

    And the total data (from which I want to extract the matching columns) looks like this:

    I have removed the tabs and replaced them with commas, so that each entry is in one string
    1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0
    122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0
    641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0


    Separately, I have the first column as a matching "hash file"
    1083978/2
    122854/1
    641613/1

    So now, I will use the hash file and the comma-separated data to create hash and key pairs. I will put the hashes in a table. And finally, I will compared my hashes to my list of names using regex matching. For the hashes that match, I will fetch the keys and replace commas with tabs, thus giving me my original data. Does this seem sensible?
      Does this seem sensible? No, it does not.

      I think you want to select particular lines of interest from the input for further processing? This simple code does that.

      #!/usr/bin/perl use strict; use warnings; use Inline::Files; # Just for this Demo # so that this is a single runnable file # Monk node_id=11106111 my %nameValid; # initialize Name Hash... # Name the hash by what the Value Means, # not by what the hash key means while (my $name = <LISTOFNAMES>) { chomp $name; # shorter idioms exist $nameValid{$name}=1; # but this is just fine } while (my $line = <DATA>) { chomp $line; my ($name,$data) = split ",",$line,2; if ( exists $nameValid{$name} ) { # name is in "list" do something # I have no idea what to do? # so I just print that line print "$name => $data\n"; } } # prints the one line of interest: # 122854/1 => 284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0 __LISTOFNAMES__ 1000567/1 1000567/2 122854/1 1000574/1 1000574/2 __DATA__ 1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0 122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0 641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0
      Please see Wiki Article on regex (Regular Expression).
      You may be able to use the -f option on grep to do what you want?

      One feature of your example data here is that the field of interest (the "name" field) is always the first field in the record, i.e., always at the start of a string read from a file. (Update: This approach assumes that the  $rx_sep field separator pattern cannot possibly appear in a "name" field!) This anchor can be very useful. If you build a regex to match all the names "of interest" (see haukex's article Building Regex Alternations Dynamically), it's a one-pass process to read all records in a file and match and extract only those records of interest.

      c:\@Work\Perl\monks>perl use strict; use warnings; my @names = qw(1000567/1 1000567/2 122854/1 1000574/2); my $rx_sep = qr{ , }xms; # adjust to match real field separator my ($rx_interesting) = map qr{ \A (?: $_) (?= $rx_sep) }xms, join q{ | }, map quotemeta, # proper! reverse sort # order! @names ; print "$rx_interesting \n"; # for debug my @data = ( '1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0', '122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0', '641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0', ); while (my $datum = shift @data) { print "interesting: >$datum< \n" if $datum =~ $rx_interesting; } __END__ (?msx-i: \A (?: 122854\/1 | 1000574\/2 | 1000567\/2 | 1000567\/1) (?= +(?msx-i: , )) ) interesting: >122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0, +1,53,0<
      Note that  $rx_sep has to be adjusted to match whatever field separator your data records actually use.

      Update 1: I didn't notice that haukex already suggested this approach here. Oh well... At least you have a worked example :)

      Update 2: I've noticed a stupid mistake in my code as originally posted. A part of the sequence of operations to build  $rx_interesting was incorrectly given as
          reverse sort
          map quotemeta,
      The code has been corrected. The error (quotemeta-ing before sort-ing) should have made no difference in this particular application, but there are corner cases (update: in other potential applications) in which it would (although I'm unable to think of a good example of such a case ATM).


      Give a man a fish:  <%-{-{-{-<