Re: Extacting lines where one column matches a name from a list of names
by davido (Cardinal) on Sep 13, 2019 at 15:50 UTC
|
You seek a performance improvement over your awk solution. For there to be an improvement, there must be room to improve. I think you mentioned in a follow-up comment that your list of names is in the order in which they will appear in the files you are parsing. This is useful, as you can save the small amount of time it might have taken to import the list of names into a hash. So lets do a little profiling:
use strict;
use warnings;
use Time::HiRes qw(time);
open my $name_infh, '<', 'path/to/names/list' or die $!;
open my $haystack_infh, '<', 'path/to/tab/delimited/list' or die $!;
my $t0 = time();
while(<$name_infh>) {}
while(<$haystack_infh>) {}
printf "Elapsed time: %-.03f\n", time-$t0;
Now run that on your input file; the largest one you've got, and see how long it takes. If it takes too long, you can stop right there because there is no Perl (or any other language) solution that will meet your time requirements unless you change the requirements by processing streams more frequently, or overnight when it doesn't matter, etc.
If it is fast enough, then you could take the next step by implementing a solution in Perl that is similiarly linear in its computational complexity:
use strict;
use warnings;
open my $name_infh => '<', 'path/to/names/list'
or die "Unable to open names list: $!\n";
open my $haystack_infh => '<', 'path/to/tab/del/file'
or die "Unable to open haystack file: $!\n";
my $name = <$name_infh>;
chomp $name;
while (my $line = <$haystack_infh>) {
my ($test_name, $payload) = split /\t/, $line, 2;
if ($name eq $test_name) {
print "We have a winner: $test_name => $payload";
$name = <$name_infh>;
last if !defined $name;
chomp $name;
}
}
This operates under the assumption that there will be exactly one match for each name in your list, and that your names list is in the correct order. If those assumptions are incorrect, then read your names list into a hash to start with; this will incur only a slight penalty -- so slight it's probably not worth maintaining your names list in any particular order to begin with. If it's not in order, just do this:
my %want;
while(<$name_infh>) {
chomp;
$want{$_}++;
}
while (my $line = <$haystack_infh>) {
my ($test_name, $payload) = split /\t/, $line, 2;
if (exists $want{$test_name}) {
delete $want{$test_name};
print "We have a winner: $test_name => $payload";
last if ! keys %want;
}
}
This last solution is still a linear time solution, as was the previous one, but is more flexible on the order in which things happen. It still makes one assumption; you're only looking for each name one time. You can remove the delete and the last if lines if that assumption isn't correct.
At any rate, if the initial profiling check determined that the sheer act of reading the files takes longer than you have, you'll have to come up with a different strategy that doesn't involve sitting around waiting for large files to load.
| [reply] [d/l] [select] |
Re: Extacting lines where one column matches a name from a list of names
by Marshall (Canon) on Sep 13, 2019 at 02:25 UTC
|
The first issue is that Perl Monks is a forum to help folks learn
about Perl, thereby enabling those folks to get better at
doing it themselves. This is not a general code writing service.
However, if you were able to make some demonstrated attempt at
this yourself, you probably will get lots of help. So one issue
is how to get started? Do you have any programming experience at all?
I am unsure of the very best current books on Beginning Perl, but
I remember an O'Reilly book by that name. I hope other Monks can make
additional suggestions?
Your Plan is a bit confusing to me owing to your use of the term "hash keys"
which I associate with something very specific and which might not be
what you meant.
In general a data search problem of this type is done by keeping your
"list of names" in memory in an efficiently searchable form. A hash table
would often be used for this. Is name1 in my list of names? can be
answered very quickly.
Things become more complex if some amount of "sort of matches" is allowed.
The question: Does name1 "look like" something in my "list of names" can
be complex or computationally expensive.I'd have to have some example data
to make a concrete recommendations.
So, if an exact match to one of words in your list is required, then
a simple hash table of your names would suffice. Read a line of data,
decide if the name matches and if so, do "something" with it, otherwise
skip that line (do nothing). Read next line, rinse repeat.
Please give some more detail. Then we can discuss "What to do" in more
detail (the processing algorithm). Along the way, you will need to quite
a bit of learning on your own about "How to do it". A good and perhaps stepwise
plan should be of interest to you along with some books and other material
to read in order for you to get started.
Update: I guess one starting point would be to try to translate your awk code into Perl. The enormous
execution time suggests to me that you have a very inefficient algorithm for determining if a name is relevant
or not? How you are currently making that decision is one main point of mine above.
| [reply] |
|
|
I understand. I believe that using hash tables would speed things up significantly. I don't need sort of matches, because my data has been designed such that my list of names exactly match the corresponding entries within the table (i.e. regex).
The list of names looks like this:
1000567/1
1000567/2
1000574/1
1000574/2
And the total data (from which I want to extract the matching columns) looks like this:
I have removed the tabs and replaced them with commas, so that each entry is in one string
1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0
122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0
641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0
Separately, I have the first column as a matching "hash file"
1083978/2
122854/1
641613/1
So now, I will use the hash file and the comma-separated data to create hash and key pairs. I will put the hashes in a table. And finally, I will compared my hashes to my list of names using regex matching. For the hashes that match, I will fetch the keys and replace commas with tabs, thus giving me my original data. Does this seem sensible?
| [reply] |
|
|
#!/usr/bin/perl
use strict;
use warnings;
use Inline::Files; # Just for this Demo
# so that this is a single runnable file
# Monk node_id=11106111
my %nameValid;
# initialize Name Hash...
# Name the hash by what the Value Means,
# not by what the hash key means
while (my $name = <LISTOFNAMES>)
{
chomp $name; # shorter idioms exist
$nameValid{$name}=1; # but this is just fine
}
while (my $line = <DATA>)
{
chomp $line;
my ($name,$data) = split ",",$line,2;
if ( exists $nameValid{$name} )
{
# name is in "list" do something
# I have no idea what to do?
# so I just print that line
print "$name => $data\n";
}
}
# prints the one line of interest:
# 122854/1 => 284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0
__LISTOFNAMES__
1000567/1
1000567/2
122854/1
1000574/1
1000574/2
__DATA__
1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0
122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0
641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0
Please see Wiki Article on regex (Regular Expression).
You may be able to use the -f option on grep to do what you want? | [reply] [d/l] |
|
|
One feature of your example data here is that the field of interest (the "name" field) is always the first field in the record, i.e., always at the start of a string read from a file. (Update: This approach assumes that the $rx_sep field separator pattern cannot possibly appear in a "name" field!) This anchor can be very useful. If you build a regex to match all the names "of interest" (see haukex's article Building Regex Alternations Dynamically), it's a one-pass process to read all records in a file and match and extract only those records of interest.
c:\@Work\Perl\monks>perl
use strict;
use warnings;
my @names = qw(1000567/1 1000567/2 122854/1 1000574/2);
my $rx_sep = qr{ , }xms; # adjust to match real field separator
my ($rx_interesting) =
map qr{ \A (?: $_) (?= $rx_sep) }xms,
join q{ | },
map quotemeta, # proper!
reverse sort # order!
@names
;
print "$rx_interesting \n"; # for debug
my @data = (
'1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0',
'122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0',
'641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0',
);
while (my $datum = shift @data) {
print "interesting: >$datum< \n" if $datum =~ $rx_interesting;
}
__END__
(?msx-i: \A (?: 122854\/1 | 1000574\/2 | 1000567\/2 | 1000567\/1) (?=
+(?msx-i: , )) )
interesting: >122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,
+1,53,0<
Note that $rx_sep has to be adjusted to match whatever field separator your data records actually use.
Update 1: I didn't notice that haukex already suggested this approach here. Oh well... At least you have a worked example :)
Update 2: I've noticed a stupid mistake in my code as originally posted. A part of the sequence of operations to build $rx_interesting was incorrectly given as
reverse sort
map quotemeta,
The code has been corrected. The error (quotemeta-ing before sort-ing) should have made no difference in this particular application, but there are corner cases (update: in other potential applications) in which it would (although I'm unable to think of a good example of such a case ATM).
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Extacting lines where one column matches a name from a list of names
by haukex (Archbishop) on Sep 13, 2019 at 05:25 UTC
|
Do a regex match between my list of names, and my hash keys.
This is only necessary if you need regex features, like partial matches - if that's necessary, see Building Regex Alternations Dynamically for one solution. If you want to match exact strings, then simply looking up the first column in a hash built from the list of names should be best, as Marshall showed.
| [reply] |
|
|
The OP wrote in a reply to me:
"I don't need sort of matches, because my data has been designed such that my list of names exactly match the corresponding entries within the table (i.e. regex)." I think one of the problems is that the OP is using terms like "regex" in ways that don't seem to make sense. I don't think that the OP really means "regex".
I am working on an approximate pattern matching problem currently. Right now I take an input, then generate a series of regex expressions that are or'd together. This indeed does work, but it is slow because that generated regex has to be compared against a large number of tokens. Right now I suspect that some sort of tree structure can be generated such that I can determine the approximate matches much, much faster. I absolutely know that this is true with a subset of my problem, something a bit more than N1. I am matching with a subset of N2. That subset of N2 being defined currently by my regex generator.
I have a colleague who is willing to look at the problem. I am working on a spec that is detailed enough so that he can understand what needs to be done without having to have any pre-knowledge of my application. That's more difficult than it sounds - the what's and why's are complex.
If anybody has experience with a similar sounding problem, send me a message.
| [reply] |
Re: Extacting lines where one column matches a name from a list of names
by Anonymous Monk on Sep 13, 2019 at 03:24 UTC
|
I have been using awk ... But I have no idea where to start.
If you have perl installed, save your awk commands to a file, then type: a2p file | [reply] |