General program and related problems

micky744monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: General program and related problems by moritz (Cardinal) on Aug 03, 2009 at 13:00 UTC
You can certainly iterate over the other file, but that might be quite slow. I haven't quite understood how your input format looks like, but it might be possible to throw you data into a database and let it do a JOIN operation. Or if you have enough memory you could read the second file into a hash and then access it. But without knowing more about the input and desired output it's hard to give a good advice.	[reply]
Re^2: General program and related problems by Anonymous Monk on Aug 03, 2009 at 13:24 UTC
Thanks for the reply basically the problem is that I do not need most of the fields contained in file 1 and most of the fields in file 2 Few lines of file1: 169: rs60465173 has merged into rs8057341 Homo sapiensCAGCTGACTGAGGCAGCGGGAGTTGAA/GAAGAAACGATATTAGTTCATGGTGA ABI, AFFY, ILLUMINA-UK, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA170: rs17312781 has merged into rs8057341 Homo sapiensCAGCTGACTGAGGCAGCGGGAGTTGAA/GAAGAAACGATATTAGTTCATGGTGA ABI, AFFY, ILLUMINA-UK, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA171: rs8057341 Homo sapiensCAGCTGACTGAGGCAGCGGGAGTTGAA/GAAGAAACGATATTAGTTCATGGTGA ABI, AFFY, ILLUMINA-UK, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA172: rs60162986 has merged into rs8046608 Homo sapiensCCCTACTTACTTGTGGCCTGTCCCCTC/TGTGAATGTGTCTCATGTCCCCAGTG AFFY173: rs8046608 Homo sapiensCCCTACTTACTTGTGGCCTGTCCCCTC/TGTGAATGTGTCTCATGTCCCCAGTG From there I need the rs value. And for this I made the code I wrote about. Now I have rs values in an array and I need to grab only the lines which contain the rsnumbers from a second huge txt file (1 GB) The second file looks like First row XXX XXX XXX XXX XXX XXX XXX XXX (1050 cells) rsnumber AA AG AG AG AA AG AG AG (1050 times) rsnumber TT AT AA AT AT ..... 500 times more I need to get from this file the rsnumbers stored in the array from file 1 toghether with the 1050 values on the string	[reply]
Re^3: General program and related problems by micky744monk (Novice) on Aug 03, 2009 at 13:30 UTC
Maybe I was not enough clear Form file1 (you can see few lines) i need only the rs with 5-8 numbers field. The same field is the first column of file 2 Here we are	[reply]
Re^3: General program and related problems by micky744monk (Novice) on Aug 03, 2009 at 18:05 UTC
Thanks every body for the help Basically my output file 1 at the moment is a file with 1 column with rs values like rs3547689 rs325678912 rs36789012 etc I need now to find these value in file 2 and print out the line or in a separated file The file 2 looks like XXX XXX XXX XXX XXX XXX (1050 times) rs3507865 AA AT AT AT TT AA (1050 values) rs3456189 GG GC GG CC CC ..... more than 700 rows Can you gimme a suggestion for keys for the hash? Can be row number even if I can not write on file 2? Cheers again	[reply]
Re^4: General program and related problems by jethro (Monsignor) on Aug 04, 2009 at 11:53 UTC
Re: General program and related problems by jethro (Monsignor) on Aug 03, 2009 at 13:44 UTC
Which grep is on your mind, the internal perl function grep or the command line utility grep? Generally there is (especially in Perl) more than one way to do things, though some are better than others in a given situation. You also seemed to imply that both data files are huge. Does that mean you are searching for thousands or millions of values in file 2 ? If that were the case, a lot would depend on the characteristics of the data. If it is only single words you could create a hash (stored on disc) of the words in file 1 and check every word in file 2 for existance in the hash. If you are looking for whole lines instead, you could sort file 2 and your seach list and work from the beginning in both lists If the list you are looking for is small on the other hand you could concatenate all search strings with '\|' and use that string as search pattern, somewhat like this (untested): `my $searchstring= '\Q' . join('\E\|\Q',@output). '\E'; while (my $line=<FILE2>) { if ($line=~/$searchstring/) { print $line; } }` [download] \Q and \E make sure your search strings have any regex special characters like '\|' escaped. Other observations: ++ for your use of warnings and strict. But you also should indent correctly. Makes your code much more readable And concerning your second posting. Please edit and use code tags for your data examples too	[reply] [d/l]
Re^2: General program and related problems by micky744monk (Novice) on Aug 04, 2009 at 20:09 UTC
Hello hello Thanks to the time you wasting with me I tried both code but I do not have any output file at end #!/usr/bin/perl -w use strict; my$line; my@fields; my@output; open (FILE1, 'snp2.txt') or die "can't open the file: $!"; open (FILE2,'chr22.txt') or die "can't open the file: $!"; open (FD, '>test.txt') or die "can't open the file: $!"; my $position = tell(FILE2); my %rs; while ($line=<FILE2>) { my ($key)= $line=~/^(rs\d{5,})\b/; if (defined $key) { $rs{$key}= $position; } $position=tell(FILE2); } while (defined ($line= <FILE1>)) { my@fields= split (/\s+/ ,$line); my@output = grep /^rs\d{5,}\b/ ,@fields; } foreach (@output) { if (exists $rs{$_}) { seek(FILE2,$rs{$_},0); my $line= <FILE2>; print FD $line; } } Close FILE1; Close FILE2; Close FD; [download] This one the most convincing to me, but I have no output file at the end...I do not know where the problem could be	[reply] [d/l]
Re^3: General program and related problems by jethro (Monsignor) on Aug 04, 2009 at 23:39 UTC
Did you see that your program is producing an error message? It should be 'close', not 'Close' at the end Some words on debugging. If you don't know what your program does, insert print statements to find out (or use Data::Dumper). For example a simple `print @output;` before the foreach loop would have told you that @output is empty. Then you could have looked at the previous loop where @output should have been filled. A `print join('\|',@fields),"\n";` or even better `print Dumper(\@fields);` (you also need a line `use Data::Dumper` for this) and `print Dumper(\@output);` at the end of the loop would have given you surprising results. If you want to learn something, please do the above, look at the result and think about it. If you don't find the solution, read the spoiler below. <Reveal this spoiler or all spoilers in this node or all in this thread> After you solved the first problem you will see that there is a further problem, you are getting only the result of the last line in file1. The ouput of the prints or Dumper lines should give you a clue again, if not read the next spoiler <Reveal this spoiler>	[reply] [d/l] [select]
Re^4: General program and related problems by micky744monk (Novice) on Aug 05, 2009 at 21:44 UTC
Re^5: General program and related problems by jethro (Monsignor) on Aug 06, 2009 at 00:24 UTC
Some notes below your chosen depth have not been shown here
Re: General program and related problems by tokpela (Chaplain) on Aug 03, 2009 at 13:55 UTC
Basically, it seems like you are trying to get some lookup data from your first file and then use it when scanning your second file. I would not use arrays since you mention that you have GB file sizes. I would instead use DBM::Deep which you could use to store your initial values. A side effect is that your retrieval will be pretty quick as well. One thing that you will need to come up with is a link between the data from the two files - some common key to use in the DBM::Deep database. Something like this: use strict; use warnings; use DBM::Deep; my $db_filepath = 'lookup.db'; my $file1 = 'XXXX.txt'; my $file2 = 'YYYY.txt'; my $db = DBM::Deep->new($db_filepath); open(my $fh, $file1) or die "[Error] COULD NOT OPEN FILE [$file1]-[$!] +"; while (<$fh>) { my $line = $_; # get a common key from the data somewhere in here my @fields = split (/\s/ ,$line); my @output = grep /rs\d{5,}\b/ ,@fields; my $rs = join (':' , @output); $rs =~ s/:/\n/g; $db->{'some-common-key-bewtween-files'} = $rs; } close($fh); # now iterate through your other file and lookup using DBM::Deep open(my $fh2, $file2) or die "[Error] COULD NOT OPEN FILE [$file2]-[$! +]"; while (<$fh2>) { my $line = $_; if ($line =~ /some-common-key-bewtween-files/) { my $db_record = $db->{'some-common-key-bewtween-files'}; # now you have linked data from both files # do your other coding here. } } close($fh2); [download]	[reply] [d/l]