input record separator and split

frednc_2014 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I am trying to split a large file into an array of record using $/. My split.pl is the following

#!/usr/bin/perl -w

$input_blast = $ARGV[0];

$/ = "\nQuery=\s+";

open (IN, "$input_blast"); 
while (<IN>) {
   
    chomp;
    @blastblock = split(/Query=/, $_);

}

 
$total_number = 0;
foreach $blastresult (@blastblock) {
    
  next if ($blastresult =~ /^BLASTN/);
  $total_number++;
  print "$total_number ----------------------$blastresult\n";
  
}
[download]

part of the input file is like:

Database: NLngoRT_WT_PCRDB
           1 sequences; 481 total letters


Query= M01133:26:000000000-A6UCG:1:1101:22656:1128
1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1128 2:N:0:1

Length=501
... (more content here, I deleted them so that this message is small)


Query= M01133:26:000000000-A6UCG:1:1101:22656:1130
1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1130 2:N:0:1

Length=501
... (more content here, I deleted them so that this message is small)
-------------
[download]

This is a huge file. It has 72320825 lines. I set the record separator on "Query= " and then split the file into an array. When I ran my split.pl on it, I got this error message: "Split loop at ../DIR_TEST/split.pl line 12, <IN> chunk 1." It did not produce an array. However, if I created a smaller file by using "head -54725379 >small.txe". It worked fine. If I added one more line using "head -54725380 >small_plus_1.txe". I got the same error message as the original file. Not sure what causes this. Somehow it is related to the split function. Thanks.

Comment on input record separator and split Select or Download Code

Replies are listed 'Best First'.
Re: input record separator and split by toolic (Bishop) on May 28, 2014 at 19:44 UTC
When I run your code, I get a warning: `Unrecognized escape \s passed through at ...` [download] $/ does not accept a regular expression (`\s+`). Also, you overwrite the contents of your @blastblock array every time through the while loop. So, the array only has 1 element after the while loop.	[reply] [d/l] [select]
Re: input record separator and split by Laurent_R (Canon) on May 28, 2014 at 21:21 UTC
Not only does $/ not accept regex, but it also looks fairly useless to add the "\s+" pattern in this context. At most, it would remove additional spaces from the chunks you get, but that can easily be done as a second step. The second thing that I don't get is that you split your file on "Query" and then try to split your lines on almost the same pattern. Unless I missed something, it does not seem to me to make much sense with the data sample you provided. Lastly, a 72320825-line file is pretty big, but I would not qualify it as huge (unless the lines are really very long), I am using much larger files on an almost daily basis and don't get any trouble so long as I am not doing something stupid sus as trying to load everything into memory (il might just take some time, but it does not fail). Anyway, since this line: `@blastblock = split(/Query=/, $_);` [download] is overwriting the @blastblock array each time through the loop, I don't really believe that you ran out of memory because of the size of the input data. I would suggest that you try to look at line 54725380 to figure out if there is something wrong with it. One possible to view it might be a one-liner such as this one: `perl -e '$/ = "\nQuery="; while (<>) { print and last if $. == 5472538 +0;}' file.txt` [download] It might have to be adapted depending on your data, but see if this works. More generally, I suspect that your split fails because your data might have a very large section (possibly the whole file) without ever matching the record splitting pattern. So the first thing to be done is to remove the `\s+` from your input record delimiter and see whether that works.	[reply] [d/l] [select]
Re: input record separator and split by taint (Chaplain) on May 29, 2014 at 02:07 UTC
Where is `use strict;` [download] ??? --Chris Ąλɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ꜱᴉɥʇ ədoH	[reply] [d/l]
Re^2: input record separator and split by frednc_2014 (Initiate) on May 29, 2014 at 13:28 UTC
Thanks for all the replies above. You pointed out correctly that the input record separator cannot take regular expression. If "\s+" is taken out, indeed, the array has only one (the last one) element because all the previous elements is replaced by the last one. But, if "\s+" is kept, the array has all the elements. If you run my split.pl above on this simple input file: `Query= M01133:26:000000000-A6UCG:1:1101:22656:1128 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1128 2:N:0:1 Block1 Query= M01133:26:000000000-A6UCG:1:1101:22656:1130 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1130 2:N:0:1 Block2` [download] You'll see the array has two elements and they are the correct ones. The result is here: `1 ---------------------- M01133:26:000000000-A6UCG:1:1101:22656:1128 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1128 2:N:0:1 block 1 2 ---------------------- M01133:26:000000000-A6UCG:1:1101:22656:1130 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1130 2:N:0:1 block 2` [download] Actually, it produced 3 elements. The first element was an empty one. I added a line of script to skip it. Chris, I ignored use strict, etc. just for testing this error. This split function is a part of my large script that worked perfectly until I got that particular input file. I just wrote a simple testing script to find out what the problem is Laurent, that particular line is just like all other lines at the same position, nothing special about it. By the way, I have two input files that do not work individually. However, if I combine them together to produce a larger file, the combined input file worked. That is strange.	[reply] [d/l] [select]
Re: input record separator and split by Lotus1 (Vicar) on May 31, 2014 at 15:34 UTC
When you set `$/` to something unrecognized or even just something that isn't found in the input file it works the same way as if you had set it to `undef`. It causes the whole file, or as much as will fit into memory, to be slurped on the first pass of the while loop. Then inside the while loop, since you split on something it does find, that part splits `$_` into your array. As long as the file is small enough to be slurped on the first pass it seems to work. The problem with your approach is you shouldn't try to read multi-line records and then split each record into the same record. If it works it is already split by the read with `$/`. To fix this is set `$/` to "Query=" and then trim any extra spaces at the beginning of `$_` within your while loop. Inside the while loop split each record into whatever form you need. #!/usr/bin/perl use strict; use warnings; $/ = "Query="; my $count = 0; while (<DATA>) { chomp; s/^\s+//; my @record = split(/[:\n]/, $_); $count++; print join(' : ', @record), "<count=$count>\n"; } __DATA__ Query= M01133:26:000000000-A6UCG:1:1101:22656:1128 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1128 2:N:0:1 Length=501 Query= M01133:26:000000000-A6UCG:1:1101:22656:1129 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1129 2:N:0:1 Length=501 Query= M01133:26:000000000-A6UCG:1:1101:22656:1130 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1130 2:N:0:1 Length=501 Query= M01133:26:000000000-A6UCG:1:1101:22656:1131 1:N:0:1+@M01133:26:000000000-A6UCG:1:1101:22656:1131 2:N:0:1 Length=501 [download] Output: <count=1> M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1128 : 1 : N : 0 : +1+@M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1128 2 : N : 0 +: 1 : : Length=501<count=2> M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1129 : 1 : N : 0 : +1+@M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1129 2 : N : 0 +: 1 : : Length=501<count=3> M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1130 : 1 : N : 0 : +1+@M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1130 2 : N : 0 +: 1 : : Length=501<count=4> M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1131 : 1 : N : 0 : +1+@M01133 : 26 : 000000000-A6UCG : 1 : 1101 : 22656 : 1131 2 : N : 0 +: 1 : : Length=501<count=5> [download]	[reply] [d/l] [select]