Re: Searching large files a block at a time

Welcome to the Monastery.

"This uses the $/ input field separator, and then uses while (<>) to read a block at a time. I'd like to do this in pure perl, but I can't find a way."

Firstly, here's a simple example of how you might do this.

#!/usr/bin/env perl -l

use strict;
use warnings;

{
    local $/ = '';

    while (<DATA>) {
        chomp;
        print '--- One Block ---';
        print;
    }
} 

__DATA__
Block1 Line1
Block1 Line2
Block1 Line3


Block2 Line1
Block2 Line2
Block2 Line3



Block3 Line1
Block3 Line2
Block3 Line3

Block4 Line1
Block4 Line2
Block4 Line3
[download]

Notes:

Setting $/ to an empty string puts you in what's called "paragraph mode". This allows reading blocks (lines separated by one or more blank lines). The number of blank lines doesn't matter: note how that differs from '$/ = "\n\n"' which specifies an exact number of blank lines. See $/ in perlvar for further details.
When you modify '$/', or indeed any special variable, you should localise the change in limited scope so that the special variable works normally elsewhere in your code. In this instance, I've used an anonymous block (the code is within braces by themselves); subroutine definitions, BEGIN blocks, and so on, could work just as well: just keep the special variable modification separate from other code. See local, and the links that page provides, for more on this.
I'm reading using '<DATA>', which is just a handy way of reading the data after '__DATA__'. You could use '<$filehandle>', where that filehandle may come from open or some other source (see below).
For the purposes of demonstration, I've separated each block with a varying number blanks lines (specifically 2, 3, and 1). This is to show that the number of intervening blank lines doesn't matter when in paragraph mode.
See also chomp and -l in perlrun which I've used. Also look at say.

The output looks like this:

--- One Block ---
Block1 Line1
Block1 Line2
Block1 Line3
--- One Block ---
Block2 Line1
Block2 Line2
Block2 Line3
--- One Block ---
Block3 Line1
Block3 Line2
Block3 Line3
--- One Block ---
Block4 Line1
Block4 Line2
Block4 Line3
[download]

I thought ++roboticus had generally covered issues relating to '$/' and IO::Uncompress::Bunzip2; however, your reply seems to suggest you were looking for something else.

I'm not entirely sure what you're looking for. Note in IO::Uncompress::Bunzip2's Constructor section:

... the object, $z, returned from IO::Uncompress::Bunzip2 can be used exactly like an IO::File filehandle. This means that all normal input file operations can be carried out with $z. For example, to read a line from a compressed file/buffer you can use either of these forms

$line = $z->getline(); $line = <$z>;
[download]

Try using '<$z>', in a way similar to my example with '<DATA>', and see if that does what you want. Something like this (untested):

my $z = IO::Uncompress::Bunzip2::->new($filename);

{
    local $/ = '';

    while (<$z>) {
        ...        
    }
}
[download]

Note that the constructor code I've used differs from that shown in the IO::Uncompress::Bunzip2 documentation. This is on purpose and I recommend you use this instead. The IO::Uncompress::Bunzip2 documentation uses "Indirect Object Syntax: if you follow that link, you'll see in bold text

"... this syntax is discouraged ..."

along with a discussion of why that syntax should be avoided.

— Ken

Comment on Re: Searching large files a block at a time Select or Download Code

Replies are listed 'Best First'.
Re^2: Searching large files a block at a time by JediWombat (Novice) on Aug 02, 2017 at 05:53 UTC
Thank you Ken! The bit that helped me understand what I was doing wrong, was `my $z = IO::Uncompress::Bunzip2::->new($filename); while (<$z>) { }` [download] What I needed was a way to use "while (<data>)" without using the "getline()" method that seemed to be reading the data in one line at a time. My LDIF is over 15 million lines, so that was quite slow. Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). Thanks to you and Roboticus for steering me in the right direction. Cheers, JW.	[reply] [d/l]
Re^3: Searching large files a block at a time by kcott (Archbishop) on Aug 02, 2017 at 06:52 UTC
"... helped me understand what I was doing wrong ..." OK, that's a good start. "Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). " I'm completely guessing but the overhead may be due to the `IO::Uncompress::Bunzip2` module. You could avoid using that module by setting up the same pipe but from within the Perl script (rather than piping to that script). I put exactly the same data I used previously into a text file (just a copy and paste): `$ cat > pm_1196493_paragraph_mode_test_data.txt Block1 Line1 ... Block4 Line3 ^D` [download] I then modified the start of my previous example code, so it now looks like this: `#!/usr/bin/env perl -l use strict; use warnings; use autodie; my $filename = 'pm_1196493_paragraph_mode_test_data.txt'; open my $z, '-\|', "cat $filename"; { local $/ = ''; while (<$z>) { chomp; print '--- One Block ---'; print; } }` [download] This produces exactly the same output as before. Obviously, you'll want to change '`cat`' to '`/usr/bin/bzcat`' (and, of course, use `.bz2` instead of `.txt` files). This solution will not be platform-independent: that may not matter to you. See open for more on the `'-\|'`, and closely related `'\|-'`, modes. Also, note that I used the autodie pragma. If you want more control over handling I/O problems, you can hand-craft messages (e.g. `open ... or die "..."`), or use something like Try::Tiny. — Ken	[reply] [d/l] [select]
Re^4: Searching large files a block at a time by JediWombat (Novice) on Aug 03, 2017 at 23:56 UTC
Thanks again, Ken. I've built this code using your and Mario's responses: `$/ = ""; open my $fh, "-\|", "/usr/bin/bzcat $file"; while (<$fh>) { if (/uid=$mbnum/m) { print $_; last; } }` [download] I've timed this version, and all others: this completed in 3.2 seconds, the previous version I built with your help took 8 seconds, and my original took 25.4! As I need to scan through three different LDIF's, that's a total of under 10 seconds, on average.	[reply] [d/l]
Re^5: Searching large files a block at a time by marioroy (Prior) on Aug 04, 2017 at 04:41 UTC
Re^6: Searching large files a block at a time by marioroy (Prior) on Aug 04, 2017 at 15:46 UTC
Re^6: Searching large files a block at a time by marioroy (Prior) on Aug 05, 2017 at 00:23 UTC