in reply to Searching large files a block at a time

G'day JediWombat,

Welcome to the Monastery.

"This uses the $/ input field separator, and then uses while (<>) to read a block at a time. I'd like to do this in pure perl, but I can't find a way."

Firstly, here's a simple example of how you might do this.

#!/usr/bin/env perl -l use strict; use warnings; { local $/ = ''; while (<DATA>) { chomp; print '--- One Block ---'; print; } } __DATA__ Block1 Line1 Block1 Line2 Block1 Line3 Block2 Line1 Block2 Line2 Block2 Line3 Block3 Line1 Block3 Line2 Block3 Line3 Block4 Line1 Block4 Line2 Block4 Line3

Notes:

The output looks like this:

--- One Block --- Block1 Line1 Block1 Line2 Block1 Line3 --- One Block --- Block2 Line1 Block2 Line2 Block2 Line3 --- One Block --- Block3 Line1 Block3 Line2 Block3 Line3 --- One Block --- Block4 Line1 Block4 Line2 Block4 Line3

I thought ++roboticus had generally covered issues relating to '$/' and IO::Uncompress::Bunzip2; however, your reply seems to suggest you were looking for something else.

I'm not entirely sure what you're looking for. Note in IO::Uncompress::Bunzip2's Constructor section:

... the object, $z, returned from IO::Uncompress::Bunzip2 can be used exactly like an IO::File filehandle. This means that all normal input file operations can be carried out with $z. For example, to read a line from a compressed file/buffer you can use either of these forms

$line = $z->getline(); $line = <$z>;

Try using '<$z>', in a way similar to my example with '<DATA>', and see if that does what you want. Something like this (untested):

my $z = IO::Uncompress::Bunzip2::->new($filename); { local $/ = ''; while (<$z>) { ... } }

Note that the constructor code I've used differs from that shown in the IO::Uncompress::Bunzip2 documentation. This is on purpose and I recommend you use this instead. The IO::Uncompress::Bunzip2 documentation uses "Indirect Object Syntax: if you follow that link, you'll see in bold text

"... this syntax is discouraged ..."

along with a discussion of why that syntax should be avoided.

— Ken

Replies are listed 'Best First'.
Re^2: Searching large files a block at a time
by JediWombat (Novice) on Aug 02, 2017 at 05:53 UTC
    Thank you Ken! The bit that helped me understand what I was doing wrong, was
    my $z = IO::Uncompress::Bunzip2::->new($filename); while (<$z>) { }
    What I needed was a way to use "while (<data>)" without using the "getline()" method that seemed to be reading the data in one line at a time. My LDIF is over 15 million lines, so that was quite slow. Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). Thanks to you and Roboticus for steering me in the right direction. Cheers, JW.
      "... helped me understand what I was doing wrong ..."

      OK, that's a good start.

      "Using your code, I get results in ~10 seconds, which is acceptable (though still a lot slower than the shell script that pipes into Perl, and I'm not sure why that is). "

      I'm completely guessing but the overhead may be due to the IO::Uncompress::Bunzip2 module. You could avoid using that module by setting up the same pipe but from within the Perl script (rather than piping to that script).

      I put exactly the same data I used previously into a text file (just a copy and paste):

      $ cat > pm_1196493_paragraph_mode_test_data.txt Block1 Line1 ... Block4 Line3 ^D

      I then modified the start of my previous example code, so it now looks like this:

      #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $filename = 'pm_1196493_paragraph_mode_test_data.txt'; open my $z, '-|', "cat $filename"; { local $/ = ''; while (<$z>) { chomp; print '--- One Block ---'; print; } }

      This produces exactly the same output as before. Obviously, you'll want to change 'cat' to '/usr/bin/bzcat' (and, of course, use *.bz2 instead of *.txt files). This solution will not be platform-independent: that may not matter to you. See open for more on the '-|', and closely related '|-', modes.

      Also, note that I used the autodie pragma. If you want more control over handling I/O problems, you can hand-craft messages (e.g. open ... or die "..."), or use something like Try::Tiny.

      — Ken

        Thanks again, Ken. I've built this code using your and Mario's responses:
        $/ = ""; open my $fh, "-|", "/usr/bin/bzcat $file"; while (<$fh>) { if (/uid=$mbnum/m) { print $_; last; } }

        I've timed this version, and all others: this completed in 3.2 seconds, the previous version I built with your help took 8 seconds, and my original took 25.4! As I need to scan through three different LDIF's, that's a total of under 10 seconds, on average.