comment on

This is a continuation of Re: Problem in RAM usage while threading the program.

Update 1: Added parallel chunking demonstration.
Update 2: Inline::C demonstrations can be found here.

Part Three create the Kyoto Cabinet DB

This is where the fun begins. The # after the file name is how to pass options. See the perldoc for more info. Notice the split line. There is no reason to split the entire sequence header line.

# https://www.perlmonks.org/?node_id=11110379
# usage: perl create_db.pl

use strict;
use warnings;
use KyotoCabinet;

unlink 'db.kch';

# construct the database object
my $db = KyotoCabinet::DB->new();

# open the database
if (! $db->open('db.kch#msiz=128m', $db->OWRITER | $db->OCREATE)) {
    die "open error (db): ", $db->error;
}

open my $fh, '<', 'NR.fasta' or die "open error: $!\n";

my $nrid = '';
my $seq  = '';

while ( <$fh> ) {
    chomp; $_ =~ s/\r$//g;
    if ( /^>/ ) {
        $db->set($nrid, $seq) if $seq;
        $nrid = ( split /\s/, $_, 2 )[0];
        $seq  = '';
    }
    else {
        $seq .= $_;
    }
}

$db->set($nrid, $seq) if $seq;

close $fh;
$db->close;
[download]

Part Four search DB serial code

One might like to measure the difference between running serially and parallel. This makes use of cursor in Kyoto Cabinet. The substr is for stripping the leading comma when displaying the value to STDOUT.

# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db.pl > Outfile.txt

use strict;
use warnings;
use KyotoCabinet;

# construct the database object
my $db = KyotoCabinet::DB->new();

# open the database
if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOCK)) {
    die "open error (db): ", $db->error;
}
my $cur = $db->cursor;

open my $fh, '<', 'peptides.txt' or die "open error: $!\n";

while ( my $pep = <$fh> ) {
    my $ids = ''; chomp $pep;

    $cur->jump; # first record
    while ( my ($key, $val) = $cur->get(1) ) {
        $ids .= ",$key" if index($val, $pep) >= 0;
    }

    print "$pep\t", substr($ids, 1), "\n" if $ids;
}

close $fh;

$cur->disable;
$db->close;
[download]

Part Five search DB parallel code

Here it is. This is the parallel code. It's practically the serial code above wrapped into MCE. Each worker opens a handle to the DB. This is done inside the user_begin block. Likewise, closing is done inside the user_end block. MCE::Relay is helpful for ensuring workers write serially to STDOUT. Thus, no reason to have workers send output to the manager process or store in a shared hash (i.e. MCE::Shared).

# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_mce.pl > Outfile.txt

use strict;
use warnings;
use KyotoCabinet;
use MCE;

# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;

my $mce = MCE->new(
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => 1,
    init_relay  => 1,

    user_begin => sub {
        # open the database
        if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOC
+K)) {
            die "open error (db): ", $db->error;
        }
        $cur = $db->cursor;
    },

    user_end => sub {
        # close the database
        $cur->disable;
        $db->close;
    },

    user_func => sub {
        my $pep = $_; chomp $pep;
        my $ids = '';

        $cur->jump; # first record
        while ( my ($key, $val) = $cur->get(1) ) {
            $ids .= ",$key" if index($val, $pep) >= 0;
        }
    
        # output serially, one worker at a time
        MCE::relay {
            print "$pep\t", substr($ids, 1), "\n" if $ids;
        };
    }
);

$mce->process('peptides.txt');
$mce->shutdown;
[download]

Part Six search DB parallel code - chunking

Chunking helps reduce time by another 25% ~ 30% for this use case. This is beneficial on machines with many cores.

# https://www.perlmonks.org/?node_id=11110379
# usage: perl search_db_chunk.pl > Outfile.txt

use strict;
use warnings;
use KyotoCabinet;
use MCE;

# construct the database object
my $db = KyotoCabinet::DB->new();
my $cur;

my $mce = MCE->new(
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => 7,
    init_relay  => 1,

    user_begin => sub {
        # open the database
        if (! $db->open('db.kch#msiz=128m', $db->OREADER | $db->ONOLOC
+K)) {
            die "open error (db): ", $db->error;
        }
        $cur = $db->cursor;
    },

    user_end => sub {
        # close the database
        $cur->disable;
        $db->close;
    },

    user_func => sub {
        my ( $mce, $chunk_ref, $chunk_id ) = @_;
        my %ret; chomp @$chunk_ref;

        $cur->jump; # first record
        while ( my ($key, $val) = $cur->get(1) ) {
            for ( @$chunk_ref ) {
                $ret{$_} .= ",$key" if index($val, $_) >= 0;
            }
        }

        # output serially, one worker at a time
        MCE::relay {
            for ( @$chunk_ref ) {
                print "$_\t", substr($ret{$_}, 1), "\n" if $ret{$_};
            }
        };
    }
);

$mce->process('peptides.txt');
$mce->shutdown;
[download]

Outfile.txt

GAAGGACTGGGACCA    >NR_000001,>NR_006611
AGGCTGCGGCAGGAC    >NR_062102
GTGAGCCGGGCAGAG    >NR_089584
AGGGGGGGTTGCTGA    >NR_036454,>NR_068535,>NR_097889
CTGACATGCGGCGCA    >NR_087289
GTGCATCGATGGCCG    >NR_005535
GGGGTCAAGCGAACC    >NR_076289,>NR_087856
TGAGACGGCGAACCT    >NR_064242
AGCGACAAAGGAAAC    >NR_045865
AGGTGCAACCATGGA    >NR_046602,>NR_056869
GAGTAAACCGCGCGA    >NR_093455
AACGACTGAACAGCG    >NR_070693
ACGCGTAATTCGATA    >NR_080086
GATGAGCGGAGCACT    >NR_070118
CGTAGCGAAACCGAG    >NR_092384,>NR_098291
GGGGGGGAGGTCCGA    >NR_021671,>NR_036907,>NR_080961
AGGGAGGGGGGTTGT    >NR_026207
ATGGGGCAGACGCGA    >NR_072314
[download]

Regards, Mario

In reply to Re^2: Problem in RAM usage while threading the program by marioroy
in thread Problem in RAM usage while threading the program by beherasan

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.