Re^2: Sharing Hash Question

Replies are listed 'Best First'.
Re^3: Sharing Hash Question by jmmach80 (Initiate) on Jul 05, 2012 at 19:17 UTC
In case anybody cares, I finally got it to do what I was trying to do. If anybody needs to parse large text files using multi-threading, here's a simple script that might help. I apologize about my original post's vagueness. I have never posted on here before and had forgotten about the tags that you can use. Anyway, If people can improve upon it, feel free. I'm always looking for better ways to do things. use strict; use warnings; use threads; use threads::shared; use Thread::Queue; # Constant that hold maximum amount of threads to start use constant MAX_THREADS => 10; # Main data structure that holds all the data my %hash : shared; # A new empty queue my $q = Thread::Queue->new(); # Build list of files my @files = qw/<file1> <file2> <file3> <etc.>/; chomp(@files); # Enqueue the files $q->enqueue(map($_, @files)); # Start the threads and wait for them to finish for(my $i=0; $i<MAX_THREADS; $i++) { threads->create( \&thread, $q )->join; } # Print out the data structure when we're finished foreach my $key1 (keys %hash) { print "$key1 =>\n"; foreach my $key2 (keys %{$hash{$key1}}) { print "\t$key2 =>\n"; print map("\t\t$_\n", @{$hash{$key1}{$key2}}); } } ############################# # This code runs inside of the thread ############################# sub thread { my ($q) = @_; while (my $file = $q->dequeue_nb()) { my @array1 : shared; my @array2 : shared; my @array3 : shared; # Lock the main hash before writing lock(%hash); chomp($file); # Initialize has with the file/key $hash{$file} = &share({}); # Open the file and pattern match the lines open(FH, $file) or die "Can't open\n"; while(my $line = <FH>) { chomp($line); # Build arrays of the things we're # looking for in the file(s) if($line =~ /^<regex1>/) { push(@array1, $line); } elsif($line =~ /^<regex2>/) { push(@array2, $line); } elsif($line =~ /^<regex3>/) { push(@array3, $line); } } close(FH); share ( $hash{$file}{<type1>} ); share ( $hash{$file}{<type2>} ); share ( $hash{$file}{<type3>} ); # Can only assign arrays as a reference $hash{$file}{<type1>} = \@array1; $hash{$file}{<type2>} = \@array2; $hash{$file}{<type3>} = \@array3; } } exit; [download]	[reply] [d/l] [select]
Re^4: Sharing Hash Question by BrowserUk (Patriarch) on Jul 05, 2012 at 19:36 UTC
I apologize about my original post's vagueness. I have never posted on here before and had forgotten about the tags that you can use. Understood. Has your program sped up your processing even slightly? I'll assume the answer is no. There are several overlapping reasons for why that must be the answer. The first is this: `for(my $i=0; $i<MAX_THREADS; $i++) { threads->create( \&thread, $q )->join; }` [download] The effect of creating many threads in a loop, but also waiting inside that loop for each one to finish (join()), before starting the next, is exactly the same as if you just called the subroutine many times one after the other. Ie. The code above is exactly the same as doing: `thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q );` [download] Except that in addition to not speeding things up, you made them take considerably longer because you added the additional overhead of starting 10 threads and of locking and manipulating shared hashes. You can correct that by starting all the threads in the loop; and then waiting for them all to finish, after the loop so they can run concurrently: `my @threads = map threads->create( \&thread, $q ), 1 .. MAX_THREADS; $_->join for @threads;` [download] This will run more quickly than your code above, but still not faster than a single-threaded process doing the same work. When you've convinced yourself that is true, come back and I'll explain why and what you can do about it. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^5: Sharing Hash Question by jmmach80 (Initiate) on Jul 05, 2012 at 22:33 UTC
Yeah I didn't mean to do the join() like that. The actual script does it like you described. I just had to throw this together primarily for the post.	[reply]
Re^6: Sharing Hash Question by BrowserUk (Patriarch) on Jul 05, 2012 at 22:34 UTC
Re^7: Sharing Hash Question by jmmach80 (Initiate) on Jul 07, 2012 at 12:59 UTC
Some notes below your chosen depth have not been shown here

use strict;
use warnings;
use threads;
use threads::shared;
use Thread::Queue;

# Constant that hold maximum amount of threads to start
use constant MAX_THREADS => 10; 

# Main data structure that holds all the data
my %hash : shared;

# A new empty queue
my $q = Thread::Queue->new();

# Build list of files
my @files = qw/<file1> <file2> <file3> <etc.>/;
chomp(@files);

# Enqueue the files
$q->enqueue(map($_, @files));

# Start the threads and wait for them to finish
for(my $i=0; $i<MAX_THREADS; $i++)
{
        threads->create( \&thread, $q )->join;
}

# Print out the data structure when we're finished
foreach my $key1 (keys %hash)
{
        print "$key1 =>\n";

        foreach my $key2 (keys %{$hash{$key1}})
        {   
                print "\t$key2 =>\n";
                print map("\t\t$_\n", @{$hash{$key1}{$key2}});
        }   
}

#############################
# This code runs inside of the thread
#############################
sub thread
{
        my ($q) = @_; 
            
        while (my $file = $q->dequeue_nb())
        {   
                my @array1 : shared;
                my @array2 : shared;
                my @array3 : shared;
                
                # Lock the main hash before writing

                lock(%hash);
                    
                chomp($file);

                # Initialize has with the file/key
                $hash{$file} = &share({});    
                    
                # Open the file and pattern match the lines
                open(FH, $file) or die "Can't open\n";
                while(my $line = <FH>)
                {   
                        chomp($line);
                            
                       # Build arrays of the things we're 
                       # looking for in the file(s)
                        if($line =~ /^<regex1>/)
                        {   
                                push(@array1, $line);
                        }   
                        elsif($line =~ /^<regex2>/)
                        {   
                                push(@array2, $line);
                        }   
                        elsif($line =~ /^<regex3>/)
                        {   
                                push(@array3, $line);
                        }
                }
                close(FH);

                share ( $hash{$file}{<type1>} );
                share ( $hash{$file}{<type2>} );
                share ( $hash{$file}{<type3>} );

                # Can only assign arrays as a reference
                $hash{$file}{<type1>} = \@array1;
                $hash{$file}{<type2>} = \@array2;
                $hash{$file}{<type3>} = \@array3;
        }
}      
exit;
[download]

[reply]
[d/l]
[select]

I apologize about my original post's vagueness. I have never posted on here before and had forgotten about the tags that you can use.

Understood.

Has your program sped up your processing even slightly?

I'll assume the answer is no. There are several overlapping reasons for why that must be the answer.

The first is this:

for(my $i=0; $i<MAX_THREADS; $i++) { threads->create( \&thread, $q )->join; }
[download]

The effect of creating many threads in a loop, but also waiting inside that loop for each one to finish (join()), before starting the next, is exactly the same as if you just called the subroutine many times one after the other.

Ie. The code above is exactly the same as doing:

thread( $q );
thread( $q );
thread( $q );
thread( $q );
thread( $q );
thread( $q );
thread( $q );
thread( $q );
thread( $q );
thread( $q );
[download]

Except that in addition to not speeding things up, you made them take considerably longer because you added the additional overhead of starting 10 threads and of locking and manipulating shared hashes.

You can correct that by starting all the threads in the loop; and then waiting for them all to finish,

after the loop so they can run concurrently:

my @threads = map threads->create( \&thread, $q ), 1 .. MAX_THREADS;

$_->join for @threads;
[download]

This will run more quickly than your code above, but still not faster than a single-threaded process doing the same work.

When you've convinced yourself that is true, come back and I'll explain why and what you can do about it.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

[reply]
[d/l]
[select]

Yeah I didn't mean to do the join() like that. The actual script does it like you described. I just had to throw this together primarily for the post.

[reply]