in reply to Sharing Hash Question

Let’s just all please forget the first misguided volley in this tennis-match and see what can be done to address the problem.   I am not sure that spinning it off into multiple threads will be helpful at all, particularly since the to-be shared hash data structure would necessarily be common to all of them and, hence, their execution would wind up being serialized anyhow.   I think that BrowserUK was (correctly) trying to focus your attention onto that, even though his choice of wording was ... less-than-delicate.

So, given that we have a technical problem here, let’s just stay focused on that, shall we?   The only plausible reason to use threads is to achieve overlapping of I/O.   If the root problem is, as I suspect, “paging churn,” having a bunch of threads or processes “churning” at once will merely make the completion time very-significantly poorer than before.

You don’t (and of course, you can’t) explain what “among other things” might be, but my initial impression about almost-any program that takes “a long time” to process “very large” files is that you are burning-up too much memory and/or causing excessive paging behavior ... easy to do with large random access data structures.   I would suggest measuring the program as it runs, even informally, to see what kind of memory footprint it has and what’s actually causing the (single...) process to wait.   Then, I would reconsider the possible solutions, but setting-aside threading as one of the alternatives.

Replies are listed 'Best First'.
Re^2: Sharing Hash Question
by BrowserUk (Patriarch) on Jul 05, 2012 at 15:32 UTC
    The only plausible reason to use threads is to achieve overlapping of I/O.

    That statement is bo.. er .. has no basis in reality.

    Likewise the rest of this misguided garbage.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      In case anybody cares, I finally got it to do what I was trying to do. If anybody needs to parse large text files using multi-threading, here's a simple script that might help. I apologize about my original post's vagueness. I have never posted on here before and had forgotten about the tags that you can use. Anyway, If people can improve upon it, feel free. I'm always looking for better ways to do things.
      use strict; use warnings; use threads; use threads::shared; use Thread::Queue; # Constant that hold maximum amount of threads to start use constant MAX_THREADS => 10; # Main data structure that holds all the data my %hash : shared; # A new empty queue my $q = Thread::Queue->new(); # Build list of files my @files = qw/<file1> <file2> <file3> <etc.>/; chomp(@files); # Enqueue the files $q->enqueue(map($_, @files)); # Start the threads and wait for them to finish for(my $i=0; $i<MAX_THREADS; $i++) { threads->create( \&thread, $q )->join; } # Print out the data structure when we're finished foreach my $key1 (keys %hash) { print "$key1 =>\n"; foreach my $key2 (keys %{$hash{$key1}}) { print "\t$key2 =>\n"; print map("\t\t$_\n", @{$hash{$key1}{$key2}}); } } ############################# # This code runs inside of the thread ############################# sub thread { my ($q) = @_; while (my $file = $q->dequeue_nb()) { my @array1 : shared; my @array2 : shared; my @array3 : shared; # Lock the main hash before writing lock(%hash); chomp($file); # Initialize has with the file/key $hash{$file} = &share({}); # Open the file and pattern match the lines open(FH, $file) or die "Can't open\n"; while(my $line = <FH>) { chomp($line); # Build arrays of the things we're # looking for in the file(s) if($line =~ /^<regex1>/) { push(@array1, $line); } elsif($line =~ /^<regex2>/) { push(@array2, $line); } elsif($line =~ /^<regex3>/) { push(@array3, $line); } } close(FH); share ( $hash{$file}{<type1>} ); share ( $hash{$file}{<type2>} ); share ( $hash{$file}{<type3>} ); # Can only assign arrays as a reference $hash{$file}{<type1>} = \@array1; $hash{$file}{<type2>} = \@array2; $hash{$file}{<type3>} = \@array3; } } exit;
        I apologize about my original post's vagueness. I have never posted on here before and had forgotten about the tags that you can use.

        Understood.

        Has your program sped up your processing even slightly?

        I'll assume the answer is no. There are several overlapping reasons for why that must be the answer.

        The first is this:

        for(my $i=0; $i<MAX_THREADS; $i++) { threads->create( \&thread, $q )->join; }

        The effect of creating many threads in a loop, but also waiting inside that loop for each one to finish (join()), before starting the next, is exactly the same as if you just called the subroutine many times one after the other.

        Ie. The code above is exactly the same as doing:

        thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q ); thread( $q );

        Except that in addition to not speeding things up, you made them take considerably longer because you added the additional overhead of starting 10 threads and of locking and manipulating shared hashes.

        You can correct that by starting all the threads in the loop; and then waiting for them all to finish,

        after the loop so they can run concurrently:

        my @threads = map threads->create( \&thread, $q ), 1 .. MAX_THREADS; $_->join for @threads;

        This will run more quickly than your code above, but still not faster than a single-threaded process doing the same work.

        When you've convinced yourself that is true, come back and I'll explain why and what you can do about it.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?