Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm trying to "parallelize" my code using threading. I used 'threads' library for this matter. I'm interesting in changing variables (hash for example) w/o using the same variable in two different threads (i.e without using the threads::shared option).

For example:

my %hash_1 = (); $hash_1{"michael"} = "michael"; $hash_1{"sasi"} = "other"; $hash_1{"wife"} = 0; my $thr = threads->create(\&hello,"michael",\%hash_1)->join; my $thr1 = threads->create(\&hello,"sasi",\%hash_1)->join; print " ".$hash_1{"wife"}."\n"; # will print '0' although it was manip +ulated in sub routine 'hello' sub hello{ my ($who,$hash) = @_; print "hello from thread: $who ".$hash->{$who}."\n"; if(defined($hash->{"wife"}) && !$hash->{"wife"}){ $hash->{"wife"} = "Deena"; print " ".$hash->{"wife"}."\n"; } }

I was hoping that since I'm passing reference to hash (i.e. address) the 'hello' function will act as usual and will change the hash outside the function scope. I also tried passing parameters like this:

my $thr1 = threads->create(\&hello, qw("sasi" \%hash_1))->join;

w/o to much luck, anyone have an idea how to make it work (if possible).

Thanks in advance.

Michael

Replies are listed 'Best First'.
Re: changing parameters in a thread
by BrowserUk (Patriarch) on Mar 29, 2009 at 12:47 UTC
    I'm interesting in changing variables (hash for example) w/o using the same variable in two different threads (i.e without using the threads::shared option).

    You can't!*

    (Why would you want to?)

    (*)The very design of threads is purposed to prevent you from doing that. Either accidently or explicitly.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      hi,

      thanks for the fast answer, yet, it didn't help me.

      The scenario of my case is as such:

      I'm comparing two huge files with unique structure, for this matter I'm reading each file into a hash structure, and compering the two hash.

      Reading the files in parallel (also comparing them in parallel) will reduce the run time significantly, since I'm working on a multi-cpu machine

      I thought that the best way to do that is by multithreading, I can control the flow of the program by catching the status of the thread using '->join

      what do you think?

      Michael

        You don't explain what you mean by "unique structure", or why you need to load the file data into hashes in order to compare them, so I'm going to take it as read that you do need to do that. But as pointed out above, putting huge files into hashes will require vast amounts more memory to hold the data, than they require on disk. If you multiply by a factor of 5 you won't be far wrong if it is a flat hash you are building. If you need a more complicated nested structure, you will probably need to use a higher multiplier.

        You are unlikely to see any performance benefit from reading two files in parallel, unless they exist on different drives. Performance will be limited by the seek performance and throughput of the drive, and accessing two huge files in parallel on the same drive will exacerbate the problems.

        Think of it like trying to read two different chapters in a book in parallel. The read head (your eyes) will be constantly flicking back and forth between the front and the back of the book.

        If you can arrange for them to be on separate drives, then there will (probably; controllers and others factors can also be an influence), some gains to be had in reading in parallel. But, building the hashes in different threads and then comparing them is a bad idea. The internal and user level locking required to prevent corruption will severely impact performance.

        Assuming different drives and the necessity to build hashes. You would be far better off reading the files line by line on separate threads and then convey the lines to a third thread that would perform the hash building and comparisons.

        That said, unless you can perform partial comparisons on the fly and conserve memory by discarding parts of the structures built as you finish with them, then you are likely to be constrained by memory.

        And if you can discard chunks of memory before you have read the files completely, then one wonders why you need to build the structures in the first place. Wouldn't a line by line comparison be possible?

        The bottom line is that whether there is any benefit in parallelising your program depends entirely upon the nature of your data, and the circumstances of your hardware setup, and you have not described either in sufficient detail to allow anyone to give you good advice.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        If the comparision you are doing doesn't involve any complex transformations of the data structure or really time-consuming math then your CPUs will mostly sit around looking at the daisies while they are waiting for your hard disk to deliver the data. Disk I/O is slow, REALLY slow, compared to the speed of your memory or CPUs

        So no matter how many CPUs you have to do the job, the only thing that probably matters in your case is how fast your disk (or disks) can read the data (and what algorithm you are using)

        And if the hashes are so big that they don't fit into the RAM memory your machine starts to swap, i.e. it puts part of its memory contents back onto the hard disk which makes you even more dependent on hard disk speed. This swapping usually leads to your program doing nothing anymore except swapping, this is called 'thrashing'.

        So your solution might be, depending on your circumstances:
        1) Buy a faster hard disk or use a raid
        2) Do some preprocessing of your data so that it takes up less space
        3) Buy more RAM
        4) Use a database for one of the huge files and compare the second one by accessing the database.
        5) Depending on your data use some algorithm that avoids reading in the two files completely into memory, for example through a merge sort

Re: changing parameters in a thread
by locked_user sundialsvc4 (Abbot) on Mar 29, 2009 at 22:09 UTC
    I'm comparing two huge files with unique structure, for this matter I'm reading each file into a hash structure, and compering the two hash.

    If you're approaching the problem this way, “you're already dead, and using multiple threads won't really help you.” You're moving a huge file into another huge file (the virtual-memory store), and comparing them in a way that's certain to produce page faults. Your program is doomed to run like a constipated snail...

    Take the two files, sort them using a disk-based sort, then compare the two sorted streams. The sort operation will be very fast and efficient, and the file comparison will proceed like greased lightning. You don't have to “search for” anything.

    Welcome to COBOL ... to sort and to merge. The techniques born of necessity eighty years ago (before commercial computers were invented...) still work. Only better.

    Your program will work literally hundreds of times faster than before... even on a single CPU.

    “Don’t diddle your code to make it faster:   find a better algorithm.”
    The Elements of Programming Style (Kernighan)