Re: changing parameters in a thread

Replies are listed 'Best First'.
Re^2: changing parameters in a thread by Anonymous Monk on Mar 29, 2009 at 12:58 UTC
hi, thanks for the fast answer, yet, it didn't help me. The scenario of my case is as such: I'm comparing two huge files with unique structure, for this matter I'm reading each file into a hash structure, and compering the two hash. Reading the files in parallel (also comparing them in parallel) will reduce the run time significantly, since I'm working on a multi-cpu machine I thought that the best way to do that is by multithreading, I can control the flow of the program by catching the status of the thread using '->join what do you think? Michael	[reply]
Re^3: changing parameters in a thread by BrowserUk (Patriarch) on Mar 29, 2009 at 16:37 UTC
You don't explain what you mean by "unique structure", or why you need to load the file data into hashes in order to compare them, so I'm going to take it as read that you do need to do that. But as pointed out above, putting huge files into hashes will require vast amounts more memory to hold the data, than they require on disk. If you multiply by a factor of 5 you won't be far wrong if it is a flat hash you are building. If you need a more complicated nested structure, you will probably need to use a higher multiplier. You are unlikely to see any performance benefit from reading two files in parallel, unless they exist on different drives. Performance will be limited by the seek performance and throughput of the drive, and accessing two huge files in parallel on the same drive will exacerbate the problems. Think of it like trying to read two different chapters in a book in parallel. The read head (your eyes) will be constantly flicking back and forth between the front and the back of the book. If you can arrange for them to be on separate drives, then there will (probably; controllers and others factors can also be an influence), some gains to be had in reading in parallel. But, building the hashes in different threads and then comparing them is a bad idea. The internal and user level locking required to prevent corruption will severely impact performance. Assuming different drives and the necessity to build hashes. You would be far better off reading the files line by line on separate threads and then convey the lines to a third thread that would perform the hash building and comparisons. That said, unless you can perform partial comparisons on the fly and conserve memory by discarding parts of the structures built as you finish with them, then you are likely to be constrained by memory. And if you can discard chunks of memory before you have read the files completely, then one wonders why you need to build the structures in the first place. Wouldn't a line by line comparison be possible? The bottom line is that whether there is any benefit in parallelising your program depends entirely upon the nature of your data, and the circumstances of your hardware setup, and you have not described either in sufficient detail to allow anyone to give you good advice. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: changing parameters in a thread by jethro (Monsignor) on Mar 29, 2009 at 15:37 UTC
If the comparision you are doing doesn't involve any complex transformations of the data structure or really time-consuming math then your CPUs will mostly sit around looking at the daisies while they are waiting for your hard disk to deliver the data. Disk I/O is slow, REALLY slow, compared to the speed of your memory or CPUs So no matter how many CPUs you have to do the job, the only thing that probably matters in your case is how fast your disk (or disks) can read the data (and what algorithm you are using) And if the hashes are so big that they don't fit into the RAM memory your machine starts to swap, i.e. it puts part of its memory contents back onto the hard disk which makes you even more dependent on hard disk speed. This swapping usually leads to your program doing nothing anymore except swapping, this is called 'thrashing'. So your solution might be, depending on your circumstances: 1) Buy a faster hard disk or use a raid 2) Do some preprocessing of your data so that it takes up less space 3) Buy more RAM 4) Use a database for one of the huge files and compare the second one by accessing the database. 5) Depending on your data use some algorithm that avoids reading in the two files completely into memory, for example through a merge sort	[reply]