in reply to Is Using Threads Slower Than Not Using Threads?

Absolutely, unless there is some advantage to be gained by running computationally intensive tasks in parallel. Running I/O bound tasks in parallel almost never offers any time advantage (although sometimes there can be an advantage in terms or code structure).

However, it is very often the case that a smarter algorithm can provide a substantial performance improvement. For example, in the case of the code you show it maybe that you could use a hash to test or an IP match. Consider:

use strict; use warnings; my @allIpsList = qw(10.1.1.10 10.1.1.17 10.1.1.125); my %ipsMatch = map {$_ => 1} @allIpsList; my $testFile = <<TESTDATA; 10.1.1.17 10.1.1.23 10.1.1.79 10.1.1.125 TESTDATA open my $fh, '<', \$testFile; my @found = check_fw_logs (\%ipsMatch, $fh); print "@found\n"; sub check_fw_logs { my ($ipsMatch, $fh) = @_; my @matched; while (defined (my $line = <$fh>)) { next if $line !~ /(\d+\.\d+\.\d+\.\d+)/; push @matched, $1 if exists $ipsMatch->{$1}; } return @matched; }

Prints:

10.1.1.17 10.1.1.125

For real code you would probably want to normalise the IP numbers wherever they are used and of course you'd use external files etc rather then the tricks I've used to make this a self contained example.

True laziness is hard work

Replies are listed 'Best First'.
Re^2: Is Using Threads Slower Than Not Using Threads?
by JavaFan (Canon) on Nov 01, 2010 at 11:28 UTC
    Running I/O bound tasks in parallel almost never offers any time advantage
    If it's a single file, one a single disk, with a single controller, yes. But if you have multiple controllers, or reading from multiple files, I/O bound tasks can speed up when done in parallel. In fact, if you have just a one single core CPU, the only time threads will speed up things if you're I/O bound (disk and/or network).

    And then there's the obvious two-way split: one thread doing I/O, while the other does the calculations. That means, your program can still make 'progress' while it's waiting for data. This may give you a speed up even if you have a single core, single CPU, single controller, single disk setup.

    Now, I were the OP, and if I were to go the divide-and-conquer method, I'd try various method, and see which ones are faster on his particular setup:

    1. Two threads/processes: one reading the file, the other checking the IPs.
    2. Use threads/forks and have each thread/child test part of the file.
    3. Have an outer thead/process reading chunks of data, have a bunch of other threads/processing each checking part of the IP addresses.
    Obviously, the latter two allow for lots of tweaking by varying the number of threads/processes.

    But first, I'd try something totally different. Instead of doing 3500 matches for each line, use a single regexp to extract all the IP addresses from the line (it doesn't have to be perfect, it's likely even /([0-9][0-9.]+[0-9])/ will do). The 3500 IP addresses, I would store in a hash. Then it's simple a matter of doing a hash lookup. This should dramatically reduce the number of matches performed. It may also fix a bug: if one of the 3500 ip addresses is "23.45.67.89", and the file contains "123.45.67.89", then it's reported as a match. This may be intended, but that I would find surprising.

Re^2: Is Using Threads Slower Than Not Using Threads?
by Dru (Hermit) on Nov 01, 2010 at 14:47 UTC
    Thank you GrandFather. What do you mean by normalizing the IP numbers?

    Thanks,
    Dru

    Perl, the Leatherman of Programming languages. - qazwart

      For the hash lookup to work the key string must exactly match. IP numbers may optionally have leading 0 digits to for a three digit number but the strings '010.001.001.001' and '10.1.1.1' are not eq so you should normalise the IP number to one version or the other - always three digit numbers or always remove leading 0 digits.

      True laziness is hard work

      The easiest way to normalise IPs is to pack them to integers:

      $hash{ pack 'C4', split '.', $ip } = 1;

        <beancounter>That works "only" for IPv4 addresses.</beancounter>.

        We are running out of IPv4 addresses, and the only alternative to massive (ab)use of NAT is to switch to IPv6 really soon now(TM). In that light, new applications should be able to handle IPv6 addresses now.

        Of course, for a logfile with 100% IPv4 addresses, adding IPv6 support is nonsense. But, if that logfile comes from a device that will soon have to work with IPv6 addresses, the script should be able to handle that.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)