Re: Pattern matching with a variable

There are more efficient ways to implement this algorithm. The implementation you've presented is n squared. That is, as input.txt or library.txt become longer, the program takes exponentially more time to complete. If the files are small, this won't be a big deal. If the files are large, you can make this run much more efficiently by using hashes.

Hashes are the fastest way to compare the contents of one list against another list. It is an n algorithm, which means you only take a linear hit in run time as input.txt or library.txt grow.

$library = [contents of library.txt]
$input = [contents of input.txt]

my %in_library = ();
foreach $entry (split(/\n/, $library)) {
  $in_library{"$entry"} = 1;
}

my @input = split(/\n/, $input);

foreach $entry (@input) {
  if ($in_library{"$entry"}) {
    print "found - $entry\n";
  } else {
    print "new - $entry\n";
  }
}
[download]

Regular expressions take time to compile, so you want to avoid putting dynamic regexps in a tight loop.

TROGDOR

Comment on Re: Pattern matching with a variable Download Code

Replies are listed 'Best First'.
Re^2: Pattern matching with a variable by Ytrew (Pilgrim) on Nov 04, 2004 at 17:08 UTC
The implementation you've presented is n squared. That is, as input.txt or library.txt become longer, the program takes exponentially more time to complete Actually, it takes quadradically longer to complete. :-) The point TROGDOR is making is correct: you need to use a faster algorithm. Currently, if you double the input size, it will take more than twice as long to execute the program. The adjective I'm quibbling about is the one that describes how much longer. An exponential time algorithm (one that runs in time proportional to kn, for some constant k), always takes longer than a quadradic time algorithm (one that runs in time proportional to n2), as n "gets big". Here, the number n is the size of our data, and the algorithms are categorized based on how many operations they make relative to n. Actually, this is true of any polynomial time algorithm (a1(k1n) eventually gets bigger than a2(n**k2), for any given set of contants (a1,k2) and (a2,k2), as n increases). In English, this just says that "exponential algorithms are really, really, slow". This is fairly obvious when you consider an algorithm with an exponent of 2: if you time the program,,then add one element, and time it again, the second run takes twice as long. If you add another element, it takes four times as long, and so on. The hardest problems in computer science require exponential space or time to solve. Quadradics are bad, but much more managable. Quicksort is, in the worst case, quadradic, and perl used it for ages. -- Ytrew Q. Uiop	[reply]

Replies are listed 'Best First'.

Re^2: Pattern matching with a variable
by Ytrew (Pilgrim) on Nov 04, 2004 at 17:08 UTC

The implementation you've presented is n squared. That is, as input.txt or library.txt become longer, the program takes exponentially more time to complete

Actually, it takes quadradically longer to complete. :-)

The point TROGDOR is making is correct: you need to use a faster algorithm. Currently, if you double the input size, it will take more than twice as long to execute the program. The adjective I'm quibbling about is the one that describes how much longer.

An exponential time algorithm (one that runs in time proportional to k**n, for some constant k), always takes longer than a quadradic time algorithm (one that runs in time proportional to n**2), as n "gets big". Here, the number n is the size of our data, and the algorithms are categorized based on how many operations they make relative to n.

Actually, this is true of any polynomial time algorithm (a1*(k1**n) eventually gets bigger than a2*(n**k2), for any given set of contants (a1,k2) and (a2,k2), as n increases).

In English, this just says that "exponential algorithms are really, really, slow". This is fairly obvious when you consider an algorithm with an exponent of 2: if you time the program,,then add one element, and time it again, the second run takes twice as long. If you add another element, it takes four times as long, and so on.

The hardest problems in computer science require exponential space or time to solve. Quadradics are bad, but much more managable. Quicksort is, in the worst case, quadradic, and perl used it for ages.

--
Ytrew Q. Uiop

[reply]