Re: Faster Method for Gathering Data
by Abigail-II (Bishop) on Jul 31, 2003 at 12:07 UTC
|
And how fast is the equivalent find command? Do something
like:
$ time find dir1 dir2 dir3 -name '*.sgml' > /dev/null
If that also takes hours, the problem isn't at your
Perl program.
Abigail | [reply] [d/l] |
|
If the requester isn't on Unix, would wrapping the appropriate system("") call with some code to store the start and finish time be useful?
Maybe the Benchmark module?
I don't know, just wanted to see if such a strategy might be worthwhile.
| [reply] |
Re: Faster Method for Gathering Data
by BrowserUk (Patriarch) on Jul 31, 2003 at 13:23 UTC
|
dir /s \\remotemachine\....\*.sgml
attrib /s \\remotemachine\....\*.sgml
If either of these is substantially faster than File::Find, then there may be ways of speeding things up. If not, then it would seem that you have a very slow link somewhere between you and the network drive.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] [d/l] |
Re: Faster Method for Gathering Data
by cianoz (Friar) on Jul 31, 2003 at 12:14 UTC
|
as far as i can tell there's nothing wrong in your code
(well except for a lacking "use strict"...)
(i tested it to search for /\.c$/ in /usr/src/linux and it took just less then 1 second for more than 4000 files on an average machine..)
try to compare it with the unix find command (if you are on unix)
also if you are using a slow terminal it could help to eliminate the print statement or redirecting it to a file.
| [reply] |
|
Sorry should have been more specific. I am on a Windows system, checking the files across a Win2000 server network drive.
I guess that impacts it.
The print command is in there to show that it is actually working and not frozen. I need the array for later use to open the files and do some reporting based on the elements in the SGML.
Thanks TONS for verifying that at least it might not be me.
| [reply] |
|
open FINDER, "find . -type f -print |"
|| die "Couldn't issue find command\n";
my %SGML_Reporting_Stuff;
while (<FINDER>)
{
my $fh = IO::File->new($_)
|| die "Cannot open '$_' for reading\n";
# Do stuff to populate %SGML_Reporting_Stuff
$fh->close;
}
close FINDER;
# Use %SGML_Reporting_Stuff here.
I used a Unix command, but you could replace the command with the appropriate Window command and it should work. This isn't necessarily going to give you a huge boost in speed, but it will reduce your memory requirements, which often translates into a 5%-15% speed improvement. In your case, where you're taking 5+ hours, that can be as much as 45 minutes, or more.
Now, of course, if you need to read file A before reading files B and C, this won't work as well. You could still do something similar, by having a second hash which says "I can't process these filenames until I have process that filename". Once you hit "that filename", you process the ones that you had to hold off on. If you were to go this route, I would create a process_file() subroutine to do your actual processing.
------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified. | [reply] [d/l] |
|
I did a bit more digging, and thought this might help...
You could use the following code (straight from the Benchmark docs) to reassure yourself that the networked access is the bottleneck.
use Benchmark;
$t0 = new Benchmark;
# ... your code here ...
# system("dir", "/s", "path_to_root_sgml_dir\\*.sgml");
$t1 = new Benchmark;
$td = timediff($t1, $t0);
print "the code took:",timestr($td),"\n";
Oh, and welcome to the monastery! | [reply] [d/l] |
Re: Faster Method for Gathering Data
by crouchingpenguin (Priest) on Jul 31, 2003 at 13:18 UTC
|
Sure, you can use Inline::C. Here is an example :
| [reply] [d/l] |
Re: Faster Method for Gathering Data
by dga (Hermit) on Jul 31, 2003 at 17:27 UTC
|
Another possibility which may not apply in your situation is to run a straight recursive directory listing into a text file then write a perl script to parse that.
The fastest of course would be to run the listing on the remote machine and then transfer the listing file to the local machine.
Second fastest might be to do the listing over the network and save the output locally and run a parsing script on that. Of course if an over the network directory listing takes 5 hours to complete you don't save a lot of time.
use strict;
while(<>)
{
push(@files) if /\.sgml$/;
}
| [reply] [d/l] |
Re: Faster Method for Gathering Data
by Cine (Friar) on Jul 31, 2003 at 12:52 UTC
|
Your problem is most likely not really related to perl, it is a filesystem thing, where lookups are made in linear time with regards to the number of files in a directory.
You should look into the htree option of ext{2,3}. Goggle will help you there ;)
T
I
M
T
O
W
T
D
I | [reply] [d/l] |
|
Howdy!
Your problem is most likely not really related to perl, it is a filesystem thing, where lookups are made in linear time with regards to the number of files in a directory
I've run into what I think is similar behavior. I have some
CDs that have something like 11,000 files on them, all in
a single directory. On a Windows or MacOS 9 box, I saw
excruciatingly slow access times for files down in the
list. The first few hundred were plenty zippy, but the
farther I got into the list, the slower the access.
Doing the same access on a Solaris box or MacOSX yielded
pleasantly surprising results. File lookups were more
like constant time instead of proportional to how far into
the list the name was.
I suspect that the problem is exacerbated by using a
"slow" medium, like CD-ROM or network volumes.
yours,
Michael
| [reply] |
|
Network, yes. CDROM no. The difference is that the meta data on the CDROM can be cached, whereas the network drive has to recheck it.
T
I
M
T
O
W
T
D
I
| [reply] [d/l] |