Faster Method for Gathering Data

APA_Perl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Faster Method for Gathering Data by Abigail-II (Bishop) on Jul 31, 2003 at 12:07 UTC
And how fast is the equivalent find command? Do something like: `$ time find dir1 dir2 dir3 -name '*.sgml' > /dev/null` [download] If that also takes hours, the problem isn't at your Perl program. Abigail	[reply] [d/l]
Re: Re: Faster Method for Gathering Data by ChrisS (Monk) on Jul 31, 2003 at 12:49 UTC
If the requester isn't on Unix, would wrapping the appropriate system("") call with some code to store the start and finish time be useful? Maybe the Benchmark module? I don't know, just wanted to see if such a strategy might be worthwhile.	[reply]
Re: Faster Method for Gathering Data by BrowserUk (Patriarch) on Jul 31, 2003 at 13:23 UTC
How long does the same search take from the command line? Time both `dir /s \\remotemachine\....\.sgml attrib /s \\remotemachine\....\.sgml` [download] If either of these is substantially faster than File::Find, then there may be ways of speeding things up. If not, then it would seem that you have a very slow link somewhere between you and the network drive. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: Faster Method for Gathering Data by cianoz (Friar) on Jul 31, 2003 at 12:14 UTC
as far as i can tell there's nothing wrong in your code (well except for a lacking "use strict"...) (i tested it to search for /\.c$/ in /usr/src/linux and it took just less then 1 second for more than 4000 files on an average machine..) try to compare it with the unix find command (if you are on unix) also if you are using a slow terminal it could help to eliminate the print statement or redirecting it to a file.	[reply]
Re: Re: Faster Method for Gathering Data by APA_Perl (Novice) on Jul 31, 2003 at 12:51 UTC
Sorry should have been more specific. I am on a Windows system, checking the files across a Win2000 server network drive. I guess that impacts it. The print command is in there to show that it is actually working and not frozen. I need the array for later use to open the files and do some reporting based on the elements in the SGML. Thanks TONS for verifying that at least it might not be me.	[reply]
Re3: Faster Method for Gathering Data by dragonchild (Archbishop) on Jul 31, 2003 at 13:57 UTC
It might be useful to consider if you can deal with the files as they are found in the filesystem. Often, programmers don't consider the option of handling things as they come through, instead feeling that they have to work through a sorted list. The way you can tell is if you don't care what order your datasources come in and if you don't need them again once you've gotten what you need. This definitely sounds like a situation where a type of stream could definitely work. Why not do something like the following: `open FINDER, "find . -type f -print \|" \|\| die "Couldn't issue find command\n"; my %SGML_Reporting_Stuff; while (<FINDER>) { my $fh = IO::File->new($_) \|\| die "Cannot open '$_' for reading\n"; # Do stuff to populate %SGML_Reporting_Stuff $fh->close; } close FINDER; # Use %SGML_Reporting_Stuff here.` [download] I used a Unix command, but you could replace the command with the appropriate Window command and it should work. This isn't necessarily going to give you a huge boost in speed, but it will reduce your memory requirements, which often translates into a 5%-15% speed improvement. In your case, where you're taking 5+ hours, that can be as much as 45 minutes, or more. Now, of course, if you need to read file A before reading files B and C, this won't work as well. You could still do something similar, by having a second hash which says "I can't process these filenames until I have process that filename". Once you hit "that filename", you process the ones that you had to hold off on. If you were to go this route, I would create a process_file() subroutine to do your actual processing. ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Re: Re: Faster Method for Gathering Data by ChrisS (Monk) on Jul 31, 2003 at 13:04 UTC
I did a bit more digging, and thought this might help... You could use the following code (straight from the Benchmark docs) to reassure yourself that the networked access is the bottleneck. `use Benchmark; $t0 = new Benchmark; # ... your code here ... # system("dir", "/s", "path_to_root_sgml_dir\\*.sgml"); $t1 = new Benchmark; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";` [download] Oh, and welcome to the monastery!	[reply] [d/l]
Re: Faster Method for Gathering Data by crouchingpenguin (Priest) on Jul 31, 2003 at 13:18 UTC
Sure, you can use Inline::C. Here is an example : Read more... (3 kB)	[reply] [d/l]
Re: Faster Method for Gathering Data by dga (Hermit) on Jul 31, 2003 at 17:27 UTC
Another possibility which may not apply in your situation is to run a straight recursive directory listing into a text file then write a perl script to parse that. The fastest of course would be to run the listing on the remote machine and then transfer the listing file to the local machine. Second fastest might be to do the listing over the network and save the output locally and run a parsing script on that. Of course if an over the network directory listing takes 5 hours to complete you don't save a lot of time. `use strict; while(<>) { push(@files) if /\.sgml$/; }` [download]	[reply] [d/l]
Re: Faster Method for Gathering Data by Cine (Friar) on Jul 31, 2003 at 12:52 UTC
Your problem is most likely not really related to perl, it is a filesystem thing, where lookups are made in linear time with regards to the number of files in a directory. You should look into the htree option of ext{2,3}. Goggle will help you there ;) T I M T O W T D I	[reply] [d/l]
Re: Re: Faster Method for Gathering Data by herveus (Prior) on Jul 31, 2003 at 15:05 UTC
Howdy! Your problem is most likely not really related to perl, it is a filesystem thing, where lookups are made in linear time with regards to the number of files in a directory I've run into what I think is similar behavior. I have some CDs that have something like 11,000 files on them, all in a single directory. On a Windows or MacOS 9 box, I saw excruciatingly slow access times for files down in the list. The first few hundred were plenty zippy, but the farther I got into the list, the slower the access. Doing the same access on a Solaris box or MacOSX yielded pleasantly surprising results. File lookups were more like constant time instead of proportional to how far into the list the name was. I suspect that the problem is exacerbated by using a "slow" medium, like CD-ROM or network volumes. yours, Michael	[reply]
Re: Re: Re: Faster Method for Gathering Data by Cine (Friar) on Jul 31, 2003 at 15:10 UTC
Network, yes. CDROM no. The difference is that the meta data on the CDROM can be cached, whereas the network drive has to recheck it. T I M T O W T D I	[reply] [d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks