Reading from large files

zer has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reading from large files by GrandFather (Saint) on Mar 21, 2006 at 04:01 UTC
Rewrite to use `while` instead of `foreach`. I can't tell if this sample is meaningful or not, but may give you the idea (note local data is a good technique for sample code like this): `use strict; use warnings; my %DB; my $store = "<Head>"; my $head; while (<DATA>) { if (/.<B>/ && ! /Last modified/){ #If line is a title my $temp = $_; $_=~ s/(<[^<>]+>)+//g; chomp; if (/\(/){ $store = $_; $DB{$store} = $temp; }else{ $store = $_; $DB{$store} = $temp; } }else{ #Not a title ($store ne "<Head>")? $DB{$store}:$head .= $_; } } print join "\n", map {"$_ -> $DB{$_}"} keys %DB; __DATA__ Title line<B> a line another line <Head><B> this line that line` [download] Prints: `use strict; use warnings; my %DB; my $store = "<Head>"; my $head; while (<DATA>) { if (/.<B>/ && ! /Last modified/){ #If line is a title my $temp = $_; $_=~ s/(<[^<>]+>)+//g; chomp; if (/\(/){ $store = $_; $DB{$store} = $temp; }else{ $store = $_; $DB{$store} = $temp; } }else{ #Not a title ($store ne "<Head>")? $DB{$store}:$head .= $_; } } print join "\n", map {"$_ -> $DB{$_}"} keys %DB; __DATA__ Title line<B> a line another line <Head><B> this line that line` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Reading from large files by zer (Deacon) on Mar 21, 2006 at 04:07 UTC
Thanks Grandfather however this didnt work either. It produced the same results.	[reply]
Re^3: Reading from large files by GrandFather (Saint) on Mar 21, 2006 at 04:22 UTC
Use a very small sample of your data that reproduces the problem in place of the data I gave. Show us what you see printed and what you expect to be printed. File size is not the issue. We need to see the code that you are using that fails and the data that it fails with. If you can't reduce the pertinent code to a size similar to my sample and the pertinent data to a similar size, you do not understand your own code and data and have not identified the actual problem. DWIM is Perl's answer to Gödel	[reply]
Re: Reading from large files by Samy_rio (Vicar) on Mar 21, 2006 at 04:15 UTC
What is a better method for this. Hi zer, See the below comparison using `Array::FileReader, Slurp, File::Content modules and do function` `use Array::FileReader; use Slurp; use Benchmark 'cmpthese'; use File::Content; cmpthese(-1, { Array => 'tie @foo, Array::FileReader, "test.txt"', Slurp => 'my @array = Slurp::to_array("test.txt")', do => 'do {local $/, "test.txt"}\|\|die ($!)', Content => 'my $o_fil = File::Content->new("test.txt")', }); __END__ Rate Slurp Content Array do Slurp 3.94/s -- -100% -100% -100% Content 997/s 25229% -- -82% -100% Array 5467/s 138774% 448% -- -99% do 466126/s 11839502% 46643% 8425% --` [download] Capacity of test.txt file is nearly 800KB Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]
Re: Reading from large files by graff (Chancellor) on Mar 21, 2006 at 04:29 UTC
Well, 420KB is not really such a large file. Perl should have no trouble at all holding that much data in memory (unless you are working on a really old, dinky little machine). As for the code you posted in your update: When I looked at it, it seemed not to be reading any input at all -- you had this line commented out: `#@Data = <F>;` [download] and nothing else in the code was reading from the input file. You seem to be reading tagged data of some sort (HTML? XML?), so you should be using an appropriate parsing module (HTML::TokeParser, XML::Parser, or some variant/derivative of one of those). The reason for using a parsing module is to avoid the sorts of mistakes that you are probably making: `$_=~ s/(<[^<>]+>)+//g; chomp;` [download] Since you are processing the data line by line, if any tags contain a newline between the angle brackets, they will not be deleted by that regex. Some of your logic seems to make no sense; after attempting to delete everything that looks like a tag, you check for the presence of an open-paren, and then you perform the same two steps in both the "if" and the "else" blocks. It would also help to show a little bit of input data -- or at least the part that you say gets "cut off", and give a little more detail about what is meant, exactly, by "It gets cut off". What / how much is missing from the output?	[reply] [d/l] [select]
Re: Reading from large files by swampyankee (Parson) on Mar 21, 2006 at 03:49 UTC
I would suspect that if you can't load the entire file into an array, you won't be able to load it into a variable. If you would be a little bit more forthcoming with information, it would make any answers you receive more useful. Given the amount of information you've supplied the answer to "What is a better method for this?" (I took the liberty of correcting punctuation) is more than a bit open-ended. Two obvious alternatives are reading a record at a time (I shan't try to educate you on loop constructs) and using `Tie::File`. emc " The most likely way for the world to be destroyed, most experts agree, is by accident. That's where we come in; we're computer professionals. We cause accidents." —Nathaniel S. Borenstein	[reply]
Re: Reading from large files by GrandFather (Saint) on Mar 21, 2006 at 03:33 UTC
Show us the snippet of code that you are using to read the file. If you can generate a short script (10 line) that fails on large files, show us that. If you can't then it is likely we can't help in any case. Make sure you have `use strict; use warnings;` at the top of your code. DWIM is Perl's answer to Gödel	[reply] [d/l]
Re^2: Reading from large files by zer (Deacon) on Mar 21, 2006 at 03:36 UTC
updated... see above not much to it	[reply]
Re^3: Reading from large files by GrandFather (Saint) on Mar 21, 2006 at 03:38 UTC
How big is the file? Do you really need to slurp it? What do you do with @D subsequently? DWIM is Perl's answer to Gödel	[reply]
Re^4: Reading from large files by zer (Deacon) on Mar 21, 2006 at 03:48 UTC
Re^5: Reading from large files by GrandFather (Saint) on Mar 21, 2006 at 04:04 UTC
Re: Reading from large files by zer (Deacon) on Mar 21, 2006 at 04:57 UTC
alright, here is where i spill the beans. Take note this was finished before graf's last message. It is also a great reason to listen to grandfathers advice about regex and html. however it is a working hack job which is perfect for what i needed to do to help a friend out. And he is currently very impressed with the community. anyways heres the scenario. modify the subpages of this link so that similarly paged links are shifted over by headings. Here is the final code. #!/usr/bin/perl -w use strict; sub go{ my $temp; my %DB; my $hist; my $tab= "   &nbsp"; my $head; my $c=0; my ($r) = shift @_; open (F,$r) \|\| die "Couldnt open $! $r"; $r =~ s/\./1\./; #@Data = <F>; my $store = "<Head>"; while (<F>){ if (/.<B>/ && ! /Last modified/){ #If line is a title $temp = $_; $_=~ s/(<[^<>]+>)+//g; chomp; if (/$/){ $store = $_; $store =~ s/[\($]//g; $DB{$store} = $temp; }else{ $store = $_; $store =~ s/[]//g; $DB{$store} = $temp; } }else{ #Not a title ($store ne "<Head>")? $DB{$store}:$head .= $_; } } print "writing $r"; open (N, ">$r")\|\|die "Can't open to write"; print N "$head";$hist = "---"; foreach (sort keys %DB){ if ((/^$hist[s:] /)){ $DB{$_} =~ s/<\/*B>/$tab/; $DB{$_} =~ s/<DD>/<DD>$tab/; print N "$DB{$_}"; }else{ $hist = $_; print N $DB{$_}; } } } my @t; my @DIR; my $d; opendir (D,".") \|\| die "Could not open directory! Check to make sure t +his file is in the right place. $!"; @DIR = readdir (D); print "Please select a file to modify\n"; $d=0; foreach (@DIR){ if (/keyword.\.html/){push @t, $_;print ++$d.". $_\n";} ; } print "Selection: "; $a = <>; go $t[$a-1] \|\| die "Invalid Number! Try something within below $#t GOS +H!!!"; [download] I used grandfathers advice, i havent tried many others. i realized a regex error late in the game because some hash values had '(' that were messing up the code... Feel free to throw modifications to this, my buddy just may use it! Thanks a lot	[reply] [d/l]
Re: Reading from large files by spiritway (Vicar) on Mar 21, 2006 at 04:34 UTC
Your file is relatively small, so I doubt the problem is memory. It sounds like your data file may be corrupted, or contain unexpected characters that are somehow telling Perl you've reached the end of the file. You may want to try this with some other data, if that isn't too difficult.	[reply]
Re: Reading from large files by zer (Deacon) on Mar 21, 2006 at 04:27 UTC
this looks like enough info for now, ill let you know how it goes. for those interested in the file to be parsed goto: here	[reply]
Re^2: Reading from large files by GrandFather (Saint) on Mar 21, 2006 at 04:34 UTC
Well why didn't you say that in the first place! Use HTML::Parser or HTML::TreeBuilder to pull appart stuff like that! Parsing HTML using regexen is a bad, bad, bad idea and you will have an unhappy life. DWIM is Perl's answer to Gödel	[reply]
Re^3: Reading from large files by zer (Deacon) on Mar 21, 2006 at 04:38 UTC
lol ya but this is why i love perl... there are a million ways to kill a cat and a million more ways to skin it... i think ive just about got it. ill fill you all in with my stupidity when im done	[reply]
Re^4: Reading from large files by spiritway (Vicar) on Mar 21, 2006 at 05:35 UTC
Re: Reading from large files by jonadab (Parson) on Mar 21, 2006 at 13:23 UTC
Some notes below your chosen depth have not been shown here