comment on

Taking your points in order:

file size is unknown

Not by us but kevyt is probably aware and can make a value judgement reconciling the size of his file with the memory resources available.

processing does not start untill you are done reading

I can't think why that would be a problem here. Please could you expand on why this is bad.

using at the very least three of four times the size of the file in memory

Yes, but as in point one kevyt can decide whether he has the resources to accomodate this. We don't know what resources are available.

most people will not realizse that you are reading it in memory first. If you really want to read it all at once then I would suggest reading it in an array first

This is a difficult topic. To what extent do you balance using using the features of Perl, or any language, against making your code accessible to beginners in the language. It has to depend on the type of workplace, the experience level of the workforce and the amount of staff churn. An experienced, stable programming team can perhaps make greater use of language features. However, if you never expose people to new techniques, they will never learn them. This exposure can be via training/mentoring or by encouraging and rewarding self-study. Personally, I am in favour of educating programmers so they can make more informed choices from a larger tool bag in order to solve problems.

the processing will be slower then reading it with a while and immedaitly creating a hash-element for each record. Now you read it in a temporary list, then you loop over that list, while looping over it you create a new list, and then you finally assign that list to a hash. Not that you will notice the speed/memory difference but that doesn't mean it's not there

Well, let's test it. Using a data file kludged up from /usr/dict/words so that we have unique keys as the first of four pipe-delimited fields per line (file size just under 1MByte) I ran some benchmarks. Here's the code

use strict;
use warnings;

use Benchmark q{cmpthese};

my $inFile = q{/work/johngg/spw593475.dat};
open my $inFH, q{<}, $inFile
   or die qq{open: $inFile: $!\n};

my $startPos = tell $inFH;

my $rcArray = sub
   {
       seek $inFH, $startPos, 0;
       my @lines = <$inFH>;
       chomp @lines;
       my %dataHash = ();
       foreach (@lines)
       {
           my ($key, $value) = split m{\|}, $_, 2;
           $dataHash{$key} = $value;
       }
       return \%dataHash;
   };

my $rcByLine = sub
   {
       seek $inFH, $startPos, 0;
       my %dataHash = ();
       while (<$inFH>)
       {
           chomp;
           my ($key, $value) = split m{\|}, $_, 2;
           $dataHash{$key} = $value;
       }
       return \%dataHash;
   };

my $rcMap = sub
   {
       seek $inFH, $startPos, 0;
       my %dataHash = 
          map {chomp; split m{\|}, $_, 2}
          <$inFH>;
       return \%dataHash;
   };

cmpthese (10,
   {
      Array  => $rcArray,
      ByLine => $rcByLine,
      Map    => $rcMap,
   });

close $inFH
   or die qq{close: $inFile: $!\n};
[download]

I ran the benchmark five times and the map solution came out faster than the line-by-line approach on four of them, although the difference is probably not statistically significant. Reading into an array was consistently the slowest by a larger margin. Here's the output

$ spw593475
       s/iter  Array ByLine    Map
Array    1.30     --   -14%   -15%
ByLine   1.12    16%     --    -1%
Map      1.10    18%     1%     --
$ spw593475
       s/iter  Array    Map ByLine
Array    1.43     --   -14%   -18%
Map      1.22    17%     --    -5%
ByLine   1.16    23%     5%     --
$ spw593475
       s/iter  Array ByLine    Map
Array    1.31     --   -14%   -15%
ByLine   1.12    17%     --    -0%
Map      1.12    17%     0%     --
$ spw593475
       s/iter  Array ByLine    Map
Array    1.31     --   -13%   -15%
ByLine   1.13    16%     --    -1%
Map      1.11    17%     1%     --
$ spw593475
       s/iter  Array ByLine    Map
Array    1.30     --   -14%   -16%
ByLine   1.12    16%     --    -3%
Map      1.09    19%     3%     --
$
[download]

I also ran each method in separate scripts to look at memory usage. As you would expect, line-by-line was most frugal with an image of about 7MB, array came next at about 9MB and map was most expensive at about 1MB, so your estimate of three to four times data file was spot on.

The platform is SPARC/Solaris, an Ultra 30 with 300MHz processor and 384 MB of memory running Solaris 9 and the data file was on a local disk; the Perl version was 5.8.4 compiled with gcc 3.4.2.

Regarding your final (added?) point, yes, I would have approached the problem a different way had duplicate detection been a requirement.

Cheers,

JohnGG

Update: Fixed typo

In reply to Re^5: Removing digits until you see | in a string by johngg
in thread Removing digits until you see | in a string by kevyt

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.