in reply to how to place rows information into column?

Dear Perl Monks,
My question is the continuation of old post "how to place rows information into column" answered by Ted, Holli & jZed, Thanks a lot for your help.
The sample file i posted earlier is different from actual one. I just wanted to get a hint from you guys and learn things to apply on my own but later when i tried to implement on my real file, i couldn't find a way.
Actually i am unable to create a matching pattern statement for my hash. Please help me here with the real file which looks like:

0.000000e+000 1.502947e+000
1.162272e-012 1.508957e+001
2.324544e-012 1.508948e+000
.....
0.000000e+000 1.502947e+001
1.162272e-012 1.508941e+001
2.324544e-012 1.508940e+000
....
0.000000e+000 1.503947e+000
1.162272e-012 1.504947e+000
2.324544e-012 1.508900e+001
..... so on

Its a 300MB file, the two values are tab seperated. The first column value is repeating after lets say 2000 lines. Their pair value varies (could be the same)

I would appreciate if you write a precise comments especially on matching statement

Thanks in advance.
Syed.
  • Comment on how to place rows information into column (2)

Replies are listed 'Best First'.
Re: how to place rows information into column (2)
by jZed (Prior) on Feb 03, 2005 at 03:21 UTC
    If your data is "tab delimited" and you have 300mb of it, you can use Text::CSV_XS, the fastest method of parsing "delimited" data (which is really "separated" data). Just start the module with sep_char="\t" and it will handle your data fine.
Re: how to place rows information into column (2)
by sh1tn (Priest) on Feb 02, 2005 at 21:00 UTC
    $file = '0.000000e+000 1.502947e+000 1.162272e-012 1.508957e+001 2.324544e-012 1.508948e+000'; my $regex = { 'first' => qr{^(\S+)}, # anything non-blank from the beginning '^' 'second' => qr{\s+(\S*)} # anything non-blank after the fist blank '\ +s+' }; for(split '\n', $file){ /$regex->{first}$regex->{second}/o # 'o' compiles the regex and # and upon success we take the first column and/or the second one print "Key: $1 and Value: $2\n"; # So hash structure can have $hash{$1} = $2 }
      Hi sh1n,

      are we assigning these lines
      '0.000000e+000 1.502947e+000
      1.162272e-012 1.508957e+001
      2.324544e-012 1.508948e+000';
      to the variable $file. With '...' in the middle means some 2000 more pairs in continuity. Kindly explain your very first line of the code.
      many thanks,
      riz.
        Hi riz,

        The very first line is scalar instead of file handler for simplicity.
        As fas as I understand more important is the way we match (or split) these two columns.
        Do you really think that another 2000 lines matter something?
        As conserns RAM or performance - it doesn't matter.
Re: how to place rows information into column (2)
by osunderdog (Deacon) on Feb 02, 2005 at 21:50 UTC

    I wouldn't try to pattern match the scientific float... you could, but I think it's over-kill. If the data is tab delimited, then you can use that to distinguish the two values:

    use strict; while(<DATA>) { # Get rid of trailing \n on line. chomp; # divide the line into two items based on tab my ($x, $y) = split("\t"); print "X:[$x] Y: [$y]\n"; } ##OUTPUT: # X:[0.000000e+000] Y: [1.502947e+000] # X:[1.162272e-012] Y: [1.508957e+001] # X:[2.324544e-012] Y: [1.508948e+000] # X:[0.000000e+000] Y: [1.502947e+001] # X:[1.162272e-012] Y: [1.508941e+001] # X:[2.324544e-012] Y: [1.508940e+000] # X:[0.000000e+000] Y: [1.503947e+000] # X:[1.162272e-012] Y: [1.504947e+000] # X:[2.324544e-012] Y: [1.508900e+001] ##NOTE in data the two fields are tab delimited. __DATA__ 0.000000e+000 1.502947e+000 1.162272e-012 1.508957e+001 2.324544e-012 1.508948e+000 0.000000e+000 1.502947e+001 1.162272e-012 1.508941e+001 2.324544e-012 1.508940e+000 0.000000e+000 1.503947e+000 1.162272e-012 1.504947e+000 2.324544e-012 1.508900e+001

    "Look, Shiny Things!" is not a better business strategy than compatibility and reuse.

      wouldn't try to pattern match the scientific float... you could, but I think it's over-kill.

      Do you really have any idea what does  /^(\S+)\s+(\S*)/o do?

      /^(\S+)\s+(\S*)/o and print FH "K: ", $1, "V: ", $2, "\n" while <DATA> +;


      where <DATA> contains over 22000 lines - less than a second
      on my old (less than 1800 Mh cpu) home machine.
        /^(\S+)\s+(\S*)/o

        Matches one or more non-whitespace characters, followed by one or more whitespace characters, followed by 0 or more non-whitespace characters. The o indicates 'compile pattern only once'.

        from perldoc perlop

        If you want such a pattern to be compiled only once, add a "/o" after +the trailing delimiter. This avoids expensive run-time recompilation +s, and is useful when the value you are interpolating won’t change ov +er the life of the script.

        The parens capture those non-whitspace matches into variables $1 and $2 for use within the scope.


        "Look, Shiny Things!" is not a better business strategy than compatibility and reuse.