comment on

Hello. It's my first time here, and I'm new to Perl (and all programming, too). I'm trying to extract coordinates from lots (about 15 GB) of small (3-5 kB) text files. Then I'll need to append them into an SQL database. Until now I've got to the regex part, and I'm stuck.

The files I need to process are of this format:

DER.7-767/04.7 5194.5700 -6772.5200 0.0000 
DER.7-767/04.8 5194.7400 -6776.3200 0.0000 
DER.7-767/04.9 5192.1000 -6776.4300 0.0000 
Der.7-539/99.1 5337.9000 6997.1200 0.0000 
Der.7-539/99.10 5348.3300 -7020.0900 0.0000 
Der.7-539/99.11 5348.4400 -7021.1100 0.0000 

Kredyt3.27 5789322.3040 7500854.8800 0.0000 
Kredyt3.27a -124.9646 373.4666 0.0000 
Kredyt3.28 5789295.3170 7500857.7380 0.0000 
Kredyt3.28a -151.9768 376.3191 0.0000 
Kredyt3.29 5789298.8620 7500874.6180 0.0000 
Kredyt3.29a -148.4337 393.2154 0.0000 
Kredyt3.2a -63.0262 297.6930 0.0000 
Kredyt3.3 5789369.8750 7500785.7170 0.0000 
Kredyt3.30 5789303.2010 7500873.9300 0.0000 
Kredyt3.30a -144.0905 392.5281 0.0000 
Kredyt3.31 5789302.7240 7500869.9080 0.0000 
Kredyt3.31a -144.5668 388.5023 0.0000 
Kredyt3.32 5789307.5930 7500869.2210 0.0000 
Kredyt3.32a -139.6932 387.8161 0.0000 
Kredyt3.33 5789307.9110 7500871.6550 0.0000 
Kredyt3.33a -139.3756 390.2524 0.0000
[download]

And my code so far is this:

#!/usr/bin/perl
##parser3.plx
use strict;
use warnings;
use diagnostics;

#call up variables
my(@array);
my($file, $filename1, $filename2, $line, $i, $tmp);
$i = 1;
$filename2 = '';

#opens a list of the files that need processing
open LIST, 'listA' or die "(L1)We've got a problem: $!";

while (<LIST>)
{
    $filename1 = <LIST>;
    fileswitch();
    $file = $_;
    
    #opens the actual file that will be processed
    open FILE, "$file" or die "(L2)We've got a problem: $!";
    while (<FILE>)
        {
          $line = $_;
          #-----------------$1----$2-----$3------$4---$5------$6
          if($line =~ /\s+([-*])(\d+)\.(\d+)\s+([-*])(\d+)\.(\d+)\s*/g
+)
            {
              #Append the coordinates (with the '-' sign where appropr
+iate)
              push(@array, "$1$2.$3 || $4$5.$6 \n");
            }
        }
    #close file
    close(FILE);
}
#close list
close(LIST);
#print to file
p2f();

#check if path stays the same, if not then append a note to the databa
+se
sub fileswitch  
  {
    if($filename1 ne $filename2)
      {
        push(@array, "Path: $filename1\n");
        #print "\n $filename1 \n" ;
        #print (".pkt file parsing completed. \n");
      }
    else
      {
        $filename2 = $filename1;
      }
  }
  
  
  #print to file
  sub p2f
    {     
      open (COORDLIST, '>>coordinates');
      print COORDLIST @array;
        close(COORDLIST);
    }
[download]

And when I check the file it is supposed to print to, I find this:

Path: #sorry I won't leave those :)

-52157127.9760 || -2989955.5568 
-52158244.6810 || -6741268.4549 
-52157681.8715 || -1698959.4033 
-50440239.8128 || -1701475.3622 
-50441191.7990 || -1705583.1112 
-57952305.4315 || -7490163.6682 
-52157134.5720 || -27730.8039 
Path: #sorry I won't leave those :)

Path: #sorry I won't leave those :)
[download]

Of course, the tests are on a smaller scale, just 40kB of data, but still it should give me at least 20 000 lines.

Oh and when I checked by adding 'print' codes everywhere it actually did go into the 'if' statement 7 or 8 times. So my guess would be that it's my regular expression skills at fault, but I just can't find what's wrong. It seems to match the pattern the coordinates represent.

Anyway, thanks in advance.

--Ignas

In reply to Extracting coordinates by Ignas

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.