Ignas has asked for the wisdom of the Perl Monks concerning the following question:

Hello. It's my first time here, and I'm new to Perl (and all programming, too). I'm trying to extract coordinates from lots (about 15 GB) of small (3-5 kB) text files. Then I'll need to append them into an SQL database. Until now I've got to the regex part, and I'm stuck.

The files I need to process are of this format:

DER.7-767/04.7 5194.5700 -6772.5200 0.0000 DER.7-767/04.8 5194.7400 -6776.3200 0.0000 DER.7-767/04.9 5192.1000 -6776.4300 0.0000 Der.7-539/99.1 5337.9000 6997.1200 0.0000 Der.7-539/99.10 5348.3300 -7020.0900 0.0000 Der.7-539/99.11 5348.4400 -7021.1100 0.0000 Kredyt3.27 5789322.3040 7500854.8800 0.0000 Kredyt3.27a -124.9646 373.4666 0.0000 Kredyt3.28 5789295.3170 7500857.7380 0.0000 Kredyt3.28a -151.9768 376.3191 0.0000 Kredyt3.29 5789298.8620 7500874.6180 0.0000 Kredyt3.29a -148.4337 393.2154 0.0000 Kredyt3.2a -63.0262 297.6930 0.0000 Kredyt3.3 5789369.8750 7500785.7170 0.0000 Kredyt3.30 5789303.2010 7500873.9300 0.0000 Kredyt3.30a -144.0905 392.5281 0.0000 Kredyt3.31 5789302.7240 7500869.9080 0.0000 Kredyt3.31a -144.5668 388.5023 0.0000 Kredyt3.32 5789307.5930 7500869.2210 0.0000 Kredyt3.32a -139.6932 387.8161 0.0000 Kredyt3.33 5789307.9110 7500871.6550 0.0000 Kredyt3.33a -139.3756 390.2524 0.0000

And my code so far is this:

#!/usr/bin/perl ##parser3.plx use strict; use warnings; use diagnostics; #call up variables my(@array); my($file, $filename1, $filename2, $line, $i, $tmp); $i = 1; $filename2 = ''; #opens a list of the files that need processing open LIST, 'listA' or die "(L1)We've got a problem: $!"; while (<LIST>) { $filename1 = <LIST>; fileswitch(); $file = $_; #opens the actual file that will be processed open FILE, "$file" or die "(L2)We've got a problem: $!"; while (<FILE>) { $line = $_; #-----------------$1----$2-----$3------$4---$5------$6 if($line =~ /\s+([-*])(\d+)\.(\d+)\s+([-*])(\d+)\.(\d+)\s*/g +) { #Append the coordinates (with the '-' sign where appropr +iate) push(@array, "$1$2.$3 || $4$5.$6 \n"); } } #close file close(FILE); } #close list close(LIST); #print to file p2f(); #check if path stays the same, if not then append a note to the databa +se sub fileswitch { if($filename1 ne $filename2) { push(@array, "Path: $filename1\n"); #print "\n $filename1 \n" ; #print (".pkt file parsing completed. \n"); } else { $filename2 = $filename1; } } #print to file sub p2f { open (COORDLIST, '>>coordinates'); print COORDLIST @array; close(COORDLIST); }

And when I check the file it is supposed to print to, I find this:

Path: #sorry I won't leave those :) -52157127.9760 || -2989955.5568 -52158244.6810 || -6741268.4549 -52157681.8715 || -1698959.4033 -50440239.8128 || -1701475.3622 -50441191.7990 || -1705583.1112 -57952305.4315 || -7490163.6682 -52157134.5720 || -27730.8039 Path: #sorry I won't leave those :) Path: #sorry I won't leave those :)

Of course, the tests are on a smaller scale, just 40kB of data, but still it should give me at least 20 000 lines.

Oh and when I checked by adding 'print' codes everywhere it actually did go into the 'if' statement 7 or 8 times. So my guess would be that it's my regular expression skills at fault, but I just can't find what's wrong. It seems to match the pattern the coordinates represent.

Anyway, thanks in advance.

--Ignas

Replies are listed 'Best First'.
Re: Extracting coordinates
by toolic (Bishop) on Mar 20, 2010 at 22:56 UTC
    If you suspect a problem with the regex, the first thing I would do is get rid of the g modifier. I'm not saying it will solve your problem, but it does seem out of place. Also, Tip #9 from the Basic debugging checklist recommends using YAPE::Regex::Explain to demystify regular expressions.

    If you could re-create your problem with some self-contained code and data, it would be something we could run. As it stands, you are reading in multiple files which we do not have access to.

    Here are some other things that look suspicious to me:

    • You use your input lines as filenames without chomping them first.
    • You are using $_ in different places inside nested while loops. It might be easier to use a different named, scoped variable for each loop.

    Also, I recommend commenting out use diagnostics; when you are done debugging.

Re: Extracting coordinates
by moritz (Cardinal) on Mar 20, 2010 at 23:01 UTC
    while (<LIST>) { $filename1 = <LIST>;

    This discards every other line line in the file; is that really what you want? if not, you have an easier time writing

    while $filename1 (<LIST>) { # removing trailing newline on $filename1: chomp $filename1; ... }

    Also I don't understand why you need such a complicated regex - using split on whitespace seems much easier to me.

    All in all I have a hard time following your code - if you gave a verbal description of what it should actually do, it would be easier to help you.

    Perl 6 - links to (nearly) everything that is Perl 6.
      DER-71-14/05.16 4930.2800 -6590.5100 0.0000 ----------------^^^^^^^^^-^^^^^^^^^^---

      I need to take these parts and put them into a database side by side (SQL). Sorry for the messy code, I've no idea how else I could do this (no previous programming experience, no knowledge).

      I'd also like to have it done as fast as it can be done, that's why I tried to do such a regex. It would be best to do everything in one load (that is, calculate the 15 GB just once).

      Thanks

        my ($identifier, $coord1, $coord2) = split /\s+/, $line; # then you have $coord1 = '4930.2800', $coord2 = '-6590.5100'
Re: Extracting coordinates
by GrandFather (Saint) on Mar 21, 2010 at 00:46 UTC

    Using strictures (use strict; use warnings;) is very good. Checking the result of open is very good.

    Declaring all your variables up front in one hit is bad. Not using the three parameter version of open is bad. Not using lexical file handles is bad. Using global variables in subs is very bad.

    In fact the fileswitch() sub isn't needed. Unless you have multiple files with the same name (most OS's don't allow that) each time through the outer loop handles a new file.

    The regex can be simplified - no need to capture 6 different parts that you are just going to glue back together again.

    Cleaning the code up and changing the file handling to suit sample code we get:

    #!/usr/bin/perl use strict; use warnings; use diagnostics; # Fake up a couple of data files my %dataFiles = ( data1 => <<DATA1, DER.7-767/04.7 5194.5700 -6772.5200 0.0000 DER.7-767/04.8 5194.7400 -6776.3200 0.0000 DER.7-767/04.9 5192.1000 -6776.4300 0.0000 Der.7-539/99.1 5337.9000 6997.1200 0.0000 Der.7-539/99.10 5348.3300 -7020.0900 0.0000 Der.7-539/99.11 5348.4400 -7021.1100 0.0000 DATA1 data2 => <<DATA2, Kredyt3.27 5789322.3040 7500854.8800 0.0000 Kredyt3.27a -124.9646 373.4666 0.0000 Kredyt3.28 5789295.3170 7500857.7380 0.0000 Kredyt3.28a -151.9768 376.3191 0.0000 Kredyt3.29 5789298.8620 7500874.6180 0.0000 Kredyt3.29a -148.4337 393.2154 0.0000 Kredyt3.2a -63.0262 297.6930 0.0000 Kredyt3.3 5789369.8750 7500785.7170 0.0000 Kredyt3.30 5789303.2010 7500873.9300 0.0000 Kredyt3.30a -144.0905 392.5281 0.0000 Kredyt3.31 5789302.7240 7500869.9080 0.0000 Kredyt3.31a -144.5668 388.5023 0.0000 Kredyt3.32 5789307.5930 7500869.2210 0.0000 Kredyt3.32a -139.6932 387.8161 0.0000 Kredyt3.33 5789307.9110 7500871.6550 0.0000 Kredyt3.33a -139.3756 390.2524 0.0000 DATA2 ); my $listA = "data1\ndata2\n"; my @coords; #opens a list of the files that need processing open my $inFile, '<', \$listA or die "(L1)We've got a problem: $!"; while (defined (my $filename = <$inFile>)) { chomp $filename; next if !length $filename; push @coords, "Path: $filename\n"; #opens the actual file that will be processed open my $inData, '<', \$dataFiles{$filename} or die "(L2)We've got a problem: $!"; while (defined (my $line = <$inData>)) { next if $line !~ /\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)/; #Append the coordinates (with the '-' sign where appropriate) push (@coords, "$1 || $2 \n"); } close ($inData); } close ($inFile); print @coords;

    Prints:

    Path: data1 5194.5700 || -6772.5200 5194.7400 || -6776.3200 5192.1000 || -6776.4300 5337.9000 || 6997.1200 5348.3300 || -7020.0900 5348.4400 || -7021.1100 Path: data2 5789322.3040 || 7500854.8800 -124.9646 || 373.4666 5789295.3170 || 7500857.7380 -151.9768 || 376.3191 5789298.8620 || 7500874.6180 -148.4337 || 393.2154 -63.0262 || 297.6930 5789369.8750 || 7500785.7170 5789303.2010 || 7500873.9300 -144.0905 || 392.5281 5789302.7240 || 7500869.9080 -144.5668 || 388.5023 5789307.5930 || 7500869.2210 -139.6932 || 387.8161 5789307.9110 || 7500871.6550 -139.3756 || 390.2524

    True laziness is hard work

      Woah mister, thank you greatly.

      I don't quite understand what some of this does, but I believe I'll figure it out with some docs and google.

      Again, thank you.

        Don't get too worried over the less obvious bits of the file handling stuff. Perl lets you use a variable as a file by using a reference to it in the open:

        my $stringBeingAFile = "This is the contents of the stringy file\n"; open my $inFile, '<', \$stringBeingAFile;

        which I use in the sample script to avoid having to create temporary files for demonstration purposes. In the current case it's slightly more complicated because you want to deal with multiple files so I use a hash that is pretending to be a directory of files (really a hash of file name => file content pairs).

        But, as I suggested, you needn't worry about that so long as you can see past the first 36 lines and ignore the "use a string as a file" syntax in the opens the remainder of the code is the important bit.


        True laziness is hard work
      In cases like this, imo, it may be worth declaring/compiling the regex outside of any loops.
      my $re = qr{\s+(-?\d+\.\d+)\s+(-?\d+\.\d+)}; # and then later, inside the loops next if $line !~ $re;
      You would need to do benchmarks to know whether it was actually worth it. The docs are a tad cirumspect. :-)
        You would need to do benchmarks to know whether it was actually worth it.

        No you wouldn't since there is no vars in the regex, it makes no difference.

Re: Extracting coordinates
by se@n (Initiate) on Mar 20, 2010 at 23:17 UTC

    This is a general answer. I didn't review the code, but there's a simple and swift approach to parsing formatted data files. The data and delimiters are ALWAYS opposites. If you have a space delimted file, with 3 fields use:


    if($line =~ /^(\S+)\s+(\S+)\s+(\S+)/) { @fields = ($1,$2,$3) } else { print "Regex Error on $line\n" }

    If you have a colon delimted file, with 3 fields use:


    if($line =~ /([^:]+):([^:]+):([^:]+)/) { @fields = ($1,$2,$3) } else { print "Regex Error on $line\n" }

    \s is the opposite of
    \S
    : is the opposite of

    [^:]

    Then you can manipulate individual fields.

    Sean