Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Regex Not Grabbing Everything

by JonDepp (Novice)
on Sep 17, 2010 at 13:52 UTC ( #860485=perlquestion: print w/replies, xml ) Need Help??

JonDepp has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a file that is set up in the following way:

REND PROV SERV DATE POS NOS PROC MODS BILLED ALLOWED + DEDUCT COINS GRP/RC AMT PROV PD ______________________________________________________________________ +__________________________________________________________ NAME DOE, JOHN HIC 1111111111 ACNT 1111111 + ICN 1111111111111 ASG Y MOA MA01 MA18 12351141821118 111809 23 001 71010 26 31.00 0.00 + 0.00 0.00 CO-18 31.00 0.00 + N347 12351141821118 111809 23 001 70450 26 142.00 44.70 + 0.00 8.94 OA-45 97.30 35.76 + N265 + PR-2 8.94 12351141821118 111809 23 001 74150 26 199.00 0.00 + 0.00 0.00 CO-18 199.00 0.00 + N347 12351141821118 111809 23 001 72192 26 182.00 0.00 + 0.00 0.00 CO-18 182.00 0.00 + N347 12351141821118 111809 23 001 72131 26 195.00 60.61 + 0.00 12.12 OA-45 134.39 48.49 + N265 + PR-2 12.12 PT RESP 21.06 CLAIM TOTALS 749.00 105.31 + 0.00 21.06 643.69 84.25 ADJ TO TOTALS: PREV PD INTEREST 0.00 LATE +FILING CHARGE 0.00 NET 84.25 CLAIM INFORMATION FORWARDED TO : XXXXXX XXXXXXXX INSURANCE CO

I'm using the following code to grab part of this file and test a substring to match one of the 0.00 in a certain spot.

use strict; use warnings; print "What file do you want parsed? "; my $file=<STDIN>; my @data; my $data; my $lines; open (TEST,"$file") or die$!; open OUTPUT, "> peptest.txt" or die$!; while (<TEST>) { if (/NAME /../ADJ TO TOTALS:/) { push @data, $_; foreach $data (@data) { if ($data =~ /1235114182/) { $lines.=$_; my $zero = substr $lines, 118, 5; if ($zero == "0.00") {print OUTPUT "@data \n";} $zero=""; $lines=""; $data=""; @data=(); } } } } close TEST; close OUTPUT;

The output I'm getting is the following:

NAME DOE, JOHN HIC 1111111111 ACNT 111111 + ICN 1111111111111 ASG Y MOA MA01 MA18 12351141821118 111809 23 001 71010 26 31.00 0.00 + 0.00 0.00 CO-18 31.00 0.00 + N265 + PR-2 8.94 12351141821118 111809 23 001 74150 26 199.00 0.00 + 0.00 0.00 CO-18 199.00 0.00 + N347 12351141821118 111809 23 001 72192 26 182.00 0.00 + 0.00 0.00 CO-18 182.00 0.00

What is wrong with the code or regex that its not grabbing everything up to "ADJ TO TOTALS"?

Thanks in advance!!

Replies are listed 'Best First'.
Re: Regex Not Grabbing Everything
by japhy (Canon) on Sep 17, 2010 at 14:05 UTC
    Update: I need to take a closer look at your code.

    Could you explain, a bit more abstractly, what it is you are attempting to do? You want the lines which have ' 0.00' at column 118, correct?
    First, your bracing and indenting style leaves something to be desired. Here's how I'd write your code:
    while (<TEST>) { if (/NAME /../ADJ TO TOTALS:/) { push @data, $_; foreach my $data (@data) { if ($data =~ /1235114182/) { $lines .= $_; my $zero = substr $lines, 118, 5; # <-- 5? or 4? # you had '==', you want 'eq' if ($zero eq "0.00") { print OUTPUT "@data \n"; } @data = (); $zero = $lines = ""; } } } }
    You were using == which is for strictly numeric data, but you want to use eq because you're looking specifically for the characters '0.00'. You either want to change your substr() to 4 characters, or else look for ' 0.00', I think.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    Nos autem praedicamus Christum crucifixum (1 Cor. 1:23) - The Cross Reference (My Blog)

      I made the changes you suggested. Changing the == to eq gave me back an empty output file and changing the double quotes to single quotes gives me the same output as before. I am extracting what I want from the file, but it seems to stop before the "ADJ TO TOTALS" part of the regex.

      There are 0.00 all over the file, so looking for just 0.00 will return almost everything from the original file. The 0.00 in that particular string at column 118 is where it is meaningful for me and if it is there I want the entire array from NAME to ADJ TO TOTALS. I can't see any reason why it would cut off the regex before the end.

      Thanks for your help!

        I also told you that the string '0.00' is only FOUR characters, and you were taking a substring of FIVE characters.

        Your code is written in a rather confusing manner. You've got code in loops that shouldn't be there. What you want to do is keep all the lines until you find one that matches your criteria, and then print the lines you've kept and all the lines following it. Here is a sample solution:
        # print all lines from 'START' to 'STOP' # if a line in between them has 'foobar' at position 10 my $target = 'foobar'; my $pos = 10; my (@buffer, $found); while (<FILE>) { if (/START/ .. /STOP/) { # if we have already found our target string, print this line if ($found) { print } # otherwise... else { # store this line in our buffer push @buffer, $_; # and if we find the target string at the right location # set $found to 1, and print the buffer if (substr($_, $pos, length($target)) eq $target) { $found = 1; print @buffer; } } } }

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        Nos autem praedicamus Christum crucifixum (1 Cor. 1:23) - The Cross Reference (My Blog)
Re: Regex Not Grabbing Everything
by dasgar (Priest) on Sep 17, 2010 at 15:24 UTC

    As for the regex part, I'm not sure since I'm not familiar with using the combo of regexes and the range operator. However, I'd advise going a different route than using the substr command to find the critical data that your trying to key in on. It looks like you're data is somewhat columnar, which means that using split /\s+/ can be used to create an array with each column stored in a different element.

    The code below is how I would approach the problem. It kind of looks like you're planning to add more code to do other stuff, which means that the code below would probably need to be modified to fit in with your bigger plan.

    Code:

    Output file:

    12351141821118 111809 23 001 71010 26 31.00 0.00 + 0.00 0.00 CO-18 31.00 0.00 + N347 12351141821118 111809 23 001 74150 26 199.00 0.00 + 0.00 0.00 CO-18 199.00 0.00 + N347 12351141821118 111809 23 001 72192 26 182.00 0.00 + 0.00 0.00 CO-18 182.00 0.00 + N347
      I tried your code, but the output file was empty. My original regex grabs what I need but doesn't print out all the contents of the array. I commented out different parts of the code and ran them serperately to see if the regex was putting what I wanted into the array to begin with, and it is. Everything works up until I check for the 0.00 and print if that condition is true. It does print if that condition is true but does not print up to the ADJ TO TOTALS part of the regex. Is there a reson why this part of the code would truncate the regex?

        Ok, finally found some time to look back at this. After doing some testing and closer examination and thinking about what's really happening, here's my thoughts and your fixed code.

           I tried your code, but the output file was empty.

        Not sure what's happening there. I have now tested this on 2 systems and it seems to be working for me. Please note that I downloaded your sample data into a file named data.txt and hard-coded that into my code. If you didn't do that, then you might encounter some issues. Anyways, I think you can forget about that code. See below.

           ...but does not print up to the ADJ TO TOTALS part of the regex. Is there a reson why this part of the code would truncate the regex?

        That forced to me actually try running your code (after, of course, cleaning up the alignment). Then as I was trying to figure out why the "ADJ TO TOTALS" line was not printed, I realized that the "NAME" line should not have been printed either, but it was. Sooooo, I pulled out one of the best tools for debugging regexes --- the print statement. I started tossing in print statements to figure out what the heck was in the variables to figure out if the problem was with the regex or what was going into the regex. And the answer is.. (insert drum roll)...neither. I know. You're thinking "Huh? What? What did he say?". Follow along.

        First, look at your code. Where is the only print statement printing to the output file? It's inside of the if ($zero == "0.00") statement. If you look at the "NAME" and "ADJ" lines, $zero is not getting "0.00" so neither line should be printing. So what is actually being sent to the output file? That would be @data. I first changed that to $data and presto! The "NAME" and "ADJ" lines were not printed. Then I realized that you had a line where your were trying to reinitialize @data. The problem was you didn't do it in the right spot. You were only reinitializing it when the line had "1235114182", which is why the "NAME" line was printed. Changing the print statement back to using the @data and relocating @data=(); also worked.

        In the end, there's two things that I did to find the issue. First, clean up the indenting so that I can quickly and easily understand what's inside of what brackets and braces. Second, debug with print statements. So why the long, convoluted response? To help illustrate the thought process that goes on in debugging. Sometimes that more helpful that saying "here's the problem and here's the solution". In other words, I thought that walking you through the debug process would be more useful to you than just handing you the "steps" to "fix" your code.

        Two more quick points before I share the modified version of your code that I ran. First, I would agree with jaffy that your logic behind the looping and variable use is somewhat confusing, which made it difficult to understand what's going on and where the problem was at. Second, if you really wanted the "NAME" and "ADJ" lines printed, then the real problem is that you've discovered that perl is extremely good at doing what you told it to do instead of what you wanted it to do, which I, speaking from personal experience, admit can be very frustrating. In other words, your corrected code told perl to not print those lines.

        Ok, the cleaned up and modified code below is what I ran to debug your code. Try running it and take a look at all of the stuff that gets printed to the screen. You'll see how that was useful in telling me what was going on.

Re: Regex Not Grabbing Everything
by Marshall (Canon) on Sep 19, 2010 at 19:47 UTC
    I find your code very hard to understand. I would spend more time on the indenting style. Many Monks love the old style indenting. I prefer the newer style for new code although I will "go with the flow for old code". You are writing new code, so I would go with the newer indenting style. Both are "correct", but whatever style you choose (old vs new), do it right according to that style.

    Also, when you present code that doesn't do what you want, the more clearly you explain what it should be doing the better!

    When using the 2 or 3 dot operator, keep things simple and finish capturing the complete record, then process it - don't try to "back up" in the middle of the if statement. Perhaps setting a flag "hey this record is of interest" would be fine. My point is: Do more complex things only if there is a performance reason. The first objective should be simplicity and clarity.

    Usually some combination of regex and split is going to work out to be more flexible, easier to write and easier to understand and maintain than using substr(). Substr will be the fastest, but that does not necessarily mean "best". I've got code with a solid 1-2 pages of substr but I needed it for the performance.

    Below, I used a regex that looks for lines with some number at the beginning and 0.00 at the end. The number at the beginning could be some huge number like what you have although I didn't see the need. Adjust to your requirements. Note that "space characters" include \t\b\r\n\s so there is no need to "chomp" the line.

    As another piece of unsolicited advice..try to write code that is "flat", meaning that: fewer levels of indention == better. Think about how to reformulate things when you get the 4th level of indentation.

    #!/usr/bin/perl -w use strict; my @data=(); while (<DATA>) { if (my $flag_EOR = /NAME /.../ADJ TO TOTALS:/) { push (@data, $_); #accumulates this record's data # add print "$flag_EOR\n"; to see what is happening... next unless $flag_EOR =~ /E0$/; } #print header/trailer and only the zero lines if (my @lines = grep{/^\d+.*\s*0\.00\s*$/}@data) { print $data[0]; # header of record print @lines; # lines that start with numbers and # end with 0.00 print $data[-1]; # trailer of record } @data=(); } =prints I manually chopped lines down to prevent word wrap NAME DOE, JOHN HIC 1111111111 ...blah... 12351141821118 111809 23 001 71010 ... 0.00 CO-18 31.00 0.00 12351141821118 111809 23 001 74150 ... 0.00 CO-18 199.00 0.00 12351141821118 111809 23 001 72192 ... 0.00 CO-18 182.00 0.00 ADJ TO TOTALS: PREV PD INTEREST 0.00 LATE FILING CHARGE 0.00 NET + 84.25 =cut __DATA__ REND PROV SERV DATE POS NOS PROC MODS BILLED ALLOWED + DEDUCT COINS GRP/RC AMT PROV PD ______________________________________________________________________ +__________________________________________________________ NAME DOE, JOHN HIC 1111111111 ACNT 1111111 + ICN 1111111111111 ASG Y MOA MA01 MA18 12351141821118 111809 23 001 71010 26 31.00 0.00 + 0.00 0.00 CO-18 31.00 0.00 + N347 12351141821118 111809 23 001 70450 26 142.00 44.70 + 0.00 8.94 OA-45 97.30 35.76 + N265 + PR-2 8.94 12351141821118 111809 23 001 74150 26 199.00 0.00 + 0.00 0.00 CO-18 199.00 0.00 + N347 12351141821118 111809 23 001 72192 26 182.00 0.00 + 0.00 0.00 CO-18 182.00 0.00 + N347 12351141821118 111809 23 001 72131 26 195.00 60.61 + 0.00 12.12 OA-45 134.39 48.49 + N265 + PR-2 12.12 PT RESP 21.06 CLAIM TOTALS 749.00 105.31 + 0.00 21.06 643.69 84.25 ADJ TO TOTALS: PREV PD INTEREST 0.00 LATE +FILING CHARGE 0.00 NET 84.25 CLAIM INFORMATION FORWARDED TO : XXXXXX XXXXXXXX INSURANCE CO

      Thanks so much for your reply and explanation. My style and confusing code stems from the fact that I started teaching myself Perl a few months ago.

      Your code works great. The only problem I have is that it doesn't grab those N347 and N265 codes that are in between the lines. That is why I wanted to print the whole of NAME...ADJ TO TOTALS because it included those codes. Is there a way to include grabbing those N codes as part of the @lines array? I tried to grep those codes and add them to the print statements but they don't line up correctly (the lines they correspond to are directly above the code lines). That is why I thought of using the substring in order to test for the 0.00 condition and grab all the information. Do you think I need to add another if conditional or modify the @lines array to grab those N codes in the right spot?

        To get the next line after a line ending in 0.00,
        if (my @lines = grep{/^\d+.*\s*0\.00\s*$/}@data) change to: if (my @lines = grep{/^\d+.*\s*0\.00\s*$/..././}@data)
        Perl grep is a very powerful critter! What the above says is to filter lines from @data. If the condition in the grep is true, make a copy of the line in @lines. The regex in the grep means: True if we are inbetween a line starts with a digit and ends with 0.00 and another line containing any character at all. The 3 dots means that this "any character" has to be on a separate line. So this will print the line ending in 0.00 and the next line whatever it has, which from your data format happens to be this N347 stuff! Pretty cool!

        As like before, I deleted some of the characters so that the output wouldn't word wrap.

        NAME DOE, JOHN ..... ASG Y MOA MA01 MA18 12351141821118 ..... CO-18 31.00 0.00 N347 12351141821118 ..... CO-18 199.00 0.00 N347 12351141821118 111809 CO-18 182.00 0.00 N347 ADJ TO TOTALS: PREV PD INTEREST ... 84.25
        I wish you well on your Perl learning adventure! Perl is certainly not considered a beginning language. So you are starting in a hard place. You will need a number of books, perhaps start with "Learning Perl".

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://860485]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (2)
As of 2022-05-28 16:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (99 votes). Check out past polls.

    Notices?