Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Record Separator affecting Regex

by Kozz (Friar)
on Nov 07, 2002 at 18:38 UTC ( [id://211176]=perlquestion: print w/replies, xml ) Need Help??

Kozz has asked for the wisdom of the Perl Monks concerning the following question:

Most esteemed monks:

I'm wrestling with this bit of code... I have modified the record separator, but I think it's interfering with my replacement regex. This should be darn simple, but I can't make it work. It's quite a simple concept: make the record separator ";\n" and filter-out all lines that start with a # -- comments.

#!/usr/bin/perl $/ = ";\n"; while (my $line = <DATA> ){ $line =~ s/^#[^\r\n]*//g; # get rid of any comments print "Query: $line\n"; } __DATA__ # one comment # two comment # another comment insert into table_name values(1, 'testing 1 2 3'); # more comments insert into table_name values (2, 'test &#149;');

Yes, the file I'm reading is SQL stuff. But I'm not directly importing it with mysql tools because I cannot - they're not available. So I'm doing it the "hard way". I think that perhaps I have to do a "local $/" inside the while loop to change the rec_sep which affects my regular expression, but I'm not sure. I want to make sure that neither a too-lenient record separator will mangle the second insert (which contains a semicolon in a value), nor will the comment-deleting line mangle the second insert which also contains a pound-symbol (octothorpe).

What the heck am I overlooking? It doesn't take care of all the comments, just the stuff anchored to the beginning of the entire string. I must be using anchors wrong, or using the pattern modifiers incorrectly. I've monkeyed with them but with no luck. I thought I had regex basics whipped, but clearly I don't. I feel so humbled.

Replies are listed 'Best First'.
Re: Record Separator affecting Regex
by Mr. Muskrat (Canon) on Nov 07, 2002 at 18:56 UTC

    If you really need to redefine the record seperator, you could do something like this:

    #!/usr/bin/perl use strict; use warnings; $/ = ";\n"; { local $/ = "\n"; while (my $line = <DATA> ){ $line =~ s/^#[^\r\n]*//g; # get rid of any comments print "Query: $line\n" if ($line !~ /^\s+$/); } } __DATA__ # one comment # two comment # another comment insert into table_name values(1, 'testing 1 2 3'); # more comments insert into table_name values (2, 'test &#149;');

    Output:

    Query: insert into table_name values(1, 'testing 1 2 3'); Query: insert into table_name values (2, 'test &#149;');

      As a note, there's a way to check for "lines which are composed only of spaces" that I find much more succinct:
      print "Query: $line" if ($line =~ /\S/);
      Or more generically:
      next unless ($line =~ /\S/); print "Query: $line";
      "Not composed entirely of spaces" is equivalent to "contains a non-space character", at least in this context.
Re: Record Separator affecting Regex
by Bird (Pilgrim) on Nov 07, 2002 at 19:09 UTC

    I think what you're looking for is the /m modifier. This allows ^ and $ to match newlines in multiline data. Since it appears you need to worry about multiline queries (otherwise, why are you modifying the record separator in the first place), I changed your data to include one.

    $/ = ";\n"; while (my $line = <DATA> ){ $line =~ s/^#[^\r\n]*//mg; print "Query: $line\n"; } __DATA__ # one comment # two comment # another comment insert into table_name values(1, 'testing 1 2 3'); # more comments insert into table_name values (2, 'test &#149;');
    ...gives...
    Query: insert into table_name values(1, 'testing 1 2 3'); Query: insert into table_name values (2, 'test &#149;');

    You could also add $line =~ s/^\s*$//mg; if you want to get rid of some of those blank lines.

    -Bird
    Update: I like insensate's solution for removing the blank lines better. Use that one. :)
      Yeah, I think you have the right idea here, Bird. Better than mine, under the circumstances.

      I'd probably rework it as follows, to get rid of all that extra leading whitespace:

      local $/ = ";\n"; while ( <> ) { s/^#.*$//mg; # kill comments s/^\s+//; # kill all remaining leading whitespace print "Query: $_\n"; }
      You're absolutely right -- the /m modifier was *exactly* what I needed, and yes, I do have multi-line commands, like create table statements and such. Thanks!
Re: Record Separator affecting Regex
by insensate (Hermit) on Nov 07, 2002 at 19:11 UTC
    Would something like this suffice? The \m modifier lets the ^ metacharacter match next to a newline in your multiline $line value.
    #!/usr/bin/perl $/ = ";\n"; while (my $line = <DATA> ){ $line =~ s/^#.*[\n\r]*//gm; print "Query: $line" unless $line=~/^\s+$/; } __DATA__ # one comment # two comment # another comment insert into table_name values(1, 'testing 1 2 3'); # more comments insert into table_name values (2, 'test &#149;')
    OUTPUT:
    Query: insert into table_name values(1, 'testing 1 2 3'); Query: insert into table_name values (2, 'test &#149;');
Re: Record Separator affecting Regex
by jdporter (Paladin) on Nov 07, 2002 at 19:13 UTC
    The problem is that comments and statements are terminated by two different things, and you can expect to see the two types of elements intermingled.

    What I would do, if the file isn't grotesquely huge, is read it all into $_, remove the comments, and then split.

    $_ = <DATA>; s/^#.*//gm; # <b>updated</b> my @lines = split /;\n+/;
      Yes, this would have been much easier, but unfortunately, the file is 23MB uncompressed. :(

      But thanks for the info - that's another way I might have tried had the file been smaller.

Re: Record Separator affecting Regex
by insensate (Hermit) on Nov 07, 2002 at 19:45 UTC
    I'm not sure what the output file will be used for... I'm an Oracle guy and routinely create files to be executed by a sqlloader/sqlplus application...in this context sometimes simple statements such as:
    set lines 32 set pages 0 set feedback off
    etc... need to be executed and don't require semicolons...just newlines. Make sure (be this the case for you as well) that your script doesn't end up managing these statements in an undesireable fashion.
Re: Record Separator affecting Regex
by dingus (Friar) on Nov 08, 2002 at 07:42 UTC
    ... This should be darn simple, but I can't make it work. It's quite a simple concept: make the record separator ";\n" and filter-out all lines that start with a # -- comments.

    First off: Are you sure the current record selector is '\n'. If this is running on a windows machine it is '\r\n' (or do I mean '\n\r'? who cares). In any case the safe way to get the new descriptor is to do (with local if required)

    $/= ';'.$/
    But actually I think you shoud be not monkeying with $/: at all and just skipping over # lines as you read them in, i.e.
    while (my $line = <DATA> ){ next if ($line =~ s/^#/); # skip comment lines print "Query: $line\n"; }

    Dingus


    Enter any 47-digit prime number to continue.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://211176]
Approved by Mr. Muskrat
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-03-29 09:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found