ericdp has asked for the wisdom of the Perl Monks concerning the following question:

I've created a large (~3 million rows) file. Well it's off by 22 rows from what it should be. In it's creation i exclude carriage returns and line feeds.
SELECT ... SUBSTR ( REPLACE ( REPLACE ( memo_system_txt, CHR(10), '' ), CHR(13), +'' ), 1, 255 ) AS memo_sys_txt, ...
But something is getting through. I'd like to find out what. I think it might be a tab character, but I am not sure. So I thought I'd parse through my file and see if I can find it. .... ok, this is where I need help. :) Am I on the right track?
#!/usr/bin/perl use strict; use warnings; open F, "memo.txt"; open X, ">memo.txt_error"; my $row; my $line; $row = 0; while (<F>) { chomp; $line = $_; $row++; # Look for any lines of the wrong length # if ( length $line != 2340 ) # { # printf X "%u : ||%s||\n", $row, $line; # } # Find any control characters if ( $line =~ /[[:cntrl]]/ ) { my $c = $1; printf X "%u : ||%s||\n", $row, $line; printf X "\t : ||%s||\n", ORD($c); print X "---------------------------"; } } close X; close F;
Getting:
printf() on unopened filehandle X at ./a2.pl line 20, <F> line 150.
Can't call method "ORD" without a package or object reference at ./a2.pl line 21, <F> line 150
What did I get wrong? I'd like to find the numeric value (i.e. ORD(tab) is 9) and where in the line the character is located. Thanks for your help. Eric

Edit: g0n - replaced pre tags with code tags

Replies are listed 'Best First'.
Re: Searching for control characters
by holli (Abbot) on Feb 07, 2006 at 20:33 UTC
    replace open X, ">memo.txt_error"; with open X, ">memo.txt_error" or die "cannot open (>) memo.txt_error: $!"; and see what happens.


    holli, /regexed monk/
Re: Searching for control characters
by Fletch (Bishop) on Feb 07, 2006 at 20:34 UTC

    a) you don't check the return value of your open calls. 2) Perl is case sensitive; ORD is not ord. III) Your regex has no capturing parens so $1 won't contain anything.

      a) Yes, I did forget that. But it does work. For better style and conformity I should have it.
      b) Yes, I mispelled it here. but it is all caps in my script.
      c) Doesn't the parens around [:cntrl:] not capture? Or did I read that wrong in the docs?
      if ( $line =~ /([[:cntrl:]])/ ) { my $c = $1;
      Or should I have gone about this differently?
      Thanks. Eric

        Your code runs, but you have no idea that the open failed and you blithely continue on until you get the error about trying to print to an unopened handle.

        Next the translation: "Yes, I mispelled it here. but it is all caps (and still wrong) in my script." Again, ORD is not ord. CASE MATTERS.

        And that code there is different than your initial example (but yes, that's correct and should put something meaningful in $1).

        oops. ... dumb me ... "ord" not "ORD".
Re: Searching for control characters
by graff (Chancellor) on Feb 07, 2006 at 23:29 UTC
    If I understand your task, you could just do it this way:
    #!/usr/bin/perl use strict; use warnings; while (<>) { my @ctrl = ( /([[:ctrl:]])/g ); if ( @ctrl ) { my $cstr = join ",", map { sprintf "x%2.2x",ord() } @ctrl; s/([[:ctrl:]])/sprintf("||%x2.2x||",$1)/eg; printf( "line %d: ctrl-chars %s in <<%s>>\n", $., $cstr, $_ ); } }
    Run it like this:
    script_name memo.txt > memo.errs
    It'll behave a little differently from your version: if a single line contains two or more control characters, this will show all of them (e.g. "x13,x01,x11"), instead of showing just the first one.

    That could also be done as a one-liner (using "perl -lne"), but then you might have shell quoting issues that I don't want to get into...

    (UPDATE: Added a line of code inside the "if(@ctrl)" block, to mark the locations in the data line where the control characters occurred, as per OP's stated intention.)

      hmmm... I try this and get:
      POSIX class :ctrl: unknown before HERE mark in regex m/([:ctrl: << HERE ])/
      
      Is my perl compiled differently to cause this error? I'm in a ksh88 environoment on HPUX.

      Thanks for help. This looks like a better idea.
      Eric
        Ah. supposed to be [[:cntrl:]] :)

        but now getting
        ...
        Argument "\n" isn't numeric in sprintf at ./a3.pl line 10, <> line 2326.
        ...
        
        so changed the sprintf to %s and ord $1. otherwise it can't seem to print ^I ^f type of characters.
        Thanks for the help

        Eric
Re: Searching for control characters
by GrandFather (Saint) on Feb 07, 2006 at 20:35 UTC

    Check that the open succeeded with open X, '>', 'memo.txt_error' or die "Open output file failed: $!"; (the three parameter version of open is preferred).

    ord is the function you want rather than ORD - case is important in Perl.


    DWIM is Perl's answer to Gödel
      Thanks much for the help. it does work.
      #!/usr/bin/perl use strict; use warnings; open F, '<', 'notes.txt_error' or die "Open output file failed: $!"; my $row; my $line; $row = 0; while (<F>) { chomp; $line = $_; $row++; if ( $line =~ /([[:cntrl:]])/ ) { my $c = $1; printf "%u : ||%s||\n", $row, $line; printf "\t : ||%s||\n", ord ($c); print "---------------------------"; } } close F;
      produces
      150 : || <long line> ||
               : ||3||
      ---------------------------
      206 : || <long line> ||
               : ||9||
      ---------------------------
      317 : || <long line> ||
               : ||28||
      ---------------------------
      12878 : || <long line> ||
               : ||24||
      ---------------------------
      
      Now to find out how these odd control characters got into my data. But that is a different story.
      Thanks much. Eric