Bemer14 has asked for the wisdom of the Perl Monks concerning the following question:

<html> <head> <title>Untitled Document</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <body bgcolor="#FFFFFF" text="#000000">
What I am trying to do, I have a .txt document that has characters like the following staggared throughout the file:

ÉÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ»

º Some text in here º

º Some text in here º

º Some text in here º


ÇÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

I am trying to take this .txt file and delete all of these characters and reformat the txt so that the text between the two º º is on a seperate line (I am further planning on putting each line into HTML but thats not a problem).
I am pretty sure I can do this in PERL but I keep having trouble getting past the first line of the file. I know I will have to do some sort of s/// or set up a m//but I have tried and it does not notice these characters. I am opening the file with read & write! Ok, my question!! Will PERL acknowledge these characters or am I barking up the wrong tree or further am I just tripping over the tree!

 

</body> </html>

Replies are listed 'Best First'.
Re: Text Manipulation
by Rich36 (Chaplain) on Jan 12, 2002 at 02:56 UTC
    Getting the data between the º characters isn't a problem in Perl. The thing I had to do to figure this out was to find the ASCII value for º, which is 186. Using the ACSII value, you can match the character and use that to get your data. The following example gets everything between the º characters and pushes it to a new array.
    #!/usr/bin/perl -w use strict; my $file = "test.txt"; # Text file to open my @newtext; # array to hold target data my $char = chr(186); # ACSII character º # Open the file open(FILE, "<$file") || die "Couldn't open $file: $!\n"; # If the contents of each line matches the º character at the beginnin +g and end of the line, # push it onto an array. while(<FILE>) { if(/$char(.+)$char/) { push(@newtext, $1); } } close FILE; # Print out the results foreach(@newtext) {print "$_\n";}
    See chr for how that function works.
    Rich36
    There's more than one way to screw it up...

(cLive ;-) Re: Text Manipulation
by cLive ;-) (Prior) on Jan 12, 2002 at 04:19 UTC
    Let's be a bit more general and a bit more specific. I looked at data under DOS and saw it was a box :), so each line is either nearly all non-word chars (top/bottom), or one either side of content you want. So:
    # clean lines with words in $text =~ s/.*? # anything (?: # don't remember me (\w.*\w) # everything from first to last word charac +ter # on the line - remember this | # or [^\w\s]{6,} # at least 6 characters that aren't words o +r # spaces (in a row) - forget me ) .* # the rest of the line /$1/gx; # replace whole line with (\w.*\w) match, # if made.

    Now, if your text contains puntuation, you might want to amend the (\w.*\w) match to take account of commas, etc that may appear at the end of a line:

    (\w.*\w) => ([\w\.,].*[\w\.,])
    if text contained only commas and periods.

    cLive ;-)

    ps - IANAL (I am not a lecythis)

Re: Text Manipulation
by metadoktor (Hermit) on Jan 12, 2002 at 02:56 UTC
    What sort of substitution or matching command are you using? It should work if you use octal codes in your search pattern. Like
     s/\361//g 

    metadoktor

    "The doktor is in."

Re: Text Manipulation
by YuckFoo (Abbot) on Jan 12, 2002 at 03:48 UTC
    Your description indicates several possiblities. Posting some code would be useful.

    You open the file read and write. You might be truncating the file before you can read from it.

    You are having trouble getting past the first line, might also indicate premature end of file that can happen when reading binary files on a Windoze OS. You'll want to use binmode function to read binary files. See 'perldoc -f binmode'.

    Also editing the file in place is prone to error. Some code demonstating how you are doing this would be helpful.

    YuckFoo

Re: Text Manipulation
by archen (Pilgrim) on Jan 12, 2002 at 07:42 UTC
    As YuckFoo said, give binmode a look if you're on Win32. I've been surprised quite a few times by my scripts mysteriously quitting in the middle of a file just because I didn't use binmode. Not a problem on strait and narrow text files, but it seems like you're dealing with some weird characters, so that could be a problem.