Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Text File Parsing.

by Anonymous Monk
on Nov 24, 2003 at 20:50 UTC ( [id://309679]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am sure this is a fairly simple question, but I am new to Perl, and could use some help.

I have a text file (test.txt) which needs to be parsed; leading whitespace removed, and properly formatted. Then written to an output file.

Here is a snippet of the original file:

Page 1 Analyze at 24-Nov-03 13:22:38 + 270 Mb Component + 625 lines + Video +Present + + 10 + Bit Video
The text file is "cascading". I need it to appear more like this:
Page 1 Analyze at 24-Nov-2003 13:22:38 270 Mb Component 625 lines Video Present 10 Bit Video ...
Any help would be greatly appreciated.

Thank you.

Replies are listed 'Best First'.
Re: Text File Parsing.
by Roy Johnson (Monsignor) on Nov 24, 2003 at 21:13 UTC
    That looks like a file that has just linefeeds instead of \cr\lf.

    perl -pe "s/^\s+//" file > newfile
    might fix your problem, if it's really leading spaces.

    The PerlMonk tr/// Advocate
      It looks to me, too, like the file contains line feeds but no carriage returns, like a UNIX file. Such files contain line feeds as line endings; Mac systems use carriage returns without line feeds; and DOS systems use carriage return-line feed combinations as line endings. This is to help minimize the chance that any software or data will be transferred between systems<grin>. I should think that changing the '0a' characters (if this is indeed the trouble) to 'odoa' sequences would do the job. (Where 0d is 13 and 0a is 10; I'm just used to talking in hex.)
Re: Text File Parsing.
by duff (Parson) on Nov 24, 2003 at 21:05 UTC
    Leading whitespace removal is easy as that's just s/^\s+//;, as for "proper formatting", it's up to you as to how you want it formatted

Re: Text File Parsing.
by Anonymous Monk on Nov 24, 2003 at 21:04 UTC
    open(IN, '<', 'test.txt'); open(OUT, '>', 'out.txt'); while (<IN>) { s/^\s+//; s/\s+$//; print OUT; print OUT "\n"; } close(IN); close(OUT);
Re: Text File Parsing.
by davido (Cardinal) on Nov 25, 2003 at 01:57 UTC
    If it's not just a \cr\lf issue carried along with transferring a file from a DOS/Windows machine to a Unix machine without proper line-end conversion, then the leading whitespace theory must be accurate. As others have mentioned you can deal with that with a s/\s+// substitution regex.

    However, it also looks like there may be extra newlines in there as well. You apparently want only single newline characters at the end of each line, so that the text appears "single spaced".

    Update: ^\s+ will also wipe out lines that contain only "newline" characters: a desirable side-effect. However, after that, it may be helpful to completely eliminate elements from @array that have now become empty.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      ... from a DOS/Windows machine to a Unix machine ...

      You mean, from a unix machine to a dos/windows machine -- that's the direction in which this sort of symptom appears, because the dos machine needs both the CR and the LF to display the text properly, and will display text in a manner just like to OP showed if the CR is missing. Meanwhile, unix uses only the LF in its text files, but will usually display a dos file (with the "extra" CR next to each LF) intelligibly.

      The example in the OP was caused either by this issue, or else it results from trying to do something like an X-windows "select/paste" operation from an html browser window to some plain-text window, where the selected lines in the browser happen to be part of a <table>. No way to be sure, given the information originally provided, but the html-paste seems more likely, and removing whitespace is the way to go (rather than adding CR's).

Re: Text File Parsing.
by Anonymous Monk on Nov 27, 2003 at 16:55 UTC
    Thank you to all those who replied.

    Both proposed solutions worked well, with slightly different outputs.

    Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://309679]
Approved by chromatic
Front-paged by chromatic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2024-03-28 12:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found