I've a file that I need to parse. What I am trying to do, essentially, is this:

#!/usr/bin/perl $file = "spg.txt"; open(SPG, $file) or die "Couldn't open $file: $!\n"; while (defined($line = <SPG>)) { $line =~ s/\s+/ /g; $line =~ s/^\s//g; my ($title, $start_date, $start_time, $end_date, $end_time, $s +tatus, $prixit) = split(/\s/, $line); print "$title $status\n"; }

that'd be pretty easy, but there are lines in the file that are "broken", so to speak. Take a look at the "spg-risk-ln_cdo_leg_synthetic" line below, for instance:

[vxp@vxp ~]$ cat spg.txt spg-risk-ln-box 06/24/2009 21:14 06/24/2009 22:01 IN 3969 +3696/0 spg-risk-Fixed_Sterling 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-aaeml 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_credit 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_fixed 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_fixed2 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_abs_flow 06/24/2009 21:14 06/24/2009 21:14 IN 3969 +3696/1 spg-risk-ln_aol_abs 06/24/2009 21:14 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_apcms 06/24/2009 21:14 06/24/2009 21:14 IN 3969 +3696/1 spg-risk-ln_bouwfonds 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_caprub 06/24/2009 21:43 06/24/2009 21:45 IN 3969 +3696/2 spg-risk-ln_capusd 06/24/2009 21:14 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_cdo 06/24/2009 21:14 06/24/2009 22:00 IN 3969 +3696/0 spg-risk-ln_cdo_leg_synthetic 06/24/2009 21:14 06/24/2009 21:18 IN 3969 +3696/0 spg-risk-ln_cdo_legacy 06/24/2009 21:14 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_cmbs 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_cmbx 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_credit_fixed 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_credit_frn 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_euresi 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_fonspa 06/24/2009 21:15 06/24/2009 21:21 IN 3969 +3696/1 spg-risk-ln_hyloans 06/24/2009 21:15 06/24/2009 21:15 IN 3969 +3696/1 spg-risk-ln_ni 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_resid_rmbs 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_rmbs 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_swaps 06/24/2009 21:15 06/24/2009 21:22 IN 3969 +3696/1 spg-risk-ln_synresi 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_synthetics 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-ln_trefs 06/24/2009 21:15 06/24/2009 21:20 IN 3969 +3696/1 spg-risk-ln_ukpurch 06/24/2009 21:15 06/24/2009 21:19 IN 3969 +3696/0 spg-risk-ln_warehouse 06/24/2009 21:15 06/24/2009 21:16 IN 3969 +3696/1 spg-risk-ln_abs_frn 06/24/2009 21:15 06/24/2009 21:17 IN 3969 +3696/1 spg-risk-lnliq 06/24/2009 21:14 06/24/2009 21:18 IN 3969 +3696/1 [vxp@vxp ~]$

Аny ideas on what's needed to "fix" the file?

I can't do a regex to match the line that starts with "spg-risk-ln_cdo_leg_synthetic" and "stitch" it with the next line (that'd involve something like checking if there is any data after the first column, and if there isn't then place the first column into a hash (with the first column being the key) and then check the next line, if it starts with a space then assign those as the key's value. That can be done technically, but that solution won't work because I've thousands and thousands of these little files to parse, I can't possibly find all of these lines and write thousands and thousands of regexes... That's why I'm asking people here for a , possibly, "universal", so to speak, solution to this problem.

Thanks in advance! :)


In reply to Regexes, stitching broken lines, and other fun stuff. by vxp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.