comment on

I've a file that I need to parse. What I am trying to do, essentially, is this:

#!/usr/bin/perl

$file = "spg.txt";

open(SPG, $file) or die "Couldn't open $file: $!\n";

while (defined($line = <SPG>)) {
        $line =~ s/\s+/ /g;
        $line =~ s/^\s//g;
        my ($title, $start_date, $start_time, $end_date, $end_time, $s
+tatus, $prixit) = split(/\s/, $line);
        print "$title $status\n";
}
[download]

that'd be pretty easy, but there are lines in the file that are "broken", so to speak. Take a look at the "spg-risk-ln_cdo_leg_synthetic" line below, for instance:

[vxp@vxp ~]$ cat spg.txt
spg-risk-ln-box              06/24/2009 21:14 06/24/2009 22:01 IN 3969
+3696/0    
  spg-risk-Fixed_Sterling    06/24/2009 21:14 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-aaeml             06/24/2009 21:14 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_abs_credit     06/24/2009 21:14 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_abs_fixed      06/24/2009 21:14 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_abs_fixed2     06/24/2009 21:14 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_abs_flow       06/24/2009 21:14 06/24/2009 21:14 IN 3969
+3696/1    
  spg-risk-ln_aol_abs        06/24/2009 21:14 06/24/2009 21:16 IN 3969
+3696/1    
  spg-risk-ln_apcms          06/24/2009 21:14 06/24/2009 21:14 IN 3969
+3696/1    
  spg-risk-ln_bouwfonds      06/24/2009 21:14 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_caprub         06/24/2009 21:43 06/24/2009 21:45 IN 3969
+3696/2    
  spg-risk-ln_capusd         06/24/2009 21:14 06/24/2009 21:16 IN 3969
+3696/1    
  spg-risk-ln_cdo            06/24/2009 21:14 06/24/2009 22:00 IN 3969
+3696/0    
  spg-risk-ln_cdo_leg_synthetic
                             06/24/2009 21:14 06/24/2009 21:18 IN 3969
+3696/0    
  spg-risk-ln_cdo_legacy     06/24/2009 21:14 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_cmbs           06/24/2009 21:15 06/24/2009 21:16 IN 3969
+3696/1    
  spg-risk-ln_cmbx           06/24/2009 21:15 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_credit_fixed   06/24/2009 21:15 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_credit_frn     06/24/2009 21:15 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_euresi         06/24/2009 21:15 06/24/2009 21:17 IN 3969
+3696/1    
  spg-risk-ln_fonspa         06/24/2009 21:15 06/24/2009 21:21 IN 3969
+3696/1    
  spg-risk-ln_hyloans        06/24/2009 21:15 06/24/2009 21:15 IN 3969
+3696/1    
  spg-risk-ln_ni             06/24/2009 21:15 06/24/2009 21:16 IN 3969
+3696/1    
  spg-risk-ln_resid_rmbs     06/24/2009 21:15 06/24/2009 21:17 IN 3969
+3696/1    
  spg-risk-ln_rmbs           06/24/2009 21:15 06/24/2009 21:17 IN 3969
+3696/1    
  spg-risk-ln_swaps          06/24/2009 21:15 06/24/2009 21:22 IN 3969
+3696/1    
  spg-risk-ln_synresi        06/24/2009 21:15 06/24/2009 21:16 IN 3969
+3696/1    
  spg-risk-ln_synthetics     06/24/2009 21:15 06/24/2009 21:17 IN 3969
+3696/1    
  spg-risk-ln_trefs          06/24/2009 21:15 06/24/2009 21:20 IN 3969
+3696/1    
  spg-risk-ln_ukpurch        06/24/2009 21:15 06/24/2009 21:19 IN 3969
+3696/0    
  spg-risk-ln_warehouse      06/24/2009 21:15 06/24/2009 21:16 IN 3969
+3696/1    
  spg-risk-ln_abs_frn        06/24/2009 21:15 06/24/2009 21:17 IN 3969
+3696/1    
  spg-risk-lnliq             06/24/2009 21:14 06/24/2009 21:18 IN 3969
+3696/1    
[vxp@vxp ~]$
[download]

Аny ideas on what's needed to "fix" the file?

I can't do a regex to match the line that starts with "spg-risk-ln_cdo_leg_synthetic" and "stitch" it with the next line (that'd involve something like checking if there is any data after the first column, and if there isn't then place the first column into a hash (with the first column being the key) and then check the next line, if it starts with a space then assign those as the key's value. That can be done technically, but that solution won't work because I've thousands and thousands of these little files to parse, I can't possibly find all of these lines and write thousands and thousands of regexes... That's why I'm asking people here for a , possibly, "universal", so to speak, solution to this problem.

Thanks in advance! :)

In reply to Regexes, stitching broken lines, and other fun stuff. by vxp

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.