Rajpreet has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I am kind of stuck , so would need help from you all. Here is the problem description -

I have a text file which has some xml tags. Say file is input.txt

Contents of this file as are like -

<MyId><data>1332</data>................</MyId> <MyId><data>1332</data>................</MyId><MyId><data>1332 </data>................</MyId><MyId><data>1332</data>................</MyId> <MyId><data>1332</data>................</MyId>

Each string starting from <MyId> and ending at </MyId> (both inclusive) is a trade message. I need to get each of the trade messages, and need to arrange them in a file such that each line starts with <MyId> and ends at </MyId>. Or in other words you can say, one trade message per line. No distortation.

Another potential problem is each trade message is very very huge and a file can have any number of messages say around 5000 or 10,000 or anything. Basically the volume is very huge, so I cannot read the whole file into one variable and do the pattern search.

I would really appreciate a quick response.

Thanks

Replies are listed 'Best First'.
Re: Pattern Search Across Multiple Lines
by johngg (Canon) on Feb 11, 2010 at 10:54 UTC

    If I have understood correctly, you could localise the input record separator (see $/ or $INPUT_RECORD_SEPARATOR in perlvar) to your closing tag then use tr to remove embedded newlines.

    With this input data

    $ cat spw822595.inp <MyId><data>1332</data><blarg>burfle bof bip blech</blarg></MyId><MyId><data>1332</data><spu>hsh wugw </spu> <spog>uh wef wegf</spog></MyId> <MyId><data>1332</data><grop>hd wg fw we we we we eryer efy we dad</grop></MyId> $

    This code

    use strict; use warnings; my $inFile = q{spw822595.inp}; open my $inFH, q{<}, $inFile or die qq{open: < $inFile: $!\n}; my $outFile = q{spw822595.out}; open my $outFH, q{>}, $outFile or die qq{open: > $outFile: $!\n}; { local $/ = q{</MyId>}; while ( <$inFH> ) { tr{\n}{}d; print $outFH qq{$_\n} if $_; } } close $inFH or die qq{close: $inFile: $!\n}; close $outFH or die qq{close: $outFile: $!\n};

    Produces this output

    $ cat spw822595.out <MyId><data>1332</data><blarg>burfle bofbip blech</blarg></MyId> <MyId><data>1332</data><spu>hsh wugw </spu><spog>uh wef wegf</spog></M +yId> <MyId><data>1332</data><grop>hd wg fw we we we weeryer efy we dad</gro +p></MyId> $

    I hope I have understood correctly and this is of use.

    Cheers,

    JohnGG

    Update: Just spotted a horrible typo in first paragraph, s{\$\.}{$/} has put things right :-(

Re: Pattern Search Across Multiple Lines
by Ratazong (Monsignor) on Feb 11, 2010 at 08:22 UTC

    You haven't written it explicitly, but lets assume that the trade-messages are not nested and that the file is not corrupted (no unmatched tags)... otherwise things will get more complicated

    You say your file is huge => therefore you should read it line-by-line

    based on the assumptions above, the algorithm below should do the trick

    • if both tags (<MyId>, </MyId>)are in your line move the text between to your output and continue processing the line
    • if only the startTag (<MyId>) is in your line, copy all text from that tag to the end of the line to a temporary variable and set a marker; then process the next line
    • if none of the tags is in your line (and the marker is set), copy that line to your temporary variable and process the next line
    • if only the endTag (</MyId>) is in your line, copy everything from the start of the line to the tag to your temporary variable, write the temporary variable to your output and delete it afterwards and reset the marker; then process the next line

    note1: you will surely want to enhance the first check to cover the situation that the endTag occurs before the startTag
    note2: check substr ... no need for regular expressions here
    note3: if things get more complicated (e.g. you need to evaluate sub-Tags, or the input is really XML and contains comments (<!--)) you will want to use some of the fine XML-modules of CPAN instead (don't forget to check the Tutorials section on that topic ;-)

    HTH, Rata
Re: Pattern Search Across Multiple Lines
by Anonymous Monk on Feb 11, 2010 at 07:43 UTC
    A reply falls below the community's threshold of quality. You may see it by logging in.