You seem determined to use this text dump from DB and make a CSV file for import again. I still recommend other approaches, but here are some more thoughts for you:

Your thinking apprears too be way too complex for the job at hand! You are making a "one off" thing. Usually the objective is to just get this one-off thing done and out of your hair. Think simple and take advantage of the details in this specific situation. Don't worry about "General purpose". I wouldn't worry about "elegant" or "fast" although simple approaches are often very fast. And to me, "straightforward" is its own kind of elegance!

As far as creating a complex structure in either C or Perl, this appears to be "over kill". You are going towards a "flat" one line per record format. The variable names that you want are unique between "sections" (ie if you know the variable name, then you know what kind of sub-section it came from and the vars look like they can only appear once per call record). Take advantage of that! Your code doesn't appear to have any need to understand the multi-level nature of the input data.

Nothing says that you can't do this is in multiple scripts or steps. This often is a good way as it eases the debug process. If code isn't "optimally efficient" don't worry about it! The idea is to set up a series of "filters" that progressively work towards your goal.

So as a "first parsing step", I would do something like the code below. This makes a intermediate file that has all of the "var : value" things in each call record in a "flat" format. Fiddle with regex until you have what you need at this step.

Then write code such that for each call record, you initialize a hash table with the default values for each var that will go into output line. Then for each var line in file's CDR record, if that name tag exists in hash, override with value from file. Then at end of record, print the CSV line. Record starts with something that matches CME20CP6.CallDataRecord and ends with blank line. Nothing is wrong with you adding a blank line manually to end of intermediate file to make the termination condition easy.

#!/usr/bin/perl -w use strict; while (<DATA>) { print "\n$_" if m/CME20CP6.CallDataRecord/; next if /^\s*\[/; #skip stuff like [1] : '011351'H print "$1 : $2\n" if m/^\s*(\S+)\s+:\s+(\S+)\s*$/; } #Prints: #CME20CP6.CallDataRecord.uMTSGSMPLMNCallDataRecord #callIdentificationNumber : '6CBFD7'H #exchangeIdentity : "DWLCCN6" #gSMCallReferenceNumber : '9103770001'H #switchIdentity : '0001'H #recordSequenceNumber : '39D42E'H #date : '1409071F'H #serviceFeatureCode : '0002'H #timeForEvent : '131A01'H #CME20CP6.CallDataRecord.uMTSGSMPLMNCallDataRecord #callIdentificationNumber : '6CC99C'H #exchangeIdentity : "DWLCCN6" #switchIdentity : '0001'H #recordSequenceNumber : '39D42F'H #date : '1409071F'H #serviceFeatureCode : '0002'H #timeForEvent : '131A20'H #note fiddle with regex to suit you needs #change to say print "$1 : $2\n" if m/^\s*(\S+)\s+:\s+(.*)\s*$/; #if you want say #chargePartySingle : 'aPartyToBeCharged (0)' to appear __DATA__ CME20CP6.CallDataRecord.uMTSGSMPLMNCallDataRecord { sCFChargingOutput { callIdentificationNumber : '6CBFD7'H exchangeIdentity : "DWLCCN6" gSMCallReferenceNumber : '9103770001'H switchIdentity : '0001'H recordSequenceNumber : '39D42E'H date : '1409071F'H } eventModule { iNServiceDataEventModule { chargePartySingle : 'aPartyToBeCharged (0)' genericChargingDigits { [0] : '2000'H [1] : '011351'H [2] : '223A941400'H [3] : '233A940209'H [4] : '043A2000'H [5] : '0542'H [6] : '2600'H [7] : '2700'H [8] : '080290701391620122'H [9] : '2A02'H [10] : '72000000000000000000000000'H [11] : '730000000000000000041F'H [12] : '7400000000'H [13] : '3502'H } genericChargingNumbers { [0] : '0003136985138324'H [1] : '010413198935930920'H [2] : '0203136985138324'H [3] : '038290905893701402'H [4] : '0B000002000000'H } serviceFeatureCode : '0002'H timeForEvent : '131A01'H } } } CME20CP6.CallDataRecord.uMTSGSMPLMNCallDataRecord { sCFChargingOutput { callIdentificationNumber : '6CC99C'H exchangeIdentity : "DWLCCN6" switchIdentity : '0001'H recordSequenceNumber : '39D42F'H date : '1409071F'H } eventModule { iNServiceDataEventModule { chargePartySingle : 'bPartyToBeCharged (1)' genericChargingDigits { [0] : '2002'H [1] : '010359'H [2] : '023A8207'H [3] : '033A8207'H [4] : '043A0000'H [5] : '0506'H [6] : '2600'H [7] : '2704'H [8] : '080290701391622322'H [9] : '2A02'H [10] : '72000000000000000000000000'H [11] : '730000000000000000001F'H [12] : '3500'H } genericChargingNumbers { [0] : '0003138935167173'H [1] : '028210850000'H [2] : '0303138935167173'H [3] : '06041319'H } serviceFeatureCode : '0002'H timeForEvent : '131A20'H } } }
Update: Just an example of how to implement the above strategy. @csv_order is the var names in order that they should appear in CSV. Now if you need say these "ChargingNumbers", I would make up a new name for that and "squish it" into one value in the intermediate file format, like you want it to appear in output CSV file. Anyway these 2 scripts will run in just a few seconds even for a million records.
#!/usr/bin/perl -w use strict; my @csv_order = qw ( exchangeIdentity callIdentificationNumber); my %defaults = map {$_ => ""}@csv_order; my %curr_record=%defaults; while (<DATA>) { if (/CME20CP6.CallDataRecord/.../^\s*$/) { if ( my ($var,$val) = ($_ =~ m/^\s*(\S+)\s+:\s+(\S+)\s*$/) ) { $curr_record{$var}=$val if exists ($curr_record{$var}); } if (/^\s*$/) #remember to add a blank line at end of file { dump_csv_line(); %curr_record=%defaults; } } } sub dump_csv_line { print join (",",map{$curr_record{$_}}@csv_order)."\n"; } __END__ Prints: "DWLCCN6",'6CBFD7'H "DWLCCN6",'6CC99C'H ,'699999'H __DATA__ CME20CP6.CallDataRecord.uMTSGSMPLMNCallDataRecord callIdentificationNumber : '6CBFD7'H exchangeIdentity : "DWLCCN6" gSMCallReferenceNumber : '9103770001'H switchIdentity : '0001'H recordSequenceNumber : '39D42E'H date : '1409071F'H serviceFeatureCode : '0002'H timeForEvent : '131A01'H CME20CP6.CallDataRecord.uMTSGSMPLMNCallDataRecord callIdentificationNumber : '6CC99C'H exchangeIdentity : "DWLCCN6" switchIdentity : '0001'H recordSequenceNumber : '39D42F'H date : '1409071F'H serviceFeatureCode : '0002'H timeForEvent : '131A20'H CME20CP6.CallDataRecord.uMTSGSMPLMNCallDataRecord callIdentificationNumber : '699999'H

In reply to Re^9: String Search by Marshall
in thread String Search by kallol.chakra

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.