Re: Replacing Lines in 100 Gig file
by bgreenlee (Friar) on Dec 17, 2004 at 18:02 UTC
|
#!/usr/bin/perl -w
while (<DATA>) {
next if m|</root>|..m|<root>|;
print;
}
print "</root>\n";
__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<root>
<some_xml></some_xml>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<some_xml></some_xml>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<some_xml></some_xml>
</root>
| [reply] [d/l] |
|
|
Neat. I think I know what it's doing but I'm not sure why. Could you point me in the direction of the appropriate perldoc that explains the .. foo?
| [reply] |
|
|
perldoc perlop. Search for "flip-flop".
In a nutshell, though, it evaluates to false until the first condition is satisfied, and then evaluates to true until the second condition is satisfied.
| [reply] |
Re: Replacing Lines in 100 Gig file
by dragonchild (Archbishop) on Dec 17, 2004 at 17:49 UTC
|
Why not make copies of the thousand XML files, remove the root nodes from the copies, then concatenate the files?
Plus, what use is a 100G file anyways??
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
|
|
100G files! Hmm, some of your batch processing guys don't like you ...
| [reply] |
Re: Replacing Lines in 100 Gig file
by BrowserUk (Patriarch) on Dec 17, 2004 at 18:08 UTC
|
Resisting the temptation to say "Silly boy!", change quotes to suit.
perl -ple"/^</root>/ and map{ scalar <> } 1..3" file.xml >new.xml
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] |
|
|
Hi Guys,
you've all been most helpful. The solution I went with was the flip-flop operator. It did exactly what I needed. It is now tucked away in the bag-o-tricks.
This board comes through every time I need something!
| [reply] |
Re: Replacing Lines in 100 Gig file
by amw1 (Friar) on Dec 17, 2004 at 17:44 UTC
|
You could try using Tie::File. It ties an array to a file without loading the whole thing into memory.
From there it should be pretty straight forward to loop through the array and strip out the stuff you want to strip. Changes to the array modify the file.
The perldoc for it should tell you what you need to know.
UPDATE: it's the looping line by line that wasn't working for you. Perhaps looping doing something like this
mmm pseudo code
my $cut = 0;
while(<FILE>)
{
if(/<\/root>/)
{
$cut = 1;
}
elsif(/<root>/)
{
$cut = 0;
next; # skip this line as well, but stop cutting
}
if(!$cut)
{
print OUTFILE $_;
}
}
| [reply] [d/l] |
Re: Replacing Lines in 100 Gig file
by osunderdog (Deacon) on Dec 17, 2004 at 19:45 UTC
|
I had a similar problem, but my approach was to just surround the catenated xml files with <root></root> tags.
I did the rest with XSLT.
"Look, Shiny Things!" is not a better business strategy than compatibility and reuse.
OSUnderdog
| [reply] [d/l] |
Re: Replacing Lines in 100 Gig file
by Anonymous Monk on Dec 17, 2004 at 23:30 UTC
|
just delete all <root>,</root> and <?xml...> after the second line and add a </root> at the end.
perl -pi.bak -e 'next unless $. > 2; s{</?root>\s*}{}; s{<\?xml.*\?>\s
+*}{};' foo.xml
echo '</root>' >> foo.xml
can't quite figure out how to leave that last </root> alone or how to add it at the end of processing... | [reply] [d/l] |
Re: Replacing Lines in 100 Gig file
by iburrell (Chaplain) on Dec 20, 2004 at 21:05 UTC
|
Is '<root>' a valid element in the rest of the file? If not, then just remove all '<?xml' and '<root>' and '</root>' lines. And write the initial and final ones manually.
| [reply] |