PitifulProgrammer has asked for the wisdom of the Perl Monks concerning the following question:

Dear Community,

here's my second question as a complete noob to programming. Having explored command line editing, I am now seeking help with an intricate regex pattern.

This is the issue. I have a large file (approx. 1 Gb). In the file the name in the attribute creationid and changeid have to be changed to a particular string.

Whenever I run the script and compare the files via diff command they turn out to be identical.

I checked my regex with the Tool Expresso, but I think Perl's regex works differently.

This is the string in question:

<tu creationdate="12345Z" changedate="6789Z" creationid="John Doe" cha +ngeid="" lang="de-DE">

changeid might also contain a name

This is the script I've been using in order to replace creationd id.  perl -pi.bak -e 's{creationid=".+?"}{creationid="Simon Simonsen"}g;' file.xml

I also tried s{creationid="([^<+])"} as alternative pattern

I appreciate any help on how to improve this script to work through this monstrous file

Thanks a mil in advance

C.

Replies are listed 'Best First'.
Re: help with regular expression required
by AppleFritter (Vicar) on Aug 13, 2014 at 12:12 UTC

    This is the script I've been using in order to replace creationd id. perl -pi.bak -e 's{creationid=".+?"}{creationid="Simon Simonsen"}g;' file.xml

    This works for me, if file.xml contains only the string you provided above. Can you try to whittle your ~1 GB file down to a simpler test case that fails, e.g. by bisecting it? Doing so may prove instructive in identifying the problem; alternatively, it'll make it easier for us monks to help you if we've got both an "it works" case and a "it stops working if I do this" case.

Re: help with regular expression required
by choroba (Cardinal) on Aug 13, 2014 at 12:12 UTC
    Works for me. In the alternative, you should check for a double quote, not a < and +:
    s{creationid="[^"]*"}{creationid="Simon Simonsen"}g

    If you need to change the structure of XML file in a more complex way, you might need to switch to XML::LibXML::Reader, an XML pull-parser.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thank you choroba and also many thanks to AppleFritter

      I do not have to change the structure of the file, it is just about replacing some strings. Chances are this has to be put into a 'proper' script at some stage (I will surely try out the XPath variant)

      As for the file, I can post a small snippet, which I had to modify. I changed the content but the structure is the same. Sorry for not having initiallly posted the snippet.

      <tu creationdate="12345Z" changedate="12345" creationid="John Doe" cha +ngeid="John DOE" srclang="en-US"> <prop type="user-defined">server:xyz</prop> <prop type="user-defined">quality:100</prop> <prop type="user-defined">storing mode:1</prop> <prop type="user-defined">uncontrolled:0</prop> <prop type="user-defined">lastuser:unknown</prop> <prop type="user-defined">context left:1;2;3;4;5;6;7;8;9;10;11 +;12;13;14;15;16;17;18;19;20</prop> <prop type="user-defined">context right:21;22;23;24;25;26;27;2 +8;29;30;31;32;33;34;35;36;37;38;39;40</prop> <prop type="user-defined">a lot of junk with letters and &quot +;90365AF9&quot</prop> <prop type="Guid::xxx">123456789</prop> <prop type="Att::xxx">name_of_someone</prop>^M <tuv xml:lang="en-US"><seg>a lot of junk with letters and &quo +t;90365AF9&quot<bpt i="1"></bpt>text<ept i="1">&lt;/cf&gt;</ept></seg +></tuv> <tuv xml:lang="en-US"><seg>a lot of junk with letters and &quo +t;90365AF9&quot<bpt i="1"></ept></seg></tuv> </tu>

      Thanks a mil for helping me out. Really helps beginners to move on. Looking forward to your replies.

      Kind regards C.

        Thanks for sharing this piece of data -- however, your oneliner is still working just fine there for me. I understand that you may not be in a position to share confidential data, but on the other hand, I'm sure you'll understand it's a bit difficult to diagnose a problem without a test case showing the problem.

        I second what other monks have suggested: if regular expressions aren't working here for whatever reason, a proper XML wrangling module may be the way to go. And XML::Twig is specifically intended for working with very large XML files that don't fit into memory, so that's my recommendation as well.

Re: help with regular expression required
by Anonymous Monk on Aug 13, 2014 at 12:10 UTC
    Since it's an XML file, you'll probably be happier (and more efficient and robust) with an XML tool such as XML::Twig.
Re: help with regular expression required
by Anonymous Monk on Aug 13, 2014 at 15:21 UTC

    Works on your example XML (once the structure is corrected):

    #!/usr/bin/env perl use warnings; use strict; use XML::Twig; XML::Twig->new( keep_spaces => 1, twig_print_outside_roots => 1, twig_roots => { tu => sub { my ($twig, $elt) = @_; $elt->set_att('creationid','Simon Simonsen'); $elt->print; } }, )->parsefile('1097276.xml');

    Prints to STDOUT, so redirect the output wherever you like (perl 1097276.pl >output.xml)

      Thank you Anonymous Monk for the code snippet.

      Your help is much appreciated. The code works fine on the command line.

      However I've been getting this nasty error message.

      Wide character in print at /usr/lib/perl5/site_perl/5.14/XML/Twig.pm l +ine 8403. Wide character in print at /usr/lib/perl5/site_perl/5.14/XML/Twig.pm l +ine 8403. Wide character in print at /usr/lib/perl5/site_perl/5.14/XML/Twig.pm l +ine 8403.

      I think this is due to the encoding.

      Is there any chance I could force utf8 output along the lines of

      use warnings; use strict; use XML::Twig; use File::Slurp; use utf8; use File::Slurp qw(read_file write_file); my $infile = shift; my $filename = $ARGV[2]; XML::Twig->new( keep_spaces => 1, twig_print_outside_roots => 1, twig_roots => { tu => sub { my ($twig, $elt) = @_; $elt->set_att('creationid','Simon Simonsen'); $elt->print; } }, )->parsefile( $infile ); write_file $filename, {binmode => ':utf8'}, $infile;

      Thanks in advance for encourageing comments

      Kind regards and many thanks

      C

        Since XML::Twig is writing to STDOUT, I'm not sure what your write_file is supposed to be doing...

        Encodings are covered in the XML::Twig docs. Here's some code that works for me:

        use warnings; use strict; use XML::Twig; open my $ofh, '>:utf8', '1097276_out.xml' or die $!; XML::Twig->new( keep_spaces => 1, twig_print_outside_roots => $ofh, twig_roots => { tu => sub { my ($twig, $elt) = @_; $elt->set_att('creationid',"Simon Sim\xF6nsen"); $elt->print($ofh); } }, )->parsefile('1097276.xml'); close $ofh;

        Also have a look at the aptly-named output_encoding option in XML::Twig.