robartes has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

I've run into something that might result in serious hair loss. Here's what:

The setting

I'm working in a 2Gb+ file in LDIF format (for an LDAP user database I'm populating) where I need to replace parts of the DN's of the users. E.g.:

cn=Perl Monks,ou=Dining Hall,ou=Monastery,c=Universe
needs to become:
cn=Perl Monks,ou=Bedchambers,ou=Monastery,c=Universe
Not a problem, I says, just search and replace the thing, and await happiness, joy and bliss everafter. Well, no. It turns out that the input file has a line length of 80 chars, and enforces it: there's a newline character in pos 80 if the line is longer than 80 charachters. Very inconvenient, of course, as for long DN's this might mean that it actually says:
cn=Perl Monks,ou=Dining H\nall,ou=Monastery,c=Universe
And those things are not matched by a simple regular expression, as the newline can appear just about anywhere in the DN (apparently, this is not a problem for the tools I will use later down the line to import the users).

Some solutions

Some things I can think of doing to remedy this:

The question

I have this nagging feeling that I am missing a deceptively simple, yet cunningly clever way of doing this with a regular expression. Can any monk bring some enlightenment in this?

CU
Robartes-

Replies are listed 'Best First'.
Re: Match a string that can contain a carriage return in a random position.
by castaway (Parson) on Dec 23, 2002 at 13:33 UTC
    Hi,

    according to my Camel (3rd edition), on page 150, you can use a /s modifier to allow '.' to also match newlines..

    So try something like this?
    /cn=(.+),ou=(.+),ou=(.+),c=(.+)/s

    C.

      Not a bad suggestion, BUT:

      1. As the file to be read in is so large, it will have to be dealt with line by line.
      2. The lines will break at the 80th position.
      3. Using the /s-switch in the regex will solve nothing as you will only find the newline-character at the end of the line just read in.

      I would suggest rather to first normalize the file, i.e. remove the newlines at the 80th positions and concatenating the various parts of each DN together so as to have one line per DN.

      Then you can use the regex /cn=(.+),ou=(.+),ou=(.+),c=(.+)/

      Finally you de-normalize the DN's again by adding -if necessary- newlines at position 80 so as no to break the other tools.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      In that case the regex would have look like this:
      /c\n?n\n?=\n?([^,]*)\n?,\n?…
      etc, etc. One suggestion is to take the basic regex above and auto-generate the necessary one by interpolating optional newlines after each character, but I think the poster is looking for a neater hack. I wish I could think of one!

Re: Match a string that can contain a carriage return in a random position.
by Aristotle (Chancellor) on Dec 23, 2002 at 20:26 UTC

    I always try to find some sort of reliable landmark in the format I'm reading and use it to disambiguate other parts of the "markup" on a syntatictal level, rather than going into the semantics (like "does the sum of tags I have here constitue a complete DN yet?" as BrowserUk did).

    In this case the (almost) reliable landmark is the commas separating the tags. If we read the file from comma to comma, there is only one ambiguous case: we can read across the newline that separates the tag at the end of one DN from that which begins the other DN. This is the one case we need to care about. How can we unambiguously conclude that is what the newline at hand means? If and only if the newline is followed by \w+=.

    Update: This doesn't quite work. / \n \w+ = /x is not strong enough - what happens if it's the tag itself that's broken up? Then we get a dangling beginning of the tag as the end of the current DN and the rest of the tag with equals sign and value, newline and the second tag as the beginning of the new DN. However, since we read from comma to comma, we can garantuee that we always get the beginning of a tag at the start of the string. Therefor requiring that an equals sign precede the newline clarifies this corner case:

    local $/ = ","; my $dn = ""; while(<>) { my $complete_dn; if(m/ \A (.+ = .+) \n ( \w+ = .+ ,) \z /sx) { $complete_dn = $dn . $1; $dn = $2; } else { $dn .= $_; } if($_ = $complete_dn) { s/\n+//g; print $_, "\n"; } }

    Here's my "test suite":

    Makeshifts last the longest.

Reading LDIF (was: Re: Match a string that can contain a carriage return in a random position.)
by fuzzycow (Sexton) on Dec 23, 2002 at 21:36 UTC
    Keep in mind that you are reading LDIF file. This means that the splitting of the line:

    1. Will not always happen on the 80s character
    2. The splitting can happen to any LDAP attribute, not just the DN

    The following two ideas may help you:
    1. Use Perl API's for reading LDIF (perldap-1.4.1 can do this). You will have to play around with API calls to prevent the API functions from reading all of the 2G file at once.

    Pros: It should work, and you can even write back, to the server, the created LDAP Entry objects.

    Cons: Mozilla LDAP API is based on C compiled libraries, however I have no idea how fast the resulting code will be.

    2. As far as I'm aware when line in LDIF is split, the next line will begin with blank space. It should be a very easy for you to write an filter program will 'unsplit' big LDIF file. (btw: personally, i think that changing the line break separator is a totally wrong way to go here)


    P.S.: Oh.. and you can try looking for that command line switch, that will prevent your whatever2ldif utility from doing the splitting ;-)
Re: Match a string that can contain a carriage return in a random position.
by pg (Canon) on Dec 23, 2002 at 16:03 UTC
    I would prefer your second solution, to strip the \n's first. Just like you, I would like to have a fancy regexp to do it, but if you install regexp's everywhere regardless whether it is the best solution for that particular case, it will turn out to be awkward, instead of smart. If you cannot come up a lean and clean regexp, I would rather not use it, remember regexp can be really difficult to understand. Especially in a corporate environment, you may well confuse whoever unfortunately has to maintain your code.

    In your case, the problem is not just the \n's can come up at any random places within the replaced strings, there is another problem which is equally annoying, if not even worse: where to put the \n's in your replacing strings, so that it satisfies the 80-char restriction? Even worse, you may well have to shift all those \n's in all continue lines, within that logic record (,as it is possible that your logic record can spread into more than two lines). All what I am saying is that, there is a large chance, that you have to reformat your lines, lots of them, if not all.

    Have said all this, I really see a clean solution would be:
    1. strip those \n's caused by 80-char restriction. (You may have two types of \n's, one is the logical record separator, one is caused by the 80-char restriction. This is not clearly described in your post, I just go with a more complete thinking.)
    2. do a s///, you don't need any fancy one, just a plain normal one, which anyone can come up in one second.
    3. Reformat to satisfy the 80-char restriction.
Re: Match a string that can contain a carriage return in a random position.
by BrowserUk (Patriarch) on Dec 23, 2002 at 19:51 UTC

    This seems to deal with al the pathelogical break points I tried.

    #!perl -slw use strict; my @lines; my $buffer; while( <DATA> ){ chomp; $buffer .= $_; #! Accumulate in buffer if ($buffer =~ /^(cn=.+?)cn=/) { #! When we have a full line+ push @lines, $1; #! save it substr($buffer, 0, $+[1], ''); #! and strip from the buffer } } push@lines,$buffer; print for @lines;

    Test data

Re: Match a string that can contain a carriage return in a random position.
by thezip (Vicar) on Dec 24, 2002 at 05:47 UTC
    After inspecting an example file (courtesy of LDAP Data Interchange Format, I think that you can safely set the input record separator to:

    $/ = 'dn:';

    With this, you can apply regexes as necessary to any data within each multiline "dn:" record without having to worry about the newlines.

    Where do you want *them* to go today?