Re: Merging multiple variations of a serial number (regex as "mini prolog")
by LanX (Saint) on Jul 28, 2022 at 21:10 UTC
|
You can actually encode your rules into a regex, which will show all possible interpretations.
(The trick is to force backtracking with a (*FAIL) )
And it highlights what hippo already told you: you get two possible results for the case of 13 digits!
IOW you need more filter rules, like ...
- a valid check digit
- impossible serial numbers
- chronological entries
- consistent format per file
- whatever other heuristic ...
And then you must hope there are no more ambiguities left ...
But I think this is a good start.
use v5.12;
use warnings;
use Data::Dump qw/pp dd/;
while (my $l = <DATA>) {
chomp $l;
my ($n,$desc) = split /\s*=\s*/,$l;
say "--- $n";
say " $desc";
my $year = join "|", 18..22;
my $month = join "|", 1..9, "01".."12";
my @res;
$n =~ /^
(?:
(?<year>$year)
|
(?<month>$month) # no month without year
(?<year>$year)
)? # date is optional
(?<serial>\d{10})
(?<check>\d)? # check is optional
$
(?{ push @res, { %+ } })
(*FAIL)
/x;
say pp \@res;
warn "*** TOO MANY MATCHES $n***\n",pp(\@res),"\n\n" if @res > 1;
warn "*** NO MATCH $n ***" if @res < 1
}
__DATA__
1231600014 = 10 digit version (just the core serial)
221231600014 = 12 digit version (minus both leading month and last
+ check digit)
1221231600014 = 13 digit version 2 (minus the leading zero for the
+month and minus the check digit). These entries would need to be corr
+ected to add the leading zero and therefore become the 14 digit versi
+on
2212316000140 = 13 digit version 1 (minus the leading 2 month digit
+s but including the last check digit)
01221231600014 = 14 digit version (minus the last check digit)
012212316000140 = Full 15 digit serial. This comprises of month and y
+ear (0122), core serial (1231600014), check digit (0)
*** TOO MANY MATCHES 2212316000140***
[
{ check => 0, serial => 1231600014, year => 22 },
{ month => 2, serial => 2316000140, year => 21 },
]
--- 1231600014
10 digit version (just the core serial)
[{ serial => 1231600014 }]
--- 221231600014
12 digit version (minus both leading month and last check digit)
[{ serial => 1231600014, year => 22 }]
--- 1221231600014
13 digit version 2 (minus the leading zero for the month and minus
+ the check digit). These entries would need to be corrected to add th
+e leading zero and therefore become the 14 digit version
[{ month => 1, serial => 1231600014, year => 22 }]
--- 2212316000140
13 digit version 1 (minus the leading 2 month digits but including
+ the last check digit)
[
{ check => 0, serial => 1231600014, year => 22 },
{ month => 2, serial => 2316000140, year => 21 },
]
--- 01221231600014
14 digit version (minus the last check digit)
[{ month => "01", serial => 1231600014, year => 22 }]
--- 012212316000140
Full 15 digit serial. This comprises of month and year (0122), cor
+e serial (1231600014), check digit (0)
[
{ check => 0, month => "01", serial => 1231600014, year => 22 },
]
update
refined the month rules by substituting \d|\d\d with $month = join "|", 1..9, "01".."12"; | [reply] [d/l] [select] |
Re: Merging multiple variations of a serial number
by tangent (Parson) on Jul 28, 2022 at 16:34 UTC
|
I would use a table structure, say a SQLite database, though you could use a tied array or CSV file. I would create a table something like this:
| d15 | d14 | d13_v1 | d13_v2 | d12 | d10 | check | month | year |
| 012212316000140 | 01221231600014 | 2212316000140 | 1221231600014 | 221231600014 | 1231600014 | 0 | 01 | 22 |
The first step to create the table would be to take the 15 digit variations and generate all the other columns for each. Move on to the 14 digit variations, see if there is a 15 digit one (select * where d14 = ?) and if not create a new row and generate as many of the other columns as you can. Keep doing this all the way. For the 13 digit variations only add the ones that can be disambiguated (as hippo and others have pointed out).
At the end you will be left with a list of 13 digit ones that haven't been added. You can look these up in the table to see possible matches but will probably need to manually sort these.
| [reply] [d/l] |
Re: Merging multiple variations of a serial number
by hippo (Archbishop) on Jul 28, 2022 at 15:11 UTC
|
They are all unique lengths bar the 2 sorts of 13 digit number. Those might be distinguishable if you have a small number of possible years. What is the range of valid years here?
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
|
|
Re: Merging multiple variations of a serial number
by AnomalousMonk (Archbishop) on Jul 28, 2022 at 16:22 UTC
|
In addition to hippo's point about discriminating 13-digit serial numbers by month and year, assume that a check digit is present and check the check digit against the ever-present serial number. You then have a 9-in-10
(update: well, assuming the check digit is in the range 0..9; there are other possibilities)
chance of detecting the absence of a check digit (i.e., the assumed serial number does not match the assumed check digit), and this will let you make a better guess (but still only a guess) about which format of 13-digit serial number is present.
What is the method for calculating the check digit? Also, are the month numbers 0 .. 11 or 1 .. 12?
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
Re: Merging multiple variations of a serial number
by AnomalousMonk (Archbishop) on Jul 28, 2022 at 17:41 UTC
|
I don't quite understand the problem. Is it that:
-
you are given a 10-digit core serial number in the range 0000000000 .. 9999999999 and a year number in the range 18 .. 22 and possibly a month number (in what range: 0 .. 11 or 1 .. 12?), and you must figure out all possible 15-, 14-, 13-, 12- and 10-digit full serial numbers that might result (with and without a leading zero for the month number); or
-
you are given a 13-digit full serial number and you must determine which of two possible formats it is in (the 15-, 14-, 12- and 10-digit full serial number lengths all being unambiguously parsable).
My impression is that the latter problem is the one you face, but I have a nagging doubt...
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
Re: Merging multiple variations of a serial number
by kcott (Archbishop) on Jul 29, 2022 at 01:10 UTC
|
G'day Doozer,
If you're thinking about using the base number as a hash key, I assume it's unique.
In which case, all of the numbers you've shown should match /${base}\d?$/;
which they do:
Just a standard alias of mine:
$ alias perle
alias perle='perl -Mstrict -Mwarnings -Mautodie=:all -MCarp::Always -E
+'
$ perle '
my $base = "1231600014";
my @possibles = qw{
012212316000140
01221231600014
2212316000140
1221231600014
221231600014
1231600014
};
my $re = qr{${base}\d?$};
for (@possibles) {
if (/$re/) {
say "$_ MATCH";
}
else {
say "$_ NO MATCH";
}
}
'
012212316000140 MATCH
01221231600014 MATCH
2212316000140 MATCH
1221231600014 MATCH
221231600014 MATCH
1231600014 MATCH
That should get you started.
I'm unsure on how you want to precede from here:
look up dates from somewhere?
determine a check digit somehow?
Your post seemed to suggest you knew the base code up-front.
If not:
$ perle '
my @possibles = qw{
012212316000140
01221231600014
2212316000140
1221231600014
221231600014
1231600014
};
my %re_map = (
10 => qr{^(\d{10})$},
12 => qr{^\d{2}(\d{10})$},
13 => qr{^(?:(?:18|19|20|21|22)(\d{10})\d|\d{3}(\d{10}))$},
14 => qr{^\d{4}(\d{10})$},
15 => qr{^\d{4}(\d{10})\d$},
);
for (@possibles) {
my $re = $re_map{+length};
/$re/ and say $1 // $2;
}
'
1231600014
1231600014
1231600014
1231600014
1231600014
1231600014
Both of those pieces of code are just showing techniques:
adapt, as necessary, to your specific needs.
| [reply] [d/l] [select] |
Re: Merging multiple variations of a serial number
by Anonymous Monk on Jul 28, 2022 at 15:33 UTC
|
I do not believe the use of JSON as a serialization format should influence your choice of internal data structures. Nor should you assume that the data structures you use to sort this mess out are the ones you want to serialize.
If you could fully parse all the forms this would be easy. The problem is the 13-digit one. This one can probably be disambiguated if you can compute the check digit from the core serial. The problem is that 1/10 chance that the serial could be read both ways.
What I think I would try first is building two hashes. One would be keyed by core serial and contain all the variants that were found of it (including fact that the core variant was found). The other would simply record 13-digit serials that can not be disambiguated. Once you have all the core serials you can make a pass through the 13-digit serials and try to match them up. Important: your code should check for the case that one of these 13-digit serials can not be disambiguated even after all core serials are known, and complain mightily about all such found.
| [reply] |
Re: Merging multiple variations of a serial number
by Doozer (Scribe) on Jul 29, 2022 at 09:24 UTC
|
Huge thanks to everyone who has commented! There are a lot of good suggestions and also some questions which I will do my best to work through.
The serial numbers are used on a piece of equipment that we source from an external company and deploy to engineers. We don't have any control over serial number generation and I don't know if there is any correlation between the core serial and the check digit.
My initial issue stems from the fact that since 2018, different people/departments have received, built, and added to, the various files using different formats of the serial numbers. This is likely what has caused issues such as duplicate entries and incorrect formats (Like the 13 digit version 2). It's a "Too many cooks spoil the broth" scenario which I have now been tasked with trying to fix. Going forward we want to implement unit tracking using any of the possible formats so a database looks like the right way to go.
Check digits are only over in the range 0-9.
Month numbers are 01 - 12 but as is apparent in our data, some entries miss the leading zero off of single digit months.
I'm going to go ahead and try some of the suggested solutions as the logic appears to be quite straight forward.
| [reply] |
|
|
> The serial numbers are used on a piece of equipment that we source from an external company and deploy to engineers. We don't have any control over serial number generation
It really is important to know, if the serial numbers stay stable like IDs and how many there are.
You can't seriously have 1e10 pieces of equipment°, so a lookup hash with correct numbers will help you filter out impossible matches.
> different people/departments have received, built, and added to, the various files using different formats of the serial numbers.
As I already said, it is very probable, that the effects of those people can be localized to certain files and time periods.
Creating a histogram for each file will help you determine which 13number format was used and for which timestamps.
°) 1e11 even with the check digit.
| [reply] |
|
|
| [reply] |
|
|
> It really is important to know, if the serial numbers stay stable like IDs and how many there are.
They should now stay stable in the current format. We currently have around 3,500 units/serial numbers
| [reply] |
Re: Merging multiple variations of a serial number
by vertigo7 (Friar) on Jul 28, 2022 at 19:46 UTC
|
Could you iterate over each file containing serials and try a regular expression match (and may God help your soul) for the core serial number?
Edit: regex deleted, it didn't work like I thought it did. | [reply] |