Help with a Regex

planetscape has asked for the wisdom of the Perl Monks concerning the following question:

Please see

Script:

#! /usr/local/bin/perl -w

use strict;

my $final_string = '';

while (<>) {
    chomp;

    if (m/([mM]{0,1})
          ([dD]{0,1})
          ([cC]{0,1})
          ([lL]{0,1})
          ([xX]{0,1})
          ([vV]{0,1})
          ([iI]{0,1})\b/x) {

    if (length($1) == 0) {
        $final_string = "\.";
    } else {
        $final_string = $1;
    }
    if (length($2) == 0) {
        $final_string .= "\.";
    } else {
        $final_string .= $2;
    }
    if (length($3) == 0) {
        $final_string .= "\.";
    } else {
        $final_string .= $3;
    }
    if (length($4) == 0) {
        $final_string .= "\.";
    } else {
        $final_string .= $4;
    }
    if (length($5) == 0) {
        $final_string .= "\.";
    } else {
        $final_string .= $5;
    }
    if (length($6) == 0) {
        $final_string .= "\.";
    } else {
        $final_string .= $6;
    }
    if (length($7) == 0) {
        $final_string .= "\.";
    } else {
        $final_string .= $7;
    }

        print "$final_string\n";
        $final_string = '';

    } 
}
[download]

Input:

I
IV
V
VI
IX
X
XI
XIV
XV
XVI
XIX
X
XL
LX
XC
CLXIX
CDXLVI
MCMXCVI
MDCLI
[download]

Actual Output:

......I
.......
.....V.
.....VI
.......
....X..
....X.I
.......
....XV.
....XVI
.......
....X..
.......
...LX..
.......
.......
.......
.......
MDCL..I
[download]

Desired Output:

......I
.....V.
.....V.
.....VI
....X..
....X..
....X.I
....XV.
....XV.
....XVI
....X.I
....X..
....X..
...LX..
....X..
..CLX.I
.D..XVI
M.C.XVI
MDCL..I
[download]

The regex and the sample data have been somewhat contrived to fit Roman Numerals. The real regex/data is similar, but character classes may contain a variable number of characters. Certain classes later in the regex may contain some or all characters contained in prior classes.

Characters must match both the class AND the position. I want to ignore characters that are "out of order" (the "I" in "IV") but allow subsequent matches of characters that are in order (the "V" in "IV"). I want to make sure a period (".") replaces any character that doesn't match both the character and its designated position.

I am not sure how to change my regex to accomplish my goals. Any help would be appreciated.

for a complete description of my quandry, including my code, sample input, actual output, and an example of desired output.

I need help modifying my regex to get my desired output, but am not sure how to fix it.

Your wisdom is greatly appreciated.

Updated to use < r e a d m o r e > tags rather than my scratchpad, on the excellent advice of reasonablekeith.

Comment on Help with a Regex Select or Download Code

Replies are listed 'Best First'.
Re: Help with a Regex by tlm (Prior) on May 06, 2005 at 06:26 UTC
Update: When I wrote my original solution to this problem, I overlooked the requirement to ignore out-of-order numerals. The code below now checks for this (I also made some other minor changes). Thanks to Roy Johnson for the heads-up! Update 2: Animator informs me that the new version of the code still fails (it does not give the desired output for "XIX", for example). (Thanks!) . I have updated the code to show the failures. On further thought, I think that this part of the spec is ambiguous: I want to ignore characters that are "out of order" (the "I" in "IV") but allow subsequent matches of characters that are in order (the "V" in "IV"). ...since it does not specify why it is the "I" and not the "V" that should be regarded as the out of order numeral. Why is "X" the desired output for "XL", but "V" the desired output for "IV"? The case of "XIX" is also problematic, because both "X" or "I" may be regarded as out of order (depending on how one chooses to interpret this specification), and yet the desired output for "XIX" is "....X.I". What should the output be for "VIX"? Or for "DLXVIM"? I don't think that regexes are the tool for this job: use strict; use warnings; my %pos; @pos{ qw( M D C L X V I ) } = ( 0 .. 6 ); my $n_keys = keys %pos; my $template = '.' x $n_keys; while ( <DATA> ) { my ( $in, $desired) = split; my $out = $template; my $ptr = $n_keys; for my $i ( reverse ( 0 .. length( $in ) - 1 ) ) { my $c = substr $in, $i, 1; if ( $pos{ $c } < $ptr ) { substr( $out, $pos{ $c }, 1 ) = $c; $ptr = $pos{ $c }; } } print "$in\t=> $out\t", ( $out eq $desired ? '' : 'not ' ), 'ok', $/ +; } __DATA__ I ......I IV .....V. V .....V. VI .....VI IX ....X.. X ....X.. XI ....X.I XIV ....XV. XV ....XV. XVI ....XVI XIX ....X.I X ....X.. XL ....X.. LX ...LX.. XC ....X.. CLXIX ..CLX.I CDXLVI .D..XVI MCMXCVI M.C.XVI MDCLI MDCL..I __END__ I => ......I ok IV => .....V. ok V => .....V. ok VI => .....VI ok IX => ....X.. ok X => ....X.. ok XI => ....X.I ok XIV => ....XV. ok XV => ....XV. ok XVI => ....XVI ok XIX => ....X.. not ok X => ....X.. ok XL => ...L... not ok LX => ...LX.. ok XC => ..C.... not ok CLXIX => ..CLX.. not ok CDXLVI => .D.L.VI not ok MCMXCVI => M.C..VI not ok MDCLI => MDCL..I ok [download] the lowliest monk	[reply] [d/l]
Re: Help with a Regex by demerphq (Chancellor) on May 06, 2005 at 13:30 UTC
Characters must match both the class AND the position. I want to ignore characters that are "out of order" (the "I" in "IV") but allow subsequent matches of characters that are in order (the "V" in "IV"). I want to make sure a period (".") replaces any character that doesn't match both the character and its designated position. As far as I can tell your desired output does not match the above description. The expected output for "XIV" is "....XV." and the expected output for "XIX" is "....X.I" I dont see how these two interpretations are compatible with each other. A few other cases are also unclear, notably the one that says "CDXLVI" should result in ".D..XVI", which I can't understand at all. Anyway, I wrote two solutions, one parsing from the front and one from the back, neither gets all your output cases correct, which makes me suspect your output expectations are incorrect. Read more... code (2 kB) Input => Expect ( LtoR : ok?) ( RroL : ok? ) --- --- --- --- --- --- I => ......I ( ......I : Ok ) ( ......I : Ok ) IV => .....V. ( ......I : Not) ( .....V. : Ok ) V => .....V. ( .....V. : Ok ) ( .....V. : Ok ) VI => .....VI ( .....VI : Ok ) ( .....VI : Ok ) IX => ....X.. ( ......I : Not) ( ....X.. : Ok ) X => ....X.. ( ....X.. : Ok ) ( ....X.. : Ok ) XI => ....X.I ( ....X.I : Ok ) ( ....X.I : Ok ) XIV => ....XV. ( ....X.I : Not) ( ....XV. : Ok ) XV => ....XV. ( ....XV. : Ok ) ( ....XV. : Ok ) XVI => ....XVI ( ....XVI : Ok ) ( ....XVI : Ok ) XIX => ....X.I ( ....X.I : Ok ) ( ....X.. : Not ) X => ....X.. ( ....X.. : Ok ) ( ....X.. : Ok ) XL => ....X.. ( ....X.. : Ok ) ( ...L... : Not ) LX => ...LX.. ( ...LX.. : Ok ) ( ...LX.. : Ok ) XC => ....X.. ( ....X.. : Ok ) ( ..C.... : Not ) CLXIX => ..CLX.I ( ..CLX.I : Ok ) ( ..CLX.. : Not ) CDXLVI => .D..XVI ( ..C.XVI : Not) ( .D.L.VI : Not ) MCMXCVI => M.C.XVI ( M.C.XVI : Ok ) ( M.C..VI : Not ) MDCLI => MDCL..I ( MDCL..I : Ok ) ( MDCL..I : Ok ) [download] Id be interested to see a better explanation of what you expect. --- $world=~s/war/peace/g	[reply] [d/l] [select]
Re: Help with a Regex by reasonablekeith (Deacon) on May 06, 2005 at 07:41 UTC
One of the great things about perlmonks is that you can search the archives to see if anyone has had your problem before, and what other peoples answers were. By not including your question in your post you're breaking this, reducing the value of any answers you may have received. If you felt your question was too long, you could have used the readmore tags to hide it from monks without the inclination to read the entire post.	[reply]
Re: Help with a Regex by Animator (Hermit) on May 06, 2005 at 12:46 UTC
I believe that this code does what you want. First it creates a string of 7 dots, then it will look for each letter, if it finds it, it sets all charachters after it (in the output string) to dots, and removes all occurence of that charachter from the input string. `my %Lookup = (M => 0, D => 1, C => 2, L => 3, X => 4, V => 5, I => 6); while (<DATA>) { my $out = "." x 7; while (m/([MDCLXVI])/g) { substr ($out, $Lookup{$1}) = "." x (length($out) - $Lookup{$1}); substr ($out, $Lookup{$1}, 1, $1); s/$1//g; } print $out; }` [download] Note, this code is similar to tlm's code, except that his code does not reset the dots, which is what the OP wants... (or atleast I guess that's what he wants by looking at the desired output) Update: typo + note about tlm's code Update2: code has a bug... input: 'XL', 'XC' and 'CDXLVI' do not generate the correct output... Update3: I fail to see the consintency in the OP's desired output. XL, and XC should result in X, which would mean that any charachter following X (other then V and I) should be ignored, but CDXLVI should result in .D..XVI, meaning that the C is ignored... this makes no sense to me.	[reply] [d/l]
Re: Help with a Regex by planetscape (Chancellor) on May 07, 2005 at 04:44 UTC
Ok, my apologies - dyslexia has indeed caused me to make some mistakes in my desired output, as some have suspected. Also, a few words of further explanation: I decided to use Roman Numerals because (a) it made for much shorter examples than my "real" data, and (b) I thought the "order" would be implicit (M = 1000 and comes "first", D = 500 and comes "second", etc.) I should definitely have explcitly stated why this is the "order" in which characters should be considered. I am taking another look at my "desired" output to insure the dyslexia demons have been banished. Then I shall take a look at the proposed solutions and report back. Thanks again for your wisdom and, of course, your patience, kind Monks!	[reply]
Re^2: Help with a Regex by demerphq (Chancellor) on May 07, 2005 at 11:14 UTC
M = 1000 and comes "first", D = 500 and comes "second", etc. Sure, M is left most, and D is second left most. But your examples have contrary input. "XIX" => "....X.I" contradicts "IX" => "....X.." as far as I can tell. --- $world=~s/war/peace/g	[reply]
Re^3: Help with a Regex by insaniac (Friar) on May 07, 2005 at 11:29 UTC
well... i got some suckie code (just woke up with a terrible hangover... too much free beer ;-) ) but i think i understand it a bit more now... take the input string, look for the highest value (first time it's M, second time it's D), and start looking from that position for the next highest value (skipping of course the previous highest one). Read more... (3 kB) for the suckie code, here's the output: I => ......I ( ......I : Ok ) IV => .....V. ( .....V. : Ok ) V => .....V. ( .....V. : Ok ) VI => .....VI ( .....VI : Ok ) IX => ....X.. ( ....X.. : Ok ) X => ....X.. ( ....X.. : Ok ) XI => ....X.I ( ....X.I : Ok ) XIV => ....XV. ( ....X.I : Not Ok) XV => ....XV. ( ....XV. : Ok ) XVI => ....XVI ( ....XVI : Ok ) XIX => ....X.I ( ....X.I : Ok ) X => ....X.. ( ....X.. : Ok ) XL => ....X.. ( ...L... : Not Ok) LX => ...LX.. ( ...LX.. : Ok ) XC => ....X.. ( ..C.... : Not Ok) CLXIX => ..CLX.I ( ..CLX.I : Ok ) CDXLVI => .D..XVI ( .D..XVI : Ok ) MCMXCVI => M.C.XVI ( M.C.XVI : Ok ) MDCLI => MDCL..I ( MDCL..I : Ok ) [download] I think i know what to do, but my code is not really working as I expected.. or maybe I should wait till my hangover is ... over ;-) to ask a question is a moment of shame to remain ignorant is a lifelong shame	[reply] [d/l] [select]
Re^3: Help with a Regex by planetscape (Chancellor) on May 12, 2005 at 10:18 UTC
I believe I have finally vanquished the Demons of Derval Byslexia... `What I posted: _______________ I ......I IV .....V. V .....V. VI .....VI IX ....X.. X ....X.. XI ....X.I XIV ....XV. XV ....XV. XVI ....XVI XIX ....X.I X ....X.. XL ....X.. WRONG Should be: ...L... LX ...LX.. XC ....X.. WRONG Should be: ..C.... CLXIX ..CLX.I CDXLVI .D..XVI WRONG Should be: .D.L.VI MCMXCVI M.C.XVI MDCLI MDCL..I _______________` [download] Read more... (6 kB) Revised `Desired Output: _______________ I ......I IV .....V. V .....V. VI .....VI IX ....X.. X ....X.. XI ....X.I XIV ....XV. XV ....XV. XVI ....XVI XIX ....X.I X ....X.. XL ...L... LX ...LX.. XC ..C.... CLXIX ..CLX.I CDXLVI .D.L.VI MCMXCVI M.C.XVI MDCLI MDCL..I _______________` [download] Now for me to figure out which monk-contributed code works with the revised output... I'm blind now, so... Updated: Fixed two relatively minor typos.	[reply] [d/l] [select]
Re: Help with a Regex by insaniac (Friar) on May 06, 2005 at 16:09 UTC
update: after trying some stuff I get the same results as posted above.. his expect data just doesn't have a pattern i guess. my comments seem to apply to demerphq's Left-to-Right solution. I think that the OP is trying to say that he first wants to scan the string for irregularities. Meaning: take the string "MCMXCVI", if you scan it based on the model "MDCLXVI", when he finds the M the first time, he wants to neglect any character other than the D, if that character is found in the string. After that he wants to proceed the scanning starting from the position where he found the D, or at failure, start at the next character. In our example, the D doesn't exist, so after the finding the M, he'll start at position two in the string and he wants to start looking if there's a C. We find two of them, he wants to take the first one and neglect the second one. At the position of C, he wants to start looking if a L exists... and so on. When everything is scanned, he just wants to print it out in a nice way. well, that's what I see by looking at the data he provided... BUT, i'm too lazy to write the code, it's weekend and I'm going to my horses. The weather is perfect now for a nice ride in the woods ;-) have fun! to ask a question is a moment of shame to remain ignorant is a lifelong shame	[reply]
Re: Help with a Regex by kwaping (Priest) on May 06, 2005 at 17:59 UTC
This is close, but doesn't work for every case. Maybe someone else can build upon this to find the correct answer. `#!/usr/bin/perl -w use strict; my @a = ('M','D','C','L','X','V','I'); my $out = ''; my $n = 0; my $b = 'XIV'; my @tmp = split(//,$b); for (my $i = 0; $i < @a; $i++) { if ( $a[$i] eq uc($tmp[$n]) ) { $out .= $a[$i]; $n++; } else { $out .= '.'; } } print $out; exit;` [download]	[reply] [d/l]
My Final Solution [Was: Re: Help with a Regex] by planetscape (Chancellor) on May 15, 2005 at 06:52 UTC
First, to report back on which Monk's proposed solutions worked best with my de-dyslexified output. Animator's code worked flawlessly with my revised desired output. I used it as inspiration in my final solution, below. tlm, demerphq, and insaniac posted solutions that came very close indeed to what I needed, but were not quite right for the problem. kwaping's was right on the money for its sole test case... Nevertheless, I have learned and will continue to learn from what each posted. I thank all for their contributions, whether such contained code or not. Now, to the problem and its final solution... Please remember, the Roman Numerals were a "dumbed down" version of the real data and regex, to make for shorter code and test cases. As I mentioned, some characters that occur in an earlier part of my 'real' regex could also occur later. Note the comments in the code below to find out how I accomplished this. I decided to split my problem into two steps, one step per script (and the posted regex is still a simplified version, though closer to actual). I wanted first to discard an earlier, one or two character match in favor of a longer, later match. I did this with LongestMatch.pl, below, which as noted in the script is pretty much a verbatim solution from Jeffrey Friedl's book, Mastering Regular Expressions . The second script below, Align.pl, borrows heavily from Animator's offering. It takes the longest match and "pads" it for "missing" characters. Also keep in mind, I'm still pretty new at this. My code may well still be ugly. Thanks to PM and the wisdom of its kind denizens, that's getting better. Thanks to the Monks listed above, the code now does what I what it to, too. That's more than I had when I started. Thanks again! (BTW, for anyone wondering, it's a Linguistics thing. Morphology.) Read more... (3 kB)	[reply] [d/l] [select]


"be consistent"
	PerlMonks