Match all Non-0 and Letters

arblargan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Match all Non-0 and Letters
by CountZero (Bishop) on Jun 24, 2017 at 08:35 UTC

In this case, you want to distinguish between "good" and "bad" words. Sometimes it is easy to define what is "good" and sometimes it is more easy to define what is "bad".

In this particular case, the definition of a good word is easy: 7 zeroes followed by a digit. It then follows logically that all words that to not comply with this simple format must be "bad". Hence we extract all "good" words and simply drop all others and we don't care in which way they may be bad.

The only regex you need is therefore qr/0{7}\d/ and depending on how the words are presented to you, you may wish to "anchor" the regex in the front or the back to avoid some false positives.

By concentrating upon the "bad" words you made it yourself unnecessary difficult.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

[reply]
[d/l]

Re^2: Match all Non-0 and Letters

by arblargan (Acolyte) on Jun 24, 2017 at 22:08 UTC

All, thank you very much for the help. My apologies with the confusing post as I typed this out before bed last night in desperation. The word extraction happens farther up in the subroutine than I've shown, but by the time it gets to this point, it will always be 8 continuous digits (or letters if there's corruption) not separated by whitespace.

I realize that using the $D1 and $D2 variables makes the regex much more difficult than it needed to be, but I created those to try and figure out where the regex was failing at. When I tried my initial regex it looked something like this

if ($Disc =~ /[1-9a-zA-Z]{7}\D/)

However, this still did not perform the functions that I was wanting. I did try something similar to if ($Disc !~ /0{7}\d/) but I think I may have used a D by mistake. I just tried if ($Disc !~ /(0{7})(\d$)/) and the regex worked great!

Thank you all for the quick replies and showing the correct syntax for what I'm trying to do. As I mentioned before, I'm relatively new to Perl, so I still have quite a ways to go, especially with the regex syntax.

[reply]
[d/l]
[select]

Re^3: Match all Non-0 and Letters

by AnomalousMonk (Archbishop) on Jun 25, 2017 at 00:16 UTC

The word ... will always be 8 continuous digits (or letters if there's corruption) not separated by whitespace.
...
I just tried if ($Disc !~ /(0{7})(\d$)/) and the regex worked great!

Note that if $Disc can ever possibly be longer than eight characters (update: with extra characters at the beginning), that regex will fail:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $Disc = 'foo00000008';
 ;;
 if ($Disc !~ /(0{7})(\d$)/) {
   print qq{'$Disc' is bad};
   }
 else {
   print qq{'$Disc' is OK!};
   }
"
'foo00000008' is OK!
[download]

$

both

^

and

The other thing I notice about the /(0{7})(\d$)/ regex is that (0{7}) captures a substring that can't possibly be anything other than '0000000', so why bother? (I assume you have some reason for capturing the trailing digit.)

So what I might end up with would be something like m{ \A 0{7} (\d) \z }xms (in a testing matrix):

c:\@Work\Perl\monks>perl -wMstrict -le
"for my $Disc (qw(
   00000000 00000001 00000002 00000003 00000004
   00000005 00000006 00000007 00000008 00000009
   0 00 000 0000 00000 000000 0000000 000000000
   FFFFFFFF ffffffff 6C163512
   x00000000  00000000x  x00000000x
   x0000000   0000000x   x0000000x
   x000000000 000000000x x000000000x
   ), '') {
   ;;
   my $proper_word =
   my ($righmost_digit) = $Disc =~ m{ \A 0{7} (\d) \z }xms;
   ;;
   if ($proper_word) {
     print qq{'$Disc' ok, rightmost digit '$righmost_digit'};
     }
   else {
     print qq{'$Disc' is bad};
     }
   }
"
'00000000' ok, rightmost digit '0'
'00000001' ok, rightmost digit '1'
'00000002' ok, rightmost digit '2'
'00000003' ok, rightmost digit '3'
'00000004' ok, rightmost digit '4'
'00000005' ok, rightmost digit '5'
'00000006' ok, rightmost digit '6'
'00000007' ok, rightmost digit '7'
'00000008' ok, rightmost digit '8'
'00000009' ok, rightmost digit '9'
'0' is bad
'00' is bad
'000' is bad
'0000' is bad
'00000' is bad
'000000' is bad
'0000000' is bad
'000000000' is bad
'FFFFFFFF' is bad
'ffffffff' is bad
'6C163512' is bad
'x00000000' is bad
'00000000x' is bad
'x00000000x' is bad
'x0000000' is bad
'0000000x' is bad
'x0000000x' is bad
'x000000000' is bad
'000000000x' is bad
'x000000000x' is bad
'' is bad
[download]

Test::More

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re: Match all Non-0 and Letters
by Athanasius (Archbishop) on Jun 24, 2017 at 07:23 UTC

Hello arblargan, and welcome to the Monastery!

Assuming your “words” are separated by whitespace within each line, the following should do what you want:

use strict;
use warnings;

OUTER: while (my $line = <DATA>)
{
    my   @words = split /\s+/, $line;

    for (@words)
    {
        next OUTER unless /^0{7}\d$/;
    }

    print $line;
}

__DATA__
00000000 00000001 00000009
00000006 FFFFFFFF 00000007
6C163512 00000000 00000008
00000003 00000004 01020102
[download]

Output:

17:21 >perl 1786_SoPW.pl
00000000 00000001 00000009

17:22 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Match all Non-0 and Letters
by Laurent_R (Canon) on Jun 24, 2017 at 09:11 UTC

arblargan

as other monks have already mentioned, all you really need is a single regex such as /0{7}\d/ (or perhaps /^0{7}\d$/ if the word you get is just the number).

You could, however, split your word into two parts as you did, but you made a logical error: you should have a "or", not an "and" in your condition for detecting a corrupt word, because you want to detect if the first part is not made of 0 OR if the second part is not a digit. So, you might fix your code as follows:

my $Disc = get_word();
my $D1 = substr($Disc,0,7);
my $D2 = substr($Disc,7,1);
print "Word $Disc is corrupt!\n" if $D1 !~ /0+/ or $D2 !~ /[0-9]+/;
[download]

/0{7}\d/

Update: this was intended to show the logical error ("and" instead of "or"). As pointed out by AnomalousMonk just below, the regexes are also wrong in terms of the intended purpose described in the original post.

[reply]
[d/l]
[select]

Re^2: Match all Non-0 and Letters

by AnomalousMonk (Archbishop) on Jun 24, 2017 at 17:34 UTC

... this was just to explain the error in your code ... /0{7}\d/ is much simpler and better.

I understand that the intended purpose of the code example is very limited, but I think it's very important to point out that the
$D1 !~ /0+/
test ("if there is not at least one '0' in the first 7 digits") is also a fundamental error.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^3: Match all Non-0 and Letters

by Laurent_R (Canon) on Jun 24, 2017 at 17:58 UTC

AnomalousMonk

Perhaps something like:

print "Word $Disc is corrupt!\n" if $D1 !~ /^0{7}$/ or $D2 !~ /[0-9]/;
[download]

print "Word $Disc is corrupt!\n" if ($D1 ne '0' x 7) or $D2 !~ /[0-9]/
+;
[download]

Update:

s/instead or "or"/instead of "or"/;

Discipulus

[reply]
[d/l]
[select]

Re^4: Match all Non-0 and Letters

by AnomalousMonk (Archbishop) on Jun 24, 2017 at 18:44 UTC

Re: Match all Non-0 and Letters
by AnomalousMonk (Archbishop) on Jun 24, 2017 at 08:16 UTC

It's not clear to me just what you want. If you want to extract from a line all "normal" words skipping other words, try something like this:

c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my $normal = qr{ 0{7} [0-9] }xms;
 ;;
 my $line = '00000000 FFFFFFFF 00000001 6C163512 00000002 '
          . 'ffffffff 00000003 0000009 00000004 000000009 '
          . '0 00 000 0000 00000 000000 0000000 000000000 '
          . '00000005'
          ;
 print qq{line: '$line'};
 ;;
 my @all_ok = $line =~ m{ \b $normal \b }xmsg;
 dd \@all_ok;
"
line: '00000000 FFFFFFFF 00000001 6C163512 00000002 ffffffff 00000003 
+0000009 00000004 000000009 0 00 000 0000 00000 000000 0000000 0000000
+00 00000005'
[
  "00000000",
  "00000001",
  "00000002",
  "00000003",
  "00000004",
  "00000005",
]
[download]

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re: Match all Non-0 and Letters
by anonymized user 468275 (Curate) on Jun 25, 2017 at 07:15 UTC

What about just converting it to decimal instead, e.g. see https://perldoc.perl.org/functions/hex.html

Update: If you want to limit the data to a range of values, you should STILL convert from hex to decimal first and then apply the test. In other words just forget the idea that fffffff is corrupt because e.g. 0000000A is only 10 in decimal - quite a low value and you might want to include the value 10!

One world, one people

[reply]

Re^2: Match all Non-0 and Letters

by haukex (Archbishop) on Jun 26, 2017 at 17:58 UTC

'corruption' - apparently just because the data is hexadecimal rather than decimal

I think the OP was quite specific in the definition of the input format - "a normal word will be 7 0's followed by a number between 0-9 (8-digits total)". To put some perspective on this from an ECE point of view, I find this kind of corruption is completely "normal", for example, in a RS-232 or wireless serial data stream corrupted by noise. Simply skipping the obviously corrupted values until a good value is seen is a valid approach to regaining synchronization with the stream. Of course there are ways to add error detection and/or correction encodings on the stream on the transmitting end so the corruption is less likely in the first place, but a large number of "modern" devices I've worked with still don't do this.

[reply]

A reply falls below the community's threshold of quality. You may see it by logging in.


We don't bite newbies here... much
	PerlMonks