http://qs321.pair.com?node_id=11146840

htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I have a log with a field called $text. I recognize 4 digit numbers in $text which represent apartment numbers. I bold those numbers and link to another script like this:

my $digits_4 = qr{ (?<!-)\b[0-9]{4}\b(?!-) }xms; $text =~ s{ ($digits_4) } {<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a>} +xmsg;

It works, but it sometimes tags other 4 digits numbers like a year or part of a phone number. I tried setting all unit numbers to a var then using an unless statement like this:

$unitslist_4 = "1001,1002,1003,1004,1101,1102,1103,1104,1201,1202,1203 +,1204,1301,1302,1303,1304,1401,1402,1403,1404, 1501,1502,1503,1504,1601,1602,1603,1604,1701,1702,1703,1704,1801,1802, +1803,1804,1901,1902,1903,1904,2001,2002,2003,2004, 2101,2102,2103,2104,2201,2202,2203,2204,2301,2302,2303,2304,2401,2402, +2403,2501,2502,2503,2504,2505"; unless ($digits_4 =~ /$unitslist/) { $text =~ s{ ($digits_4) } {<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a>} +xmsg; }
It works for the first few $text fields, but then stops working altogether. This can't be the right way to do something like this. Any help would be appreciated.

Replies are listed 'Best First'.
Re: Text switching
by Fletch (Bishop) on Sep 13, 2022 at 01:42 UTC

    If you've got a list of valid apt numbers probably the most efficient thing to do would be create a hash and use that to validate (although declaring it with a comma separated list likewise is a minor kludge, but you get the idea):

    my %valid_apt_numbers; for my $unit ( split( /,/, q{1001,1002,...YADDAYADDA...,2505} ) ) { $valid_apt_numbers{ $unit } = 1; } if( exists $valid_apt_numbers{ $digits_4 } ) { $text =~ s{\b($digits_4)\b}{<a href=YADDAYADDA</a>}xmsg; }

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      If you've got a list of valid apt numbers probably the most efficient thing to do would be create a hash and use that to validate

      The danger here is when a valid apartment number (either now or in the future) is also a valid year.

      The OP appears to have listed 2022 as being an apartment number - and that's certainly a valid year...

      if( exists $valid_apt_numbers{ $digits_4 } ) {
        $text =~ s{\b($digits_4)\b}{<a href=YADDAYADDA</a>}xmsg;
      }

      If $digits_4 is defined as in the OP, i.e., as a Regexp object, I don't see how its stringization will ever match any key in %valid_apt_numbers, all of which are "unit" numbers of the format '1001', '2202', etc.


      Give a man a fish:  <%-{-{-{-<

        DERP, good point; for some reason I read that as having the matched number in it. Handwaving attempt below using /e and adding a bit of logic to the RHS of the substitution would work. Still has the problem of the fuzziness that it's non-trivial to be sure you're only pulling apartment numbers with just the regex and not getting years or what not.

        A better solution would be to annotate explicitly in the source text as mentioned elsewhere. Better still use a real templating solution to explicitly mark up things. (e.g. TT like See more about [% apt_link( 1002 ) %] )

        #!/usr/bin/env perl use 5.034; my $text = <<'EOT'; This is an appt: 1203 So is 1001. But 8675 is not. EOT chomp($text); my $digits_4_re = qr{ (?<!-)\b[0-9]{4}\b(?!-) }xms; my %valid_appt_numbers = ( 1001 => 1, 1203 => 1, ); sub _anchor_if_apt { my $candidate = shift; if ( exists $valid_appt_numbers{$candidate} ) { return _anchor_for_num($candidate); } else { return $candidate; } } sub _anchor_for_num { my $unit = shift; return qq{<a href="apartment.pl?do_what=view&unit=$unit"><b>$unit< +/b></a>}; } $text =~ s{($digits_4_re)}{ _anchor_if_apt( $1 ) }xsmeg; say $text; exit 0; __END__ $ perl texty.plx This is an appt: <a href="apartment.pl?do_what=view&unit=1203"><b>1203 +</b></a> So is <a href="apartment.pl?do_what=view&unit=1001"><b>1001</b></a>. But 8675 is not.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

      Wow, thanks. It works, but only on the first $text field. This whole section is in a while loop. So we search the database then print the latest 50 records after doing the text switching like this:
      while (($pointer = $sth->fetchrow_hashref) && ($current_count <= $stop +count)){ $current_count++; if ($current_count >= $startcount && $current_count <= $st +opcount) { my %valid_apt_numbers; for my $unit ( split( /,/, q{1001,1002,1003,1004,1101,1102,1103,1104,1 +201,1202,1203,1204,1301,1302,1303,1304,1401,1402,1403,1404, 1501,1502,1503,1504,1601,1602,1603,1604,1701,1702,1703,1704,1801,1802 +,1803,1804,1901,1902,1903,1904,2001,2002,2003,2004, 2101,2102,2103,2104,2201,2202,2203,2204,2301,2302,2303,2304,2401,2402 +,2403,2501,2502,2503,2504,2505} ) ) { $valid_apt_numbers{ $unit } = 1; } if( exists $valid_apt_numbers{ $digits_4 } ) { $text =~ s{\b($digits_4)\b}{<a href="apartments.pl?do_what=view&unit +=$1"><b>$1</b></a>}xmsg; } print qq~ $text ~; } }
Re: Text switching (updated)
by AnomalousMonk (Archbishop) on Sep 14, 2022 at 05:22 UTC
    my $digits_4 = qr{ (?<!-)\b[0-9]{4}\b(?!-) }xms;
    ...
    $unitslist_4 = "1001,1002,1003,1004,...,2504,2505";
    ...
    unless ($digits_4 =~ /$unitslist/) { ... }

    $digits_4 =~ /$unitslist/   seems to be matching the stringization of Regexp (regex or qr//) object $digits_4 to a string $unitslist of all possible unit numbers (assuming $unitslist is the same as $unitslist_4) that has been compiled as a match regex. I don't understand the purpose of this.

    unless ($digits_4 =~ /$unitslist/) { ... }   is doing the match described above and executing a conditional block (which contains the substitution) if there is no match. I don't understand this either.

    A simple approach to the general problem would be to uniquely mark all unit number sub-strings: %1234% or {{1234}}. This would also allow for easy support of unit numbers like 123A or 12-B or C-4 or even Penthouse! Substitution matching then becomes straightforward and unambiguous.

    Update: A spelling correction, and a few minor wording changes for clarity.


    Give a man a fish:  <%-{-{-{-<

Re: Text switching
by duelafn (Parson) on Sep 13, 2022 at 13:02 UTC

    You can replace all unit numbers at once if you build your regex pattern out of them. For instance:

    use strict; my $unitslist = join "|", qw( 1001 1002 1003 1004 1101 1102 1103 1104 1201 1202 1203 1204 1301 130 +2 1303 1304 1401 1402 1403 1404 1501 1502 1503 1504 1601 1602 1603 160 +4 1701 1702 1703 1704 1801 1802 1803 1804 1901 1902 1903 1904 2001 200 +2 2003 2004 2101 2102 2103 2104 2201 2202 2203 2204 2301 2302 2303 230 +4 2401 2402 2403 2501 2502 2503 2504 2505 ); my $text = "Visit units 1101 or 2202, call us at 555-555-5555\n"; $text =~ s{ \b($unitslist)\b } {<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a>} +xg; print $text;

    Update: Added \b boundary matches to pattern matcher

    Good Day,
        Dean

      Unfortunately, that doesn't seem to me to eliminate the problem entirely.

      Win8 Strawberry 5.8.9.5 (32) Tue 09/13/2022 15:47:56 C:\@Work\Perl\monks >perl use strict; use warnings; # use Data::Dump qw(dd); # for debug my $unitslist = join "|", qw( 1001 1002 1003 1004 1101 1102 1103 1104 1201 1202 1203 1204 1301 130 +2 1303 1304 1401 1402 1403 1404 1501 1502 1503 1504 1601 1602 1603 160 +4 1701 1702 1703 1704 1801 1802 1803 1804 1901 1902 1903 1904 2001 200 +2 2003 2004 2101 2102 2103 2104 2201 2202 2203 2204 2301 2302 2303 230 +4 2401 2402 2403 2501 2502 2503 2504 2505 ); my $text = "Visit units 1101 or 2202, call us at 555-555-2202 and call before 13 Aug, 2202\n"; $text =~ s{ \b($unitslist)\b } {\n<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a +>\n}xg; print $text; ^Z Visit units <a href="apartments.pl?do_what=view&unit=1101"><b>1101</b></a> or <a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a> , call us at 555-555- <a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a> and call before 13 Aug, <a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a>
      Indeed, it doesn't seem as if the problem can be entirely eliminated unless input text can be specified to be much more specialized. E.g., uniquely delimit all unit number sub-strings: %1234% or {{1234}}. This would also allow for easy support of unit numbers like 123A or 12-B.

      It's possible to somewhat mitigate the problems associated with completely free-form text by adding more boundary conditions.

      Win8 Strawberry 5.8.9.5 (32) Wed 09/14/2022 0:21:27 C:\@Work\Perl\monks >perl use strict; use warnings; # use Data::Dump qw(dd); # for debug my ($rx_all_units) = map qr{ (?<! [-.:]) \b (?: $_) \b (?! [-.:]) }xms, join '|', reverse sort qw( 1001 1002 1003 1004 1101 1102 1103 1104 1201 1202 1203 1204 1301 1 +302 1303 1304 1401 1402 1403 1404 1501 1502 1503 1504 1601 1602 1603 1 +604 1701 1702 1703 1704 1801 1802 1803 1804 1901 1902 1903 1904 2001 2 +002 2003 2004 2101 2102 2103 2104 2201 2202 2203 2204 2301 2302 2303 2 +304 2401 2402 2403 2501 2502 2503 2504 2505 ); my $text = "Visit units 1101 or 2202, call us at 555-555-2202 and call before 13 Aug, 2202\n"; $text =~ s{ ($rx_all_units) } {\n<a href="apartments.pl?do_what=view&unit=$1"><b>$1</b></a +>\n}xg; print $text; ^Z Visit units <a href="apartments.pl?do_what=view&unit=1101"><b>1101</b></a> or <a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a> , call us at 555-555-2202 and call before 13 Aug, <a href="apartments.pl?do_what=view&unit=2202"><b>2202</b></a>


      Give a man a fish:  <%-{-{-{-<