http://qs321.pair.com?node_id=11119327

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

FIRST: the example code. (See Advanced Perl Programming , p.266. in sub pick_word{}

($page , $section) = ( $line =~ /^(\w+) (\(.*?\))?/)

This works find for a manpage of the from 'ftpd(8)' What about a page such as 'dhcp-config(5)' ? # '-' is not int that class What about 'Cache::Cache(3)' and 'Tk::widget::demo(3)' ? Here -- :: is not in the class. In all cases, I need to catch the page in () and assign it to $page as was done in the above example.

Replies are listed 'Best First'.
Re: HELP! I am in regex-hell
by LanX (Saint) on Jul 14, 2020 at 23:53 UTC
    use character classes or or conditions | to include the missing characters

    But your regex seems wrong anyway, since there is a space in between

    DB<1> $line = 'Cache::Cache(3)' DB<2> x ($page , $section) = ( $line =~ /^((?:\w|:|-)+)(\(.*?\))?/) 0 'Cache::Cache' 1 '(3)' DB<3> $line = 'dhcp-config(5)' DB<4> x ($page , $section) = ( $line =~ /^([\w:-]+)(\(.*?\))?/) 0 'dhcp-config' 1 '(5)' DB<5>

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Eh Rolf -- thanks! I messed with the following line:

      ($page , $section) = ( $line =~ /^((?:\w|:|-)+)(\(.*?\))?/)

      This did what I wanted. I needed to keep "\(.*?\)" because that block is used elsewhere in the code. I found it amazing that the first chunk was so simple. I was over-complicating the matter...

      DOH! ... The space was not intended. I cant keep all those () straight while reading my code.

Re: HELP! I am in regex-hell
by kcott (Archbishop) on Jul 15, 2020 at 07:42 UTC

    You could simply exclude the opening parenthesis from the page capture and the closing parenthesis from the section capture like so:

    /^([^(]+)\(([^)]*)/

    Here's a test with your four examples:

    $ perl -E ' my @manpages = qw{ ftpd(8) dhcp-config(5) Cache::Cache(3) Tk::widget::demo(3) }; for my $line (@manpages) { my ($page , $section) = $line =~ /^([^(]+)\(([^)]*)/; say "page[$page] section[$section]"; } ' page[ftpd] section[8] page[dhcp-config] section[5] page[Cache::Cache] section[3] page[Tk::widget::demo] section[3]

    — Ken

Re: HELP! I am in regex-hell
by AnomalousMonk (Archbishop) on Jul 15, 2020 at 06:08 UTC

    Another way. This approach uses highly factored and specific regexes to achieve a high degree of discrimination — if that's what you want! It's easy to add further, highly specialized regexes. ($section is returned as '' (empty atring) if no section is present rather than as undef.) Optional whitespace may exist between page and section sub-fields. Note that with the right pattern anchors, multiple page/section fields can be extracted from a single string/line.

    c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $rx_simple = qr{ [[:alpha:]] [[:alnum:]]* (?: - [[:alnum:]]+)* }xms; my $rx_module = qr{ [[:upper:]] [[:alpha:]]* (?: :: [[:upper:]] [[:alpha:]]*)* }xms; my $rx_page = qr{ $rx_simple | $rx_module }xms; ;; my $rx_section = qr{ [(] \d* [)] }xms; ;; for my $line (qw( ftpd(8) ftpd dhcp-config(5) dhcp-config foo2 foo2(2) foo-2 Cache::Cache(3) Cache::Cache Foo::Bar::Baz(42) Foo::Bar::Baz ), 'ftpd (8)', 'dhcp-config (5)', 'Cache::Cache (3)', qw(-foo foo- %^&*@! 123 1foo foo--bar), ) { my $got_page_section = my ($page, $section) = $line =~ m{ \A ($rx_page) \s* ($rx_section?) \z }xms; ;; $page = $section = '???' unless $got_page_section; ;; print qq{'$line' -> '$page' '$section'}; } ;; my $line = 'ftpd(8) -no dhcp-config no- dhcp-config (5) -- Foo::Bar +::Baz(42) (999)'; my @pages; push @pages, [ $1, $2 ] while $line =~ m{ (?<! \S) ($rx_page) \s* ($rx_section?) (?! \S) }xmsg; dd \@pages; " 'ftpd(8)' -> 'ftpd' '(8)' 'ftpd' -> 'ftpd' '' 'dhcp-config(5)' -> 'dhcp-config' '(5)' 'dhcp-config' -> 'dhcp-config' '' 'foo2' -> 'foo2' '' 'foo2(2)' -> 'foo2' '(2)' 'foo-2' -> 'foo-2' '' 'Cache::Cache(3)' -> 'Cache::Cache' '(3)' 'Cache::Cache' -> 'Cache::Cache' '' 'Foo::Bar::Baz(42)' -> 'Foo::Bar::Baz' '(42)' 'Foo::Bar::Baz' -> 'Foo::Bar::Baz' '' 'ftpd (8)' -> 'ftpd' '(8)' 'dhcp-config (5)' -> 'dhcp-config' '(5)' 'Cache::Cache (3)' -> 'Cache::Cache' '(3)' '-foo' -> '???' '???' 'foo-' -> '???' '???' '%^&*@!' -> '???' '???' '123' -> '???' '???' '1foo' -> '???' '???' 'foo--bar' -> '???' '???' [ ["ftpd", "(8)"], ["dhcp-config", ""], ["dhcp-config", "(5)"], ["Foo::Bar::Baz", "(42)"], ]
    (Update: A thorough test plan (see Test::More and friends) will give you confidence that whatever solution you choose actually will match what you want and reject what you don't want.)


    Give a man a fish:  <%-{-{-{-<