http://qs321.pair.com?node_id=11137036

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Parsing a HTML page I have the contents of a set of paragraphs, the general form:
Integer. Text - or – more text e.g. 123. The quick brown fox - Jumps over the 123. The quick brown fox – Jumps over the 456. The lazy dog - Barked & wagged results $num = 123; $text1 = 'The quick brown fox'; $text2 = 'Jumps over the';
Right now I'm capturing the three variables I need using index & substr a bunch but was hoping for a cleaner, more perlish solution. TIA

Replies are listed 'Best First'.
Re: Parsing/regex help required
by roboticus (Chancellor) on Sep 27, 2021 at 13:32 UTC

    You generally need to figure out how to describe the problem to yourself to guide yourself to a solution. You didn't present any requirements, but let's assume from your example that you want to recognize lines that are numbered (i.e., begin with a number followed by a period) and include a hyphen surrounded by whitespace.

    There are several ways you can accomplish it. You've already mentioned index and substr, another way could be to use split, or as you mention in the title a regular expression.

    For a regular expression, you just build the expression bit by bit, like this:

    $ cat t.pl use strict; use warnings; my $str = "123. The quick brown fox - Jumps over the"; if ($str =~ /^ # start of line\/string (\d+) # capture one or more digits \.\s+ # a literal period followed by some space (.*) # some characters \s+-\s+ # some space, a hyphen and more space (.*) # more characters $ # end of the line or string /x) { # x means allow whitespace and comments in reg +ex my ($num, $text1, $text2) = ($1, $2, $3); print "num=$num, text1=<$text1>, text2=<$text2>\n"; } else { print "No match!\n"; } $ perl t.pl num=123, text1=<The quick brown fox>, text2=<Jumps over the>

    The parenthesis tell perl to capture the part of the string you care about, so later if you find a match, you can use the matched parts. The first capture group will be in variable $1, the next in $2 and so on. A normal perl installation will have a good bit of documentation on regular expressions, so be sure to look over:

    Don't forget that you can check the perl documentation index via perldoc perldoc to see which documents may be helpful at a given time.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks, how to make this conditional, either hyphen (-) or I want to say en dash(–)? \s+-\s+    # some space, a hyphen and more space
        "how to make this conditional, either hyphen (-) or I want to say en dash(–)?"

        Just replace the single hyphen in your regex with a character class containing all possible dashes, hyphens, etc. In the character class, always put an ASCII hyphen as the last character or you'll generate a range. See perlrecharclass and, in particular, the "Bracketed Character Classes" section for much more detailed information.

        An example script follows but, first, some notes:

        • The open pragma indicates that output to stdout should use UTF-8. This also avoids the "Wide character in print ..." warning.
        • I've used a mix of \x{...} and \N{...} to show some alternatives. Don't do this in your real code as it's likely to be confusing: pick one format and stick with that.
        #!/usr/bin/env perl use strict; use warnings; use open OUT => qw{:encoding(UTF-8) :std}; my ($en_dash, $em_dash) = ("\x{2013}", "\N{EM DASH}"); my $str = "a-b${en_dash}c${em_dash}d"; my $re = qr{[\N{EN DASH}\x{2014}-]}; print "Original string: $str\n"; print "Parts separated by some dash:\n"; print "$_\n" for split $re, $str;

        Output:

        Original string: a-b–c—d Parts separated by some dash: a b c d

        Because the hyphen and dashes are not easily distinguishable, here's the same output piped through cat -vet. Don't worry too much if you don't understand the codes; just notice that they are different.

        $ ./pm_11137036_re_alt_dashes.pl | cat -vet Original string: a-bM-bM-^@M-^ScM-bM-^@M-^Td$ Parts separated by some dash:$ a$ b$ c$ d$

        See also these Unicode® resources: the PDF "Code Chart: General Punctuation -- Range: 2000–206F"; and, for characters referenced therein but not in that range, "Unicode 14.0 Character Code Charts" (note the "Find chart by hex code:" near the top of the page).

        — Ken

        Perhaps you mean "em dash" instead of "en dash"?
        This is called "em" because it is similar to the with of "M" in a variable width font.
        An en dash is shorter, like the width of the letter "n"

        In any event, you will have to be reading using UTF-8 encoding. My dev environment for Perl only can do ASCII. I cannot easily write code for this.

        As far as regex goes:
        You need to group an or'd expression something like this (-|em_dash)
        To make it "non capturing", (?:-|em_dash);

        The question is what "em_dash" should be and how that relates to how the data decoding that was used during the read.

        update: under some coding scenarios an em dash is \x{2014}.
        I think you need "use utf8;" for that to work, but I am not sure.

        Some Monks here are quite experienced with utf8 encoding.
        Bring it on!

Re: Parsing/regex help required
by Fletch (Bishop) on Sep 27, 2021 at 13:21 UTC

    First be sure if you have HTML you need to be using an HTML parser, not regex, to extract your lines.

    Presumably this is something where the numbering's not generated by say an <ol> and you've actually pulled the text of whatever nodes out (using say HTML::TreeBuilder or Mojo::DOM) then you could use something maybe like.

    my( $num, $text1, $text2 ) = $line_from_html =~ m{^ (\d+) \. \s+ (.*?) + \s+-\s+ (.*?) $}x;

    Edit: Tweaked.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      each paragraph text is captured using mojo->all_text so that's all good. Running that code:
      my $entry = "123. The Quick brown fox – jumped over"; my( $num, $text1, $text2 )= $entry =~ m{^ (\d+) \. \s+ (.*?) \s+-\s+ ( +.*?) $}x; say "$num|$text1|$text2";
      gives
      Use of uninitialized value $num in concatenation (.) or string at ./te +st.pl line 10. Use of uninitialized value $text1 in concatenation (.) or string at ./ +test.pl line 10. Use of uninitialized value $text2 in concatenation (.) or string at ./ +test.pl line 10. ||

        Problem is your dash is a fancy unicode-y en dash, not just a simple "-" character so my naïve attempt's not matching. I had to do some monkeying with Encode cutting and pasting your sample (which I don't think you'd need for Mojo when you're actually fetching your real results) but then I was able to get this to match.

        ## I set $_ to your sample string cut-n-pasted, then ran it through +decode DB<33> $_ = Encode::decode( q{UTF-8}, $_ ) ## Afterwards this worked (U+2013 is EN DASH); if you're not interes +ted in what ## the separator was you can of course change that bit to non-captur +ing DB<38> x m{ ^ (\d+) \. \s+ (.*?) \s+(-|\N{EN DASH}|\N{EM DASH})\s+ ( +.*?) $}x 0 123 1 'The Quick brown fox' 2 '\x{2013}' 3 'jumped over'

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

        This is what I get:

        Win8 Strawberry 5.30.3.1 (64) Mon 09/27/2021 15:56:45 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -Mfeature=say my $entry = "123. The Quick brown fox - jumped over"; my( $num, $text1, $text2 )= $entry =~ m{^ (\d+) \. \s+ (.*?) \s+-\s+ ( +.*?) $}x; say "$num|$text1|$text2"; ^Z 123|The Quick brown fox|jumped over
        Are you sure the code you posted is really the code you're running?


        Give a man a fish:  <%-{-{-{-<