Parsing/regex help required

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Parsing a HTML page I have the contents of a set of paragraphs, the general form:

Integer. Text - or – more text
e.g.

123. The quick brown fox - Jumps over the
123. The quick brown fox – Jumps over the
456. The lazy dog - Barked & wagged

results

$num = 123;
$text1 = 'The quick brown fox';
$text2 = 'Jumps over the';
[download]

Right now I'm capturing the three variables I need using index & substr a bunch but was hoping for a cleaner, more perlish solution. TIA

Comment on Parsing/regex help required Download Code

Replies are listed 'Best First'.
Re: Parsing/regex help required by roboticus (Chancellor) on Sep 27, 2021 at 13:32 UTC
You generally need to figure out how to describe the problem to yourself to guide yourself to a solution. You didn't present any requirements, but let's assume from your example that you want to recognize lines that are numbered (i.e., begin with a number followed by a period) and include a hyphen surrounded by whitespace. There are several ways you can accomplish it. You've already mentioned `index` and `substr`, another way could be to use `split`, or as you mention in the title a regular expression. For a regular expression, you just build the expression bit by bit, like this: $ cat t.pl use strict; use warnings; my $str = "123. The quick brown fox - Jumps over the"; if ($str =~ /^ # start of line\/string (\d+) # capture one or more digits \.\s+ # a literal period followed by some space (.) # some characters \s+-\s+ # some space, a hyphen and more space (.) # more characters $ # end of the line or string /x) { # x means allow whitespace and comments in reg +ex my ($num, $text1, $text2) = ($1, $2, $3); print "num=$num, text1=<$text1>, text2=<$text2>\n"; } else { print "No match!\n"; } $ perl t.pl num=123, text1=<The quick brown fox>, text2=<Jumps over the> [download] The parenthesis tell perl to capture the part of the string you care about, so later if you find a match, you can use the matched parts. The first capture group will be in variable $1, the next in $2 and so on. A normal perl installation will have a good bit of documentation on regular expressions, so be sure to look over: `perldoc perlreref` : a quick reference, `perldoc perlreftut` : a tutorial, `perldoc perlrequick` : a quick start guide, and there are more, too! Don't forget that you can check the perl documentation index via `perldoc perldoc` to see which documents may be helpful at a given time. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^2: Parsing/regex help required by Anonymous Monk on Sep 27, 2021 at 14:02 UTC
Thanks, how to make this conditional, either hyphen (-) or I want to say en dash(–)? `\s+-\s+ # some space, a hyphen and more space`	[reply] [d/l]
Re^3: Parsing/regex help required by kcott (Archbishop) on Sep 28, 2021 at 07:50 UTC
"how to make this conditional, either hyphen (-) or I want to say en dash(–)?" Just replace the single hyphen in your regex with a character class containing all possible dashes, hyphens, etc. In the character class, always put an ASCII hyphen as the last character or you'll generate a range. See perlrecharclass and, in particular, the "Bracketed Character Classes" section for much more detailed information. An example script follows but, first, some notes: The open pragma indicates that output to stdout should use UTF-8. This also avoids the "Wide character in print ..." warning. I've used a mix of `\x{...}` and `\N{...}` to show some alternatives. Don't do this in your real code as it's likely to be confusing: pick one format and stick with that. `#!/usr/bin/env perl use strict; use warnings; use open OUT => qw{:encoding(UTF-8) :std}; my ($en_dash, $em_dash) = ("\x{2013}", "\N{EM DASH}"); my $str = "a-b${en_dash}c${em_dash}d"; my $re = qr{[\N{EN DASH}\x{2014}-]}; print "Original string: $str\n"; print "Parts separated by some dash:\n"; print "$_\n" for split $re, $str;` [download] Output: `Original string: a-b–c—d Parts separated by some dash: a b c d` [download] Because the hyphen and dashes are not easily distinguishable, here's the same output piped through `cat -vet`. Don't worry too much if you don't understand the codes; just notice that they are different. `$ ./pm_11137036_re_alt_dashes.pl \| cat -vet Original string: a-bM-bM-^@M-^ScM-bM-^@M-^Td$ Parts separated by some dash:$ a$ b$ c$ d$` [download] See also these Unicode® resources: the PDF "Code Chart: General Punctuation -- Range: 2000–206F"; and, for characters referenced therein but not in that range, "Unicode 14.0 Character Code Charts" (note the "Find chart by hex code:" near the top of the page). — Ken	[reply] [d/l] [select]
Re^3: Parsing/regex help required by Marshall (Canon) on Sep 28, 2021 at 02:07 UTC
Perhaps you mean "em dash" instead of "en dash"? This is called "em" because it is similar to the with of "M" in a variable width font. An en dash is shorter, like the width of the letter "n" In any event, you will have to be reading using UTF-8 encoding. My dev environment for Perl only can do ASCII. I cannot easily write code for this. As far as regex goes: You need to group an or'd expression something like this (-\|em_dash) To make it "non capturing", (?:-\|em_dash); The question is what "em_dash" should be and how that relates to how the data decoding that was used during the read. update: under some coding scenarios an em dash is \x{2014}. I think you need "use utf8;" for that to work, but I am not sure. Some Monks here are quite experienced with utf8 encoding. Bring it on!	[reply]
Re: Parsing/regex help required by Fletch (Bishop) on Sep 27, 2021 at 13:21 UTC
First be sure if you have HTML you need to be using an HTML parser, not regex, to extract your lines. Presumably this is something where the numbering's not generated by say an `<ol>` and you've actually pulled the text of whatever nodes out (using say HTML::TreeBuilder or Mojo::DOM) then you could use something maybe like. `my( $num, $text1, $text2 ) = $line_from_html =~ m{^ (\d+) \. \s+ (.?) + \s+-\s+ (.?) $}x;` [download] Edit: Tweaked. The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l] [select]
Re^2: Parsing/regex help required by Anonymous Monk on Sep 27, 2021 at 13:49 UTC
each paragraph text is captured using mojo->all_text so that's all good. Running that code: `my $entry = "123. The Quick brown fox – jumped over"; my( $num, $text1, $text2 )= $entry =~ m{^ (\d+) \. \s+ (.?) \s+-\s+ ( +.?) $}x; say "$num\|$text1\|$text2";` [download] gives `Use of uninitialized value $num in concatenation (.) or string at ./te +st.pl line 10. Use of uninitialized value $text1 in concatenation (.) or string at ./ +test.pl line 10. Use of uninitialized value $text2 in concatenation (.) or string at ./ +test.pl line 10. \|\|` [download]	[reply] [d/l] [select]
Re^3: Parsing/regex help required by Fletch (Bishop) on Sep 27, 2021 at 19:50 UTC
Problem is your dash is a fancy unicode-y en dash, not just a simple "-" character so my naļve attempt's not matching. I had to do some monkeying with Encode cutting and pasting your sample (which I don't think you'd need for Mojo when you're actually fetching your real results) but then I was able to get this to match. `## I set $_ to your sample string cut-n-pasted, then ran it through +decode DB<33> $_ = Encode::decode( q{UTF-8}, $_ ) ## Afterwards this worked (U+2013 is EN DASH); if you're not interes +ted in what ## the separator was you can of course change that bit to non-captur +ing DB<38> x m{ ^ (\d+) \. \s+ (.?) \s+(-\|\N{EN DASH}\|\N{EM DASH})\s+ ( +.?) $}x 0 123 1 'The Quick brown fox' 2 '\x{2013}' 3 'jumped over'` [download] The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l]
Re^3: Parsing/regex help required by AnomalousMonk (Archbishop) on Sep 27, 2021 at 20:01 UTC
This is what I get: `Win8 Strawberry 5.30.3.1 (64) Mon 09/27/2021 15:56:45 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings -Mfeature=say my $entry = "123. The Quick brown fox - jumped over"; my( $num, $text1, $text2 )= $entry =~ m{^ (\d+) \. \s+ (.?) \s+-\s+ ( +.?) $}x; say "$num\|$text1\|$text2"; ^Z 123\|The Quick brown fox\|jumped over` [download] Are you sure the code you posted is really the code you're running? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom