in reply to Parsing/regex help required

You generally need to figure out how to describe the problem to yourself to guide yourself to a solution. You didn't present any requirements, but let's assume from your example that you want to recognize lines that are numbered (i.e., begin with a number followed by a period) and include a hyphen surrounded by whitespace.

There are several ways you can accomplish it. You've already mentioned index and substr, another way could be to use split, or as you mention in the title a regular expression.

For a regular expression, you just build the expression bit by bit, like this:

$ cat use strict; use warnings; my $str = "123. The quick brown fox - Jumps over the"; if ($str =~ /^ # start of line\/string (\d+) # capture one or more digits \.\s+ # a literal period followed by some space (.*) # some characters \s+-\s+ # some space, a hyphen and more space (.*) # more characters $ # end of the line or string /x) { # x means allow whitespace and comments in reg +ex my ($num, $text1, $text2) = ($1, $2, $3); print "num=$num, text1=<$text1>, text2=<$text2>\n"; } else { print "No match!\n"; } $ perl num=123, text1=<The quick brown fox>, text2=<Jumps over the>

The parenthesis tell perl to capture the part of the string you care about, so later if you find a match, you can use the matched parts. The first capture group will be in variable $1, the next in $2 and so on. A normal perl installation will have a good bit of documentation on regular expressions, so be sure to look over:

Don't forget that you can check the perl documentation index via perldoc perldoc to see which documents may be helpful at a given time.


When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Parsing/regex help required
by Anonymous Monk on Sep 27, 2021 at 14:02 UTC
    Thanks, how to make this conditional, either hyphen (-) or I want to say en dash()? \s+-\s+    # some space, a hyphen and more space
      "how to make this conditional, either hyphen (-) or I want to say en dash()?"

      Just replace the single hyphen in your regex with a character class containing all possible dashes, hyphens, etc. In the character class, always put an ASCII hyphen as the last character or you'll generate a range. See perlrecharclass and, in particular, the "Bracketed Character Classes" section for much more detailed information.

      An example script follows but, first, some notes:

      • The open pragma indicates that output to stdout should use UTF-8. This also avoids the "Wide character in print ..." warning.
      • I've used a mix of \x{...} and \N{...} to show some alternatives. Don't do this in your real code as it's likely to be confusing: pick one format and stick with that.
      #!/usr/bin/env perl use strict; use warnings; use open OUT => qw{:encoding(UTF-8) :std}; my ($en_dash, $em_dash) = ("\x{2013}", "\N{EM DASH}"); my $str = "a-b${en_dash}c${em_dash}d"; my $re = qr{[\N{EN DASH}\x{2014}-]}; print "Original string: $str\n"; print "Parts separated by some dash:\n"; print "$_\n" for split $re, $str;


      Original string: a-bcd Parts separated by some dash: a b c d

      Because the hyphen and dashes are not easily distinguishable, here's the same output piped through cat -vet. Don't worry too much if you don't understand the codes; just notice that they are different.

      $ ./ | cat -vet Original string: a-bM-bM-^@M-^ScM-bM-^@M-^Td$ Parts separated by some dash:$ a$ b$ c$ d$

      See also these Unicode® resources: the PDF "Code Chart: General Punctuation -- Range: 2000206F"; and, for characters referenced therein but not in that range, "Unicode 14.0 Character Code Charts" (note the "Find chart by hex code:" near the top of the page).

      — Ken

      Perhaps you mean "em dash" instead of "en dash"?
      This is called "em" because it is similar to the with of "M" in a variable width font.
      An en dash is shorter, like the width of the letter "n"

      In any event, you will have to be reading using UTF-8 encoding. My dev environment for Perl only can do ASCII. I cannot easily write code for this.

      As far as regex goes:
      You need to group an or'd expression something like this (-|em_dash)
      To make it "non capturing", (?:-|em_dash);

      The question is what "em_dash" should be and how that relates to how the data decoding that was used during the read.

      update: under some coding scenarios an em dash is \x{2014}.
      I think you need "use utf8;" for that to work, but I am not sure.

      Some Monks here are quite experienced with utf8 encoding.
      Bring it on!