Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Regex help

by newbio (Beadle)
on Dec 14, 2006 at 23:28 UTC ( [id://589933]=perlquestion: print w/replies, xml ) Need Help??

newbio has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear Monks,

I am badly stuck on this. Pleasee.. enlighten me as to how I should progress on the following problem on:

I have a file containing a set of lines with each line containing multiple sentences. I need to check each line for certain type of words (patterns) and extract them and their positions.

Word properties:
  1. variable in length.
  2. must contain letters and numbers (such as Hca12a).
  3. May or may not contain hyphen and underscore (such as PQRS12-a, Ma2_b).
  4. Letters in the word may be both in lower and upper case.

As output I need to get:
  1. the matched Word
  2. the sentence number (1st, 2nd and so on) in a line where the word occurs
  3. the position of the word from the start of the sentence.

Thanks a lot.
Raj

Edit: Fixed formatting by holli

Replies are listed 'Best First'.
Re: Regex help
by graff (Chancellor) on Dec 15, 2006 at 02:01 UTC
    What constitutes the end of a "sentence"? How will your script get the "certain type of words (patterns)" that it needs to extract?

    If the output needs to refer to sentence number, you must first split each line into an array of sentences. So that's where you would use a regex to match the sentence boundary (whatever that may be).

    Then, looping over the sentences on a line, you check for matches to your target word (whatever that may be). If the search target is a regex (e.g.  /f(?:oo|o?u)/ to match any of "foo", "fu" or "fou"), identifying the position of the match as a character offset within the sentence could be done as a two step process: get the match, then find its offset:

    while (<>) { # read a string from input my @senteces = split /\.\s+/; # ". " might work for splitting int +o sentences(?) for my $i ( 0 .. $#sentences ) { if ( $sentence[$i] =~ /(?<!\S)(f(?:oo|o?u))(?!\S)/ ) my $match = $1; my $position = index( $sentence[$i], $match ); printf( "%s found in line %d, sentence %d\n", $match, $., $i+1 ); } } }
    (not tested)

    The big ugly regex is using negative look-behind and negative look-ahead (see perlre) in order to make sure that "foo"/"fou"/"fu" is matched only when not part of a larger word (e.g. "food", "afoul" and "snafu" will not produce matches, because the target is preceded and/or followed by a non-whitespace character).

    There's a good chance that this snippet won't do exactly what you want, but if you don't show us any code you've tried, or any sample input with desired output, you can't expect much from us.

Re: Regex help
by lin0 (Curate) on Dec 15, 2006 at 00:18 UTC
Re: Regex help
by rje (Deacon) on Dec 15, 2006 at 00:07 UTC
    In order to help you, we need to know what you've already tried.

    What have you tried so far? What sort of strategies have you thought up for doing this? What are your thoughts about the problem?
Re: Regex help
by astaines (Curate) on Dec 14, 2006 at 23:33 UTC
    Hi,
    Some specific examples would help, e.g. a set of possible inputs and the desired output.
    Do you need the position as the word number, or the character number?
    Anthony
    -- Anthony Staines
Re: Regex help
by leocharre (Priest) on Dec 15, 2006 at 15:27 UTC

    Man, write the stuff! Even if it doesn't work. Write it even- as if you knew what you were doing. It will help us help you. Help us... help you...

    Write the script top to bottom how you think it should go, include sample data maybe.. like 20 lines in a heredoc within the script. Do all the things they told you that you should do, use warnings, use strict, -w.. go the distance. Come back, write an update .. and we go from there. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://589933]
Approved by astaines
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-19 21:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found