Regex help

newbio has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear Monks,

I am badly stuck on this. Pleasee.. enlighten me as to how I should progress on the following problem on:

I have a file containing a set of lines with each line containing multiple sentences. I need to check each line for certain type of words (patterns) and extract them and their positions.

Word properties:

variable in length.
must contain letters and numbers (such as Hca12a).
May or may not contain hyphen and underscore (such as PQRS12-a, Ma2_b).
Letters in the word may be both in lower and upper case.

As output I need to get:

the matched Word
the sentence number (1st, 2nd and so on) in a line where the word occurs
the position of the word from the start of the sentence.

Thanks a lot.
Raj

Edit: Fixed formatting by holli

Comment on Regex help

Replies are listed 'Best First'.
Re: Regex help by graff (Chancellor) on Dec 15, 2006 at 02:01 UTC
What constitutes the end of a "sentence"? How will your script get the "certain type of words (patterns)" that it needs to extract? If the output needs to refer to sentence number, you must first split each line into an array of sentences. So that's where you would use a regex to match the sentence boundary (whatever that may be). Then, looping over the sentences on a line, you check for matches to your target word (whatever that may be). If the search target is a regex (e.g. `/f(?:oo\|o?u)/` to match any of "foo", "fu" or "fou"), identifying the position of the match as a character offset within the sentence could be done as a two step process: get the match, then find its offset: `while (<>) { # read a string from input my @senteces = split /\.\s+/; # ". " might work for splitting int +o sentences(?) for my $i ( 0 .. $#sentences ) { if ( $sentence[$i] =~ /(?<!\S)(f(?:oo\|o?u))(?!\S)/ ) my $match = $1; my $position = index( $sentence[$i], $match ); printf( "%s found in line %d, sentence %d\n", $match, $., $i+1 ); } } }` [download] (not tested) The big ugly regex is using negative look-behind and negative look-ahead (see perlre) in order to make sure that "foo"/"fou"/"fu" is matched only when not part of a larger word (e.g. "food", "afoul" and "snafu" will not produce matches, because the target is preceded and/or followed by a non-whitespace character). There's a good chance that this snippet won't do exactly what you want, but if you don't show us any code you've tried, or any sample input with desired output, you can't expect much from us.	[reply] [d/l] [select]
Re: Regex help by lin0 (Curate) on Dec 15, 2006 at 00:18 UTC
Hi newbio It is going to be really hard to help you without knowing what you have tried already. So, please, give us more details. While you are at it, I encourage you to have a look at the Perl documentation Finally, I suggest you to have a look to the following nodes to help you next time you ask a question: How do I post a question effectively? How do I compose an effective node title? Cheers, lin0	[reply]
Re: Regex help by rje (Deacon) on Dec 15, 2006 at 00:07 UTC
In order to help you, we need to know what you've already tried. What have you tried so far? What sort of strategies have you thought up for doing this? What are your thoughts about the problem?	[reply]
Re: Regex help by astaines (Curate) on Dec 14, 2006 at 23:33 UTC
Hi, Some specific examples would help, e.g. a set of possible inputs and the desired output. Do you need the position as the word number, or the character number? Anthony -- Anthony Staines	[reply]
Re: Regex help by leocharre (Priest) on Dec 15, 2006 at 15:27 UTC
Man, write the stuff! Even if it doesn't work. Write it even- as if you knew what you were doing. It will help us help you. Help us... help you... Write the script top to bottom how you think it should go, include sample data maybe.. like 20 lines in a heredoc within the script. Do all the things they told you that you should do, use warnings, use strict, -w.. go the distance. Come back, write an update .. and we go from there. :-)	[reply]


XP is just a number
	PerlMonks