Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Using punctuation to split a text

by UrbanHick (Sexton)
on Nov 04, 2006 at 08:28 UTC ( [id://582223]=perlquestion: print w/replies, xml ) Need Help??

UrbanHick has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks-

I have a file that is just a huge block of text. I have found the documentation on Text::Autoformat and read up on using it. What I would like to do is just slice out individual sentences and then feed them individually to Text::Autoformat.. To do this I am looking for someway to split the text at every '.' '?' or '!'. In addition there are a few places in the text where the punctuation is missing, and so the only indication of a new sentence is that a lowercase letter is followed by a capital letter. (For example it looks likeThis.)

I am not a regular Perl user, but I think a regex including ‘/.?!/’ might be of some use. On the other hand, the small letter capital letter split has me totally flummoxed.

Here is an example of what I am looking for:

__DATA__ The Tell-Tale HeartEdgar Allen PoeTRUE! nervous, very, very dreadfully + nervous I had been and am; but why WILL you say that I am mad? The d +isease had sharpened my senses, not destroyed, not dulled them. Above + all was the sense of hearing acute. I heard all things in the heaven + and in the earth. I heard many things in hell. How then am I mad? He +arken! and observe how healthily, how calmly, I can tell you the whol +e story. __END__

I would like to split it into something that looks like this:

__DATA__ The Tell-Tale Heart Edgar Allen Poe TRUE! nervous, very, very dreadfully nervous I had been and am; but why WILL + you say that I am mad? The disease had sharpened my senses, not destroyed, not dulled them. Above all was the sense of hearing acute. I heard all things in the heaven and in the earth. I heard many things in hell. How then am I mad? Hearken! and observe how healthily, how calmly, I can tell you the whole story. __END__

My thanks to anyone who can suggest a way to go forward with this.

-UH

Replies are listed 'Best First'.
Re: Using punctuation to split a text
by GrandFather (Saint) on Nov 04, 2006 at 09:26 UTC

    The following regex using lookback and look ahead assertions gets pretty close. However I don't think there is a way to distinguish between the two places where "mad?" is used except by explicit special cases.

    use strict; use warnings; my $str = <<DATA; The Tell-Tale HeartEdgar Allen PoeTRUE! nervous, very, very dreadfully + nervous I had been and am; but why WILL you say that I am mad? The d +isease had sharpened my senses, not destroyed, not dulled them. Above + all was the sense of hearing acute. I heard all things in the heaven + and in the earth. I heard many things in hell. How then am I mad? He +arken! and observe how healthily, how calmly, I can tell you the whol +e story. DATA $str =~ s/(?<=[a-z])(?=[A-Z])|(?<=[?!.] )(?![a-z])/\n/g; print $str;

    Prints:

    The Tell-Tale Heart Edgar Allen Poe TRUE! nervous, very, very dreadfully nervous I had been and am; but wh +y WILL you say that I am mad? The disease had sharpened my senses, not destroyed, not dulled them. Above all was the sense of hearing acute. I heard all things in the heaven and in the earth. I heard many things in hell. How then am I mad? Hearken! and observe how healthily, how calmly, I can tell you the who +le story.

    DWIM is Perl's answer to Gödel

      Thank you for your time and response.

      Actually single words are acceptable. I am actually cutting up larger files to be used by an ESL teaching program.

      Thanks,

      UH
Re: Using punctuation to split a text
by BrowserUk (Patriarch) on Nov 04, 2006 at 09:28 UTC

    This is as close to your desired output as I can achieve.

    #! perl -slw use strict; ( my $data = do{ local $/; <DATA> } )=~ s[\n\+][]g; print "'$1'" while $data =~ m[ ( .+? ## Capture everything (minimally) (?: [.!?] \s+ ## A terminator followed by whitespac +e | [a-z] ## or a lowercase letter ) ) (?= [A-Z] | \z ) ## followed by a an uppercase letter +or EOS ]xg; __DATA__ The Tell-Tale HeartEdgar Allen PoeTRUE! nervous, very, very dreadfully + nervous I + had been and am; but why WILL you say that I am mad? The disease had + sharpened + my senses, not destroyed, not dulled them. Above all was the sense o +f hearing + acute. I heard all things in the heaven and in the earth. I heard ma +ny things i +n hell. How then am I mad? Hearken! and observe how healthily, how ca +lmly, I ca +n tell you the whole story.
    c:\test>junk5 'The Tell-Tale Heart' 'Edgar Allen Poe' 'TRUE! nervous, very, very dreadfully nervous I had been and am; but w +hy WILL you say that I am mad? ' 'The disease had sharpened my senses, not destroyed, not dulled them. +' 'Above all was the sense of hearing acute. ' 'I heard all things in the heaven and in the earth. ' 'I heard many things in hell. ' 'How then am I mad? ' 'Hearken! and observe how healthily, how calmly, I can tell you the wh +ole story. '

    Which doesn't quite match, but then I do not see how (or more importantly why?) you differenciate between splitting after the question mark here I am mad? The disease but decide not to here How then am I mad? Hearken!?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thank you very much for your response.

      The explainations are very much appreciated. The resulting output is perfectably useable for my purposes.

      cheers!

      -UH
Re: Using punctuation to split a text
by johngg (Canon) on Nov 04, 2006 at 13:07 UTC
    This is a similar approach to GrandFather's and BrowserUK's but uses split and map to break the text up. The regular expression allows for a split between punctuation and a lower-case letter to cope with "TRUE! nervous" which you break over two lines. Here it is

    use strict; use warnings; my $rxBoundary = qr {(?xms) (?: (?<=[a-z])(?=[A-Z]) | (?<=[.!?])(?=\s?[A-Za-z]) ) }; my $text = <DATA>; chomp $text; print map {qq{$_\n\n}} split m{$rxBoundary}, $text; __END__ The Tell-Tale HeartEdgar Allen PoeTRUE! nervous, very, very dreadfully + nervous I had been and am; but why WILL you say that I am mad? The d +isease had sharpened my senses, not destroyed, not dulled them. Above + all was the sense of hearing acute. I heard all things in the heaven + and in the earth. I heard many things in hell. How then am I mad? He +arken! and observe how healthily, how calmly, I can tell you the whol +e story.

    and the output is

    The Tell-Tale Heart Edgar Allen Poe TRUE! nervous, very, very dreadfully nervous I had been and am; but why WIL +L you say that I am mad? The disease had sharpened my senses, not destroyed, not dulled them. Above all was the sense of hearing acute. I heard all things in the heaven and in the earth. I heard many things in hell. How then am I mad? Hearken! and observe how healthily, how calmly, I can tell you the whole story +.

    I hope this is of use.

    Cheers,

    JohnGG

Re: Using punctuation to split a text
by mickeyn (Priest) on Nov 04, 2006 at 08:36 UTC
    almost ... :-)
    split /[.!]/, $text;
    anyway, I suggest you read perlre.

    Enjoy,
    Mickey

      So this should do the trick:

      @sentences =split([.?!\L], $text);

      Thank you for your help.

      -UH
Re: Using punctuation to split a text
by fenLisesi (Priest) on Nov 04, 2006 at 10:09 UTC
    The following could get you started. Cheers.
    use warnings; use strict; my $SPACER = "\n\n"; while (my $line = <DATA>) { $line =~ s|^\s+||; ## trim leading whitespace $line =~ s|\s+$||; ## trim trailing whitespace $line =~ s|([.?!]+)\s*|$1$SPACER|g; ## break at [.?!] $line =~ s|([a-z])([A-Z])|$1$SPACER$2|g; ## break at camelCase print $line . $SPACER; }
Re: Using punctuation to split a text
by OfficeLinebacker (Chaplain) on Nov 04, 2006 at 18:37 UTC

    To focus on the flummoxing part, you're talking about whenever a lowercase letter is immediately followed by an uppercase letter (with nothing in between)?

    did

    @sentences=split([.!?\L], $text);
    work for you?

    First off I think you'd need regexp slashes around the brackets to make sure you're talking about a character class, which will make sure that Perl interprets the . and ? literally. Secondly, I don't understand how \L works within a character class, especially when not followed by \E.

    _________________________________________________________________________________

    I like computer programming because it's like Legos for the mind.

Re: Using punctuation to split a text
by OfficeLinebacker (Chaplain) on Nov 04, 2006 at 18:50 UTC
    In any case, would
    @sents=split(/([.?!]|[:lower:]\u)/,$text);
    work?

    I think the split function consumes the characters it splits on though, right, so you'd have to use $1 and append it to the previous record, then dealing with the lower/upper boundary would be even more problematic.

    _________________________________________________________________________________

    I like computer programming because it's like Legos for the mind.

      You are partially right, split can consume the characters that it uses to split on but it is not always the case. Consider the following two code snippets

      $ perl -e ' > $str = q{abcXghiXstu}; > @elems = split m{X}, $str; > print qq{$_\n} for @elems;' abc ghi stu $ perl -e ' > $str = q{abcXghiXstu}; > @elems = split m{(X)}, $str; > print qq{$_\n} for @elems;' abc X ghi X stu $

      As you can see, the capturing parentheses in the regular expression of the second snippet cause split to keep the separators and assign them to the output array. So split consumes the characters only if you let it.

      This behaviour does not help us with the lower/upper case split but you can see from this post that you can split on a zero-width match, or, in other words, a boundary condition. In the following snippet I want to split on the boundary between letters and digits so I use zero-width look behind and look-ahead assertions to match a point in the string preceded by a letter and followed by a digit or vice versa.

      $ perl -e ' > $str = q{abc123def456ghi}; > @elems = > split m > {(?x) > (?: > (?<=[a-z]) # look behind for letter > (?=[0-9]) # look ahead for digit > ) > | # or > (?: > (?<=[0-9]) # look behind for digit > (?=[a-z]) # look ahead for letter > ) > }, $str; > print qq{$_\n} for @elems;' abc 123 def 456 ghi $

      I hope this throws more light on the various ways split can be used.

      Cheers,

      JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://582223]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-04-24 04:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found