Splitting a Sentence

ardibehest has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks, I have a simple sentence splitter in Perl which does a pretty decent job of splitting a running text into sentences.

#!/usr/bin/perl

use strict;
use warnings;

my $s;
my @arr;

open(FILE, "<test");

while(<FILE>) {
  chomp $_;
  $s .= $_;
}
@arr = $s =~ m/[A-Z].+?[.;]/g;
foreach (@arr) {
    print $_, "\n";
}
[download]

However as you can see it stumbles on abbreviations and acronyms.

I have a list of abbreviations which I would like to integrate in the script but since I am a newbie to Perl, I have not been able to integrate them. I am giving below the list of such cases. The syntax is clear in the given cases

Abbr["followed by the abbreviation];

Abbr["Co."];
Abbr["Corp."];
Abbr["vs."];
Abbr["e.g."];
Abbr["etc."];
Abbr["ex."];
Abbr["cf."];
Abbr["eg."];
Abbr["Jan."];
Abbr["Feb."];
Abbr["Mar."];
Abbr["Apr."];
Abbr["Jun."];
Abbr["Jul."];
Abbr["Aug."];
Abbr["Sep."];
Abbr["Sept."];
Abbr["Oct."];
Abbr["Nov."];
Abbr["Dec."];
Abbr["jan."];
Abbr["feb."];
Abbr["mar."];
Abbr["apr."];
Abbr["jun."];
Abbr["jul."];
Abbr["aug."];
Abbr["sep."];
Abbr["sept."];
Abbr["oct."];
Abbr["nov."];
Abbr["dec."];
Abbr["ed."];
Abbr["eds."];
Abbr["repr."];
Abbr["trans."];
Abbr["vol."];
Abbr["vols."];
Abbr["rev."];
Abbr["est."];
Abbr["b."];
Abbr["m."];
Abbr["bur."];
Abbr["d."];
Abbr["r."];
Abbr["M."];
Abbr["Dept."];
Abbr["MM."];
Abbr["U."];
Abbr["Mr."];
Abbr["Jr."];
Abbr["Ms."];
Abbr["Mme."];
Abbr["Mrs."];
Abbr["Dr."];
[download]

How do I integrate such cases in the script. A couple of examples would suffice. Many thanks for help

Comment on Splitting a Sentence Select or Download Code

Replies are listed 'Best First'.
Re: Splitting a Sentence by AppleFritter (Vicar) on Jul 03, 2014 at 10:56 UTC
Quick note -- it's not actually working so well yet, even disregarding abbreviations. Consider the following input: `This is a sentence. This is another sentence. This is a third sentence, which also happens to be spanning a line. This is a sentence as well... I think. This is an abbreviation, cf. the list posted on Perlmonks. This is a sentence; this is the last sentence.` [download] This produces: `This is a sentence. This is another sentence. This is a thirdsentence, which also happens to be spanning a line. This is asentence as well. I think. This is an abbreviation, cf. Perlmonks. This is a sentence;` [download] As opposed to: `This is a sentence. This is another sentence. This is a third sentence, which also happens to be spanning a line. This is a sentence as well... I think. This is an abbreviation, cf. the list posted on Perlmonks. This is a sentence; this is the last sentence.` [download] Note how the third and fourth one are missing a space and how lowercase characters following semicolons or periods aren't handled correctly. Try the following -- add a space in your loop, and use a lookahead assertion to take a peek at what's following a period or colon: `#!/usr/bin/perl use feature qw/say/; use strict; use warnings; my $s; my @arr; while(<>) { chomp $_; $s .= $_ . " "; } @arr = $s =~ m/[A-Z].+?[.;](?=[^.;][A-Z]\|\s*$)/g; foreach (@arr) { say; }` [download] This produces: `This is a sentence. This is another sentence. This is a third sentence, which also happens to be spanning a line. This is a sentence as well... I think. This is an abbreviation, cf. the list posted on Perlmonks.` [download] As you can see, it's not perfect -- it still splits if you have e.g. a run-on sentence followed by an "I", or in fact any uppercase word, e.g. a proper noun --, but it mostly handles abbreviations (arbitrary ones, even). What I'd do to fix the remaining edge cases is add another processing step after the regex where you loop over `@arr`, check if each element ends with a known abbreviation, and join that element with the next one if so.	[reply] [d/l] [select]
Re^2: Splitting a Sentence by SimonPratt (Friar) on Jul 03, 2014 at 16:00 UTC
Try this one on for size: `[A-Z].+?[.;](?<!\.{3})(?=[^.;][A-Z]\|\s*$)`	[reply] [d/l]
Re^3: Splitting a Sentence by AppleFritter (Vicar) on Jul 03, 2014 at 16:34 UTC
Works like a charm. BTW, I was gonna say that `m/[A-Z].+?[.;](?<!\.{2,})(?=[^.;][A-Z]\|\s$)/g` [download] would perhaps be even better, so as to not hardcode any specific number of periods for run-on sentences, but it turns out that: `Variable length lookbehind not implemented in regex m/[A-Z].+?[.;](?<! +\.{2,})(?=[^.;][A-Z]\|\s$)/` [download] Sigh. (Oh well, at least there's workarounds; see Why is variable length lookahead implemented while lookbehind is not?.)	[reply] [d/l] [select]
Re^2: Splitting a Sentence by ardibehest (Novice) on Jul 03, 2014 at 14:50 UTC
Many thanks, I tried it and it seems to work pretty well, The loop over `@arr` seemed to work for a couple of abbreviations. I will need to add all of them and I am sure it will work pretty decently.	[reply] [d/l]
Re: Splitting a Sentence by Tux (Canon) on Jul 03, 2014 at 10:56 UTC
substitue the newlines with spaces, otherwise `this is line one rous is the first word of line two` [download] will become `this is line onerous is the first word of line two` which is not what you want `open my $fh, "<", "test" or die "test: $!" while (<$fh>) { chomp; $s .= " $_"; }` [download] or shorter `my $s = join " " => split m/\n/ => do { local (@ARGV, $/) = "test"; <> };` Or even shorter `my ($s = do { local (@ARGV, $/) = "test"; <> }) =~ s/\n+/ /g;` Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^2: Splitting a Sentence by johngg (Canon) on Jul 03, 2014 at 11:29 UTC
TIMTOWTDI - chomp in a map and join seems fairly concise yet readable. `$ perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOD or die $!; Line 1 Line 2 Line 3 EOD my $wholeText = join q{ }, map { chomp; $_ } <$inFH>; say $wholeText;' Line 1 Line 2 Line 3 $` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l]
Re: Splitting a Sentence by tangent (Parson) on Jul 03, 2014 at 16:32 UTC
You may want to have a look at the Lingua::EN::Sentence module. It includes support for abbreviations and acronyms and has a built-in list of these which cover many of the abbreviations you have listed (see the source code of the module for the full list). You can also add your own items to the list like this: `use Lingua::EN::Sentence qw( get_sentences add_acronyms ); add_acronyms('ed','eds'); # adding support for 'Ed. Eds.' my $sentences = get_sentences($text); for my $sentence (@$sentences) { # do something with $sentence }` [download]	[reply] [d/l]
Re: Splitting a Sentence by InfiniteSilence (Curate) on Jul 03, 2014 at 14:14 UTC
Is it just me or doesn't anybody else read this like a homework assignment? I mean...how can you have a ready-made list of abbreviations...a fairly complete one mind you, but have no idea how to integrate into a script? Celebrate Intellectual Diversity	[reply]
Re^2: Splitting a Sentence by ardibehest (Novice) on Jul 03, 2014 at 14:42 UTC
Sorry. I am a newbie but rather old to do homework assignment: I am 65 years old and work on language analysis and am learning Perl since it helps me do a lot of string manipulation which in C would be an expensive proposition. No homework assignment here, I am afraid. Sentence splitting is a major problem in NLP and creates issues which you can see from the replies posted.	[reply]
Re^3: Splitting a Sentence by RonW (Parson) on Jul 03, 2014 at 16:44 UTC
Never too old for school or homework. The whole world is a school.	[reply]
Re: Splitting a Sentence by Anonymous Monk on Jul 03, 2014 at 10:13 UTC
sentence -> Text::Sentence - module for splitting text into sentences	[reply]
Re^2: Splitting a Sentence by choroba (Cardinal) on Jul 03, 2014 at 10:51 UTC
The linked module doesn't handle abbreviations. لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^3: Splitting a Sentence by Anonymous Monk on Jul 04, 2014 at 07:14 UTC
choroba: The linked module doesn't handle abbreviations. abbrev -> Text::UnAbbrev - Expand abbreviations and acronyms. sentence -> Lingua::Sentence - Perl extension for breaking text paragraphs into sentences	[reply]


Problems? Is your data what you think it is?
	PerlMonks