Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Splitting a Sentence

by ardibehest (Novice)
on Jul 03, 2014 at 10:03 UTC ( [id://1092133]=perlquestion: print w/replies, xml ) Need Help??

ardibehest has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks, I have a simple sentence splitter in Perl which does a pretty decent job of splitting a running text into sentences.
#!/usr/bin/perl use strict; use warnings; my $s; my @arr; open(FILE, "<test"); while(<FILE>) { chomp $_; $s .= $_; } @arr = $s =~ m/[A-Z].+?[.;]/g; foreach (@arr) { print $_, "\n"; }

However as you can see it stumbles on abbreviations and acronyms.

I have a list of abbreviations which I would like to integrate in the script but since I am a newbie to Perl, I have not been able to integrate them. I am giving below the list of such cases. The syntax is clear in the given cases

 Abbr["followed by the abbreviation];
Abbr["Co."]; Abbr["Corp."]; Abbr["vs."]; Abbr["e.g."]; Abbr["etc."]; Abbr["ex."]; Abbr["cf."]; Abbr["eg."]; Abbr["Jan."]; Abbr["Feb."]; Abbr["Mar."]; Abbr["Apr."]; Abbr["Jun."]; Abbr["Jul."]; Abbr["Aug."]; Abbr["Sep."]; Abbr["Sept."]; Abbr["Oct."]; Abbr["Nov."]; Abbr["Dec."]; Abbr["jan."]; Abbr["feb."]; Abbr["mar."]; Abbr["apr."]; Abbr["jun."]; Abbr["jul."]; Abbr["aug."]; Abbr["sep."]; Abbr["sept."]; Abbr["oct."]; Abbr["nov."]; Abbr["dec."]; Abbr["ed."]; Abbr["eds."]; Abbr["repr."]; Abbr["trans."]; Abbr["vol."]; Abbr["vols."]; Abbr["rev."]; Abbr["est."]; Abbr["b."]; Abbr["m."]; Abbr["bur."]; Abbr["d."]; Abbr["r."]; Abbr["M."]; Abbr["Dept."]; Abbr["MM."]; Abbr["U."]; Abbr["Mr."]; Abbr["Jr."]; Abbr["Ms."]; Abbr["Mme."]; Abbr["Mrs."]; Abbr["Dr."];
How do I integrate such cases in the script. A couple of examples would suffice. Many thanks for help

Replies are listed 'Best First'.
Re: Splitting a Sentence
by AppleFritter (Vicar) on Jul 03, 2014 at 10:56 UTC

    Quick note -- it's not actually working so well yet, even disregarding abbreviations. Consider the following input:

    This is a sentence. This is another sentence. This is a third sentence, which also happens to be spanning a line. This is a sentence as well... I think. This is an abbreviation, cf. the list posted on Perlmonks. This is a sentence; this is the last sentence.

    This produces:

    This is a sentence. This is another sentence. This is a thirdsentence, which also happens to be spanning a line. This is asentence as well. I think. This is an abbreviation, cf. Perlmonks. This is a sentence;

    As opposed to:

    This is a sentence. This is another sentence. This is a third sentence, which also happens to be spanning a line. This is a sentence as well... I think. This is an abbreviation, cf. the list posted on Perlmonks. This is a sentence; this is the last sentence.

    Note how the third and fourth one are missing a space and how lowercase characters following semicolons or periods aren't handled correctly.

    Try the following -- add a space in your loop, and use a lookahead assertion to take a peek at what's following a period or colon:

    #!/usr/bin/perl use feature qw/say/; use strict; use warnings; my $s; my @arr; while(<>) { chomp $_; $s .= $_ . " "; } @arr = $s =~ m/[A-Z].+?[.;](?=[^.;][A-Z]|\s*$)/g; foreach (@arr) { say; }

    This produces:

    This is a sentence. This is another sentence. This is a third sentence, which also happens to be spanning a line. This is a sentence as well... I think. This is an abbreviation, cf. the list posted on Perlmonks.

    As you can see, it's not perfect -- it still splits if you have e.g. a run-on sentence followed by an "I", or in fact any uppercase word, e.g. a proper noun --, but it mostly handles abbreviations (arbitrary ones, even). What I'd do to fix the remaining edge cases is add another processing step after the regex where you loop over @arr, check if each element ends with a known abbreviation, and join that element with the next one if so.

      Try this one on for size:

      [A-Z].+?[.;](?<!\.{3})(?=[^.;][A-Z]|\s*$)

        Works like a charm. BTW, I was gonna say that

        m/[A-Z].+?[.;](?<!\.{2,})(?=[^.;][A-Z]|\s*$)/g

        would perhaps be even better, so as to not hardcode any specific number of periods for run-on sentences, but it turns out that:

        Variable length lookbehind not implemented in regex m/[A-Z].+?[.;](?<! +\.{2,})(?=[^.;][A-Z]|\s*$)/

        Sigh. (Oh well, at least there's workarounds; see Why is variable length lookahead implemented while lookbehind is not?.)

      Many thanks, I tried it and it seems to work pretty well, The loop over @arr seemed to work for a couple of abbreviations. I will need to add all of them and I am sure it will work pretty decently.
Re: Splitting a Sentence
by Tux (Canon) on Jul 03, 2014 at 10:56 UTC

    substitue the newlines with spaces, otherwise

    this is line one rous is the first word of line two

    will become

    this is line onerous is the first word of line two

    which is not what you want

    open my $fh, "<", "test" or die "test: $!" while (<$fh>) { chomp; $s .= " $_"; }

    or shorter

    my $s = join " " => split m/\n/ => do { local (@ARGV, $/) = "test"; <> };

    Or even shorter

    my ($s = do { local (@ARGV, $/) = "test"; <> }) =~ s/\n+/ /g;

    Enjoy, Have FUN! H.Merijn

      TIMTOWTDI - chomp in a map and join seems fairly concise yet readable.

      $ perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOD or die $!; Line 1 Line 2 Line 3 EOD my $wholeText = join q{ }, map { chomp; $_ } <$inFH>; say $wholeText;' Line 1 Line 2 Line 3 $

      I hope this is of interest.

      Cheers,

      JohnGG

Re: Splitting a Sentence
by tangent (Parson) on Jul 03, 2014 at 16:32 UTC
    You may want to have a look at the Lingua::EN::Sentence module. It includes support for abbreviations and acronyms and has a built-in list of these which cover many of the abbreviations you have listed (see the source code of the module for the full list).

    You can also add your own items to the list like this:

    use Lingua::EN::Sentence qw( get_sentences add_acronyms ); add_acronyms('ed','eds'); # adding support for 'Ed. Eds.' my $sentences = get_sentences($text); for my $sentence (@$sentences) { # do something with $sentence }
Re: Splitting a Sentence
by InfiniteSilence (Curate) on Jul 03, 2014 at 14:14 UTC

    Is it just me or doesn't anybody else read this like a homework assignment? I mean...how can you have a ready-made list of abbreviations...a fairly complete one mind you, but have no idea how to integrate into a script?

    Celebrate Intellectual Diversity

      Sorry. I am a newbie but rather old to do homework assignment: I am 65 years old and work on language analysis and am learning Perl since it helps me do a lot of string manipulation which in C would be an expensive proposition. No homework assignment here, I am afraid. Sentence splitting is a major problem in NLP and creates issues which you can see from the replies posted.
        Never too old for school or homework. The whole world is a school.
Re: Splitting a Sentence
by Anonymous Monk on Jul 03, 2014 at 10:13 UTC
      The linked module doesn't handle abbreviations.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1092133]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-04-16 19:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found