Statistical NLP

Hello Brothers,

recently we acquired an interesting book Foundations of Statistical Natural Language Processing. This is supposed to be the current State-Of-The-Art reference book for statistical NLP. I'd like to cite from chapter 4 (Corpus-Based Work), Section 4.1.3 (Software), especially the paragraph about Programming languages (page 121):

Most Statistical NLP work is currently done in C/C++. The need to deal
with large amounts of data collection and processing from large texts
means that the efficiency gains of coding in a language like C/C++ are
generally worth it. But for a lot of the ancillary processing of text, there
are many other languages which may be more economical with human
labor. Many people use Perl for general text preparation and reformatting.
Its integration of regular expressions into the language syntax is
particularly powerful. In general, interpreted languages are faster for
these kinds of tasks than writing everything in C. Old timers might still
use awk rather than Perl - even though what you can do with it is rather
more limited. Another choice, better liked by programmingpurists is
Python, but using regular expressionsin Python just is not as easy as
Perl. One of the authors still makes considerable use of Prolog. The built-
in database facilities and easy handling of complicated data structures
makes Prolog excel for some tasks, but again, lacks the easy access to
regular expressions available in Perl. [...]

So it seems, we're not driving that bad with Perl as our choice. :-)

Bye
PetaMem

Comment on Statistical NLP Download Code

Replies are listed 'Best First'.
Re: Statistical NLP by kvale (Monsignor) on Jul 31, 2002 at 15:57 UTC
I can't help but think that Perl also appeals to linguists because Larry Wall consciously incorporated elements of human languages into Perl itself: expressive keywords, non-orthogal constructions and context sensitivities that make expressions and statements more natural. -Mark	[reply]
Re: Re: Statistical NLP by Hanamaki (Chaplain) on Aug 01, 2002 at 11:32 UTC
I can't help but think that Perl also appeals to linguists because Larry Wall consciously incorporated elements of human languages into Perl itself:... While I cannot give any proof that your opionion is FALSE, I do not think you are right in this case. If the incorporation of elements of human languages appeals to some group, it will should rather appeal to the set of humans in general, than to its subset of linguists. Considering the extensive use of the zero pronoun (`$_`), Perl could be viewed as more appealing to native speakers of Japanese than to Anglosaxons. No, I don't believe this, but if someone has empirical data about it, I may chance my mind. A more functional approach would be to ask what Perl gives to Linguists to make there job easier. Since you can easily built regular grammars (Chomsky Hierarchy Type 3 Grammars) with the regular part of Perl's unregular expressions Perl is a good tool to implement Type 3 grammars and/or test some theories. If you look further into Natural Language Processing you will see that finite state technologies are wildly applied and got some popularity in the field. Implementing finite state automata (`m//`) and finite state transducter (`s///`) in Perl is pretty easy. While you may port some really huge automata to C or whatever for efficiency Perl should be good enough for experimenting and building smaller automata. So for some kind of theroretical linguistics Perl is a easy to use tool, and therfore popular (at least among students). While you can built context-free Grammars (Chomsky Type 2) with Perl or Perl parser modules, a linguist needing this kind of grammar may leave Perl and look for some other programming language. Probably Perl isn't that popular in the field of Semantics as well. Here Prolog seems to rule. As we know Perl is a pretty good language for text processing/ matching. Aka a good helper programming language for corpus linguistics, statistical NLP, data preparation, etc. pp. Hanamaki	[reply] [d/l] [select]
Re: Statistical NLP by vladb (Vicar) on Jul 31, 2002 at 17:47 UTC
Perl code appears only natural to me. The syntax is flexible enough to accomodate tastes of a diverse number of developers. Following the work being done on the Parrot engine which will serve as a 'base' for the new and improved version of Perl (6), lexical element of the language is given an additional dimension. It will be possible for one to alter any portion of the language to fit his/her cultural, aesthetical and otherwise needs. Although regular expressions is the jewel of Perl, other things are not the least important. First comes to mind it's incredible flexibility and power to do complicated tasks in just a few code statements. In contrast, same task may require significantly larger code and effort were it tackled in C, for example. Worth mentioning also are it's numerous modules designed to solve any problem imaginable (at least in the Internet/IP realm). _____________________ `# Under Construction` [download]	[reply] [d/l]


No such thing as a small change
	PerlMonks