in reply to Re^2: LaTeX Abbreviations for Linguists
in thread LaTeX Abbreviations for Linguists

#Last Updated 05.09.09
Um... okay, given the date of your last reply, that comment line in your code can be interpreted correctly, but you should be aware that taken by itself, that date string could mean three different things.

This regex in your code looks wrong, and is very different from the one I recommended (and from the one I quoted out of the original version of your script):

if ( $file[$i] =~ m/(\\(gll|[abcdef(exg.)]g\.)|textsc\/)/ ) {
It creates a character class that includes "e" two times, and also includes period and open and close parens. It will match things like "\)g.", "\(g.", "\.g.", "\xg.", etc, and probably won't match things that you want it to match, like "exg". I realize now that in my earlier reply, I left out a colon; I've updated that accordingly, and I apologize for that mistake.

Consistent indentation is a nice thing, and so is using @ARGV for things like asking for usage help and providing file names and options -- please get acquainted with using @ARGV (and Getopt::Std and/or Getopt::Long), because making the user manually type things in after the script is running is a Real Pain™.

A long list of "configuration" or "initialization" data (your "@lgr" list) would be handled more cleanly (and would be easier to maintain) as a __DATA__ segment that gets read into an array or hash on start-up.

The regex alternation character (vertical-bar, |) does not work as such inside a regex character class (between square brackets), it just matches a literal "|" -- so you should study the perlre man page to understand how character classes work.

Also, when you want to delete all occurrences of particular characters from a string, using tr/xyz//d is much more efficient than s/x//g; s/y//g; s/z//g; (using tr is even more efficient than s/[xyz]//g).

For a small script like this, modifying global-scope variables inside of subroutines isn't such a big deal, because the script is small, but it's usually not a good idea. As a general rule (and in the interest of creating subroutines that are modular and easy to maintain and adapt), it's better to pass data to subs as parameters, and have the sub either return its resulting data to the caller, or modify its parameters in-place (because they were passed as references).

If you document your code with POD, it will be easier to read and maintain the documentation, which is important. If the documentation includes a brief description of what the code actually does, that will help you to organize your thoughts in a sensible way about the algorithm, and then organize your code according to what makes sense (and is documented). As it is, there's a lot of inefficiency in your code, because the algorithm hasn't been thought out. In particular, you are using arrays where you should be using hashes.

This note in your help text is not necessarily true:

... Note: The script and the TeX file have to be in the same directory +. ...
The script could be in the shell's execution PATH, so it doesn't have to be where the data file is; also, a user can provide (relative or absolute) paths for both the script and the data file, so they don't both have to be in the same place. Also, it's always a good idea to include the filename string and $! in the error message when you "die" on a failed "open" call.

I happened to notice this one odd entry in your long list of LGR abbrevs:

N-=non- (e.g. NSG nonsingular, NPST nonpast)
There's nothing in your code that handles this "N" prefix on other abbrevs, so things like "NPST" and "NSG" will never be labeled as "nonpast" or "nonsingular" in your output. Also, that explanation will never appear in the output either, unless "N-" happens to occur in the tex file.

One last point: do you have a suitable "test.tex" file that contains at least one example of every kind of abbreviation you intend to handle with this script, along with some variety of "normal" content? If not, make one. The point would be to make sure that all these abbreviations get listed as intended.

Of course, you can't anticipate all the ways that "normal" LaTeX content might cause your script to miss things that are real abbreviations (e.g. if two abbrevs occur next to each other separated by a single space, the second one will be missed), or to list things as abbrevs when they really aren't (e.g. FULL WORDS IN UPPERCASE, or any single-digit number). But even a little bit of testing is better than none.

Here's how I would write your script (though this version won't behave exactly the same as yours, and might have some mistakes in it -- I didn't have any LaTeX files with abbreviations to test it on):

(note that you have to fill in the part about defining what abbreviations are, and that part should be updated as you refine your code)
#!/usr/bin/perl =head1 NAME name-of-script =head1 SYNOPSIS name-of-script [-l] filename.tex =head1 DESCRIPTION This script reads a given LaTeX file, finds everything in the text that looks like an abbreviation, and then creates a new file in the same directory (called "filename-abbrev.txt") that lists them all. In this process, an abbreviation in LaTeX is defined as: - this... - that... - whatever else... If your LaTeX file uses abbreviations that are specified in the 'Leipzig Glossing Rules' (LGR), you can use the '-l' option to have these abbreviations listed with the full terms that they represent. In this case, the output file will list the non-LGR abbreviations first, and then the LGR ones are given with their meanings. =cut use strict; use Getopt::Long; my %lgr; while (<DATA>) { chomp; my ($abbr, $term) = split( /=/ ); $lgr{$abbr} = $term; } my $Usage = "$0 [-l] filename.tex\n (run 'perldoc $0' for help)\n"; my $opt_lgr; my $opt_ok = GetOptions( 'l' => \$opt_lgr ); my $arg_ok = ( @ARGV == 1 and -f $ARGV[0] ); die $Usage unless ( $opt_ok and $arg_ok ); my $filename = shift; open( TEX, "<:utf8", $filename ) or die "$0: $filename: $!\n"; my @texlines = <TEX>; close TEX; chomp @texlines; my %abbr_seen; for my $ln ( 0 .. $#texlines - 1 ) { next unless ( $texlines[$ln] =~ /(\\(?:gll|[abcdef]g\.|exg\.?))/ ) +; my $ln1 = $ln + 1; while ( $texlines[$ln1] =~ / [-=\s.:]([A-Z]+)[-=\s.:] | (SG|DU|PL) | ([123]) /gx ) { $abbr_seen{$1}++; } } $filename =~ s/\.tex.*//; $filename .= '-abbrev.txt'; open( ABBR, ">:utf8", $filename ) or die "$0: $filename: $!\n"; for my $abbr ( sort keys %abbr_seen ) { next if ( $opt_lgr and exists( $lgr{$abbr} )); print ABBR "$abbr\n"; } if ( $opt_lgr ) { print ABBR "\n"; for my $abbr ( sort keys %abbr_seen ) { print ABBR "\\item[$abbr] '$lgr{$abbr}'" if ( exists( $lgr{$ab +br} )); } } close ABBR; __DATA__ 1=first person 2=second person 3=third person A=agent-like argument of canonical transitive verb ABL=ablative ABS=absolutive ACC=accusative ADJ=adjective ADV=adverb(ial) AGR=agreement ALL=allative ANTIP=antipassive APPL=applicative ART=article AUX=auxiliary BEN=benefactive CAUS=causative CLF=classifier COM=comitative COMP=complementizer COMPL=completive COND=conditional COP=copula CVB=converb DAT=dative DECL=declarative DEF=definite DEM=demonstrative DET=determiner DIST=distal DISTR=distributive DU=dual DUR=durative ERG=ergative EXCL=exclusive F=feminine FOC=focus FUT=future GEN=genitive IMP=imperative INCL=inclusive IND=indicative INDF=indefinite INF=infinitive INS=instrumental INTR=intransitive IPFV=imperfective IRR=irrealis LOC=locative M=masculine N=neuter N-=non- (e.g. NSG nonsingular, NPST nonpast) NEG=negation, negative NMLZ=nominalizer/nominalization NOM=nominative OBJ=object OBL=oblique P=patient-like argument of canonical transitive verb PASS=passive PFV=perfective PL=plural POSS=possessive PRED=predicative PRF=perfect PRS=present PROG=progressive PROH=prohibitive PROX=proximal/proximate PST=past PTCP=participle PURP=purposive Q=question particle/marker QUOT=quotative RECP=reciprocal REFL=reflexive REL=relative RES=resultative S=single argument of canonical intransitive verb SBJ=subject SBJV=subjunctive SG=singular TOP=topic TR=transitive VOC=vocative
(update: added missing ABBR file handle in last print statement)