Re^3: LaTeX Abbreviations for Linguists

in reply to Re^2: LaTeX Abbreviations for Linguists
in thread LaTeX Abbreviations for Linguists

#Last Updated 05.09.09
[download]

Um... okay, given the date of your last reply, that comment line in your code can be interpreted correctly, but you should be aware that taken by itself, that date string could mean three different things.

This regex in your code looks wrong, and is very different from the one I recommended (and from the one I quoted out of the original version of your script):

    if ( $file[$i] =~ m/(\\(gll|[abcdef(exg.)]g\.)|textsc\/)/ ) {
[download]

It creates a character class that includes "e" two times, and also includes period and open and close parens. It will match things like "\)g.", "\(g.", "\.g.", "\xg.", etc, and probably won't match things that you want it to match, like "exg". I realize now that in my earlier reply, I left out a colon; I've updated that accordingly, and I apologize for that mistake.

Consistent indentation is a nice thing, and so is using @ARGV for things like asking for usage help and providing file names and options -- please get acquainted with using @ARGV (and Getopt::Std and/or Getopt::Long), because making the user manually type things in after the script is running is a Real Pain™.

A long list of "configuration" or "initialization" data (your "@lgr" list) would be handled more cleanly (and would be easier to maintain) as a __DATA__ segment that gets read into an array or hash on start-up.

The regex alternation character (vertical-bar, |) does not work as such inside a regex character class (between square brackets), it just matches a literal "|" -- so you should study the perlre man page to understand how character classes work.

Also, when you want to delete all occurrences of particular characters from a string, using tr/xyz//d is much more efficient than s/x//g; s/y//g; s/z//g; (using tr is even more efficient than s/[xyz]//g).

For a small script like this, modifying global-scope variables inside of subroutines isn't such a big deal, because the script is small, but it's usually not a good idea. As a general rule (and in the interest of creating subroutines that are modular and easy to maintain and adapt), it's better to pass data to subs as parameters, and have the sub either return its resulting data to the caller, or modify its parameters in-place (because they were passed as references).

If you document your code with POD, it will be easier to read and maintain the documentation, which is important. If the documentation includes a brief description of what the code actually does, that will help you to organize your thoughts in a sensible way about the algorithm, and then organize your code according to what makes sense (and is documented). As it is, there's a lot of inefficiency in your code, because the algorithm hasn't been thought out. In particular, you are using arrays where you should be using hashes.

This note in your help text is not necessarily true:

... Note: The script and the TeX file have to be in the same directory
+. ...
[download]

The script could be in the shell's execution PATH, so it doesn't have to be where the data file is; also, a user can provide (relative or absolute) paths for both the script and the data file, so they don't both have to be in the same place. Also, it's always a good idea to include the filename string and $! in the error message when you "die" on a failed "open" call.

I happened to notice this one odd entry in your long list of LGR abbrevs:

N-=non- (e.g. NSG nonsingular, NPST nonpast)
[download]

There's nothing in your code that handles this "N" prefix on other abbrevs, so things like "NPST" and "NSG" will never be labeled as "nonpast" or "nonsingular" in your output. Also, that explanation will never appear in the output either, unless "N-" happens to occur in the tex file.

One last point: do you have a suitable "test.tex" file that contains at least one example of every kind of abbreviation you intend to handle with this script, along with some variety of "normal" content? If not, make one. The point would be to make sure that all these abbreviations get listed as intended.

Of course, you can't anticipate all the ways that "normal" LaTeX content might cause your script to miss things that are real abbreviations (e.g. if two abbrevs occur next to each other separated by a single space, the second one will be missed), or to list things as abbrevs when they really aren't (e.g. FULL WORDS IN UPPERCASE, or any single-digit number). But even a little bit of testing is better than none.

Here's how I would write your script (though this version won't behave exactly the same as yours, and might have some mistakes in it -- I didn't have any LaTeX files with abbreviations to test it on):

(note that you have to fill in the part about defining what abbreviations are, and that part should be updated as you refine your code)

#!/usr/bin/perl

=head1 NAME

name-of-script

=head1 SYNOPSIS

 name-of-script [-l] filename.tex

=head1 DESCRIPTION

This script reads a given LaTeX file, finds everything in the text
that looks like an abbreviation, and then creates a new file in the
same directory (called "filename-abbrev.txt") that lists them all.

In this process, an abbreviation in LaTeX is defined as:

  - this...

  - that...

  - whatever else...

If your LaTeX file uses abbreviations that are specified in the
'Leipzig Glossing Rules' (LGR), you can use the '-l' option to have
these abbreviations listed with the full terms that they represent.
In this case, the output file will list the non-LGR abbreviations
first, and then the LGR ones are given with their meanings.

=cut

use strict;
use Getopt::Long;

my %lgr;
while (<DATA>) {
    chomp;
    my ($abbr, $term) = split( /=/ );
    $lgr{$abbr} = $term;
}

my $Usage = "$0 [-l] filename.tex\n  (run 'perldoc $0' for help)\n";
my $opt_lgr;
my $opt_ok = GetOptions( 'l' => \$opt_lgr );
my $arg_ok = ( @ARGV == 1 and -f $ARGV[0] );
die $Usage unless ( $opt_ok and $arg_ok );

my $filename = shift;

open( TEX, "<:utf8", $filename ) or die "$0: $filename: $!\n";
my @texlines = <TEX>;
close TEX;
chomp @texlines;

my %abbr_seen;
for my $ln ( 0 .. $#texlines - 1 ) {
    next unless ( $texlines[$ln] =~ /(\\(?:gll|[abcdef]g\.|exg\.?))/ )
+;
    my $ln1 = $ln + 1;
    while ( $texlines[$ln1] =~
            / [-=\s.:]([A-Z]+)[-=\s.:] | (SG|DU|PL) | ([123]) /gx ) {
        $abbr_seen{$1}++;
    }
}

$filename =~ s/\.tex.*//;
$filename .= '-abbrev.txt';
open( ABBR, ">:utf8", $filename ) or die "$0: $filename: $!\n";

for my $abbr ( sort keys %abbr_seen ) {
    next if ( $opt_lgr and exists( $lgr{$abbr} ));
    print ABBR "$abbr\n";
}
if ( $opt_lgr ) {
    print ABBR "\n";
    for my $abbr ( sort keys %abbr_seen ) {
        print ABBR "\\item[$abbr] '$lgr{$abbr}'" if ( exists( $lgr{$ab
+br} ));
    }
}
close ABBR;

__DATA__
1=first person
2=second person
3=third person
A=agent-like argument of canonical transitive verb
ABL=ablative
ABS=absolutive
ACC=accusative
ADJ=adjective
ADV=adverb(ial)
AGR=agreement
ALL=allative
ANTIP=antipassive
APPL=applicative
ART=article
AUX=auxiliary
BEN=benefactive
CAUS=causative
CLF=classifier
COM=comitative
COMP=complementizer
COMPL=completive
COND=conditional
COP=copula
CVB=converb
DAT=dative
DECL=declarative
DEF=definite
DEM=demonstrative
DET=determiner
DIST=distal
DISTR=distributive
DU=dual
DUR=durative
ERG=ergative
EXCL=exclusive
F=feminine
FOC=focus
FUT=future
GEN=genitive
IMP=imperative
INCL=inclusive
IND=indicative
INDF=indefinite
INF=infinitive
INS=instrumental
INTR=intransitive
IPFV=imperfective
IRR=irrealis
LOC=locative
M=masculine
N=neuter
N-=non- (e.g. NSG nonsingular, NPST nonpast)
NEG=negation, negative
NMLZ=nominalizer/nominalization
NOM=nominative
OBJ=object
OBL=oblique
P=patient-like argument of canonical transitive verb
PASS=passive
PFV=perfective
PL=plural
POSS=possessive
PRED=predicative
PRF=perfect
PRS=present
PROG=progressive
PROH=prohibitive
PROX=proximal/proximate
PST=past
PTCP=participle
PURP=purposive
Q=question particle/marker
QUOT=quotative
RECP=reciprocal
REFL=reflexive
REL=relative
RES=resultative
S=single argument of canonical intransitive verb
SBJ=subject
SBJV=subjunctive
SG=singular
TOP=topic
TR=transitive
VOC=vocative
[download]

(update: added missing ABBR file handle in last print statement)

Comment on Re^3: LaTeX Abbreviations for Linguists Select or Download Code

In Section Code Catacombs