http://qs321.pair.com?node_id=777633
Category: Text Processing
Author/Contact Info
Description: This is a script which finds the abbreviations in the glossings and lists them alphabetically. The abbreviations defined by Leipzig Glossing Rules (http://www.eva.mpg.de/lingua/resources/glossing-rules.php) are left out per default. If you want to list the LGR abbreviations, too (and their definition), you should use the "-lgr" suffix.
#!/usr/bin/perl
#abbreviation script (ab)
#Last Updated 05.09.09

use strict;
use utf8;

my $option = "no";
my $length;
my @sorted_abbr;
my @lgr = ("1=first person",
"2=second person",
"3=third person",
"A=agent-like argument of canonical transitive verb",
"ABL=ablative",
"ABS=absolutive",
"ACC=accusative",
"ADJ=adjective",
"ADV=adverb(ial)",
"AGR=agreement",
"ALL=allative",
"ANTIP=antipassive",
"APPL=applicative",
"ART=article",
"AUX=auxiliary",
"BEN=benefactive",
"CAUS=causative",
"CLF=classifier",
"COM=comitative",
"COMP=complementizer",
"COMPL=completive",
"COND=conditional",
"COP=copula",
"CVB=converb",
"DAT=dative",
"DECL=declarative",
"DEF=definite",
"DEM=demonstrative",
"DET=determiner",
"DIST=distal",
"DISTR=distributive",
"DU=dual",
"DUR=durative",
"ERG=ergative",
"EXCL=exclusive",
"F=feminine",
"FOC=focus",
"FUT=future",
"GEN=genitive",
"IMP=imperative",
"INCL=inclusive",
"IND=indicative",
"INDF=indefinite",
"INF=infinitive",
"INS=instrumental",
"INTR=intransitive",
"IPFV=imperfective",
"IRR=irrealis",
"LOC=locative",
"M=masculine",
"N=neuter",
"N-=non- (e.g. NSG nonsingular, NPST nonpast)",
"NEG=negation, negative",
"NMLZ=nominalizer/nominalization",
"NOM=nominative",
"OBJ=object",
"OBL=oblique",
"P=patient-like argument of canonical transitive verb",
"PASS=passive",
"PFV=perfective",
"PL=plural",
"POSS=possessive",
"PRED=predicative",
"PRF=perfect",
"PRS=present",
"PROG=progressive",
"PROH=prohibitive",
"PROX=proximal/proximate",
"PST=past",
"PTCP=participle",
"PURP=purposive",
"Q=question particle/marker",
"QUOT=quotative",
"RECP=reciprocal",
"REFL=reflexive",
"REL=relative",
"RES=resultative",
"S=single argument of canonical intransitive verb",
"SBJ=subject",
"SBJV=subjunctive",
"SG=singular",
"TOP=topic",
"TR=transitive",
"VOC=vocative");

my $filename;
sub first{
print "Enter the name of your TeX file or type 'help' for help: ";
$filename = <>;
if ($filename =~ /help/) {
   help();
}else{
   parse();
};
}; #first ends

first();
#parse();
lgr();
prin();

sub parse{
#parsing tex
my $dummy = 0;
my @words;
my $check = "no";
my @spaces;
my @abbr;
chomp $filename;
if ( $filename =~ /\s/ ) {
   my @dumb = split(/\s/, $filename);
   $filename = $dumb[0];
   if ($dumb[1] =~ /-lgr/ ) {
      $option = "yes";
   };
};
open FILE, "<:utf8", $filename or die "No such file";
my @file;
while (<FILE>) {
   chomp;
   push(@file, $_);
};
close FILE;
$length = scalar(@file);
for (my $i = 0; $i < $length; $i++) {
   if ( $file[$i] =~ m/(\\(gll|[abcdef(exg.)]g\.)|textsc\/)/ ) {
      $dummy = $i + 1;
      while ( $file[$dummy] =~ m/[\-|\s]([A-Z]+)[\s|\-|\:|\.|\=]|[\=|\
+s]([A-Z]+)[\s|\-|\:|\.|\=]|[\.|\s]([A-Z]+)[\s|\-|\:|\.|\=]|[\:|\s]([A
+-Z]+)[\s|\-|\:|\.|\=]|(SG|DU|PL)|(1|2|3)|\s([A-Z]+)\s/g ) {
#    print $&."\n";
        my $hell = $&;
        $hell =~ s/-//g;$hell =~ s/\s//g;$hell =~ s/://g;$hell =~ s/\.
+//g;$hell =~ s/=//g;
        for my $line (@abbr){
              $check = "yes" if ($line eq $hell);
#              print $hell.$line.$check."\n";
        };
           push(@abbr, $hell) if ($check =~ "no");
        $check = "no";
      };
   };
};


@sorted_abbr = sort @abbr;
$length = scalar(@sorted_abbr);
foreach (@sorted_abbr) {
   chomp;
};
}; #subparse ends


sub lgr{
$length = scalar(@sorted_abbr);
if ( $option =~ /yes/ ) {
   for (my $i = 0; $i < $length;$i++) {
      foreach (@lgr) {
         my @dummy = split(/=/, $_);
         if ( $dummy[0] eq $sorted_abbr[$i] ) {
            $sorted_abbr[$i] = "\\item[$sorted_abbr[$i]] '$dummy[1]'";
         };
      };
   };
};
}; #lgr ends


sub prin{
foreach (@sorted_abbr) {
   if ( $_ !~ /^\\/ ) {
      $_ = "\\item[$_] ''";
   };
};
$filename =~ s/\.tex*//;
my $newfilename = $filename."-abb.txt";
open ABB, ">:utf8", $newfilename;
foreach (@sorted_abbr) {
    print ABB $_."\n";
};
close ABB;
}; #prin ends

sub help{
print "Instructions: 1. Enter the name of the TeX file you want to run
+ the script on.\n2. If you want to find the abbreviations already in 
+the list of 'Leipzig Glossing Rules' automatically, type '-lgr' after
+ the file name.\nExample: 'foo.tex -lgr'.\n3. The list of abbreviatio
+ns will be printed in a file '*-abb.txt'. If the name of the TeX file
+ was 'foo.tex' than the name of the list-file will be 'foo-abb-txt'.\
+n\nNote: The script and the TeX file have to be in the same directory
+.\n\nNote 2: Abbreviation is defined as:\n1. Anything which is writte
+n in uppercase and has more than one letter\n2. Anything written in \
+\textsc{}\n3. Only exceptions to the first two rules are the person/n
+umber abbreviations '1s, 1d, 2s, etc.'\n4. Only the line \\gll, \\ag.
+, \\bg. etc. \\exg. and \\exg are searched for abbreviations\n\nQuest
+ions: email to ozangulle\@gmail.com .\n";
first();
}; #help ends
Replies are listed 'Best First'.
Re: LaTeX Abbreviations for Linguists
by jwkrahn (Abbot) on Jul 06, 2009 at 20:13 UTC
    if ( $abbr =~ m/[A-Z]{2,}/ ) { my @inlist; push(@inlist, $abbr); foreach (@inlist) { if ( $_ =~ /\./ ) { my @dumm = split(/\./, $_); $inlist[0] =~ s/\.//g; foreach (@dumm) { push(@inlist, $_); }; }; };

    From perlsyn:

    If any part of LIST is an array, foreach will get very confused if you add or remove elements within the loop body, for example with splice. So don't do that.
    You are pushing items onto @inlist at the same time that you at iterating over @inlist.   You shouldn't do that!

Re: LaTeX Abbreviations for Linguists
by graff (Chancellor) on Jul 06, 2009 at 22:07 UTC
    Apart from the problem cited in the first reply, you could accomplish the same goal with a lot fewer lines of code -- e.g. this:
    if ( $abbr =~ s/([123][sdp])// ) { push(@abbr, $1); $abbr =~ s/\s//g; };
    will replace 45 lines of OP code (the 5-line block that is repeated 9 times, for each combination of 1/2/3 with s/d/p).

    A similar refactoring should be done on all those blocks that "do the nasty" with @inlist -- you should be able to reduce all of those blocks down to a single loop as well, and then you only have to fix the misuse of "push" on the one remaining instance of that problem.

    (updated to fix wording in 1st paragraph)

    Oh, and instead of this:

    if ( $file[$i] =~ m/(\\gll|\\ag\.|\\bg\.|\\cg\.|\\dg\.|\\eg\.|\\fg\.|\ +\exg\.|\\exg)/ )
    How about:
    if ( $file[$i] =~ /(\\(?:gll|[abcdef]g\.|exg\.?))/ )
    (updated last snippet to include a much-needed ":" after the first "?")
      @jwkrahn and graff: Thank you very much for your comments and corrections. I've revised and edited the script.
        #Last Updated 05.09.09
        Um... okay, given the date of your last reply, that comment line in your code can be interpreted correctly, but you should be aware that taken by itself, that date string could mean three different things.

        This regex in your code looks wrong, and is very different from the one I recommended (and from the one I quoted out of the original version of your script):

        if ( $file[$i] =~ m/(\\(gll|[abcdef(exg.)]g\.)|textsc\/)/ ) {
        It creates a character class that includes "e" two times, and also includes period and open and close parens. It will match things like "\)g.", "\(g.", "\.g.", "\xg.", etc, and probably won't match things that you want it to match, like "exg". I realize now that in my earlier reply, I left out a colon; I've updated that accordingly, and I apologize for that mistake.

        Consistent indentation is a nice thing, and so is using @ARGV for things like asking for usage help and providing file names and options -- please get acquainted with using @ARGV (and Getopt::Std and/or Getopt::Long), because making the user manually type things in after the script is running is a Real Pain™.

        A long list of "configuration" or "initialization" data (your "@lgr" list) would be handled more cleanly (and would be easier to maintain) as a __DATA__ segment that gets read into an array or hash on start-up.

        The regex alternation character (vertical-bar, |) does not work as such inside a regex character class (between square brackets), it just matches a literal "|" -- so you should study the perlre man page to understand how character classes work.

        Also, when you want to delete all occurrences of particular characters from a string, using tr/xyz//d is much more efficient than s/x//g; s/y//g; s/z//g; (using tr is even more efficient than s/[xyz]//g).

        For a small script like this, modifying global-scope variables inside of subroutines isn't such a big deal, because the script is small, but it's usually not a good idea. As a general rule (and in the interest of creating subroutines that are modular and easy to maintain and adapt), it's better to pass data to subs as parameters, and have the sub either return its resulting data to the caller, or modify its parameters in-place (because they were passed as references).

        If you document your code with POD, it will be easier to read and maintain the documentation, which is important. If the documentation includes a brief description of what the code actually does, that will help you to organize your thoughts in a sensible way about the algorithm, and then organize your code according to what makes sense (and is documented). As it is, there's a lot of inefficiency in your code, because the algorithm hasn't been thought out. In particular, you are using arrays where you should be using hashes.

        This note in your help text is not necessarily true:

        ... Note: The script and the TeX file have to be in the same directory +. ...
        The script could be in the shell's execution PATH, so it doesn't have to be where the data file is; also, a user can provide (relative or absolute) paths for both the script and the data file, so they don't both have to be in the same place. Also, it's always a good idea to include the filename string and $! in the error message when you "die" on a failed "open" call.

        I happened to notice this one odd entry in your long list of LGR abbrevs:

        N-=non- (e.g. NSG nonsingular, NPST nonpast)
        There's nothing in your code that handles this "N" prefix on other abbrevs, so things like "NPST" and "NSG" will never be labeled as "nonpast" or "nonsingular" in your output. Also, that explanation will never appear in the output either, unless "N-" happens to occur in the tex file.

        One last point: do you have a suitable "test.tex" file that contains at least one example of every kind of abbreviation you intend to handle with this script, along with some variety of "normal" content? If not, make one. The point would be to make sure that all these abbreviations get listed as intended.

        Of course, you can't anticipate all the ways that "normal" LaTeX content might cause your script to miss things that are real abbreviations (e.g. if two abbrevs occur next to each other separated by a single space, the second one will be missed), or to list things as abbrevs when they really aren't (e.g. FULL WORDS IN UPPERCASE, or any single-digit number). But even a little bit of testing is better than none.

        Here's how I would write your script (though this version won't behave exactly the same as yours, and might have some mistakes in it -- I didn't have any LaTeX files with abbreviations to test it on):