http://qs321.pair.com?node_id=11126850


in reply to Re: convert tags to punctuation
in thread convert tags to punctuation

As a variation on Polyglot's solution, you can define the tags in a hash. The advantage is that it is more easily expanded if more tags are needed. I have chosen to specify the characters by name (charnames) because I find single punctuation marks, embedded in quotes, hard to read.
use strict; use warnings; my %tags = ( 91 => "\N{FULL STOP}", # '.' 92 => "\N{APOSTROPHE}", # ''' 93 => "\N{COMMA}", # ',' 94 => "\N{EXCLAMATION MARK}", # '!' ); my $line = 'Text with unusual punctuation<91><91><91>' .'I<92>m not going to lie<93> this is odd text<94>' ; $line =~ s/<(\d\d)>/$tags{$1}/ge; print $line, "\n";
Bill

Replies are listed 'Best First'.
Re^3: convert tags to punctuation
by Anonymous Monk on Jan 15, 2021 at 19:01 UTC

    Bill -- I think your code is more maintainable. The document I am messing with is about 600,000 lines long. Is there a way to speed this up? Is there a way to get a complete list of <ab> tags ?

      You should ask the person who prepares your input file if he can direct you to either a specification of the file format or to the documentation of the program that created it. If this fails, I would write a perl program to list all the tags. The only way I know to get the values, is use an editor to examine the tags in context and make your best guess. (It usually will be obvious.)

      It is nearly impossible to guess what will or will not make a Perl program faster. The usual advice is to profile your program. Only work on those parts which are using the most time. Use benchmark to measure possible improvement. In your case, I/O is probably taking much longer than processing. Slurping the entire file into memory is probably not an option. Reading the file in large blocks may help, but it is not easy to get right. I recommend against any optimization unless it is absolutely necessary.

      Bill

        I noticed something interesting about this document: If I view it with the 'more' filter. I see a bunch of black rectangles with the tags inside them. If I view it with gedi or ptked I see \x{93} , \x{94} , \x{95} , etc. Does it matter what chars go in my s/ ... / line? What does PERL see?

      > Is there a way to speed this up?

      what makes you think it's not fast enough?

      Update

      quoting davido from the CB:

      who cares about how fast Perl runs; it's almost always the network or IO that are standing in the way.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery