Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: help needed on encoding a text file

by dakkar (Hermit)
on Apr 02, 2003 at 18:42 UTC ( [id://247559]=note: print w/replies, xml ) Need Help??


in reply to help needed on encoding a text file

Before showing you the code, some notes.

  • You are mixing text (the boundary delimiters) and binary data. This is bad
  • You are writing out hexadecimal representations of numbers, but you want to save space. This makes little sense
  • You should really use pack

Now, I changed the encoding. Now it is:

  • one long, for the index length
  • the index. Each entry in the index is made of:
    1. one long: the index number
    2. one long: the string length
    3. some bytes: the string itself
  • the content, as a series of longs
  • And now, the code:

    Encoder:

    #!/usr/bin/perl @lines = <>; binmode STDOUT; $x = 1; foreach $ln (@lines) { @words = split /\s+/, $ln; foreach $word (@words) { $index{$word} = $x++; $count{$word}++; } } $idx=''; foreach $key (keys %index) { $idx.=pack("N(N/a*)",$index{$key},$key); } print STDOUT pack("N",length($idx)),$idx; foreach $ln (@lines) { @words = split /\s+/, $ln; foreach $word (@words) { print STDOUT pack("N",$index{$word}); } }

    Decoder:

    #!/usr/bin/perl binmode STDIN; undef $/; $i=<>; $is=unpack "N",$i; $ind=substr($i,4,$is);$con=substr($i,4+$is); %index=unpack "(N(N/a))*",$ind; @con=unpack "N*",$con; print join(' ',@index{@con});

    Note the use of binmode and the unsetting of $/ (aka input record separator).

    -- 
            dakkar - Mobilis in mobile
    

Replies are listed 'Best First'.
Re: Re: help needed on encoding a text file
by bfdi533 (Friar) on Apr 02, 2003 at 19:55 UTC
    Point well taken on mixing text and binmode. Doing the decode was getting pretty hairy as I had it.

    I must say that I do not understand the pack("N(N/a*)" syntax. I know what N and a are but cannot find any docs on the parens; I did find docs on the slash and understand it now. (Though I have to admit that I have really TOTALLY avoided pack and unpack as they seem VERY confusing to me.)

    Also why 2 N's? Seems it could be written as N/a* per the perlfunc man page.

    Also, I get an error on the decoder as follows:
    / must follow a numeric type at ./dcode2.pl line 7, <> chunk 1.

    Extrapolating this out it would seem that there is a syntax error in the decoder as it is written unpack "(N(N/a))*",$ind; and is seems it should probably be unpack "N(N/a*)",$ind; but that does not produce any output, just blank lines.

    Ed

      pack("N(N/a*)",$index{$key},$key)! meaning that the $index{$key} is written as a Number, and the $key as a length-prefixed string.

      The parenthesis are for grouping. Their meaning is half-hidden in the documentation...

      Regarding the error: I don't get it. I tested the code.

      The unpack is right as it is: it means "a series of pairs of Numbers and length-prefixed strings (in unpack you must not put the *, see the docs)

      -- 
              dakkar - Mobilis in mobile
      
        Ok, so I finally understand the N/a notation but am having trouble with this code. Seems that the () notation is pretty new as it is not in my Programming Perl, 2nd Edition and my perl 5.6.1 does not support it. But, my perl 5.8.0 does support it and is documented in its perlfunc manpage.

        Here is the error that I get with the perl 5.6.1:
        Invalid type in unpack: '(' at C:\Data\test\dcode2.pl line 8, <> chunk 1.

        I am not too sure how to go about "converting" this code to work with perl 5.6.1.

        BTW, still cannot get the code to work on perl 5.8.0 without an error with the "/" as stated earlier.

        Ed

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://247559]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-19 05:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found