Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

help needed on encoding a text file

by bfdi533 (Friar)
on Apr 02, 2003 at 16:33 UTC ( [id://247520]=perlquestion: print w/replies, xml ) Need Help??

bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have a bit of code that I am working on to encode a text file by replacing each word with a binary code representing it (after building an index of all words in the file) in an effort to save storage (and, of course, to display it later).

I am trying to get the binary code to the file and have tried many different things including printf and pack and have not been able to produce the results that I am after.

Here is my current encode.pl code:

#!/usr/bin/perl @lines = <>; #print "Number of lines; $#lines\n"; $x = 0; foreach $ln (@lines) { @words = split /\s+/, $ln; foreach $word (@words) { $index{$word} = $x++; $count{$word}++; } } print "-=INDEX=-\n"; foreach $key (keys %index) { #print "[$key] $index{$key}\n"; #$bkey = pack("C*",$index{$key}); #print $key . chr(0) . $bkey . chr(0); printf "%s%c%x%c", $key, chr(0), $index{$key}, chr(0); } print "\n"; print "-=CONTENTS=-\n"; foreach $ln (@lines) { @words = split /\s+/, $ln; foreach $word (@words) { #print $index{$word} . " "; #$bword = pack("C*",$index{$word}); #print $bword . chr(0); printf "%x%c", $bword, chr(0); } print "\n"; }


I would be happy for any all all input on this problem.

Ed

Replies are listed 'Best First'.
Re: help needed on encoding a text file
by hardburn (Abbot) on Apr 02, 2003 at 16:39 UTC

    Unless you're really, really good with information theory, I doubt you could come up with something better than a general-purpose compression algorithm. Take a look at the Compress:: modules, particularly Compress::Zlib and Compress::Bzip2. These will probably reduce the space by a lot more than most customized solutions, and will probably be easier to code.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: help needed on encoding a text file
by dakkar (Hermit) on Apr 02, 2003 at 18:42 UTC

    Before showing you the code, some notes.

    • You are mixing text (the boundary delimiters) and binary data. This is bad
    • You are writing out hexadecimal representations of numbers, but you want to save space. This makes little sense
    • You should really use pack

    Now, I changed the encoding. Now it is:

    • one long, for the index length
    • the index. Each entry in the index is made of:
      1. one long: the index number
      2. one long: the string length
      3. some bytes: the string itself
    • the content, as a series of longs
    • And now, the code:

      Encoder:

      #!/usr/bin/perl @lines = <>; binmode STDOUT; $x = 1; foreach $ln (@lines) { @words = split /\s+/, $ln; foreach $word (@words) { $index{$word} = $x++; $count{$word}++; } } $idx=''; foreach $key (keys %index) { $idx.=pack("N(N/a*)",$index{$key},$key); } print STDOUT pack("N",length($idx)),$idx; foreach $ln (@lines) { @words = split /\s+/, $ln; foreach $word (@words) { print STDOUT pack("N",$index{$word}); } }

      Decoder:

      #!/usr/bin/perl binmode STDIN; undef $/; $i=<>; $is=unpack "N",$i; $ind=substr($i,4,$is);$con=substr($i,4+$is); %index=unpack "(N(N/a))*",$ind; @con=unpack "N*",$con; print join(' ',@index{@con});

      Note the use of binmode and the unsetting of $/ (aka input record separator).

      -- 
              dakkar - Mobilis in mobile
      
      Point well taken on mixing text and binmode. Doing the decode was getting pretty hairy as I had it.

      I must say that I do not understand the pack("N(N/a*)" syntax. I know what N and a are but cannot find any docs on the parens; I did find docs on the slash and understand it now. (Though I have to admit that I have really TOTALLY avoided pack and unpack as they seem VERY confusing to me.)

      Also why 2 N's? Seems it could be written as N/a* per the perlfunc man page.

      Also, I get an error on the decoder as follows:
      / must follow a numeric type at ./dcode2.pl line 7, <> chunk 1.

      Extrapolating this out it would seem that there is a syntax error in the decoder as it is written unpack "(N(N/a))*",$ind; and is seems it should probably be unpack "N(N/a*)",$ind; but that does not produce any output, just blank lines.

      Ed

        pack("N(N/a*)",$index{$key},$key)! meaning that the $index{$key} is written as a Number, and the $key as a length-prefixed string.

        The parenthesis are for grouping. Their meaning is half-hidden in the documentation...

        Regarding the error: I don't get it. I tested the code.

        The unpack is right as it is: it means "a series of pairs of Numbers and length-prefixed strings (in unpack you must not put the *, see the docs)

        -- 
                dakkar - Mobilis in mobile
        
Re: help needed on encoding a text file
by Pardus (Pilgrim) on Apr 02, 2003 at 16:43 UTC
    Maybe a bold question, but why not use one of the modules in the cpan Compression:: namespace like Compression::Zlib ? These are specialised modules to do this job.

    Laziness is a virtue.

    update: hardburn apperently types faster :(
    --
    Jaap Karssenberg || Pardus (Larus)? <pardus@cpan.org>
    >>>> Zoidberg: So many memories, so many strange fluids gushing out of patients' bodies.... <<<<
Re: help needed on encoding a text file
by bfdi533 (Friar) on Apr 02, 2003 at 18:06 UTC
    To be honest, I was not trying to compress the file to save space so much as I was trying to compress it into a sort of "e-book reader" style format so that I could do the other half of this problem, to write the "reader" app. I am thinking of this for an iPAQ running Linux with small storage space.

    That said, I understand the use of Compress:: but want to solve the real problem here, how to write binary data to a file.

    Let's say I have an integer, $i, that is less than 256. The answer is simple:
    printf "%c", $i;
    But what about an integer with a value of, say, 16503. Obviously, the above code would not work.

    How would I get this to the file?

    Ed
      binmode FILE; print FILE pack("N",$i);

      pack takes a list of values, and returns a string obtained by packing them as indicated by the first argument. Look at perldoc.

      -- 
              dakkar - Mobilis in mobile
      
Re: help needed on encoding a text file
by Thelonius (Priest) on Apr 02, 2003 at 18:18 UTC
    You can use pack("n") for 16-bit values and pack("N") for 32-bit values. You are evidently trying to use "\0" as a separator, but this won't work because the index numbers themselves can have zero bytes in them. If you want a variable-length encoding, use pack("w"). This would be especially good if you have your most common words with the lowest index numbers, since they will come out shorter.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://247520]
Approved by hardburn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-04-24 06:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found