Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Sort text by Chapter names

by Bman70 (Acolyte)
on May 24, 2018 at 05:24 UTC ( [id://1215128]=perlquestion: print w/replies, xml ) Need Help??

Bman70 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

Is there any way to take a block of text that includes different Chapter names, and then sort all of the text so that the Chapter names are in alphabetic order, while keeping the correct text with each Chapter?

Example of the text I'm trying to sort:

Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Nine This chapter has probably 1000 words. Chapter Two Here is the text in the second chapter... Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short.

What I would want is to sort these chapters to be in Alphabetical order.. they don't need to be in order numerically. I just want to know if there's a sort function that can arrange each title in order while preserving the text associated with each title. Resulting in something like:

Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short. Chapter Nine This chapter has probably 1000 words. Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Two Here is the text in the second chapter...

The text is all in a text file. There are some newlines and a mix of letters and numbers.

Thank you for any tips you can offer.

Replies are listed 'Best First'.
Re: Sort text by Chapter names
by tybalt89 (Monsignor) on May 24, 2018 at 08:10 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1215128 use strict; use warnings; print sort split /^(?=Chapter)/m, join '', <DATA>; __DATA__ Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Nine This chapter has probably 1000 words. Chapter Two Here is the text in the second chapter... Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short.
Re: Sort text by Chapter names
by jimpudar (Pilgrim) on May 24, 2018 at 06:53 UTC

    This won't make a good homework answer, but may help in a real world scenario where is is not viable to read the entire book into memory.

    I do realize that RAM is so cheap these days it would need to be one hell of a book to warrant this kind of treatment :D

    #! /usr/bin/env perl use strict; use warnings; use autodie; # This solution will avoid reading the entire # file into memory. First we will separate the # input into one file per chapter. # This will become a filehandle in a bit my $fh; # Create a directory for the chapter output mkdir 'chapters'; # Keep a list of all chapter names my @chapter_names; # Diamond operator reads input line by line. # This can be a file or even STDIN while (<>) { if (/^Chapter/) { close $fh if $fh; # Open a new file. # The filename is the chapter name. chomp; open $fh, '>', "chapters/$_"; push @chapter_names, $_; next; } print $fh $_; } close $fh; # Now we can sort the chapter names and # read them line by line, printing on STDOUT foreach my $chapter_name (sort @chapter_names) { open $fh, '<', "chapters/$chapter_name"; print "$chapter_name\n"; while (<$fh>) { print $_; } close $fh; unlink "chapters/$chapter_name"; } rmdir 'chapters';

    Best,

    Jim

      Another alternative that I often see used here is to calculate the position and length of each unit within the file as you are scanning through it looking for markers. Put the marker-name, position, and length into an array of hashes, then sort the array by name using a custom sort-function. Retrieve each chapter directly from the original file by seeking to the proper position and reading the calculated number of bytes. (If the length might also be huge, read and write it in chunks not-to-exceed a digestible buffer-size.)

        For the fun of it ( and also to show you can seek on DATA )

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1215128 use strict; use warnings; my %chapters; my $previous = undef; my $buffer; my $max = 4096; while(<DATA>) { if( /^Chapter/ ) { $chapters{$_} = $previous = [ tell(DATA) - length, length ]; } elsif( defined $previous ) { $previous->[1] += length; } } use Data::Dump 'pp'; print pp \%chapters; print "\n\n"; for ( sort keys %chapters ) { my ($start, $length) = $chapters{$_}->@*; seek DATA, $start, 0; while( $length > $max ) { read DATA, $buffer, $max; print $buffer; $length -= $max; } read DATA, $buffer, $length; print $buffer; } __DATA__ Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Nine This chapter has probably 1000 words. Chapter Two Here is the text in the second chapter... Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short.

        This is an elegant solution. I do like it better than mine as it uses only half the disk space!

        This thread is a fantastic example of TMTOWTDI.

        Best,

        Jim

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Sort text by Chapter names
by kcott (Archbishop) on May 24, 2018 at 10:25 UTC

    G'day Bman70,

    Whether this is homework, $work, or something else, you really should make an effort to write some code yourself. We're always happy to help; we're less happy to do your work for you. If it is homework, tell us what part of the language you're currently studying — is this intended to give you an exercise in I/O, sorting, regexes, etc. — we can tailor answers to help you with whatever part you're having difficulties with.

    That said, here's my TMTOWTDI:

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $text = <<'EOT'; Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Nine This chapter has probably 1000 words. Chapter Two Here is the text in the second chapter... Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short. EOT { open my $memfh, '<', \$text; local $/ = 'Chapter '; for (sort <$memfh>) { chomp; next unless length; print $/, $_; } }

    Output:

    Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short. Chapter Nine This chapter has probably 1000 words. Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Two Here is the text in the second chapter...

    — Ken

Re: Sort text by Chapter names
by Your Mother (Archbishop) on May 24, 2018 at 05:36 UTC

    Most definitely possible. It sounds like homework though. Have you tried writing any code?

    A successful approach is going to collect/match/parse the data into blocks of chapter + its text in an array, or an array of hashes, and then use the chapter in a sort block. You can search here or online for multi-line matching and sorting. If you show some code, you'll most definitely get help with improving it.

      Thanks all! Haha it's in no way homework.. just a problem I created for myself when I used a hash to get the chapter data, resulting in the chapters always returning in a different order (since hashes aren't ordered).

      So I want to get the chapters, then sort them. But I may have found a way to use an ARRAY OF HASHES, to avoid having to do this!

      I'll ponder over all the answers though so I can have this option if necessary.

Re: Sort text by Chapter names
by rnewsham (Curate) on May 24, 2018 at 05:36 UTC

    You could read the text into a hash keyed by the chapter title then sort that

    use strict; use warnings; my %chapters; my $chapter_title; while ( <DATA> ) { chomp; if ( m/^Chapter \w+/ ) { $chapter_title = $_; next; } next unless $chapter_title; $chapters{$chapter_title} .= $_; } for ( sort {$a cmp $b} keys %chapters ) { print "$_\n$chapters{$_}\n"; } __DATA__ Chapter One There were lots of monkeys here and they ate all the bananas... lots more text up to hundreds of words. Chapter Nine This chapter has probably 1000 words. Chapter Two Here is the text in the second chapter... Chapter Five Here is the text in the fifth chapter... every chapter is of differing length, some long some short.
Re: Sort text by Chapter names
by BillKSmith (Monsignor) on May 24, 2018 at 13:25 UTC

    In the real world, you must be prepared to allow chapter titles within the text.

    Chapter One There were lots of monkeys here and they ate all the bananas (see 'ban +anas' in Chapter Two)... lots more text up to hundreds of words.

    Probably the best you can do is to ignore titles that do not start at the beginning of a line.

    Bill
Re: Sort text by Chapter names
by AnomalousMonk (Archbishop) on May 24, 2018 at 23:25 UTC

    You seem to have solved your original problem by keeping the chapters of the book in order to begin with, but here's another approach to the original problem using "pure" numeric sorting. I think you can extract chapter headers and bodies to a hash already, so I'll start from that point.

    Update: Note that the existing code can handle chapter numbers like 'ThirtyOne' or 'Thirtyone', but not 'Thirty One'. What changes would be needed to handle such numbers?


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1215128]
Approved by marto
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-25 20:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found