Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Algorithm to reduce the weight of a collection of bags

by ibm1620 (Friar)
on Jul 04, 2022 at 20:10 UTC ( #11145274=perlquestion: print w/replies, xml ) Need Help??

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a tool to expand a CSV file to a columnar format, with each column sized to accommodate the max width encountered in the file. That's easily done with two passes. But now I want to narrow the columns so that each record fits on one line in the terminal, at the expense of truncating some wide values. For example:
Input: Date|Amount|Category|Description 2022-06-23|123.45|Software & Tech|BACKBLAZE HTTPSWWW.BACKCA| 2022-06-24|63.45|Internet|RECURRING PAYMENT AUTHORIZED ON 06/11 SPECTR +UM TX| 2022-06-24|69.34|Phone|RECURRING PAYMENT AUTHORIZED ON 06/02 VZWRLSS*A +POCC VISE| (Max widths 10,6,15,55) Simple expansion Date |Amount|Category |Description + | 2022-06-23|123.45|Software & Tech|BACKBLAZE HTTPSWWW.BACKCA + | 2022-06-24|63.45 |Internet |RECURRING PAYMENT AUTHORIZED ON 06/1 +1 SPECTRUM TX | 2022-06-24|69.34 |Phone |RECURRING PAYMENT AUTHORIZED ON 06/0 +2 VZWRLSS*APOCC VISE| (Col widths 10,6,15,55) Shrunk to fit 52-char-wide window Date |Amount|Category |Description | 2022-06-23|123.45|Software & Tech|BACKBLAZE HTTPSWW| 2022-06-24|63.45 |Internet |RECURRING PAYMENT| 2022-06-24|69.34 |Phone |RECURRING PAYMENT| (Col widths 10,6,15,17) Shrunk to fit 46-char-wide window Date |Amount|Category |Description | 2022-06-23|123.45|Software & Te|BACKBLAZE HTT| 2022-06-24|63.45 |Internet |RECURRING PAY| 2022-06-24|69.34 |Phone |RECURRING PAY| (Col widths 10,6,13,13)
I recast the problem as an ordered set of bags whose contents vary in weight, and removing enough from the bags so they don't exceed some total weight. Furthermore, I want to penalize the heaviest bags first. I coded up a working solution (trying to use as many v5.36 features as I could). But I can't get over the feeling that there is a much simpler solution that's eluded me.

I'd appreciate any comments or suggestions for a simpler algorithm (for one thing, I don't think making it recursive helped any). I'd be particularly intrested in solutions that exeercise v5.36 features.

#!/usr/bin/env perl use v5.36; # implies use warnings my $target_weight = shift // die 'need target_weight'; # Starting weights my @weights = ( 20, 3, 25, 10, 3, 24, 25 ); say "Before:\n" . display( \@weights ); shrink( \@weights, $target_weight ); say "After:\n" . display( \@weights ); # shrink($bags, $target_weight) # # $bags = ref. to array of bag weights # $target_weight = maximum allowed weight of all bags # # If bags exceed target_weight, lighten the bags to achieve target by # lightening the heaviest bags first. no warnings q/experimental::for_list/; no warnings q/experimental::builtin/; use builtin qw/indexed/; use List::Util qw/sum/; sub shrink ( $bags, $target_weight, $curr_weight = undef ) { # Outer call only: if ( not defined $curr_weight ) { $curr_weight = sum @$bags; # quick exit if no shrink req'd return if ( $curr_weight <= $target_weight ); # copy input array and sort by weight, descending my @indexed_weights; for my ( $i, $wt ) ( indexed @$bags ){ push @indexed_weights, [ $i, $wt ]; } @indexed_weights = sort { $b->[1] <=> $a->[1] } @indexed_weights; # split indexes and weights into two arrays my @sorted_indexes = map { $_->[0] } @indexed_weights; my @sorted_weights = map { $_->[1] } @indexed_weights; say "Sorted:\n" . display( \@sorted_weights ); shrink( \@sorted_weights, $target_weight, $curr_weight ); # Deliver de-sorted result to caller for my ( $i, $wt ) ( indexed @sorted_weights ) { $bags->[ $sorted_indexes[$i] ] = $wt; } return; } # For inner call: return if ( $curr_weight <= $target_weight ); my $nbags = scalar @$bags; my $heaviest = $bags->[0]; # weight of heaviest bag # Count the heaviest bags and also find the next-heaviest my $n_of_heaviest; my $next_heaviest; COUNT: for ( 1 .. $nbags - 1 ) { if ( $bags->[$_] < $heaviest ) { $n_of_heaviest = $_; $next_heaviest = $bags->[$_]; last COUNT; } } $n_of_heaviest //= $nbags; $next_heaviest //= 0; my $loss = $heaviest - $next_heaviest; my $total_loss = $loss * $n_of_heaviest; if ( $curr_weight - $total_loss >= $target_weight ) { $curr_weight -= $total_loss; $bags->[$_] -= $loss for ( 0 .. $n_of_heaviest - 1 ); say "Reduce bags #0-#" . ( $n_of_heaviest - 1 ) . " by $loss to weight of next_heaviest, " . "$next_heaviest:\n" . display($bags); shrink( $bags, $target_weight, $curr_weight ); } else { # Need to do an equally-distributed shrink of the heaviest # bags to hit the target use integer; my $target_loss = $curr_weight - $target_weight; my $div = $target_loss / $n_of_heaviest; my $rem = $target_loss % $n_of_heaviest; for my $i ( -( $n_of_heaviest - 1 ) .. 0 ) { $loss = $div + ( $rem-- > 0 ? 1 : 0 ); $bags->[ -$i ] -= $loss; } say "Finally, reduce bags #0-#" . ( $n_of_heaviest - 1 ) . " to target weight of $target_weight:\n" . display($bags); } } sub display ($aref) { my $r = ''; for my ( $i, $wt ) ( indexed @$aref ) { $r .= sprintf "%2s: %s (%d)\n", "#$i", ( '-' x $wt ), $wt; } $r .= sprintf "Weight %d, target %d\n", sum(@$aref), $target_weight; return $r; }
Shrink to 100:
~/private/perl$ shrink 100 Before: #0: -------------------- (20) #1: --- (3) #2: ------------------------- (25) #3: ---------- (10) #4: --- (3) #5: ------------------------ (24) #6: ------------------------- (25) Weight 110, target 100 Sorted: #0: ------------------------- (25) #1: ------------------------- (25) #2: ------------------------ (24) #3: -------------------- (20) #4: ---------- (10) #5: --- (3) #6: --- (3) Weight 110, target 100 Reduce bags #0-#1 by 1 to weight of next_heaviest, 24: #0: ------------------------ (24) #1: ------------------------ (24) #2: ------------------------ (24) #3: -------------------- (20) #4: ---------- (10) #5: --- (3) #6: --- (3) Weight 108, target 100 Finally, reduce bags #0-#2 to target weight of 100: #0: ---------------------- (22) #1: --------------------- (21) #2: --------------------- (21) #3: -------------------- (20) #4: ---------- (10) #5: --- (3) #6: --- (3) Weight 100, target 100 After: #0: -------------------- (20) #1: --- (3) #2: ---------------------- (22) #3: ---------- (10) #4: --- (3) #5: --------------------- (21) #6: --------------------- (21) Weight 100, target 100

Replies are listed 'Best First'.
Re: Algorithm to reduce the weight of a collection of bags
by haukex (Archbishop) on Jul 04, 2022 at 20:34 UTC

    First of all, I personally would not rename "columns" to "bags" and "widths" to "weights". Anyway, could you tell us some more about the constrains, specifically on the number of columns and the possible window widths? Though coming up with a completely generic algorithm always seems more fun, I'm also for the pragmatic approach. For example, are the data types of the columns known? Because I doubt it would be a good idea to cut off the "date" and "amount" columns. There are also a bunch of table formatting modules on the CPAN, have you looked at those? Update: For example, Text::Table::Any lists 27 possible backends, and I'm sure there are plenty more. "Format an ASCII table" seems to be a favorite wheel to reinvent (I'm guilty of that too).

      I probably shouldn't have talked as much about table and data formatting. I'm writing a simple, dumb tool to browse/glance at CSV files from the terminal as quickly and effortlessly as possible. The reason I only need a generic solution (aside from its being more fun ;-) is this: My terminal is 158 characters wide; my CSV files have relatively few fields (under 15); the columns with the widest data tend to be unstructured text, where the widest cells are generally much wider than the average cell. Currently I'm browsing bank transaction files, where the Description column is the widest, and is the safest to truncate without chopping off "important" information. I indicate truncation by replacing the final character with a tilde, signaling data loss. The likelihood of my wanting to browse a CSV file having 30 columns, or on a terminal with 80 characters, where I'd have to be start narrowing the date and amount columns, is fairly remote.

      I looked at Text::ANSITable but it did so much more than I needed, and didn't appear to address fitting the table width to the terminal.

        My terminal is 158 characters wide; my CSV files have relatively few fields (under 15); the columns with the widest data tend to be unstructured text, where the widest cells are generally much wider than the average cell. Currently I'm browsing bank transaction files, where the Description column is the widest, and is the safest to truncate without chopping off "important" information.

        Thanks for the context. In that case I personally would probably take the pragmatic route and attempt to identify those text columns based on their width, and truncate those, while not truncating any columns under a certain length to make sure I don't truncate amounts or dates (perhaps even trying to identify such "important" column data types with regexes). But as you said, since this kind of thing is also fun, I understand wanting a more generic solution - at the moment I just don't have good tips for that. As to the existing modules, I didn't have the time to look through all of them to see if maybe there is one that already limits its output width to the terminal width "intelligently" - but perhaps another solution would be to implement the truncation yourself before passing the data off to a module for the output.

        Itís not Perl (unfortunately) but itís pretty spiffy and already works for what youíre probably wanting to do: visidata.

        Edit: iOS smart quotes strike again.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Re: Algorithm to reduce the weight of a collection of bags
by kcott (Archbishop) on Jul 05, 2022 at 05:27 UTC

    G'day ibm1620,

    "But I can't get over the feeling that there is a much simpler solution that's eluded me."

    I knocked this off in my lunch break: I expect there are many opportunities for improvement; however, it's certainly simpler than the code you posted.

    #!/usr/bin/env perl use v5.36; use autodie; use constant { DATE => 0, AMOUNT => 1, CATEGORY => 2, DESCRIPTION => 3, }; use Text::CSV; my $csv_file = 'pm_11145274_tabulate_csv.csv'; # Dummy figures from 1st pass which you say you know how to do my @max_widths = qw{10 6 15 55}; say 'Shrunk to fit 52-char-wide window'; tabulate($csv_file, \@max_widths, 52); say 'Shrunk to fit 46-char-wide window'; tabulate($csv_file, \@max_widths, 46); sub tabulate ($filename, $max_widths, $table_width) { state $des_weight = 0.7; my $table_available = $table_width - 4; # 4 pipe chars my $data_fmt = '%8d|'; # date: YYYYMMDD my $head_fmt = '%-8s|'; $table_available -= 8; $data_fmt .= '%6.2f|'; # amount: | 12.34| $head_fmt .= '%-6s|'; $table_available -= 6; my $max_cat_des = $max_widths->[CATEGORY] + $max_widths->[DESCRIPT +ION]; my $cat_width = int($table_available * ($max_widths->[CATEGORY] / +$max_cat_des)); my $des_width = int($table_available * ($max_widths->[DESCRIPTION] + / $max_cat_des)); $des_width = int($des_width * $des_weight); $cat_width += $table_available - ($cat_width + $des_width); $data_fmt .= "%-${cat_width}.${cat_width}s|%-${des_width}.${des_wi +dth}s|\n"; $head_fmt .= "%-${cat_width}s|%-${des_width}s|\n"; my $csv = Text::CSV::->new(); { open my $fh, '<', $filename; my $head_row = $csv->getline($fh); printf $head_fmt, @$head_row; while (my $data_row = $csv->getline($fh)) { printf $data_fmt, $data_row->[DATE] =~ y/-//dr, $data_row->@[AMOUNT, CATEGORY, DESCRIPTION]; } } return; }

    I used this for input. You may need some code adjustments depending on what your real input looks like.

    $ cat pm_11145274_tabulate_csv.csv Date,Amount,Category,Description 2022-06-23,123.45,Software & Tech,BACKBLAZE HTTPSWWW.BACKCA 2022-06-24,63.45,Internet,RECURRING PAYMENT AUTHORIZED ON 06/11 SPECTR +UM TX 2022-06-24,69.34,Phone,RECURRING PAYMENT AUTHORIZED ON 06/02 VZWRLSS*A +POCC VISE

    I reduced the YYYY-MM-DD date to YYYYMMDD to free up a couple of extra characters and aligned the amounts. Fiddle with the arithmetic to get exactly what you want: adjusting $des_weight may be all that you need. Here's the output:

    $ ./pm_11145274_tabulate_csv.pl Shrunk to fit 52-char-wide window Date |Amount|Category |Description | 20220623|123.45|Software & Tech |BACKBLAZE HTTPSWWW| 20220624| 63.45|Internet |RECURRING PAYMENT | 20220624| 69.34|Phone |RECURRING PAYMENT | Shrunk to fit 46-char-wide window Date |Amount|Category |Description | 20220623|123.45|Software & Te|BACKBLAZE HTTPS| 20220624| 63.45|Internet |RECURRING PAYME| 20220624| 69.34|Phone |RECURRING PAYME|

    — Ken

Re: Algorithm to reduce the weight of a collection of bags
by perlfan (Vicar) on Jul 04, 2022 at 21:13 UTC
    length, substr, split, sprintf, and x operator can make for some pretty terse table generation. I would want to preserve as much information as possible; which means before just chopping off useful data, condense the data; e.g.:
    • convert date to m/YY
    • have a "short" category code; e.g., Software & Tech would become somethbing like Sw/Tech
    • filter out "stop" words to description to filter out useless words, or maybe shorten them
    • You can probably eliminate $ from amount

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11145274]
Approved by LanX
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2022-11-30 08:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?