Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

regexp word break help

by nop (Hermit)
on Apr 30, 2002 at 20:52 UTC ( [id://163152]=perlquestion: print w/replies, xml ) Need Help??

nop has asked for the wisdom of the Perl Monks concerning the following question:

What's the best regexp way to say:
Make sure this string is 45 chars or less, and if you have to truncate it to make it shorter, only break it between a word (\s)?
I am using split to get words, popping words until I am below 45 chars, then joining it back together -- this seems very UnPerlish.

Thanks
nop

Replies are listed 'Best First'.
Re: regexp word break help
by stephen (Priest) on Apr 30, 2002 at 21:04 UTC
    Does it have to be a regexp? Why?
    use Text::Wrap qw(wrap $columns); $columns = 45; my @lines = wrap('', '', 'Make sure this string is 45 chars or less, a +nd if you have to truncate it to make it shorter, only break it betwe +en a word (\s)'); print @lines;

    stephen

Re: regexp word break help
by japhy (Canon) on Apr 30, 2002 at 21:32 UTC
    Here's one way to do it: ($str) = $str =~ /(.{1,45})\b/s;

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a (from-home) job
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      I really like that, partly because I don't understand quite why it works. Without the brackets $str would be set to "1", which I take to be the return value of the successful pattern matching.

      But why do the brackets stop this happening? I should have thought that $str and ($str) wd behave the same way.

      And then why do they not perform this function in ($str) = $str =~ s/(.{1,45})\b/foo/s;, in which $str gets set to "1"?

      /me backs respectfully away from the shrine...

      § George Sherston
        I should have thought that $str and ($str) wd behave the same way.

        This node explains the difference between list context and scalar context very well. Many of the replies have great links to some interesting articles on the subject.

        And then why do they not perform this function in ($str) = $str =~ s/(.{1,45})\b/foo/s;, in which $str gets set to "1"?

        The perlop manpage explains the return values of the m/// and s/// operators in list and scalar context. The s/// operator returns the number of substitutions in either context.

        buckaduck

Re: regexp word break help
by dws (Chancellor) on Apr 30, 2002 at 21:12 UTC
    My first approach wouldn't be to do this with a single regexp. Instead, consider the following:
    • If the string length is <= 45 characters, it wins
    • Otherwise, extract the first 46 characters, and apply $str =~ s/\s\S*?$//; which trims the string back to a word boundary. This doesn't guarantee the that resulting string will be 45 characters or less, though, since someone could provide a long string with no whitespace.

    You might also consider things like first deleting leading spaces, and collapsing multiple spaces into one.

Re: regexp word break help
by graff (Chancellor) on Apr 30, 2002 at 21:01 UTC
    Try this (assuming you want to see the entire original string, broken into lines of less than 45 characters each):
    while (length($longstring)>45) { my $break = rindex( $longstring, " ", 45 ); print substr( $longstring, 0, $break ), $/; $longstring = substr( $longstring, $break ); } print $longstring;
Re: regexp word break help
by graff (Chancellor) on Apr 30, 2002 at 21:07 UTC
    Or this (since you said you wanted to do it with a regex) -- this time, we'll assume that you actually want to truncate the string, rather than break it into shorter lines:
    s/(.{1,45}) .*/$1\n/;
Re: regexp word break help
by coreolyn (Parson) on Apr 30, 2002 at 21:15 UTC

    Well off the top of my head...

    while ( length($string) > 45 ) { $_ = $string; /\s.*$/ $string = $`; }

    ... but I'm sure there's even cleaner ways to do it. (Yeah I know Death to dot star and all but I did say off the top of my head :)

    coreolyn
Re: regexp word break help
by dsb (Chaplain) on Apr 30, 2002 at 21:21 UTC
    $str = "make sure this string is less than 45 chars and truncate on wh +itespace if need be"; $fstr = substr($str,0,45); $sstr = substr($str,45); $fstr =~ s/\s+\w*$// if $sstr =~ /^\w/;
    Seems a little redundant I suppose, but I did it off the top of my head.

    UPDATE: I benchmarked it and it seems to compare favorably to graff's first option which I figured would be fastest since it doesn't use regexes at all.




    Amel
Re: regexp word break help
by thelenm (Vicar) on Apr 30, 2002 at 21:24 UTC
    It's not particularly pretty, but here's a solution using substr and a regular expression:
    if (length $word > 45) { my $forty_sixth = substr($word, 45, 1); $word = substr($word, 0, 45); $word =~ s/\s*\w*\z// if $forty_sixth =~ /\w/; }

    Update: For some reason I was thinking about this while trying to get to sleep last night. My solution will fail to trim whitespace in the case where the 45th and 46th chars are both whitespace. Also, I should know better than to write a substitution expression that can match the empty string. Here's a revised version that should work correctly:

    if (length $word > 45) { my $forty_sixth = substr($word, 45, 1); $word = substr($word, 0, 45); $word =~ s/\w+\z// if $forty_sixth =~ /^\w/; $word =~ s/\s+\z//; }
      The other problem is your use of \w* in your substitution. By making that word character optional you aren't taking into consideration one letter words like 'a' or 'I'.

      UPDATE: thelenm is correct in saying that requiring a \w character will fail to trim whitespace where the 45th character is a space and the 46th a word char. Turns out I can fix my own solution above by not requiring the word char. Thanks thelenm ;0).




      Amel
        But requiring a word character (\w+) will fail to trim off spaces in the case where the 45th character is a space character and the 46th character is a word character. I think my solution works correctly... can you give an example where it doesn't? Here are some boundary cases that work as they should, using one-character words as you suggested:
        @words = ( # 1 2 3 4 5 #12345678901234567890123456789012345678901234567890 "The quick brown fox jumped over the lazy d I own", "The quick brown fox jumped over the lazy do I own", "The quick brown fox jumped over the lazy dog I own", "The quick brown fox jumped over the lazy dogs I own", "The quick brown fox jumped over the lazy doggy I own", ); for my $word (@words) { if (length $word > 45) { my $forty_sixth = substr($word, 45, 1); $word = substr($word, 0, 45); $word =~ s/\s*\w*\z// if $forty_sixth =~ /\w/; } print "Word: '$word', Length: ", length $word, "\n"; }
        produces:
        Word: 'The quick brown fox jumped over the lazy d I', Length: 44 Word: 'The quick brown fox jumped over the lazy do I', Length: 45 Word: 'The quick brown fox jumped over the lazy dog', Length: 44 Word: 'The quick brown fox jumped over the lazy dogs', Length: 45 Word: 'The quick brown fox jumped over the lazy', Length: 40
Re: regexp word break help
by clintp (Curate) on May 01, 2002 at 15:01 UTC
    Steal from the Perl Power Tools project fold(1). You're probably not interested in the bytewise processing (handled as a special case in a loop to process embedded \b \r and \t) but instead want the if() block that begins if ($Byte_Only) {.
Re: regexp word break help
by lshatzer (Friar) on May 01, 2002 at 17:06 UTC
    All of these above will help. Check out Truncate string, it will also allow you to put ... or whatever you want at the end of the string it chops, and if you ask for an array, it will give you two scalars, first one, choped, and the second will be the remaining string.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://163152]
Approved by brianarn
Front-paged by RhetTbull
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-03-28 15:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found