Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

How to substitute something from only between two specified charecters

by ZWcarp (Beadle)
on Jun 28, 2011 at 16:57 UTC ( [id://911807]=perlquestion: print w/replies, xml ) Need Help??

ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks, I thank you for your time and input.

I am having trouble splitting a file properly because of some weird spacing. The structure of each "header" line is as follows :

>cds:ADD75048 A/Brussels/INS71/2009 2009/10/30 HA

>cds:ADF58353 A/Germany-MV/HGW4/2009 2009/12/ HA

>cds:ADF58351 A/Germany-MV/HGW6/2009 2009/12/ HA

>cds:ADU76781 A/England/94780010/2009 2009/10/22 HA

>cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA

>cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA

>cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA

>cds:ADD74978 A/San Diego/INS54/2009 2009/10/12 HA

>cds:ADF27925 A/Texas/JMS407/2010 2010/01/11 HA

>cds:ADM95824 A/Finland/661/2009 2009/10/26 HA

>cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA

Normally you could just split by space, but i realized that there is sometimes a space in the location(San(space)Diego for example). I want to remove these spaces specifically. I think this can be done by telling perl to substitute all spaces between the first and second forward slashes it encounters. Does anyone know how to do this, or even better how to do it in bash?

This is the structure of the headers, and my goal is to remove the spaces ONLY from D:

A:B C/D/E/F G/H/I J

Any ideas? hope this is more clear. Thanks so much!

Replies are listed 'Best First'.
Re: Tricky reg ex
by BrowserUk (Patriarch) on Jun 28, 2011 at 17:22 UTC

    Sometimes split is just too much hassle:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; ## cds:Acc(space)strain/location/ID/year(space)date(space)segment. my @recs = map[ m[ > ( \S+ ) \s ( [^/]+ ) / ( [^/]+ ) / ( [^/]+ ) / ( \S+ ) \s ( \S+ ) \s ( \S+ ) ]x ], <DATA>; pp \@recs; __DATA__ >cds:ADD75048 A/Brussels/INS71/2009 2009/10/30 HA >cds:ADF58353 A/Germany-MV/HGW4/2009 2009/12/ HA >cds:ADF58351 A/Germany-MV/HGW6/2009 2009/12/ HA >cds:ADU76781 A/England/94780010/2009 2009/10/22 HA >cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA >cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA >cds:ADD74978 A/San Diego/INS54/2009 2009/10/12 HA >cds:ADF27925 A/Texas/JMS407/2010 2010/01/11 HA >cds:ADM95824 A/Finland/661/2009 2009/10/26 HA >cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA

    Outputs

    c:\test>junk98 [ ["cds:ADD75048","A","Brussels","INS71",2009,"2009/10/30","HA"], ["cds:ADF58353","A","Germany-MV","HGW4",2009,"2009/12/","HA"], ["cds:ADF58351","A","Germany-MV","HGW6",2009,"2009/12/","HA"], ["cds:ADU76781","A","England",94780010,2009,"2009/10/22","HA"], ["cds:AEA30293","A","Netherlands","2223b",2009,"2009/11/18","HA",], ["cds:ADD23250","A","District of Columbia","INS17",2009,"2009/10/26" +,"HA",], ["cds:ADX98640","A","San Diego","INS13",2009,"2009/10/19","HA",], ["cds:ADD74978","A","San Diego","INS54",2009,"2009/10/12","HA",], ["cds:ADF27925","A","Texas","JMS407",2010,"2010/01/11","HA"], ["cds:ADM95824","A","Finland",661,2009,"2009/10/26","HA"], ["cds:ADD97035","A","Wisconsin","629-D00036",2009,"2009/09/15","HA", +], ]

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Tricky reg ex
by SuicideJunkie (Vicar) on Jun 28, 2011 at 17:15 UTC

    It looks to me like you've got:

    1. "cds:"
    2. not-slashes up to a slash (strain)
    3. not-slashes up to a slash (location)
    4. not-slashes, a slash then not-whitespace up to a whitespace (ID)
    5. 4 digits, slash, 2 digits, slash, 2 digits (date)
    6. the rest of the line (???)
    Since the field separations are sometimes slashes, and sometimes whitespace, a regex with 4 or 5 captures in it sounds like the way to go.

Re: How to substitute something from only between two specified charecters
by wind (Priest) on Jun 28, 2011 at 18:15 UTC

    Just rely on the fact that the second item is the only one that allows spacing:

    use strict; use warnings; while (<DATA>) { if (/^(\S+)\s+(.*\S)\s+(\S+)\s+(\S+)$/) { print "A:B = $1\n"; print "C/D/E/F = $2\n"; print "G/H/I = $3\n"; print "J = $4\n"; } else { warn "Invalid record: $_"; } } __DATA__ >cds:ADD75048 A/Brussels/INS71/2009 2009/10/30 HA >cds:ADF58353 A/Germany-MV/HGW4/2009 2009/12/ HA >cds:ADF58351 A/Germany-MV/HGW6/2009 2009/12/ HA >cds:ADU76781 A/England/94780010/2009 2009/10/22 HA >cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA >cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA >cds:ADD74978 A/San Diego/INS54/2009 2009/10/12 HA >cds:ADF27925 A/Texas/JMS407/2010 2010/01/11 HA >cds:ADM95824 A/Finland/661/2009 2009/10/26 HA >cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA
      OP can probably extrapolate, and maybe that's why the parent stops just short of actually answering the original question: how to remove spaces, but only in the location field -- or learn from some other replies.

      But just in case the assumption above is wrong, assign $2 to a named var ($second maybe) and remove spaces:

      $second =~ s/\s*//g; ... say "C/D/E/F - $second"; ...

      BUT that's not really the point of this post; rather (perhaps because /me is suffering brain-freeze, why the heck is the second capture ((.*\S)) a-greedy-anything followed by anything-not-whitespace working?

      Y::R::E isn't helping this morning; neither is a recheck of (some obvious parts of) Mastering Regular Expressions

      And in case my brain-freeze isn't clear, that chill is telling me that s+(.*\S)\s+(\S+) should capture the location-field and everything else up to the last space, before "HA". That's obviously wrong, but why?

      Can someone, please, provide a the meat for a slap my forehead, grunt "Duh!" moment?

        I wrote the regex that way to have an explicit boundary between the second field and the spacing separating it from the third field. I didn't want to eat any extra spacing.

        I could have accomplished this in one of three ways:

        1. 1) Explicitly specify that the field shouldn't contain a space at the end like I did. (.*\S)\s+
        2. 2) Use an explicit boundary like (.*)\b\s+
        3. 3) Or rely on non-greedy matching: (.*?)\s+ Which would work because of the hard boundaries for the other fields

        In the end, the third method above would probably appear the cleanest, but they all accomplish the same thing in the context of the rest of the regex.

Re: Tricky reg ex
by Anonymous Monk on Jun 28, 2011 at 17:13 UTC

    but I couldn't get this to work either.

    Show your code, and put both the sample data and the code inside code tags

Re: How to substitute something from only between two specified charecters
by ambrus (Abbot) on Jun 29, 2011 at 11:19 UTC

    Split on slashes, modify the second field, join with slashes. In code,

    @s = split m"/"; 1 < @s and $s[1] =~ s/ //g; print join "/", @s;
Re: How to substitute something from only between two specified charecters
by Marshall (Canon) on Jun 28, 2011 at 19:26 UTC
    A couple of solutions for you. It is possible to put an extra qualifier on the split regex. In the first example below, I say split on white space but only if those spaces are preceded by a digit or the / character. This is done by a positive look behind assertion. So a name like "District of Columbia" has the spaces preserved and no split happens on those spaces.

    In the second example below, I used the same extra qualifier trick and said remove spaces but only if the spaces are preceded by a letter. Then I did a split on the result.

    Note that the chomp is not necessary in the second case. When splitting on the default of \s+, space characters are in the set of [space,\n\r\f\t]. Since \n is in that set, it is removed. In the first example a chomp() is needed because the condition of the split was modified.

    The seek statement just "rewinds" the DATA file handle. The DATA file handle starts out positioned at the first byte after the __DATA__ statement. $begin is used to remember what that byte is so that I can go back. If I had done a seek DATA,0,0; that would have moved the file pointer to right before the "hashbang" line. If for some reason you would like for a Perl program to read itself, that is one way!

    #!/usr/bin/perl -w use strict; my $begin = tell(DATA); #to rewind DATA later on while (<DATA>) { chomp; # (?<=\d) is a positive look behind assertion # a digit or / must preceed the \s+ in order to split # upon it. Note chomp is necessary because the # trailing \n will not be removed because there is # no digit in HA. my @tokens = split(/(?<=\d|\/)\s+/, $_); print join("\n",@tokens),"\n"; } =prints like: >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA =cut seek DATA,$begin,0; #rewinds DATA back to beginning while (<DATA>) { s/(?<=[a-zA-Z])\s+//g; #remove spaces if preceeded by letter my @tokens = split; print join("\n",@tokens),"\n"; } =prints like: >cds:ADD23250 A/DistrictofColumbia/INS17/2009 2009/10/26 HA =cut __DATA__ >cds:ADD75048 A/Brussels/INS71/2009 2009/10/30 HA >cds:ADF58353 A/Germany-MV/HGW4/2009 2009/12/ HA >cds:ADF58351 A/Germany-MV/HGW6/2009 2009/12/ HA >cds:ADU76781 A/England/94780010/2009 2009/10/22 HA >cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA >cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA >cds:ADD74978 A/San Diego/INS54/2009 2009/10/12 HA >cds:ADF27925 A/Texas/JMS407/2010 2010/01/11 HA >cds:ADM95824 A/Finland/661/2009 2009/10/26 HA >cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA
Re: How to substitute something from only between two specified charecters
by johngg (Canon) on Jun 28, 2011 at 21:16 UTC

    With just one problematic field that contains spaces you can still split the whole line on spaces without first modifying it by using the third argument to limit the number of fields. Work first from the left leaving the field with spaces along with the rest of the line in the "remainder" part of the string. Then, by using reverse, work in from the right, again with a limit, on the reversed "remainder" to get the rest of the fields; the field with the spaces is the last field and is not disturbed because of the limit to the split.

    use strict; use warnings; use 5.010; my @dataLines = ( q{>cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA}, q{>cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA}, q{>cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA}, q{>cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA}, ); say q{=} x 60; foreach my $dataLine ( @dataLines ) { say $dataLine; my @elems; ( $elems[ 0 ], my $remainder ) = split m{\s+}, $dataLine, 2; @elems[ 3, 2, 1 ] = map { scalar reverse } split m{\s+}, reverse( $remainder ), 3; say for @elems; say q{=} x 60; }

    The output.

    ============================================================ >cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA >cds:AEA30293 A/Netherlands/2223b/2009 2009/11/18 HA ============================================================ >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA >cds:ADD23250 A/District of Columbia/INS17/2009 2009/10/26 HA ============================================================ >cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA >cds:ADX98640 A/San Diego/INS13/2009 2009/10/19 HA ============================================================ >cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA >cds:ADD97035 A/Wisconsin/629-D00036/2009 2009/09/15 HA ============================================================

    I hope this is of interest.

    Update: Modified code to change order of array slice and thereby eliminated the final reverse

    Cheers,

    JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://911807]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-03-28 11:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found