http://qs321.pair.com?node_id=1171469

jake7176 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

This is my first post and I am sure it has been asked before, but I can't find it anywhere.

Say I have a string ksguhdipghisosipghthispartudirlhgdr

How would I go about extracting thispart?

I feel like I should make clear that I don't know where abouts in the string it will be, I just know that it is in there somewhere. I have tried some basic regex but all I seem to be able to do is remove "thispart" and end up with the stuff I don't want.

Any help is much appreciated,

thanks

Replies are listed 'Best First'.
Re: Get a known substring from a string
by johngg (Canon) on Sep 09, 2016 at 21:29 UTC

    As BrowserUk has pointed out, it is a little puzzling why you need to search for the ID if you already know it. However, if you are looking for an exact substring within a longer string then index might be a better approach rather than a regex. If you are also wanting to remove the substring from the string then the four argument form of substr is useful as it returns the removed text.

    johngg@shiraz:~ > perl -Mstrict -Mwarnings -E ' my $find = q{thispart}; say $find; my $str = q{ksguhdipghisosipghthispartudirlhgdr}; say $str; my $posn = index $str, $find; die qq{Substring not found\n} if $posn == -1; my $idNo = substr $str, $posn, length $find, q{}; say $idNo; say $str;' thispart ksguhdipghisosipghthispartudirlhgdr thispart ksguhdipghisosipghudirlhgdr

    index returns -1 if the substring is not found.

    johngg@shiraz:~ > perl -Mstrict -Mwarnings -E ' my $find = q{thatpart}; say $find; my $str = q{ksguhdipghisosipghthispartudirlhgdr}; say $str; my $posn = index $str, $find; die qq{Substring not found\n} if $posn == -1; my $idNo = substr $str, $posn, length $find, q{}; say $idNo; say $str;' thatpart ksguhdipghisosipghthispartudirlhgdr Substring not found
    I get a whole load of ID numbers come in from different sources, but for some reason, they aren't spaced apart

    If the IDs are all mashed together beware of finding false positives. Given 4-digit IDs of 3819, 8076 and 7204 in the string 381980767204, looking for ID 6720 would falsely report as being present. If you are lucky enough to have fixed length IDs, consider breaking the string down using unpack to place the IDs into a hash. Then searching for any ID becomes simple.

    johngg@shiraz:~ > perl -Mstrict -Mwarnings -MData::Dumper -E ' my $idStr = q{381980767204}; my %idLookup = map { $_ => 1 } unpack q{(a4)*}, $idStr; print Data::Dumper->Dumpxs( [ \ %idLookup ], [ qw{ *idLookup } ] ); say qq{ID $_ }, exists $idLookup{ $_ } ? q{found} : q{not found} for qw{ 7204 6720 };' %idLookup = ( '8076' => 1, '7204' => 1, '3819' => 1 ); ID 7204 found ID 6720 not found

    I hope this is helpful.

    Cheers,

    JohnGG

Re: Get a known substring from a string
by Laurent_R (Canon) on Sep 09, 2016 at 20:52 UTC
    I feel like I should make clear that I don't know where abouts in the string it will be, I just know that it is in there somewhere.
    If you know the content of the sub-string literally in advance and know that it is there, then why would you want to look for it? Well, it you want to check that it is really there, or want to know where in the string it is, then maybe you want to try the index (or, possibly but less likely, the rindex) function. But I suspect that's not really what you want.

    The problem, though, it that you did not state what you want. I would suspect you probably want to use a regex to find a sub-string that you don't know exactly in advance, but that matches certain criteria or pattern. If this is the case, please provide more information about what you know about that sub-string.

Re: Get a known substring from a string
by pryrt (Abbot) on Sep 09, 2016 at 18:37 UTC
    I have tried some basic regex but all I seem to be able to do is remove "thispart" and end up with the stuff I don't want.

    If you show us the regex that removes "thispart", there are probably lots of Monks here who would be able to teach you how to edit it to make it keep just "thispart". And, as BrowserUK said, supplying a small amount of actual data will help the Monks customize their responses to your actual input data, rather than the alpha-only data you've shown, along with what you expect the output to be given that data.

    As a hint while you're reading about perlre: if you create a matching group in your regex, the magic $1 variable will hold the contents of the first matching group; you can use that with a m// to just extract your ID (assuming the "thispart" is really a partial regex that matches what you're looking for), or you could use it with s/// to remove either what matches or remove what doesn't match.

Re: Get a known substring from a string
by shmem (Chancellor) on Sep 09, 2016 at 21:42 UTC

    This sounds as a XY Problem to me, and I suspect that you want to do Z, of which you haven't told us anything.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Get a known substring from a string
by BrowserUk (Patriarch) on Sep 09, 2016 at 16:38 UTC

    If you know its there and you know what it is; why look, you already have it?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      So at the moment, I get a whole load of ID numbers come in from different sources, but for some reason, they aren't spaced apart. So when I am trying to get that ID number, I can't use it because it has all of the other numbers attached. So I was going to strip everything else then run my script with the ID that I have left

        If you describe your actual problem, with examples of the real data, the real ID number you need to find and what you know about the number you need to extract; you might get some useful answers.

        Your current description implies that you already have the ID; otherwise how are you searching for it?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Get a known substring from a string
by Anonymous Monk on Sep 09, 2016 at 23:05 UTC
    Just do this:
    if ($str =~ m/(thispart)/) { $found = $1; }
    $1 is set to the part that matches the regex in between the brackets.

      That is equivalent to $found = 'thispart' if 1+index( $str, 'thispart' );, but index is 3 times faster:

      s='the quick brown fox jumps over the lazy dog'; cmpthese -1,{ a => q[ if( $s =~ m[(lazy)] ){ $found=$1 } ], b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ], };; Rate a b a 585631/s -- -77% b 2535746/s 333% --

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I know it looks really trivial once you see it, but I'm really astonished by your approach of using 1+index(...) - it had not occurred to me to use index that way in an expression to check for presence. I'll add that to my set of idiosyncratic phrases, just like if( system(...) == 0 ) { for successful execution of subprocesses.

        Update: I wondered about how much the capturing parentheses cost, and it seems they account for roughly a third half of the performance attainable when using the regex engine. Maybe the two additional steps executed in the regex engine (OPEN1 and CLOSE1) are to blame for that, as they effectively double the number of steps the regex engine has to execute for a successful match.

        Not invoking the regex engine still is much faster, even though I had thought there once was an optimization that turned constant regular expressions without anchors or quantifiers into an index lookup...

        # a: if( $s =~ m[(lazy)] ){ $found=$1 } Compiling REx "(lazy)" Final program: 1: OPEN1 (3) 3: EXACT <lazy> (5) 5: CLOSE1 (7) 7: END (0) anchored "lazy" at 0 (checking anchored) minlen 4 Matching REx "(lazy)" against "the quick brown fox jumps over the lazy + dog" Intuit: trying to determine minimum start position... Found anchored substr "lazy" at offset 35... (multiline anchor test skipped) try at offset... Intuit: Successfully guessed: match at offset 35 35 < the > <lazy dog> | 1:OPEN1(3) 35 < the > <lazy dog> | 3:EXACT <lazy>(5) 39 <the lazy> < dog> | 5:CLOSE1(7) 39 <the lazy> < dog> | 7:END(0) Match successful! Freeing REx: "(lazy)" # b: $found = 'lazy' if 1+index( $s, 'lazy' ); # c: if( $s =~ m[lazy] ){ $found=$& } Compiling REx "lazy" Final program: 1: EXACT <lazy> (3) 3: END (0) anchored "lazy" at 0 (checking anchored isall) minlen 4 Matching REx "lazy" against "the quick brown fox jumps over the lazy d +og" Intuit: trying to determine minimum start position... Found anchored substr "lazy" at offset 35... (multiline anchor test skipped) try at offset... Intuit: Successfully guessed: match at offset 35 Freeing REx: "lazy" Rate a c b a 2038631/s -- -50% -75% c 4089154/s 101% -- -49% b 8013601/s 293% 96% --

        The program I used:

        use strict; use Benchmark 'cmpthese'; use vars '$s'; $s='the quick brown fox jumps over the lazy dog'; my $found; my %benchmarks = ( a => q[ if( $s =~ m[(lazy)] ){ $found=$1 } ], b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ], c => q[ if( $s =~ m[lazy] ){ $found=$& } ], ); { use re 'debug'; for (sort keys %benchmarks) { print "# $_: $benchmarks{$_}\n"; undef $found; my $code = eval qq{sub { $benchmarks{$_} } } or die "Couldn't compile benchmark $_: $@"; $code->(); $found eq 'lazy' or die "Unexpected results: [$found] vs. 'lazy'"; }; }; cmpthese( -1, \%benchmarks);
Re: Get a known substring from a string
by Krambambuli (Curate) on Sep 13, 2016 at 09:51 UTC
    Just wondering if the number of occurences would be of any importance...
    And, if yes, if

    thisparthispart

    would be considered to be one or two occurencies... ;)