Get a known substring from a string

jake7176 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

This is my first post and I am sure it has been asked before, but I can't find it anywhere.

Say I have a string ksguhdipghisosipghthispartudirlhgdr

How would I go about extracting thispart?

I feel like I should make clear that I don't know where abouts in the string it will be, I just know that it is in there somewhere. I have tried some basic regex but all I seem to be able to do is remove "thispart" and end up with the stuff I don't want.

Any help is much appreciated,

thanks

Comment on Get a known substring from a string

Replies are listed 'Best First'.
Re: Get a known substring from a string by johngg (Canon) on Sep 09, 2016 at 21:29 UTC
As BrowserUk has pointed out, it is a little puzzling why you need to search for the ID if you already know it. However, if you are looking for an exact substring within a longer string then index might be a better approach rather than a regex. If you are also wanting to remove the substring from the string then the four argument form of substr is useful as it returns the removed text. `johngg@shiraz:~ > perl -Mstrict -Mwarnings -E ' my $find = q{thispart}; say $find; my $str = q{ksguhdipghisosipghthispartudirlhgdr}; say $str; my $posn = index $str, $find; die qq{Substring not found\n} if $posn == -1; my $idNo = substr $str, $posn, length $find, q{}; say $idNo; say $str;' thispart ksguhdipghisosipghthispartudirlhgdr thispart ksguhdipghisosipghudirlhgdr` [download] index returns `-1` if the substring is not found. `johngg@shiraz:~ > perl -Mstrict -Mwarnings -E ' my $find = q{thatpart}; say $find; my $str = q{ksguhdipghisosipghthispartudirlhgdr}; say $str; my $posn = index $str, $find; die qq{Substring not found\n} if $posn == -1; my $idNo = substr $str, $posn, length $find, q{}; say $idNo; say $str;' thatpart ksguhdipghisosipghthispartudirlhgdr Substring not found` [download] I get a whole load of ID numbers come in from different sources, but for some reason, they aren't spaced apart If the IDs are all mashed together beware of finding false positives. Given 4-digit IDs of 3819, 8076 and 7204 in the string 381980767204, looking for ID 6720 would falsely report as being present. If you are lucky enough to have fixed length IDs, consider breaking the string down using unpack to place the IDs into a hash. Then searching for any ID becomes simple. `johngg@shiraz:~ > perl -Mstrict -Mwarnings -MData::Dumper -E ' my $idStr = q{381980767204}; my %idLookup = map { $_ => 1 } unpack q{(a4)}, $idStr; print Data::Dumper->Dumpxs( [ \ %idLookup ], [ qw{ idLookup } ] ); say qq{ID $_ }, exists $idLookup{ $_ } ? q{found} : q{not found} for qw{ 7204 6720 };' %idLookup = ( '8076' => 1, '7204' => 1, '3819' => 1 ); ID 7204 found ID 6720 not found` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l] [select]
Re: Get a known substring from a string by Laurent_R (Canon) on Sep 09, 2016 at 20:52 UTC
I feel like I should make clear that I don't know where abouts in the string it will be, I just know that it is in there somewhere. If you know the content of the sub-string literally in advance and know that it is there, then why would you want to look for it? Well, it you want to check that it is really there, or want to know where in the string it is, then maybe you want to try the index (or, possibly but less likely, the rindex) function. But I suspect that's not really what you want. The problem, though, it that you did not state what you want. I would suspect you probably want to use a regex to find a sub-string that you don't know exactly in advance, but that matches certain criteria or pattern. If this is the case, please provide more information about what you know about that sub-string.	[reply]
Re: Get a known substring from a string by pryrt (Abbot) on Sep 09, 2016 at 18:37 UTC
I have tried some basic regex but all I seem to be able to do is remove "thispart" and end up with the stuff I don't want. If you show us the regex that removes "thispart", there are probably lots of Monks here who would be able to teach you how to edit it to make it keep just "thispart". And, as BrowserUK said, supplying a small amount of actual data will help the Monks customize their responses to your actual input data, rather than the alpha-only data you've shown, along with what you expect the output to be given that data. As a hint while you're reading about perlre: if you create a matching group in your regex, the magic `$1` variable will hold the contents of the first matching group; you can use that with a `m//` to just extract your ID (assuming the "thispart" is really a partial regex that matches what you're looking for), or you could use it with `s///` to remove either what matches or remove what doesn't match.	[reply] [d/l] [select]
Re: Get a known substring from a string by shmem (Chancellor) on Sep 09, 2016 at 21:42 UTC
This sounds as a XY Problem to me, and I suspect that you want to do Z, of which you haven't told us anything. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply]
Re: Get a known substring from a string by BrowserUk (Patriarch) on Sep 09, 2016 at 16:38 UTC
If you know its there and you know what it is; why look, you already have it? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Get a known substring from a string by jake7176 (Novice) on Sep 09, 2016 at 16:44 UTC
So at the moment, I get a whole load of ID numbers come in from different sources, but for some reason, they aren't spaced apart. So when I am trying to get that ID number, I can't use it because it has all of the other numbers attached. So I was going to strip everything else then run my script with the ID that I have left	[reply]
Re^3: Get a known substring from a string by BrowserUk (Patriarch) on Sep 09, 2016 at 17:10 UTC
If you describe your actual problem, with examples of the real data, the real ID number you need to find and what you know about the number you need to extract; you might get some useful answers. Your current description implies that you already have the ID; otherwise how are you searching for it? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Get a known substring from a string by Anonymous Monk on Sep 09, 2016 at 23:05 UTC
Just do this: `if ($str =~ m/(thispart)/) { $found = $1; }` [download] $1 is set to the part that matches the regex in between the brackets.	[reply] [d/l]
Re^2: Get a known substring from a string by BrowserUk (Patriarch) on Sep 09, 2016 at 23:18 UTC
That is equivalent to `$found = 'thispart' if 1+index( $str, 'thispart' );`, but index is 3 times faster: `s='the quick brown fox jumps over the lazy dog'; cmpthese -1,{ a => q[ if( $s =~ m[(lazy)] ){ $found=$1 } ], b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ], };; Rate a b a 585631/s -- -77% b 2535746/s 333% --` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: Get a known substring from a string by Corion (Patriarch) on Sep 10, 2016 at 09:42 UTC
I know it looks really trivial once you see it, but I'm really astonished by your approach of using `1+index(...)` - it had not occurred to me to use `index` that way in an expression to check for presence. I'll add that to my set of idiosyncratic phrases, just like `if( system(...) == 0 ) {` for successful execution of subprocesses. Update: I wondered about how much the capturing parentheses cost, and it seems they account for roughly ~~a third~~ half of the performance attainable when using the regex engine. Maybe the two additional steps executed in the regex engine (`OPEN1` and `CLOSE1`) are to blame for that, as they effectively double the number of steps the regex engine has to execute for a successful match. Not invoking the regex engine still is much faster, even though I had thought there once was an optimization that turned constant regular expressions without anchors or quantifiers into an `index` lookup... # a: if( $s =~ m[(lazy)] ){ $found=$1 } Compiling REx "(lazy)" Final program: 1: OPEN1 (3) 3: EXACT <lazy> (5) 5: CLOSE1 (7) 7: END (0) anchored "lazy" at 0 (checking anchored) minlen 4 Matching REx "(lazy)" against "the quick brown fox jumps over the lazy + dog" Intuit: trying to determine minimum start position... Found anchored substr "lazy" at offset 35... (multiline anchor test skipped) try at offset... Intuit: Successfully guessed: match at offset 35 35 < the > <lazy dog> \| 1:OPEN1(3) 35 < the > <lazy dog> \| 3:EXACT <lazy>(5) 39 <the lazy> < dog> \| 5:CLOSE1(7) 39 <the lazy> < dog> \| 7:END(0) Match successful! Freeing REx: "(lazy)" # b: $found = 'lazy' if 1+index( $s, 'lazy' ); # c: if( $s =~ m[lazy] ){ $found=$& } Compiling REx "lazy" Final program: 1: EXACT <lazy> (3) 3: END (0) anchored "lazy" at 0 (checking anchored isall) minlen 4 Matching REx "lazy" against "the quick brown fox jumps over the lazy d +og" Intuit: trying to determine minimum start position... Found anchored substr "lazy" at offset 35... (multiline anchor test skipped) try at offset... Intuit: Successfully guessed: match at offset 35 Freeing REx: "lazy" Rate a c b a 2038631/s -- -50% -75% c 4089154/s 101% -- -49% b 8013601/s 293% 96% -- [download] The program I used: use strict; use Benchmark 'cmpthese'; use vars '$s'; $s='the quick brown fox jumps over the lazy dog'; my $found; my %benchmarks = ( a => q[ if( $s =~ m[(lazy)] ){ $found=$1 } ], b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ], c => q[ if( $s =~ m[lazy] ){ $found=$& } ], ); { use re 'debug'; for (sort keys %benchmarks) { print "# $_: $benchmarks{$_}\n"; undef $found; my $code = eval qq{sub { $benchmarks{$_} } } or die "Couldn't compile benchmark $_: $@"; $code->(); $found eq 'lazy' or die "Unexpected results: [$found] vs. 'lazy'"; }; }; cmpthese( -1, \%benchmarks); [download]	[reply] [d/l] [select]
Re^4: Get a known substring from a string by BrowserUk (Patriarch) on Sep 10, 2016 at 11:40 UTC
Re^5: Get a known substring from a string by flowdy (Scribe) on Sep 13, 2016 at 07:34 UTC
Some notes below your chosen depth have not been shown here
Re: Get a known substring from a string by Krambambuli (Curate) on Sep 13, 2016 at 09:51 UTC
Just wondering if the number of occurences would be of any importance... And, if yes, if thisparthispart would be considered to be one or two occurencies... ;)	[reply]

Back to Seekers of Perl Wisdom