Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Separating multiple keyword search input

by Dente (Acolyte)
on Apr 30, 2004 at 01:35 UTC ( [id://349311]=perlquestion: print w/replies, xml ) Need Help??

Dente has asked for the wisdom of the Perl Monks concerning the following question:

I need to know if there is a efficient way to differentiate keyword search results between 2 flatdbs. I have been blessed to have jarich help split filehandles and assign them to arrays to hold data until needed later. However, earlier on I find I need to differentiate search results.

These are my placeholders

$main_template =~ s/%%keywords%%/$fields{'keywords'}/g; $main_template =~ s/%%searcresults%%/$pitem/g; $main_template =~ s/%%keywords%%/$fields{'keywords'}/g; $main_template =~ s/%%premiumlistings%%/$premiumitem/g;
I need to differentiate between ~ s/%%keywords%%/$fields{'keywords'}/g; for the %%searcresult%% and %%premiumlistings%%
$fields{'keywords'} = &get_search_ready($fields{'keywords'}); if ($fields{'keywords'} eq ""){$fields{'keywords'} = "NOTHING";} if (length($fields{'keywords'}) < $minimum_keyword_length){$fields{'ke +ywords'} = "2_CHARS";} if (($use_mysql eq "Yes") and ($fields{'keywords'} ne "NOTHING") and ( +$fields{'keywords'} ne "2_CHARS")) { &mysql_search; } elsif (($fields{'keywords'} ne "NOTHING") and ($fields{'keywords'} + ne "2_CHARS")) { &normal_search; } if (($fields{'keywords'} eq "NOTHING") or ($fields{'keywords'} eq "2_C +HARS")) { $fields{'keywords'} = ""; } #### RELEVANCE $rspcntr = 0; ############ CAPTURES ALL KEYWORD INPUT ############### (@skeyw) = split(/ /,$fields{'keywords'}); ########### ADDED 04/29/04 ################### $nrkeywords = push(@skeyw); foreach $kwrel (@resultline) ########################################### { $keycount = 0; $kwcntr = 1000; while ($keycount < $nrkeywords) { #### TITLE (@litems) = split(/\t/,$kwrel); if ($litems[1] =~ /$skeyw[$keycount]/i) { $kwcntr = $kwcntr - 5; } #### DESCRIPTION if ($litems[2] =~ /$skeyw[$keycount]/i) { $kwcntr = $kwcntr - 1; } #### KEYWORDS if ($litems[5] =~ /$skeyw[$keycount]/i) { $kwcntr = $kwcntr - 4; } ##### GENERAL if (($kwrel =~ /$skeyw[$keycount]/i) and ($skeyw[$keycount] +ne "")) { $kwcntr = $kwcntr - 1; } ##### PHRASE if (($kwrel =~ /$fields{'keywords'}/i) and ($nrkeywords > 1) +) { $kwcntr = $kwcntr - 10; } ##### AND BOOLEAN MATCHING if ($nrkeywords == 2) { if (($kwrel =~ /$skeyw[0]/i) and ($kwrel =~ /$skeyw[1] +/i)) {$kwcntr = $kwcntr - 20;} } elsif ($nrkeywords == 3) { if (($kwrel =~ /$skeyw[0]/i) and ($kwrel =~ /$skeyw[1] +/i) and ($kwrel =~ /$skeyw[2]/i)) {$kwcntr = $kwcntr - 30;} } elsif ($nrkeywords == 4) { if (($kwrel =~ /$skeyw[0]/i) and ($kwrel =~ /$skeyw[1] +/i) and ($kwrel =~ /$skeyw[2]/i) and ($kwrel =~ /$skeyw[3]/i)) {$kwcntr = $kwcntr - 40;} } $keycount++; } if (length($kwcntr) == 3){$kwcntr = "0" . $kwcntr;} if (length($kwcntr) == 2){$kwcntr = "00" . $kwcntr;} if (length($kwcntr) == 1){$kwcntr = "000" . $kwcntr;} $resultline[$rspcntr] = $kwcntr . "-_:" . $kwrel;

If not, is there a way to add multiple names (inferences) to the search input text field so I may separate 'keywords' from 'keywords2' to differentiate the %%searchresults%% placeholder from the %%premiumlistings%% placeholder doring the same search process?

Example: http://www.urhosted.com/mall/searchres.html - template before populated

then search for "store"

Replies are listed 'Best First'.
Re: Separating multiple keyword search input
by tkil (Monk) on Apr 30, 2004 at 05:50 UTC

    Having seen a couple of your posts, I'd like to make a suggestion before I get to my main response: you will have a lot more luck getting help if you would take the time to reduce your problem to simplest possible terms. For most questions related to language usage, it is rare that you can't illustrate the problem in less than 20 lines.

    Think about it: the helpful monks have only so much time to offer. Given the choice between a small, succinct tidbit and a long, incoherent, out-of-context, "I'll just dump some code into a question and someone will be able to fully grasp it and tell me what is going wrong" question... which do you think they'll answer?

    There are resources that describe how to ask a question so that it will get answered; consider availing yourself of them.

    Anyway. Rant concluded.

    So far as I can infer from your other postings, you have two sets of data ("normal" and "premium"); you want to accept keywords off a web form, search through those sets for those keywords, then display the results (keeping the results from the two sets distinct).

    It further looks like that data might be large and stored in a MySQL database. The data is also apparently structured into tab-separated fields of: unknown, title, description, unknown, unknown, keywords.

    Without knowing your entire system (and I don't want to learn it -- I just want you to think about it), we can start biting off chunks. The two phases of processing you show above are interpreting form parameters, then searching through some result lines (which it's not clear where you got them from.)

    Finding matches from a file.

    # find_hits $file_name, @keywords; # # Returns a list of "score: line" strinsg. Example: # # If file "foo.txt" contains: # # Larry Wall, Programming Perl, Reference, x, y, perl # Peter Scott, Perl Medic, Legacy herding, z, a, perl # # A call to this function might look like this: # # my @hits = find_hits "foo.txt", "medic"; # # And it would return a one-element list: # # '32: Peter Scot...a, perl' sub find_hits( $ @ ) { my ( $file, @keywords ) = @_; # compose a regex for quick rejection of non-matching lines: my $any_keyword_re = join '|', map quotemeta $_, @keywords; $any_keyword_re = qr/$any_keyword_re/i; # and a keyword for the whole phrase my $phrase_re = join '\W+', @keywords; $phrase_re = qr/$phrase_re/i; # open input file for read open my $in, $file or croak "opening $file for read: $!"; my @rv; # for accumulating return values while ( <$in> ) { # reject lines with no matches out of hand next unless m/$any_keyword_re/; # any match at all is one point. my $score = 1; # split into fields for further scoring. my ( undef, $title, $desc, undef, undef, $keys ) = split /\t/; # title matches are worth 5 points each while ( $title =~ m/$any_keyword_re/g ) { $score += 5 } # description matches are only 1 point while ( $desc =~ m/$any_keyword_re/g ) { $score += 1 } # keyword matches are 4 points while ( $keys =~ m/$any_keyword_re/g ) { $score += 4 } # phrase matches (against entire line) are 10 points while ( m/$phrase_re/g ) { $score += 10 } # multiple matches are worth 10x the number # of keywords that matched. my $n_matches = () = m/$any_keyword_re/g; # see perlfaq4 if ( $n_matches > 1 ) { $score += 10*$n_matches } # finally, format $score and save for returning # to the caller push @rv, sprintf "%03d: %s", $score, $_; } return @rv; }

    Ok, so now we address your other main issue, that of doing a substitution in your template. Since we can have a variable number of responses to each, but you only have one template variable, I'll assume that we can wedge all our answers into that one spot. We'll do this by joining together all the hits we found.

    # validate our keywords. unless ( defined $fields{keywords} && fields{keywords} ne '' ) { # complain, bail out of this run. } my @keywords = split ' ', $fields{keywords}; # find actual matches in highest- to lowest-score order my @normal_hits = reverse sort find_hits "normal_list.txt", @keywords +; my @premium_hits = reverse sort find_hits "premium_list.txt", @keyword +s; # keep lists reasonable if ( @normal_hits > 100 ) { splice @normal_hits, 100 } if ( @premium_hits > 100 ) { splice @premium_hits, 100 } # now join them together for presentation: my $normal_hits = join '<br />', @normal_hits; my $premium_hits = join '<br />', @premium_hits; # finally, do the now-obvious substitution: $template =~ s/%%normalresults%%/$normal_hits/; $template =~ s/%%premiumresults%%/$premium_hits/;

    Hopefully this gets you closer. There are all sorts of places that this code might need tweaking: if there are possibly a huge number of hits, you can't store them all in memory, so you'll need to keep track of just the top 100 (or whatever) the whole way. Handling special characters in the return strings for printing to HTML. How to handle more advanced CSV files (e.g. double quotes protecting a comma that is in a field.)

    None of these are insurmountable, but until and unless you develop the skill to break down problems into simpler, more digestable chunks (both for asking questions and for writing the solution in the first place), no amount of cookbookery will help you. Good luck.

      This definetly gets me closer to the solution. I apologize for submitting code from mid point as I thought that is where my problem was initially, however, upon further disection, I feel the problem needs to be addressed early on. Here is code I thought would have worked for me, kepping in mind that I want a site search engine returning the result-like appearance of google.
      open (SIDX, "$data_dir/search.idx"); open (SIDX2, "$data_dir/search2.idx"); my @sidx = <SIDX>; my @sidx2 = <SIDX2>; @skeyw = grep {$_ ne ""} @skeyw; @premiumkeyw = grep {$_ ne ""} @premiumkeyw; my $regexp = join('|', @skeyw, @premiumkeyw); $regexp = qr/$regexp/; (@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE (@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F +ILE $nrkeywords = push(@skeyw); $premiumnrkeywords = push(@premiumskeyw);

        First, apologies for taking so long to get back to you on this. (And it took even longer, as my browser decided to nose dive after I wrote up this response the first time. Sigh.)

        Looking at this code, though, I again have to say that it is almost a complete non-sequitor to me. I think I understand your overall goals, but your actual program construction seems to be very haphazard. I can't figure out your larger issues without grasping your entire system (which I'm not equipped to do!), but here are a few comments on your "programming in the small":

        open (SIDX, "$data_dir/search.idx"); open (SIDX2, "$data_dir/search2.idx"); my @sidx = <SIDX>; my @sidx2 = <SIDX2>;

        Assuming that these are what you're actually searching through, you probably don't want to load them into memory up front like this. Build up your comparison / scoring function, then apply it to each line of input, keeping only those that match to a certain level. (This all depends on what portion of the index you expect to return; in a typical situation, I would assume that only a small part of the index is relevant to any given query, so I would want to have only those values in memory.)

        (@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE (@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F +ILE

        At the very least, the comments are grossly inaccurate. Also, you're splitting the same value into two different arrays. Finally, you are using a pattern which can result in the null strings that you later have to filter out; a better split pattern can fix that. The most trivial change is that simple hash keys don't need to be in quotes. All together, we can just say:

        my @keywords = split ' ', $fields{keywords};

        The use of ' ' has special meaning to split: break it up on one or more whitespace characters, and ignore leading whitespace as well. This should guarantee that you have no null keywords in the array, making your checks unnecessary.

        (Side note: don't skimp on variable names. Use an editor that does expansions if you have to, but I personally find @keywords more evocative and self-descriptive than @skeyw. Same comment applies to filehandle names, and pretty much every other name in your program...)

        $nrkeywords = push(@skeyw); $premiumnrkeywords = push(@premiumskeyw);

        Assuming you just want to get the number of elements in those arrays (which, as pointed out above, are the same array, so you don't need to do it twice!), you can simply evaluate the array in scalar context. The slightly long way of saying that is:

        my $n_keywords = scalar @keywords;

        I say this is the long way, because the scalar is unnecessary. You're asasigning something into a scalar, and that puts that thing into a scalar context. So, we could have written:

        my $n_keywords = @keywords;

        Either way, you certainly don't need push.

        Finally, now that you know there is only one set of keywords, you can construct the regex out of them. One point to consider is whether thes keywords are themselves regexes, or if they are just plain text. If the latter, we want to make sure we neutralize any characters that are special to the regex engine. The built-in function quotemeta is just the ticket. Putting it all together, we have:

        my $regex = join '|', map { quotemeta $_ } @keywords; $regex = qr/$regex/;

        I addressed this in more detail in my original response, but now you can basically do:

        my @std_hits = grep { m/$regex/ } <STD_IDX>; my @prem_hits = grep { m/$regex/ } <PREM_IDX>;

        Hopefully this has helped you further. I do worry that there are archetectural issues that I am unable to address, and untill you get the chance to rethink what you are doing (as opposed to just making random changes and hoping for the best), you're not going to have much luck.

        Regardless, I hope it works out for you.

Re: Separating multiple keyword search input
by davido (Cardinal) on Apr 30, 2004 at 03:00 UTC
    The following code isn't intended to be a cut-n-paste solution, but rather a demonstration of a technique you can use to search for multiple keywords and do replacements based on those keywords.

    my %replacements = qw/men women houses homes cats dogs venus mars carpet rug full empty/; my $string = "men cats houses venus carpet full"; for my $keyword ( keys %replacements ) { my $re = qr/\b$keyword\b/; $string =~ s/$re/$replacements{$keyword}/gi; } print "$string\n"; __OUTPUT__ women dogs homes mars rugs empty

    Hope this helps. There are other similar idioms, and other different idioms that accomplish the same thing, but this one seems to be more or less what you're asking for.


    Dave

Re: Separating multiple keyword search input
by bart (Canon) on Apr 30, 2004 at 19:33 UTC
    $main_template =~ s/%%keywords%%/$fields{'keywords'}/g; $main_template =~ s/%%searcresults%%/$pitem/g; $main_template =~ s/%%keywords%%/$fields{'keywords'}/g; $main_template =~ s/%%premiumlistings%%/$premiumitem/g;
    Don't do several substitutions one after another, you could end up replacing what you replaced, again. For example:
    $_ = 'taxial'; s/taxi/cab/g; s/cabal/gang/g; print;
    prints "gang".

    Better would be to replace both at once:

    $_ = 'taxial'; my %subst = ( 'taxi' => 'cab', 'cabal' => 'gang' ); local $" = "|"; my @keys = map quotemeta, sort { length $b <=> length $a } keys %sub +st; my $regexp = qr(@keys); s/($regexp)/$subst{$1}/g; print;
    which prints "cabal". Once the first substitution went past "taxi", replacing it with "cab", we've moved too far to replace it again in the second substitution. Usually, this is exactly what we want.

    I like Regex::PreSuf (abbreviated from prefix/suffix) for this kind of solution. It combines several "words" into one regexp, as alternatives, and takes care of quotemeta as well. The code is shorter, and may likely become faster, too. That's the idea, though it depends on the words in the list.

    $_ = 'taxial'; my %subst = ( 'taxi' => 'cab', 'cabal' => 'gang' ); use Regex::PreSuf; my $regexp = presuf(keys %subst); s/($regexp)/$subst{$1}/g; print;

    Applied to your code, it becomes:

    my %subst = ( %fields, searcresults => $pitem, premiumlistings => $premiumitem ); use Regex::PreSuf; my $regexp = presuf(keys %subst); $main_template =~ s/%%($regexp)%%/$subst{$1}/go;
    BTW Using the /o modifier, the regexp is compiled only once, the first time a match is tried. Even though you can later change the values in the substitution hash, adding more keys will not do any good.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://349311]
Approved by Gerard
Front-paged by Gerard
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-26 06:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found