Dente has asked for the wisdom of the Perl Monks concerning the following question:
I need to know if there is a efficient way to differentiate keyword search results between 2 flatdbs. I have been blessed to have jarich help split filehandles and assign them to arrays to hold data until needed later. However, earlier on I find I need to differentiate search results.
These are my placeholders
$main_template =~ s/%%keywords%%/$fields{'keywords'}/g;
$main_template =~ s/%%searcresults%%/$pitem/g;
$main_template =~ s/%%keywords%%/$fields{'keywords'}/g;
$main_template =~ s/%%premiumlistings%%/$premiumitem/g;
I need to differentiate between ~ s/%%keywords%%/$fields{'keywords'}/g; for the %%searcresult%% and %%premiumlistings%%
$fields{'keywords'} = &get_search_ready($fields{'keywords'});
if ($fields{'keywords'} eq ""){$fields{'keywords'} = "NOTHING";}
if (length($fields{'keywords'}) < $minimum_keyword_length){$fields{'ke
+ywords'} = "2_CHARS";}
if (($use_mysql eq "Yes") and ($fields{'keywords'} ne "NOTHING") and (
+$fields{'keywords'} ne "2_CHARS"))
{
&mysql_search;
}
elsif (($fields{'keywords'} ne "NOTHING") and ($fields{'keywords'}
+ ne "2_CHARS"))
{
&normal_search;
}
if (($fields{'keywords'} eq "NOTHING") or ($fields{'keywords'} eq "2_C
+HARS"))
{
$fields{'keywords'} = "";
}
#### RELEVANCE
$rspcntr = 0;
############ CAPTURES ALL KEYWORD INPUT ###############
(@skeyw) = split(/ /,$fields{'keywords'});
########### ADDED 04/29/04 ###################
$nrkeywords = push(@skeyw);
foreach $kwrel (@resultline)
###########################################
{
$keycount = 0;
$kwcntr = 1000;
while ($keycount < $nrkeywords)
{
#### TITLE
(@litems) = split(/\t/,$kwrel);
if ($litems[1] =~ /$skeyw[$keycount]/i)
{
$kwcntr = $kwcntr - 5;
}
#### DESCRIPTION
if ($litems[2] =~ /$skeyw[$keycount]/i)
{
$kwcntr = $kwcntr - 1;
}
#### KEYWORDS
if ($litems[5] =~ /$skeyw[$keycount]/i)
{
$kwcntr = $kwcntr - 4;
}
##### GENERAL
if (($kwrel =~ /$skeyw[$keycount]/i) and ($skeyw[$keycount]
+ne ""))
{
$kwcntr = $kwcntr - 1;
}
##### PHRASE
if (($kwrel =~ /$fields{'keywords'}/i) and ($nrkeywords > 1)
+)
{
$kwcntr = $kwcntr - 10;
}
##### AND BOOLEAN MATCHING
if ($nrkeywords == 2)
{
if (($kwrel =~ /$skeyw[0]/i) and ($kwrel =~ /$skeyw[1]
+/i))
{$kwcntr = $kwcntr - 20;}
}
elsif ($nrkeywords == 3)
{
if (($kwrel =~ /$skeyw[0]/i) and ($kwrel =~ /$skeyw[1]
+/i) and ($kwrel =~ /$skeyw[2]/i))
{$kwcntr = $kwcntr - 30;}
}
elsif ($nrkeywords == 4)
{
if (($kwrel =~ /$skeyw[0]/i) and ($kwrel =~ /$skeyw[1]
+/i) and ($kwrel =~ /$skeyw[2]/i) and ($kwrel =~ /$skeyw[3]/i))
{$kwcntr = $kwcntr - 40;}
}
$keycount++;
}
if (length($kwcntr) == 3){$kwcntr = "0" . $kwcntr;}
if (length($kwcntr) == 2){$kwcntr = "00" . $kwcntr;}
if (length($kwcntr) == 1){$kwcntr = "000" . $kwcntr;}
$resultline[$rspcntr] = $kwcntr . "-_:" . $kwrel;
If not, is there a way to add multiple names (inferences) to the search input text field so I may separate 'keywords' from 'keywords2' to differentiate the %%searchresults%% placeholder from the %%premiumlistings%% placeholder doring the same search process?
Example: http://www.urhosted.com/mall/searchres.html - template before populated
then search for "store"
Re: Separating multiple keyword search input
by tkil (Monk) on Apr 30, 2004 at 05:50 UTC
|
Having seen a couple of your posts, I'd like to make
a suggestion before I get to my main response: you will
have a lot more luck getting help if you would
take the time to reduce your problem to simplest possible
terms. For most questions related to language usage,
it is rare that you can't illustrate the problem in less
than 20 lines.
Think about it: the helpful monks have only so much time
to offer. Given the choice between a small, succinct
tidbit and a long, incoherent,
out-of-context, "I'll just dump some code into a question
and someone will be able to fully grasp it and tell me
what is going wrong" question... which do you think
they'll answer?
There are
resources
that describe how to ask a question so that
it will get answered; consider availing yourself of
them.
Anyway. Rant concluded.
So far as I can infer from your other postings, you have
two sets of data ("normal" and "premium"); you want to
accept keywords off a web form, search through those
sets for those keywords, then display the results
(keeping the results from the two sets distinct).
It further looks like that data might be large and
stored in a MySQL database. The data is also apparently
structured into tab-separated fields of: unknown, title,
description, unknown, unknown, keywords.
Without knowing your entire system (and I don't want
to learn it -- I just want you to think about
it), we can start biting off chunks. The two phases
of processing you show above are interpreting form
parameters, then searching through some result lines
(which it's not clear where you got them from.)
Finding matches from a file.
# find_hits $file_name, @keywords;
#
# Returns a list of "score: line" strinsg. Example:
#
# If file "foo.txt" contains:
#
# Larry Wall, Programming Perl, Reference, x, y, perl
# Peter Scott, Perl Medic, Legacy herding, z, a, perl
#
# A call to this function might look like this:
#
# my @hits = find_hits "foo.txt", "medic";
#
# And it would return a one-element list:
#
# '32: Peter Scot...a, perl'
sub find_hits( $ @ )
{
my ( $file, @keywords ) = @_;
# compose a regex for quick rejection of non-matching lines:
my $any_keyword_re = join '|', map quotemeta $_, @keywords;
$any_keyword_re = qr/$any_keyword_re/i;
# and a keyword for the whole phrase
my $phrase_re = join '\W+', @keywords;
$phrase_re = qr/$phrase_re/i;
# open input file for read
open my $in, $file or croak "opening $file for read: $!";
my @rv; # for accumulating return values
while ( <$in> )
{
# reject lines with no matches out of hand
next unless m/$any_keyword_re/;
# any match at all is one point.
my $score = 1;
# split into fields for further scoring.
my ( undef, $title, $desc, undef, undef, $keys ) =
split /\t/;
# title matches are worth 5 points each
while ( $title =~ m/$any_keyword_re/g ) { $score += 5 }
# description matches are only 1 point
while ( $desc =~ m/$any_keyword_re/g ) { $score += 1 }
# keyword matches are 4 points
while ( $keys =~ m/$any_keyword_re/g ) { $score += 4 }
# phrase matches (against entire line) are 10 points
while ( m/$phrase_re/g ) { $score += 10 }
# multiple matches are worth 10x the number
# of keywords that matched.
my $n_matches = () = m/$any_keyword_re/g; # see perlfaq4
if ( $n_matches > 1 ) { $score += 10*$n_matches }
# finally, format $score and save for returning
# to the caller
push @rv, sprintf "%03d: %s", $score, $_;
}
return @rv;
}
Ok, so now we address your other main issue, that of
doing a substitution in your template. Since we can
have a variable number of responses to each, but you
only have one template variable, I'll assume that we
can wedge all our answers into that one spot. We'll
do this by joining together all the hits we found.
# validate our keywords.
unless ( defined $fields{keywords} &&
fields{keywords} ne '' )
{
# complain, bail out of this run.
}
my @keywords = split ' ', $fields{keywords};
# find actual matches in highest- to lowest-score order
my @normal_hits = reverse sort find_hits "normal_list.txt", @keywords
+;
my @premium_hits = reverse sort find_hits "premium_list.txt", @keyword
+s;
# keep lists reasonable
if ( @normal_hits > 100 ) { splice @normal_hits, 100 }
if ( @premium_hits > 100 ) { splice @premium_hits, 100 }
# now join them together for presentation:
my $normal_hits = join '<br />', @normal_hits;
my $premium_hits = join '<br />', @premium_hits;
# finally, do the now-obvious substitution:
$template =~ s/%%normalresults%%/$normal_hits/;
$template =~ s/%%premiumresults%%/$premium_hits/;
Hopefully this gets you closer. There are all sorts of
places that this code might need tweaking: if there are
possibly a huge number of hits, you can't store them all
in memory, so you'll need to keep track of just the top
100 (or whatever) the whole way. Handling special
characters in the return strings for printing to HTML.
How to handle more advanced CSV files (e.g. double quotes
protecting a comma that is in a field.)
None of these are insurmountable, but until and unless
you develop the skill to break down problems into simpler,
more digestable chunks (both for asking questions and
for writing the solution in the first place), no amount
of cookbookery will help you. Good luck.
| [reply] [d/l] [select] |
|
This definetly gets me closer to the solution. I apologize for submitting code from mid point as I thought that is where my problem was initially, however, upon further disection, I feel the problem needs to be addressed early on. Here is code I thought would have worked for me, kepping in mind that I want a site search engine returning the result-like appearance of google.
open (SIDX, "$data_dir/search.idx");
open (SIDX2, "$data_dir/search2.idx");
my @sidx = <SIDX>;
my @sidx2 = <SIDX2>;
@skeyw = grep {$_ ne ""} @skeyw;
@premiumkeyw = grep {$_ ne ""} @premiumkeyw;
my $regexp = join('|', @skeyw, @premiumkeyw);
$regexp = qr/$regexp/;
(@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE
(@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F
+ILE
$nrkeywords = push(@skeyw);
$premiumnrkeywords = push(@premiumskeyw);
| [reply] [d/l] |
|
First, apologies for taking so long to get back to you
on this. (And it took even longer, as my browser decided
to nose dive after I wrote up this response the first
time. Sigh.)
Looking at this code, though, I again have to say that
it is almost a complete non-sequitor to me. I think I
understand your overall goals, but your actual program
construction seems to be very haphazard. I can't figure
out your larger issues without grasping your entire
system (which I'm not equipped to do!), but here are a
few comments on your "programming in the small":
open (SIDX, "$data_dir/search.idx");
open (SIDX2, "$data_dir/search2.idx");
my @sidx = <SIDX>;
my @sidx2 = <SIDX2>;
Assuming that these are what you're actually searching
through, you probably don't want to load them into
memory up front like this. Build up your comparison /
scoring function, then apply it to each line of input,
keeping only those that match to a certain level. (This
all depends on what portion of the index you expect to
return; in a typical situation, I would assume that only
a small part of the index is relevant to any given query,
so I would want to have only those values in memory.)
(@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE
(@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F
+ILE
At the very least, the comments are grossly inaccurate.
Also, you're splitting the same value into two different
arrays. Finally, you are using a pattern which can
result in the null strings that you later have to
filter out; a better split pattern can fix that. The
most trivial change is that simple hash keys don't need
to be in quotes. All together, we can just say:
my @keywords = split ' ', $fields{keywords};
The use of ' ' has special meaning to
split: break it up on one or more whitespace
characters, and ignore leading whitespace as well. This
should guarantee that you have no null keywords in the
array, making your checks unnecessary.
(Side note: don't skimp on variable names. Use an editor
that does expansions if you have to, but I personally
find @keywords more evocative and
self-descriptive than @skeyw. Same
comment applies to filehandle names, and pretty much
every other name in your program...)
$nrkeywords = push(@skeyw);
$premiumnrkeywords = push(@premiumskeyw);
Assuming you just want to get the number of elements
in those arrays (which, as pointed out above, are the
same array, so you don't need to do it twice!), you
can simply evaluate the array in scalar context. The
slightly long way of saying that is:
my $n_keywords = scalar @keywords;
I say this is the long way, because the
scalar is unnecessary. You're asasigning
something into a scalar, and that puts that thing
into a scalar context. So, we could have written:
my $n_keywords = @keywords;
Either way, you certainly don't need push.
Finally, now that you know there is only one set of
keywords, you can construct the regex out of them. One
point to consider is whether thes keywords are themselves
regexes, or if they are just plain text. If the latter,
we want to make sure we neutralize any characters that
are special to the regex engine. The built-in function
quotemeta is just the ticket. Putting it
all together, we have:
my $regex = join '|', map { quotemeta $_ } @keywords;
$regex = qr/$regex/;
I addressed this in more detail in my original response,
but now you can basically do:
my @std_hits = grep { m/$regex/ } <STD_IDX>;
my @prem_hits = grep { m/$regex/ } <PREM_IDX>;
Hopefully this has helped you further. I do worry that
there are archetectural issues that I am unable to
address, and untill you get the chance to rethink what
you are doing (as opposed to just making random changes
and hoping for the best), you're not going to have much
luck.
Regardless, I hope it works out for you.
| [reply] [d/l] [select] |
Re: Separating multiple keyword search input
by davido (Cardinal) on Apr 30, 2004 at 03:00 UTC
|
The following code isn't intended to be a cut-n-paste solution, but rather a demonstration of a technique you can use to search for multiple keywords and do replacements based on those keywords.
my %replacements = qw/men women
houses homes
cats dogs
venus mars
carpet rug
full empty/;
my $string = "men cats houses venus carpet full";
for my $keyword ( keys %replacements ) {
my $re = qr/\b$keyword\b/;
$string =~ s/$re/$replacements{$keyword}/gi;
}
print "$string\n";
__OUTPUT__
women dogs homes mars rugs empty
Hope this helps. There are other similar idioms, and other different idioms that accomplish the same thing, but this one seems to be more or less what you're asking for.
| [reply] [d/l] |
Re: Separating multiple keyword search input
by bart (Canon) on Apr 30, 2004 at 19:33 UTC
|
$main_template =~ s/%%keywords%%/$fields{'keywords'}/g;
$main_template =~ s/%%searcresults%%/$pitem/g;
$main_template =~ s/%%keywords%%/$fields{'keywords'}/g;
$main_template =~ s/%%premiumlistings%%/$premiumitem/g;
Don't do several substitutions one after another, you could end up replacing what you replaced, again. For example:
$_ = 'taxial';
s/taxi/cab/g;
s/cabal/gang/g;
print;
prints "gang".
Better would be to replace both at once:
$_ = 'taxial';
my %subst = ( 'taxi' => 'cab', 'cabal' => 'gang' );
local $" = "|";
my @keys = map quotemeta, sort { length $b <=> length $a } keys %sub
+st;
my $regexp = qr(@keys);
s/($regexp)/$subst{$1}/g;
print;
which prints "cabal". Once the first substitution went past "taxi", replacing it with "cab", we've moved too far to replace it again in the second substitution. Usually, this is exactly what we want.
I like Regex::PreSuf (abbreviated from prefix/suffix) for this kind of solution. It combines several "words" into one regexp, as alternatives, and takes care of quotemeta as well. The code is shorter, and may likely become faster, too. That's the idea, though it depends on the words in the list.
$_ = 'taxial';
my %subst = ( 'taxi' => 'cab', 'cabal' => 'gang' );
use Regex::PreSuf;
my $regexp = presuf(keys %subst);
s/($regexp)/$subst{$1}/g;
print;
Applied to your code, it becomes:
my %subst = ( %fields, searcresults => $pitem,
premiumlistings => $premiumitem );
use Regex::PreSuf;
my $regexp = presuf(keys %subst);
$main_template =~ s/%%($regexp)%%/$subst{$1}/go;
BTW Using the /o modifier, the regexp is compiled only once, the first time a match is tried. Even though you can later change the values in the substitution hash, adding more keys will not do any good.
| [reply] [d/l] [select] |
|
|