Re: Trimming hash based on element uniqueness

I think you are approaching the problem from the wrong end.

Trying to determine the similarity of queries by comparing the SQL is problematic. For example, all these queries are textually similar, but produce quite different results sets:

select thing from tableA;
select thing from tableA limit 1000 offset 1000;
select count( thing ) from tableA;
select thing from tableB;
[download]

Conversely, it not hard to dream up two or more radically different queries that produce identical results sets. Throw in a few joins and sub-selects and you would likely never determine their similarity.

If your intention is to try and cache the results from long running queries and re-use them, I think you will have an inordinately big task.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re: Trimming hash based on element uniqueness Download Code

Replies are listed 'Best First'.
Re^2: Trimming hash based on element uniqueness by Corion (Patriarch) on Jun 18, 2008 at 06:02 UTC
While this is true in the general case, in the concrete case, you usually have all queries generated by a fixed program, and most likely these queries differ only in their interpolated values (unless the application programmer had been smart and used placeholders, but then the whole point would be moot). In such a case, it can still make sense to check the queries for similarity by reversing the interpolation process and parsing the statement back into a placeholder query.	[reply]
Re^3: Trimming hash based on element uniqueness by BrowserUk (Patriarch) on Jun 18, 2008 at 07:33 UTC
Okay. I see what you mean. If there are a few dozen or hundred `select * from tableX where id == '...';` with only the '...' varying, you might want to reduce them to one generic select and do a local hash lookup instead of issuing a query for each. Quite frankly, this is one of those situations where I'd print all the selects to a file, one per line, load them into my editor, sort them. And then step through them manually breaking them into groups. Even with a few thousand to process, doing this manually would probably be far quicker than deriving a heuristic, coding and algorithm and testing it. And, it does seem like a process one would have to repeat frequently enough (or at all for any given application, if you did it right the first time), to consider automating. I suppose if this were an application where the queries issued are derived and constructed from user input, then you might need to repeat the process a few times. Even then, doing a sort & group process manually, and then inspecting each group looking for caching opportunities and a heuristic to recognise each one is probably far simpler than trying to come up with a totally generic mechanism. Were I tackling this, in addition to the 1 per line list of the queries, I'd also try an obtain/generate a "sample output" from each one. That is, a representation of the output each select produces in terms of the table/field pairing and number of rows produced, rather than the actual values: `(table.field1, table.field3, table.field7) x 60` It might then be possible to prefix this information to queries themselves prior to sorting and use that to aid in the grouping. I'd still derive the generic query manually the first few times, and if the caching is successful, the numbers of long running queries should be very quickly reduced by the caching process, which should mean that it becomes redundant to automate it. Probably the hardest part would be recognising which queries (at runtime) can be redirected to the cache, but manually deriving a regex or two to recognise the pattern(s) of queries in each group replaced by a generic query, would be fairly easily done manually I think. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks