Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: Trimming hash based on element uniqueness

by BrowserUk (Patriarch)
on Jun 17, 2008 at 22:16 UTC ( [id://692599]=note: print w/replies, xml ) Need Help??


in reply to Trimming hash based on element uniqueness

I think you are approaching the problem from the wrong end.

Trying to determine the similarity of queries by comparing the SQL is problematic. For example, all these queries are textually similar, but produce quite different results sets:

select thing from tableA; select thing from tableA limit 1000 offset 1000; select count( thing ) from tableA; select thing from tableB;

Conversely, it not hard to dream up two or more radically different queries that produce identical results sets. Throw in a few joins and sub-selects and you would likely never determine their similarity.

If your intention is to try and cache the results from long running queries and re-use them, I think you will have an inordinately big task.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Trimming hash based on element uniqueness
by Corion (Patriarch) on Jun 18, 2008 at 06:02 UTC

    While this is true in the general case, in the concrete case, you usually have all queries generated by a fixed program, and most likely these queries differ only in their interpolated values (unless the application programmer had been smart and used placeholders, but then the whole point would be moot). In such a case, it can still make sense to check the queries for similarity by reversing the interpolation process and parsing the statement back into a placeholder query.

      Okay. I see what you mean. If there are a few dozen or hundred select * from tableX where id == '...'; with only the '...' varying, you might want to reduce them to one generic select and do a local hash lookup instead of issuing a query for each.

      Quite frankly, this is one of those situations where I'd print all the selects to a file, one per line, load them into my editor, sort them. And then step through them manually breaking them into groups. Even with a few thousand to process, doing this manually would probably be far quicker than deriving a heuristic, coding and algorithm and testing it. And, it does seem like a process one would have to repeat frequently enough (or at all for any given application, if you did it right the first time), to consider automating.

      I suppose if this were an application where the queries issued are derived and constructed from user input, then you might need to repeat the process a few times. Even then, doing a sort & group process manually, and then inspecting each group looking for caching opportunities and a heuristic to recognise each one is probably far simpler than trying to come up with a totally generic mechanism.

      Were I tackling this, in addition to the 1 per line list of the queries, I'd also try an obtain/generate a "sample output" from each one. That is, a representation of the output each select produces in terms of the table/field pairing and number of rows produced, rather than the actual values:

      (table.field1, table.field3, table.field7) x 60

      It might then be possible to prefix this information to queries themselves prior to sorting and use that to aid in the grouping. I'd still derive the generic query manually the first few times, and if the caching is successful, the numbers of long running queries should be very quickly reduced by the caching process, which should mean that it becomes redundant to automate it.

      Probably the hardest part would be recognising which queries (at runtime) can be redirected to the cache, but manually deriving a regex or two to recognise the pattern(s) of queries in each group replaced by a generic query, would be fairly easily done manually I think.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://692599]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-20 04:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found