Tauchain R&D Analysis and Discussion: Subsumptive tabling beats magic sets.

avatar
(Edited)

Subsumption optimization vs magic sets

In my last update post I referred to the latest optimization from Ohad Asor called "subsumptive tabling". Like many of you I originally from esoteric title of the academic paper had no idea what it was talking about so to provide clarity I'm going to explore my analysis of this particular paper for the benefit of the community. Please note that to fully understand what any R&D paper is actually doing you will have to also look at the Github code and verify that the implementation of the code produces the expected results predicted by the calculations in the paper.

The title of the paper is: "More efficient datalog queries: subsumptive tabling beats magic sets". This particular paper was written by computer science pioneers Tekle and Liu. Annie Liu is a professor who specializes in modeling and specification, analysis and verification, design and optimization, code generation, and testing. Annie Liu also researches concurrent and distributed algorithms, with a particular interest in BFT (byzantine fault tolerance).

When I look at the paper I can see from the abstract that the goal of the paper is to find an efficiency gain for datalog queries. For those who have studied the technical aspects of TML then they would be aware that TML is very much based on (yet not equivalent to) datalog. The magic set transformation is a mechanism by which Irrelevant tuples are avoided. If you want to learn more about magic set transformation you can study here.

Datalog relies on inference and typically there is top down or bottom up. Top down is also called "backward chaining" while bottom up is called "forward chaining". These procedures are isomorphic when successful which means they both produce the same result no matter which procedure you choose to use but the difference between them is in the efficiency. The top down procedure can be more efficient at times, and all of this is about how TML efficiently handles proving theorems which is a necessary feature for Tauchain which relies on logic. G is a logical consequence of the knowledge base aka KB entails G, is called logical entailment. Logical entailment according to WIkipedia:

describes the relationship between statements that hold true when one statement logically follows from one or more statements.

In essence the bottom up procedure often referred to as forward chaining. The essence of the is path is that backward chaining derives only sentences that follow from the facts in the database (knowledge base).

  • Data leads to a decision (forward chaining).
  • Decision leads back to data (backward chaining).
  • KB/knowledge base contains facts.

A decision could be a question such as "Should I go to New York or Texas?".
The decision you are going to make is between two choices so you start with the decision and you then search for the data which relates to this decision.

But if you are going with forward chaining then you're starting with the data and this converges to bring you to a decision. The more data you have the more "common sense" is produced by forward chaining. In other words forward chaining is data driven, and it works when maybe you don't know the question to ask prior to having the data.

It's probably not the best possible explanation and the concept is deep but to move on we can now look at the paper and figure out that it's a new more efficient way of answering queries (questions). Efficiency can be gained when answers to sub-questions can be re-used. Note: **Bottom up inference starts with facts to answer questions while top down inference starts with the question. Data driven vs goal driven. **.

Subsumtive demand transforms a set of rules and a question into a new set of rules, such that the set of facts which can be inferred from the new set of rules contains only the facts which would be inferred during v-topdown of the original rules. It can add new rules that define needed facts for each hypothesis in each rule, and add hypotheses to the original rules to restrict computation to infer only needed facts. I realize this explanation is really technical and in order to fully understand it would likely require studying the code but it appears to produce efficiency gains based on the experiments in the paper. They did a complexity analysis and comparison where they tested and compared s-topdown, s-bottom up, v-topdown, v-bottom up. The experiment showed s-topdown beats v-topdown, and s-bottomup beats v-bottomup. Short and simple, subsumitive demand transform is a new transformation method which they call subsumption optimization.

It appears that there is a two step process where the first step is to identify the demand patterns to subsume. In essence all sets of demand patterns are generated and once generated the annotation rules are then generated for each set. Then for each set of resulting rules the time complexity is compared between the resulted annotation rules and the original rules all with the goal of answering the question of whether the demand pattern in s should be subsumed by the demand patterns in s2. Asymptotic time complexity is what gets compared to answer this question. All of this is to be implemented in TML but the algorithm is highlighted in the paper with experimental evidence to support it which could explain why it was chosen.

Conclusion

Subsumption optimization has the goal to reuse as many facts as possible as it's path to optimization. The paper is backed by experimental evidence showing time efficiency and also they showed with complexity analysis that their method of transformation is more efficient than magic sets. For this reason the statement: "Subsumptive tabling beats magic sets" is true based on experimental results. I will admit I struggle with some of the theoretical concepts but in my opinion it is not necessary to understand the deep logical concepts to try a test it out and see approach. Simply writing it in code to see if the approach actually does produce the same results in the paper will either produce the promised efficiency gain or it will not but from looking at the experiments with my limited understanding of this paper it seems to me this will produce an efficiency gain.

This is a paper I may have to go back to at a later date just to get a firmer grasp of it. Subsumptive optimization is something I had never heard of before. I'm guessing most readers also never heard of this stuff so the key take away from this is if you read the paper put a focus on the experimental results, the implementation of the algorithm, and the conclusion. Anyone with more knowledge on this topic than me, feel free to correct my errors in the comments below.

References

Tekle, K. T., & Liu, Y. A. (2011, June). More efficient datalog queries: subsumptive tabling beats magic sets. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (pp. 661-672). ACM.



0
0
0.000
2 comments
avatar

To listen to the audio version of this article click on the play image.

Brought to you by @tts. If you find it useful please consider upvoting this reply.

0
0
0.000
avatar

Really glad that they're hiring, it means Ohad Asor, the lead dev, is likely past the most difficult R+D stages of the project and now top CPP devs have an opportunity to contribute.

I recall Ohad, after only studying this for a few days, reached out to Annie Liu about a possible error in the paper. After a little bit of back and forth, it was concluded that Ohad was indeed correct. To give an indication of how technical it was, this was the passage in question:

The s-demand pattern of the given query p(a1,...,ak) is <p, 1, _, true, s>,
where 1 and _ indicate that the given query can be regarded
as the first (and only) hypothesis of a non-existing rule, and
the ith character of s is ‘b’ if ai is a constant, and ‘f’ otherwise. For each computed s-demand pattern hp, n, r, g, si,
for each rule r2 that defines p, and for each jth hypothesis
h of r2 whose predicate is an IDB predicate, say, q, add an
s-demand pattern hq, j, r2, g2, s2i, if this s-demand pattern
is not subsumed by an s-demand pattern already computed,
where g2 is true iff (i) g is true, (ii) j is 1, and (iii) the ith
character of s2 is ‘b’ if the ith argument of h is a constant,
appears in a hypothesis to the left of h in r, or is an argument of the conclusion of r bound by s; and ‘f’ otherwise.

I'm very much looking forward to what Ohad can do with Tau

0
0
0.000