Datalog

Datalog is a declarative logic programming language that syntactically is a subset of Prolog. It is often used as a query language for deductive databases. In recent years, Datalog has found new application in data integration, information extraction, networking, program analysis, security, cloud computing and machine learning.[1][2]

Its origins date back to the beginning of logic programming, but it became prominent as a separate area around 1977 when Hervé Gallaire and Jack Minker organized a workshop on logic and databases.[3] David Maier is credited with coining the term Datalog.[4]

Features, limitations and extensions

Unlike in Prolog, statements of a Datalog program can be stated in any order. Furthermore, Datalog queries on finite sets are guaranteed to terminate, so Datalog does not have Prolog's cut operator. This makes Datalog a fully declarative language.

In contrast to Prolog, Datalog

  1. disallows complex terms as arguments of predicates, e.g., p (1, 2) is admissible but not p (f (1), 2),
  2. imposes certain stratification restrictions on the use of negation and recursion,
  3. requires that every variable that appears in the head of a clause also appears in a nonarithmetic positive (i.e. not negated) literal in the body of the clause,
  4. requires that every variable appearing in a negative literal in the body of a clause also appears in some positive literal in the body of the clause[5]

Query evaluation with Datalog is based on first-order logic, and is thus sound and complete. However, Datalog is not Turing complete, and is thus used as a domain-specific language that can take advantage of efficient algorithms developed for query resolution. Indeed, various methods have been proposed to efficiently perform queries, e.g., the Magic Sets algorithm,[6] tabled logic programming[7] or SLG resolution.[8]

Some widely used database systems include ideas and algorithms developed for Datalog. For example, the SQL:1999 standard includes recursive queries, and the Magic Sets algorithm (initially developed for the faster evaluation of Datalog queries) is implemented in IBM's DB2.[9] Moreover, Datalog engines are behind specialised database systems such as Intellidimension's database for the semantic web.

Several extensions have been made to Datalog, e.g., to support aggregate functions, to allow object-oriented programming, or to allow disjunctions as heads of clauses. These extensions have significant impacts on the definition of Datalog's semantics and on the implementation of a corresponding Datalog interpreter.

Fragments

Datalog programs may or may not use negation in rule bodies: Datalog programs with negation are often required to use it as stratified negation to ensure that the semantics are well-defined. Datalog programs may or may not use inequalities between variables in rule bodies.

Some syntactic fragments of Datalog have been defined, such as the following (from most restricted to least restricted):

  • linear Datalog, where rule bodies must consist of a single atom
  • guarded Datalog, where for every rule, all the variables that occur in the rule bodies must occur together in at least one atom, called a guard atom
  • frontier-guarded Datalog, where for every rule, all the variables that are shared between the rule body and the rule head (called the frontier variables) must all occur together in a guard atom

Another restriction concerns the use of recursion: nonrecursive Datalog is defined by disallowing recursion in the definition of Datalog programs. This fragment of Datalog is as expressive as unions of conjunctive queries, but writing queries as nonrecursive Datalog can be more concise.

Expressiveness

The boundedness problem for Datalog asks, given a Datalog program, whether it is bounded, i.e., the maximal recursion depth reached when evaluating the program on an input database can be bounded by some constant. In other words, this question asks whether the Datalog program could be rewritten as a nonrecursive Datalog program. Solving the boundedness problem on arbitrary Datalog programs is undecidable,[10] but it can be made decidable by restricting to some fragments of Datalog.

Example

These two lines define two facts, i.e. things that always hold:

parent(xerces, brooke).
parent(brooke, damocles).

This is what they mean: xerces is a parent of brooke and brooke is a parent of damocles. The names are written in lowercase because strings beginning with an uppercase letter stand for variables.

These two lines define rules, which define how new facts can be inferred from known facts.

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

meaning:

  • X is an ancestor of Y if X is a parent of Y.
  • X is an ancestor of Y if X is a parent of some Z, and Z is an ancestor of Y.

This line is a query:

?- ancestor(xerces, X).

It asks the following: Who are all the X that xerces is an ancestor of? It would return brooke and damocles when posed against a Datalog system containing the facts and rules described above.

More about rules: a rule has a :- symbol in the middle: the part to the left of this symbol is the head of the rule, the part to the right is the body. A rule reads like this: <head> is known to be true if it is known that <body> is true. Uppercase letters in rules stand for variables: in the example, we don't know who X or Y are, but some X is the ancestor of some Y if that X is the parent of that Y. The ordering of the clauses is irrelevant in Datalog, in contrast to Prolog, which depends on the ordering of clauses for computing the result of the query call.

Datalog distinguishes between extensional predicate symbols (defined by facts) and intensional predicate symbols (defined by rules).[11] In the example above ancestor is an intensional predicate symbol, and parent is extensional. Predicates may also be defined by facts and rules and therefore neither be purely extensional nor intensional, but any Datalog program can be rewritten into an equivalent program without such predicate symbols with duplicate roles.

Syntax

A Datalog program consists of a list of facts and rules (Horn clauses).[12] If constant and variable are two countable sets of constants and variables respectively and relation is a countable set of predicate symbols, then the following grammar expresses the structure of a Datalog program:

<program> ::= <fact> <program> | <rule> <program> | ɛ
<fact> ::=  <relation> "(" <constant-list> ")." 
<rule> ::= <atom> ":-" <atom-list> "."
<atom> ::= <relation> "(" <term-list> ")"
<atom-list> ::= <atom> | <atom> "," <atom-list>
<term> ::= <constant> | <variable>
<term-list> ::= <term> | <term> "," <term-list>
<constant-list> ::= <constant> | <constant> "," <constant-list>

Atoms are also referred to as literals in the literature.

Semantics

There are three widely-used approaches to the semantics of Datalog programs: model-theoretic, fixed-point, and proof-theoretic.

Model Theory

A fact or rule is called ground if all of its subterms are constants. A ground rule R1 is a ground instance of another rule R2 if R1 is the result of a substitution of constants for all the variables in R2.

The Herbrand base (see Herbrand Universe) of a Datalog program is the set of all ground atoms that can be made with the constants appearing in the program. An interpretation (also known as a database instance) is a subset of the Herbrand base. A ground atom is true in an interpretation I if it is an element of I. A rule is true in an interpretation I if for each ground instance of that rule, if all the clauses in the body are true in I, then the head of the rule is also true in I.

A model of a Datalog program P is an interpretation I of P which contains all the ground facts of P, and makes all of the rules of P true in I. Model-theoretic semantics state that the meaning of a Datalog program is its minimal model (equivalently, the intersection of all its models).[13]

Fixpoint Semantics

Let I be the set of interpretations of a Datalog program P (i.e. I = P(H) where H is the Herbrand base of P and P is the powerset operator. Define a map from I to I as follows: For each ground instance of each rule in P, if every clause in the body is in the input interpretation, the add the head of the ground instance to the output interpretation. Then the fixed point of this map is the minimal model of the program. The fixpoint semantics suggest an algorithm for computing the minimal model: Start with the set of ground facts in the program, then repeatedly add consequences of the rules until a fixpoint is reached.[14]

Evaluation

There are many different ways to evaluate a Datalog program, with different performance characteristics.

Bottom-Up Evaluation Strategies

Bottom-up evaluation strategies start with the facts in the program and repeatedly apply the rules until the either some goal or query is established, or until the complete minimal model of the program is produced.

Naïve Evaluation

Naïve evaluation mirrors the fixpoint semantics for Datalog programs. Naïve evaluation uses a set of "known facts", which is initialized to the facts in the program. It proceeds by repeatedly enumerating all ground instances of each rule in the program. If each atom in the body of the ground instance is in the set of known facts, then the head atom is added to the set of known facts. This process is repeated until a fixed point is reached, and no more facts may be deduced. Naïve evaluation produces the entire minimal model of the program.[15]

Systems implementing Datalog

Here is a short list of systems that are either based on Datalog or provide a Datalog interpreter:

Free software/open source

Written in Name Try it online External Database Description Licence
C XSB A logic programming and deductive database system for Unix and MS Windows with tabling giving Datalog-like termination and efficiency, including incremental evaluation[16] GNU LGPL
C++ Coral[17] A deductive database system written in C++ with semi-naïve datalog evaluation. Developed 1988-1997. custom licence, free for non-commercial use
DLV[18] A Datalog extension that supports disjunctive head clauses. custom licence, free for academic and non-commercial educational use, as well as for use by non-profit organisations[19]
Inter4QL[20] an open-source command-line interpreter of Datalog-like 4QL query language implemented in C++ for Windows, Mac OS X and Linux. Negation is allowed in heads and bodies of rules as well as in recursion GNU GPL v3
RDFox[21] RDF triple store with Datalog reasoning. Implements the FBF algorithm for incremental evaluation. custom licence, free for non-commercial use[22]
Souffle[23] an open-source Datalog-to-C++ compiler converting Datalog into high-performance, parallel C++ code, specifically designed for complex Datalog queries over large data sets as e.g. encountered in the context of static program analysis UPL v1.0
Clojure Cascalog Hadoop a Clojure library for querying data stored on Hadoop clusters Apache
Clojure Datalog a contributed library implementing aspects of Datalog Eclipse Public License 1.0
Crux Yes Apache Kafka A general-purpose database with an "unbundled" architecture, using log-centric streaming of documents and transactions to achieve significant architectural flexibility and elegant horizontal scaling. Pluggable components include Kafka, RocksDB and LMDB. Indexes are bitemporal to support point-in-time Datalog queries by default. Java and HTTP APIs are provided. MIT License
Datascript in-memory Immutable database and Datalog query engine that runs in the browser Eclipse Public License 1.0
Datalevin LMDB A fork of Datascript that is optimized for the LMDB durable storage Eclipse Public License 1.0
Datahike file, in-memory A fork of Datascript with a durable backend using a hitchiker tree. Eclipse Public License 1.0
Erlang Datalog The library is designed to query and formalise relation of n-ary streams using datalog. It implements an ad-hoc query engine using simplified version of general logic programming paradigm. The library facilitates development of data integration, information exchange and semantic web applications. Apache v2
Haskell Dyna[24] Dyna is a declarative programming language for statistical AI programming. The language is based on Datalog, supports both forward and backward chaining, and incremental evaluation. GNU AGPL v3
DDlog[25] DDlog is an incremental, in-memory, typed Datalog engine. It is well suited for writing programs that incrementally update their output in response to input changes. The DDlog programmer specifies the desired input-output mapping in a declarative manner, using a dialect of Datalog. The DDlog compiler then synthesizes an efficient incremental implementation. DDlog is based on the differential dataflow[26] library. MIT License
Java AbcDatalog[27] AbcDatalog is an open-source implementation of the logic programming language Datalog written in Java. It provides ready-to-use implementations of common Datalog evaluation algorithms, as well as some experimental multi-threaded evaluation engines. It supports language features beyond core Datalog such as explicit (dis-)unification of terms and stratified negation. Additionally, AbcDatalog is designed to be easily extensible with new evaluation engines and new language features. BSD
IRIS[28] IRIS extends Datalog with function symbols, built-in predicates, locally stratified or un-stratified logic programs (using the well-founded semantics), unsafe rules and XML schema data types GNU LGPL v2.1
Jena a Semantic Web framework which includes a Datalog implementation as part of its general purpose rule engine, which provides OWL and RDFS support.[29] Apache v2
SociaLite[30] SociaLite is a datalog variant for large-scale graph analysis developed in Stanford Apache v2
Graal[31] Graal is a Java toolkit dedicated to querying knowledge bases within the framework of existential rules, aka Datalog+/-. CeCILL v2.1
Flix Yes A functional and logic programming language inspired by Datalog extended with user-defined lattices and monotone filter/transfer functions. Apache v2
Lua Datalog[32] Yes[33] a lightweight deductive database system. GNU LGPL
OCaml datalog[34] An in-memory datalog implementation for OCaml featuring bottom-up and top-down algorithms. BSD 2-clause
Prolog DES[35] an open-source implementation to be used for teaching Datalog in courses GNU LGPL
Python pyDatalog 11 dialects of SQL adds logic programming to Python's toolbox. It can run logic queries on databases or Python objects, and use logic clauses to define the behavior of Python classes. GNU LGPL
Racket Datalog for Racket[36] GNU LGPL
Datafun[37] Generalized Datalog on Semilattices GNU LGPL
Ruby bloom / bud A Ruby DSL for programming with data-centric constructs, based on the Dedalus extension of Datalog which adds a temporal dimension to the logic. BSD 3-Clause
Rust Crepe Crepe is a library that allows you to write declarative logic programs in Rust, with a Datalog-like syntax. It provides a procedural macro that generates efficient, safe code and interoperates seamlessly with Rust programs. It also supports extensions like stratified negation, semi-naive evaluation, and calling external functions within Datalog rules. MIT License / Apache 2.0
Datafrog Datafrog is a lightweight Datalog engine intended to be embedded in other Rust programs. MIT License / Apache 2.0
Tcl tclbdd[38] Implementation based on binary decision diagrams. Built to support development of an optimizing compiler for Tcl. BSD
Other or Unknown Languages bddbddb[39] an implementation of Datalog done at Stanford University. It is mainly used to query Java bytecode including points-to analysis on large Java programs GNU LGPL
ConceptBase[40] a deductive and object-oriented database system based on a Datalog query evaluator : Prolog for triggered procedures and rewrites, axiomatized Datalog called « Telos » for (meta)modeling. It is mainly used for conceptual modeling and metamodeling BSD 2-Clause

Non-free software

  • Datomic is a distributed database designed to enable scalable, flexible and intelligent applications, running on new cloud architectures. It uses Datalog as the query language.
  • FoundationDB provides a free-of-charge database binding for pyDatalog, with a tutorial on its use.[41]
  • Leapsight Semantic Dataspace (LSD) is a distributed deductive database that offers high availability, fault tolerance, operational simplicity, and scalability. LSD uses Leaplog (a Datalog implementation) for querying and reasoning and was create by Leapsight.[42]
  • LogicBlox, a commercial implementation of Datalog used for web-based retail planning and insurance applications.
  • Profium Sense is a native RDF compliant graph database written in Java. It provides Datalog evaluation support of user defined rules.
  • .QL, a commercial object-oriented variant of Datalog created by Semmle for analyzing source code to detect security vulnerabilities.[43]
  • SecPAL a security policy language developed by Microsoft Research.[44]
  • Stardog is a graph database, implemented in Java. It provides support for RDF and all OWL 2 profiles providing extensive reasoning capabilities, including datalog evaluation.
  • StrixDB: a commercial RDF graph store, SPARQL compliant with Lua API and Datalog inference capabilities. Could be used as httpd (Apache HTTP Server) module or standalone (although beta versions are under the Perl Artistic License 2.0).

See also

References

  1. Huang, Green, and Loo, "Datalog and Emerging applications", SIGMOD 2011 (PDF), UC DavisCS1 maint: multiple names: authors list (link).
  2. "Neural Datalog Through Time: Informed Temporal Modeling via Logical Specification". Proceedings of ICML 2020.
  3. Gallaire, Hervé; Minker, John ‘Jack’, eds. (1978), "Logic and Data Bases, Symposium on Logic and Data Bases, Centre d'études et de recherches de Toulouse, 1977", Advances in Data Base Theory, New York: Plenum Press, ISBN 978-0-306-40060-5.
  4. Abiteboul, Serge; Hull, Richard; Vianu, Victor (1995), Foundations of databases, p. 305, ISBN 9780201537710.
  5. Datalog
  6. Bancilhon. "Magic sets and other strange ways to implement logic programs" (PDF). PT: UNL. Archived from the original (PDF) on 2012-03-08. Cite journal requires |journal= (help)
  7. Pfenning, Frank; Schuermann, Carsten. "Twelf User's Guide". CMU. Cite journal requires |journal= (help)
  8. "Efficient top-down computation of queries under the well-founded semantics" (PDF). Cite journal requires |journal= (help)
  9. Gryz; Guo; Liu; Zuzarte (2004). "Query sampling in DB2 Universal Database" (PDF). Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04. p. 839. doi:10.1145/1007568.1007664. ISBN 978-1581138597. S2CID 7775190.
  10. Hillebrand, Gerd G; Kanellakis, Paris C; Mairson, Harry G; Vardi, Moshe Y (1995-11-01). "Undecidable boundedness problems for datalog programs". The Journal of Logic Programming. 25 (2): 163–190. doi:10.1016/0743-1066(95)00051-K. ISSN 0743-1066.
  11. Lifschitz (2011). "Datalog Programs and Their Stable Models". Datalog Reloaded. Lecture Notes in Computer Science. 6702. DE: Springer. pp. 78–87. CiteSeerX 10.1.1.225.4027. doi:10.1007/978-3-642-24206-9_5. ISBN 978-3-642-24205-2.
  12. Ceri, Gottlob & Tanca 1989, p. 146.
  13. Ceri, Gottlob & Tanca 1989, p. 149.
  14. Ceri, Gottlob & Tanca 1989, p. 150.
  15. Ceri, Gottlob & Tanca 1989, p. 154.
  16. The XSB System, Version 3.7.x, Volume 1: Programmer's Manual (PDF).
  17. Coral Database Project web page.
  18. "DLVSYSTEM S.r.l. | DLV". www.dlvsystem.com. Retrieved 2018-11-29..
  19. "DLVSYSTEM S.r.l. | DLV". www.dlvsystem.com. Retrieved 2018-11-29.
  20. 4QL.
  21. RDFox web page.
  22. RDFox licence, archived from the original on 2018-02-21, retrieved 2018-11-29.
  23. Souffle Compiler, 2018-12-12.
  24. "Dyna", Dyna web page, archived from the original on 2016-01-17, retrieved 2016-11-07.
  25. DDlog
  26. Differential Dataflow
  27. AbcDatalog.
  28. Iris reasoner.
  29. "Jena". Source forge.
  30. SociaLite homepage, archived from the original on 2017-09-11, retrieved 2015-10-12.
  31. Graal library.
  32. Ramsdell, "Datalog", Tools, NEU.
  33. Sangkok, Y, "Wrapper", Mitre Datalog, Git hub, (compiled to JavaScript).
  34. Cruanes, Simon, "datalog", datalog, GitHub.
  35. Saenz-Perez (2011), "DES: A Deductive Database System", Electronic Notes in Theoretical Computer Science, ES, 271: 63–78, doi:10.1016/j.entcs.2011.02.011.
  36. "Datalog", Racket (technical documentation).
  37. "Datafun", Datafun in Racket (Links to paper, talk and github site).
  38. Kenny, Kevin B (12–14 November 2014). Binary decision diagrams, relational algebra, and Datalog: deductive reasoning for Tcl (PDF). Twenty-first Annual Tcl/Tk Conference. Portland, Oregon. Retrieved 29 December 2015.
  39. "bddbddb", Source forge.
  40. ConceptBase.
  41. FoundationDB Datalog Tutorial, archived from the original on 2013-08-09.
  42. "Leapsight". Archived from the original on 2018-11-11.
  43. Semmle QL.
  44. "SecPAL". Microsoft Research. Archived from the original on 2007-02-23.

Bibliography

Further reading

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.