Symbolic regression
Symbolic Regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity. No particular model is provided as a starting point to the algorithm. Instead, initial expressions are formed by randomly combining mathematical building blocks such as mathematical operators, analytic functions, constants, and state variables. Usually, a subset of these primitives will be specified by the person operating it, but that's not a requirement of the technique. The symbolic regression problem for mathematical functions has been tackled with a variety of methods, including recombining equations most commonly using genetic programming,[1] as well as more recently methods utilizing Bayesian methods [2] and physics inspired AI.[3] The other non-classical alternative method to SR is called Universal Functions Originator (UFO), which has a different mechanism, search-space, and building strategy.[4]
By not requiring a priori specification of a model, symbolic regression isn't affected by human bias, or unknown gaps in domain knowledge. It attempts to uncover the intrinsic relationships of the dataset, by letting the patterns in the data itself reveal the appropriate models, rather than imposing a model structure that is deemed mathematically tractable from a human perspective. The fitness function that drives the evolution of the models takes into account not only error metrics (to ensure the models accurately predict the data), but also special complexity measures,[5] thus ensuring that the resulting models reveal the data's underlying structure in a way that's understandable from a human perspective. This facilitates reasoning and favors the odds of getting insights about the data-generating system.
Difference from classical regression
While conventional regression techniques seek to optimize the parameters for a pre-specified model structure, symbolic regression avoids imposing prior assumptions, and instead infers the model from the data. In other words, it attempts to discover both model structures and model parameters.
This approach has the disadvantage of having a much larger space to search, because not only the search space in symbolic regression is infinite, but there are an infinite number of models which will perfectly fit a finite data set (provided that the model complexity isn't artificially limited). This means that it will possibly take a symbolic regression algorithm longer to find an appropriate model and parametrization, than traditional regression techniques. This can be attenuated by limiting the set of building blocks provided to the algorithm, based on existing knowledge of the system that produced the data; but in the end, using symbolic regression is a decision that has to be balanced with how much is known about the underlying system.
Nevertheless, this characteristic of symbolic regression also has advantages: because the evolutionary algorithm requires diversity in order to effectively explore the search space, the end result is likely to be a selection of high-scoring models (and their corresponding set of parameters). Examining this collection could provide better insight into the underlying process, and allows the user to identify an approximation that better fits their needs in terms of accuracy and simplicity.
See also
- Eureqa, a symbolic regression engine
- HeuristicLab, a software environment for heuristic and evolutionary algorithms, including symbolic regression
- Closed-form expression § Conversion from numerical forms
- Genetic programming[3]
- Gene expression programming
- Kolmogorov complexity
- Mathematical optimization
- Regression analysis
- Reverse mathematics
- Universal Functions Originator
References
- Michael Schmidt; Hod Lipson (2009). "Distilling free-form natural laws from experimental data". Science. American Association for the Advancement of Science. 324 (5923): 81–85. Bibcode:2009Sci...324...81S. CiteSeerX 10.1.1.308.2245. doi:10.1126/science.1165893. PMID 19342586.
- Ying Jin; Weilin Fu; Jian Kang; Jiadong Guo; Jian Guo (2019). "Bayesian Symbolic Regression". arXiv:1910.08892 [stat.ME].
- Silviu-Marian Udrescu; Max Tegmark (2020). "AI Feynman: A physics-inspired method for symbolic regression". Science_Advances. American Association for the Advancement of Science. 6 (16): eaay2631. doi:10.1126/sciadv.aay2631. PMC 7159912. PMID 32426452.
- Ali R. Al-Roomi; Mohamed E. El-Hawary (2020). "Universal Functions Originator". Applied Soft Computing. Elsevier B.V. 94: 106417. doi:10.1016/j.asoc.2020.106417. ISSN 1568-4946.
- Ekaterina J. Vladislavleva; Guido F. Smits; Dick Den Hertog (2009). "Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming" (PDF). IEEE Transactions on Evolutionary Computation. 13 (2): 333–349. doi:10.1109/tevc.2008.926486.
Further reading
- Mark J. Willis; Hugo G. Hiden; Ben McKay; Gary A. Montague; Peter Marenbach (1997). "Genetic programming: An introduction and survey of applications" (PDF). IEE Conference Publications. IEE. pp. 314–319.
- Wouter Minnebo; Sean Stijven (2011). "Chapter 4: Symbolic Regression" (PDF). Empowering Knowledge Computing with Variable Selection (M.Sc. thesis). University of Antwerp.
- John R. Koza; Martin A. Keane; James P. Rice (1993). "Performance improvement of machine learning via automatic discovery of facilitating functions as applied to a problem of symbolic system identification" (PDF). IEEE International Conference on Neural Networks. San Francisco: IEEE. pp. 191–198.
External links
- Ivan Zelinka (2004). "Symbolic regression — an overview".
- Hansueli Gerber (1998). "Simple Symbolic Regression Using Genetic Programming". (Java applet) — approximates a function by evolving combinations of simple arithmetic operators, using algorithms developed by John Koza.
- Katya Vladislavleva. "Symbolic Regression: Function Discovery & More". Archived from the original on 2014-12-18.
- RGP, a Genetic Programming (GP) framework in R that supports symbolic regression
- GPTIPS, a Genetic Programming and Symbolic Data Mining Platform for MATLAB
- dcgp, an open source symbolic regression toolbox.
- Glyph, a python 3 library based on deap providing abstraction layers for symbolic regression problems
- AI-Feynman python3 + pytorch code for A physics-inspired method for symbolic regression.
- TuringBot, a symbolic regression software based on simulated annealing.