spaCy
spaCy (/speɪˈsiː/ spay-SEE) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.[3][4] The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.
Original author(s) | Matthew Honnibal |
---|---|
Developer(s) | Explosion AI, various |
Initial release | February 2015[1] |
Stable release | 3.0.0
/ 1 February 2021[2] |
Repository | |
Written in | Python, Cython |
Operating system | Linux, Windows, macOS, OS X |
Platform | Cross-platform |
Type | Natural language processing |
License | MIT License |
Website | spacy |
Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.[5][6] As of version 1.0, spaCy also supports deep learning workflows[7] that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc.[8][9] Using Thinc as its backend, spaCy features convolutional neural network models for part-of-speech tagging, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical neural network models to perform these task are available for English, German, Greek, Spanish, Portuguese, French, Italian, Dutch, Lithuanian and Norwegian, and there is also a multi-language NER model. Additional support for tokenization for more than 50 languages allows users to train custom models on their own datasets as well.[10]
Main features
- Non-destructive tokenization
- Named entity recognition
- "Alpha tokenization" support for over 50 languages[11]
- Statistical models for 11 languages[12]
- Pre-trained word vectors
- Part-of-speech tagging
- Labelled dependency parsing
- Syntax-driven sentence segmentation
- Text classification
- Built-in visualizers for syntax and named entities
- Deep learning integration
Extensions and visualizers
spaCy comes with several extensions and visualizations that are available as free, open-source libraries:
- Thinc: A machine learning library optimized for CPU usage and deep learning with text input.
- sense2vec: A library for computing word similarities, based on Word2vec and sense2vec.[13]
- displaCy: An open-source dependency parse tree visualizer built with JavaScript, CSS and SVG.
- displaCyENT: An open-source named entity visualizer built with JavaScript and CSS.
References
- "Introducing spaCy". explosion.ai. Retrieved 2016-12-18.
- "Release v3.0.0: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more · explosion/spaCy". GitHub. Retrieved 2021-02-02.
- Choi et al. (2015). It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool.
- "Google's new artificial intelligence can't understand these sentences. Can you?". Washington Post. Retrieved 2016-12-18.
- "Facts & Figures - spaCy". spacy.io. Retrieved 2020-04-04.
- Bird, Steven; Klein, Ewan; Loper, Edward; Baldridge, Jason (2008). "Multidisciplinary instruction with the Natural Language Toolkit" (PDF). Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, ACL: 62. doi:10.3115/1627306.1627317. ISBN 9781932432145. S2CID 16932735.
- "explosion/spaCy". GitHub. Retrieved 2016-12-18.
- "PyTorch, TensorFlow & MXNet". thinc.ai. Retrieved 2020-04-04.
- "explosion/thinc". GitHub. Retrieved 2016-12-30.
- "Models & Languages | spaCy Usage Documentation". spacy.io. Retrieved 2020-03-10.
- "Models & Languages - spaCy". spacy.io. Retrieved 2020-03-10.
- "Models & Languages | spaCy Usage Documentation". spacy.io. Retrieved 2020-03-10.
- Trask et al. (2015). sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.