Sentence boundary disambiguation

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in the Wall Street Journal corpus denote abbreviations.[1] Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.

Some languages including Japanese and Chinese have unambiguous sentence-ending markers.

Strategies

The standard 'vanilla' approach to locate the end of a sentence:

(a) If it's a period, it ends a sentence.

(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.

(c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct.[2] Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%.

Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model.[3] The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

Software

Examples of use of Perl compatible regular expressions ("PCRE")

((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
$sentences = preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for PHP)

Online use, libraries, and APIs

sent_detector – Java
Lingua-EN-Sentence – perl
Sentence.pm – perl
SATZ – An Adaptive Sentence Segmentation System – by David D. Palmer – C

Toolkits that include sentence detection

Apache OpenNLP –
Freeling (software) –
Natural Language Toolkit –
Stanford NLP –
GExp –
CogComp-NLP

References

E. STAMATATOS; N. FAKOTAKIS & G. KOKKINAKIS. "1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION". University of Patras. Retrieved 2009-01-03.
O'Neil, John. "Doing Things with Words, Part Two: Sentence Boundary Detection". Retrieved 2009-01-03.
Reynar, JC; Ratnaparkhi, A. "A Maximum Entropy Approach to Identifying Sentence Boundaries" (PDF). Retrieved 2009-01-03.

External links

Search for 'sentence boundary disambiguation', Google Scholar.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] E. STAMATATOS; N. FAKOTAKIS & G. KOKKINAKIS. "1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION". University of Patras. Retrieved 2009-01-03.

[2] O'Neil, John. "Doing Things with Words, Part Two: Sentence Boundary Detection". Retrieved 2009-01-03.

[3] Reynar, JC; Ratnaparkhi, A. "A Maximum Entropy Approach to Identifying Sentence Boundaries" (PDF). Retrieved 2009-01-03.

Natural language processing
General terms	AI-complete Bag-of-words n-gram Bigram Trigram Natural language understanding Speech corpus Stopwords Text corpus
Text analysis	Collocation extraction Concept mining Compound term processing Coreference resolution Lemmatisation Named-entity recognition Ontology learning Parsing Part-of-speech tagging Semantic similarity Sentiment analysis Stemming Terminology extraction Text chunking Text segmentation Sentence segmentation Word segmentation Textual entailment Truecasing Word-sense disambiguation
Automatic summarization	Multi-document summarization Sentence extraction Text simplification
Machine translation	Computer-assisted Example-based Rule-based Neural
Automatic identification and data capture	Speech recognition Speech segmentation Speech synthesis Natural language generation Optical character recognition
Topic model	Latent Dirichlet allocation Latent semantic analysis Pachinko allocation
Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing
Natural language user interface	Chatbot Interactive fiction Question answering Virtual assistant Voice user interface

Sentence boundary disambiguation

Strategies

Software

See also

References

External links