Semantic similarity

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1][2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.[3] For example, "car" is similar to "bus", but is also related to "road" and "driving".

Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph (e.g., a taxonomy), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such the information retrieval, recommender systems, natural language processing, etc.

Terminology

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymy, while similarity does not.[4] However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity.

Visualization

An intuitive way of visualizing the semantic similarity of terms is by grouping together terms which are closely related and spacing wider apart the ones which are distantly related. This is also common in practice for mind maps and concept maps.

A more direct way of visualizing the semantic similarity of two linguistic items can be seen with the Semantic Folding approach. In this approach a linguistic item such as a term or a text can be represented by generating a pixel for each of its active semantic features in e.g. a 128 x 128 grid. This allows for a direct visual comparison of the semantics of two items by comparing image representations of their respective feature sets.

Applications

In biomedical informatics

Semantic similarity measures have been applied and developed in biomedical ontologies.[5][6] They are mainly used to compare genes and proteins based on the similarity of their functions rather than on their sequence similarity, but they are also being extended to other bioentities, such as diseases.[7]

These comparisons can be done using tools freely available on the web:

  • ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of UniProt proteins and to get the information content and calculate the functional semantic similarity of GO terms.[8]
  • CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using ChEBI based semantic similarity measures.[9]
  • CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.[10]

In geoinformatics

Similarity is also applied in geoinformatics to find similar geographic features or feature types:[11]

  • SIM-DL similarity server[12] can be used to compute similarities between concepts stored in geographic feature type ontologies.
  • Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology.[13][14]
  • The OSM semantic network can be used to compute the semantic similarity of tags in OpenStreetMap.[15]

In computational linguistics

Several metrics use WordNet, a manually constructed lexical database of English words. Despite the advantages of having human supervision in constructing the database, since the words are not automatically learned the database cannot measure relatedness between multi-word term, non-incremental vocabulary.[4][16]

In natural language processing

Natural language processing (NLP) is a field of computer science and linguistics. Sentiment analysis, Natural language understanding and Machine translation (Automatically translate text from one human language to another) are a few of the major areas where it is being used. For example, knowing one information resource in the internet, it is often of immediate interest to find similar resources. The Semantic Web provides semantic extensions to find similar data by content and not just by arbitrary descriptors.[17][18][19][20][21][22][23][24][25] Deep learning methods have become an accurate way to gauge semantic similarity between two text passages, in which each passage is first embedded into a continuous vector representation.[26][27][28]

Measures

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts:

  • Edge-based: which use the edges and their types as the data source;
  • Node-based: in which the main data sources are the nodes and their properties.

Other measures calculate the similarity between ontological instances:

  • Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
  • Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent

Some examples:

Edge-based

  • Pekar et al.[29]
  • Cheng and Cline[30]
  • Wu et al.[31]
  • Del Pozo et al.[32]
  • IntelliGO: Benabderrahmane et al.[6]

Node-based

  • Resnik[33]
    • based on the notion of information content. The information content of a concept (term or word) is the logarithm of the probability of finding the concept in a given corpus.
    • only considers the information content of lowest common subsumer (lcs). A lowest common subsumer is a concept in a lexical taxonomy ( e.g. WordNet), which has the shortest distance from the two concepts compared. For example, animal and mammal both are the subsumers of cat and dog, but mammal is lower subsumer than animal for them.
  • Lin[34]
    • based on Resnik's similarity.
    • considers the information content of lowest common subsumer (lcs) and the two compared concepts.
  • Maguitman, Menczer, Roinestad and Vespignani[35]
    • Generalizes Lin's similarity to arbitrary ontologies (graphs).
  • Jiang and Conrath[36]
    • based on Resnik's similarity.
    • considers the information content of lowest common subsumer (lcs) and the two compared concepts to calculate the distance between the two concepts. The distance is later used in computing the similarity measure.
  • Align, Disambiguate, and Walk: Random walks on Semantic Networks[37]

Node-and-Relation-Content-based

  • applicable to ontology
  • consider properties (content) of nodes
  • consider types (content) of relations
  • based on eTVSM[38]
  • based on Resnik's similarity[39]

Pairwise

  • maximum of the pairwise similarities
  • composite average in which only the best-matching pairs are considered (best-match average)

Groupwise

Statistical similarity

Statistical similarity approaches can be learned from data, or predefined. Similarity learning can often outperform predefined similarity measures. Broadly speaking, these approaches build a statistical model of documents, and use it to estimate similarity.

  • LSA (Latent semantic analysis)[40][41](+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
  • PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
  • SOC-PMI (Second-order co-occurrence pointwise mutual information) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
  • GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
  • ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
  • NGD (Normalized Google distance) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below.[42]
  • TSS - Twitter Semantic Similarity -pdf large vocab, because it use online tweets from Twitter to compute the similarity. It has high temporary resolution that allows to capture high frequency events. Open Source
  • NCD (Normalized Compression Distance)
  • ESA (Explicit Semantic Analysis) based on Wikipedia and the ODP
  • SSA (Salient Semantic Analysis) which indexes terms using salient concepts found in their immediate context.
  • n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
  • VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
  • SimRank
  • NASARI:[43] Sparse vector representations constructed by applying the hypergeometric distribution over the Wikipedia corpus in combination with BabelNet taxonomy. Cross-lingual similarity is currently also possible thanks to the multilingual and unified extension.[44]

Semantics-based similarity

  • Marker Passing: Combining Lexical Decomposition for automated Ontology Creation and Marker Passing the approach of Fähndrich et al. introduces a new type of semantic similarity measure.[45] Here markers are passed from the two target concepts carrying an amount of activation. This activation might increase or decrease depending on the relations weight with which the concepts are connected. This combines edge and node based approaches and includes connectionist reasoning with symbolic information.
  • Good Common Subsumer-(GCS)-based Semantic Similarity Measure[46]

Gold standards

Researchers have collected datasets with similarity judgements on pairs of words, which are used to evaluate the cognitive plausibility of computational measures. The golden standard up to today is an old 65 word list where humans have judged the word similarity.[47] For a list of datasets, and an overview of the state of the art see https://www.aclweb.org/.

See also

References

  1. Harispe S.; Ranwez S. Janaqi S.; Montmain J. (2015). "Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8:1: 1–254. arXiv:1704.05295. doi:10.2200/S00639ED1V01Y201504HLT027. S2CID 17428739.
  2. Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). "The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1–30. doi:10.1017/S0269888917000029.
  3. A. Ballatore; M. Bertolotto; D.C. Wilson (2014). "An evaluative baseline for geo-semantic relatedness and similarity". GeoInformatica. 18:4 (4): 747–767. arXiv:1402.3371. Bibcode:2014arXiv1402.3371B. doi:10.1007/s10707-013-0197-8. S2CID 17474023.
  4. Budanitsky, Alexander; Hirst, Graeme (2001). "Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures" (PDF). Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh.
  5. Guzzi, Pietro Hiram; Mina, Marco; Cannataro, Mario; Guerra, Concettina (2012). "Semantic similarity analysis of protein data: assessment with biological features and issues". Briefings in Bioinformatics. 13 (5): 569–585. doi:10.1093/bib/bbr066. PMID 22138322.
  6. Benabderrahmane, Sidahmed; Smail Tabbone, Malika; Poch, Olivier; Napoli, Amedeo; Devignes, Marie-Domonique. (2010). "IntelliGO: a new vector-based semantic similarity measure including annotation origin". BMC Bioinformatics. 11: 588. doi:10.1186/1471-2105-11-588. PMC 3098105. PMID 21122125.
  7. Köhler, S; Schulz, MH; Krawitz, P; Bauer, S; Dolken, S; Ott, CE; Mundlos, C; Horn, D; et al. (2009). "Clinical diagnostics in human genetics with semantic similarity searches in ontologies". American Journal of Human Genetics. 85 (4): 457–64. doi:10.1016/j.ajhg.2009.09.003. PMC 2756558. PMID 19800049.
  8. "ProteInOn".
  9. "CMPSim".
  10. "CESSM".
  11. Janowicz, K., Raubal, M. and Kuhn, W. (2011). "The semantics of similarity in geographic information retrieval". Journal of Spatial Information Science. 2 (2): 29–57. doi:10.5311/josis.2011.2.3.CS1 maint: multiple names: authors list (link)
  12. "SIM-DL similarity server". 2007: 128–145. CiteSeerX 10.1.1.172.5544. Cite journal requires |journal= (help)
  13. "Geo-Net-PT Similarity Calculator".
  14. "Geo-Net-PT".
  15. A. Ballatore; D.C. Wilson; M. Bertolotto. "Geographic Knowledge Extraction and Semantic Similarity in OpenStreetMap" (PDF). Knowledge and Information Systems: 61–81.
  16. Kaur, I. & Hornof, A.J. (2005). A Comparison of LSA, WordNet and PMI for Predicting User Click Behavior. Proceedings of the Conference on Human Factors in Computing, CHI 2005. pp. 51–60. doi:10.1145/1054972.1054980. ISBN 978-1-58113-998-3. S2CID 14347026.
  17. Similarity-based Learning Methods for the Semantic Web (C. d'Amato, PhD Thesis)
  18. Gracia, J. & Mena, E. (2008). "Web-Based Measure of Semantic Relatedness" (PDF). Proceedings of the 9th International Conference on Web Information Systems Engineering (WISE '08): 136–150.
  19. Raveendranathan, P. (2005). Identifying Sets of Related Words from the World Wide Web. Master of Science Thesis, University of Minnesota Duluth.
  20. Wubben, S. (2008). Using free link structure to calculate semantic relatedness. In ILK Research Group Technical Report Series, nr. 08-01, 2008.
  21. Juvina, I., van Oostendorp, H., Karbor, P., & Pauw, B. (2005). Towards modeling contextual information in web navigation. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1078–1083). Austin, Tx: The Cognitive Science Society, Inc.
  22. Navigli, R., Lapata, M. (2007). Graph Connectivity Measures for Unsupervised Word Sense Disambiguation, Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, January 6-12th, 2007, pp. 1683–1688.
  23. Pirolli, P. (2005). "Rational analyses of information foraging on the Web". Cognitive Science. 29 (3): 343–373. doi:10.1207/s15516709cog0000_20. PMID 21702778.
  24. Pirolli, P., & Fu, W.-T. (2003). "SNIF-ACT: A model of information foraging on the World Wide Web". Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2702. pp. 45–54. CiteSeerX 10.1.1.6.1506. doi:10.1007/3-540-44963-9_8. ISBN 978-3-540-40381-4.CS1 maint: multiple names: authors list (link)
  25. Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491–502). Freiburg, Germany.
  26. Reimers, Nils; Gurevych, Iryna (November 2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics: 3982–3992. arXiv:1908.10084. doi:10.18653/v1/D19-1410.
  27. Mueller, Jonas; Thyagarajan, Aditya (2016-03-05). "Siamese Recurrent Architectures for Learning Sentence Similarity". Thirtieth AAAI Conference on Artificial Intelligence.
  28. Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Russ R; Zemel, Richard; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015), Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M. (eds.), "Skip-Thought Vectors" (PDF), Advances in Neural Information Processing Systems 28, Curran Associates, Inc., pp. 3294–3302, retrieved 2020-03-13
  29. Pekar, Viktor; Staab, Steffen (2002). Taxonomy learning. Proceedings of the 19th international conference on Computational linguistics -. 1. pp. 1–7. doi:10.3115/1072228.1072318.
  30. Cheng, J; Cline, M; Martin, J; Finkelstein, D; Awad, T; Kulp, D; Siani-Rose, MA (2004). "A knowledge-based clustering algorithm driven by Gene Ontology". Journal of Biopharmaceutical Statistics. 14 (3): 687–700. doi:10.1081/BIP-200025659. PMID 15468759. S2CID 25224811.
  31. Wu, H; Su, Z; Mao, F; Olman, V; Xu, Y (2005). "Prediction of functional modules based on comparative genome analysis and Gene Ontology application". Nucleic Acids Research. 33 (9): 2822–37. doi:10.1093/nar/gki573. PMC 1130488. PMID 15901854.
  32. Del Pozo, Angela; Pazos, Florencio; Valencia, Alfonso (2008). "Defining functional distances over Gene Ontology". BMC Bioinformatics. 9: 50. doi:10.1186/1471-2105-9-50. PMC 2375122. PMID 18221506.
  33. Philip Resnik (1995). Chris S. Mellish (ed.). "Using information content to evaluate semantic similarity in a taxonomy". Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI'95). 1: 448–453. arXiv:cmp-lg/9511007. Bibcode:1995cmp.lg...11007R. CiteSeerX 10.1.1.41.6956.
  34. Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML '98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 296-304
  35. Ana Gabriela Maguitman, Filippo Menczer, Heather Roinestad, Alessandro Vespignani: Algorithmic detection of semantic similarity. WWW 2005: 107-116
  36. J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In International Conference on Research on Computational Linguistics (ROCLING X), pages 9008+, September 1997
  37. M. T. Pilehvar, D. Jurgens and R. Navigli. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity.. Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4–9, 2013, pp. 1341–1351.
  38. Dong, Hai (2009). "A Hybrid Concept Similarity Measure Model for Ontology Environment". On the Move to Meaningful Internet Systems: OTM 2009 Workshops. Lecture Notes in Computer Science. 5872. pp. 848–857. Bibcode:2009LNCS.5872..848D. doi:10.1007/978-3-642-05290-3_103. ISBN 978-3-642-05289-7.
  39. Dong, Hai (2011). "A context-aware semantic similarity model for ontology environments". Concurrency and Computation: Practice and Experience. 23 (2): 505–524. doi:10.1002/cpe.1652. S2CID 412845.
  40. Landauer, T. K.; Dumais, S. T. (1997). "A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge" (PDF). Psychological Review. 104 (2): 211–240. CiteSeerX 10.1.1.184.4759. doi:10.1037/0033-295x.104.2.211.
  41. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). "Introduction to Latent Semantic Analysis" (PDF). Discourse Processes. 25 (2–3): 259–284. CiteSeerX 10.1.1.125.109. doi:10.1080/01638539809545028.CS1 maint: multiple names: authors list (link)
  42. "Google Similarity Distance".
  43. J. Camacho-Collados, M. T. Pilehvar, and R. Navigli. NASARI: a Novel Approach to a Semantically-Aware Representation of Items. In Proceedings of the North American Chapter of the Association of Computational Linguistics (NAACL 2015), Denver, USA, pp. 567-577, 2015
  44. J. Camacho-Collados, M. T. Pilehvar, and R. Navigli. A Unified Multilingual Semantic Representation of Concepts. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, July 27–29, pp. 741-751, 2015
  45. Fähndrich J., Weber S., Ahrndt S. (2016) Design and Use of a Semantic Similarity Measure for Interoperability Among Agents. In: Klusch M., Unland R., Shehory O., Pokahr A., Ahrndt S. (eds) Multiagent System Technologies. MATES 2016. Lecture Notes in Computer Science, vol 9872. Springer, available at author version
  46. C. d'Amato, S. Staab, and N. Fanizzi. On the influence of description logics ontologies on conceptual similarity. Knowledge Engineering: Practice and Patterns, pages 48-63, 2008 doi:10.1007/978-3-540-87696-0_7
  47. Rubenstein, Herbert, and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.
  48. Rubenstein, Herbert; Goodenough, John B. (1965-10-01). "Contextual correlates of synonymy". Communications of the ACM. 8 (10): 627–633. doi:10.1145/365628.365657. S2CID 18309234.
  49. Miller, George A.; Charles, Walter G. (1991-01-01). "Contextual correlates of semantic similarity". Language and Cognitive Processes. 6 (1): 1–28. doi:10.1080/01690969108406936. ISSN 0169-0965.
  50. "Placing search in context". ACM Transactions on Information Systems (TOIS). 20: 116–131. 2002-01-01. doi:10.1145/503104.503110. S2CID 12956853.

Sources

Survey articles

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.