Language resource
In linguistics and language technology, a language resource is a `[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications'.[1]
According to Bird & Simons (2003),[2] this includes
- data, i.e. `any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar',[2]
- tools, i.e., `computational resources that facilitate creating, viewing, querying, or otherwise using language data',[2] and
- advice, i.e., `any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data'. The latter aspect is usually referred to as `best practices' or `(community) standards'.[2]
In a narrower sense, language resource is specifically applied to resources that are available in digital form, and then, `encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management.'[1]
Typology
As of May 2020, no widely used standard typology of language resources has been established (current proposals include the LREMap,[3] METASHARE,[4] and, for data, the LLOD classification). Important classes of language resources include
- data
- lexical resources, e.g., machine-readable dictionaries,
- linguistic corpora, i.e., digital collections of natural language data,
- linguistic data bases such as the Cross-Linguistic Linked Data collection,
- tools
- linguistic annotations and tools for creating such annotations in a manual or semiautomated fashion (e.g., tools for annotating interlinear glossed text such as Toolbox and FLEx, or other language documentation tools),
- applications for search and retrieval over such data (corpus management systems), for automated annotation (part-of-speech tagging, syntactic parsing, semantic parsing, etc.),
- metadata and vocabularies
- vocabularies, repositories of linguistic terminology and language metadata, e.g., MetaShare (for language resource metadata),[4] the ISO 12620 data category registry (for linguistic features, data structures and annotations within a language resource),[5] or the Glottolog database (identifiers for language varieties and bibiliographical database).[6]
Language resource publication, dissemination and creation
A major concern of the language resource community has been to develop infrastructures and platforms to present, discuss and disseminate language resources. Selected contributions in this regard include:
- a series of International Conferences on Language Resources and Evaluation (LREC),
- the European Language Resources Association (ELRA, EU-based), and the Linguistic Data Consortium (LDC, US-based), which represent commercial hosting and dissemination platforms for language resources,
- the Open Languages Archives Community (OLAC), which provides and aggregates language resource metadata,
- the Language Resources and Evaluation Journal (LREJ).[7]
As for the development of standards and best practices for language resources, these are subject of several community groups and standardization efforts, including
- ISO Technical Committee 37: Terminology and other language and content resources (ISO/TC 37), developing standards for all aspects of language resources,
- W3C Community Group Best Practices for Multilingual Linked Open Data (BPMLOD),[8] working on best practice recommendations for publishing language resources as Linked Data or in RDF,
- W3C Community Group Linked Data for Language Technology (LD4LT),[9] working on linguistic annotations on the web and language resource metadata,
- W3C Community Group Ontology-Lexica (OntoLex),[10] working on lexical resources,
- the Open Linguistics working group of the Open Knowledge Foundation, working on conventions for publishing and linking open language resources, developing the Linguistic Linked Open Data cloud,[11]
- the Text Encoding Initiative (TEI),[12] working on XML-based specifications for language resources and digitally edited text.
References
- LD4LT (2020), The Metashare Ontology as Created by the LD4LT Community Group, W3C Community Group Linked Data for Language Technology (LD4LT), Development branch, version of Mar 10, 2020
- Bird, Steven; Simons, Gary (2003-11-01). "Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources". Computers and the Humanities. 37 (4): 375–388. arXiv:cs/0308022. Bibcode:2003cs........8022B. doi:10.1023/A:1025720518994. ISSN 1572-8412. S2CID 5969663.
- Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., & Soria, C. (2012, May). The LRE Map. Harmonising Community Descriptions of Resources. In LREC (pp. 1084-1089).
- McCrae, John P.; Labropoulou, Penny; Gracia, Jorge; Villegas, Marta; Rodríguez-Doncel, Víctor; Cimiano, Philipp (2015). Gandon, Fabien; Guéret, Christophe; Villata, Serena; Breslin, John; Faron-Zucker, Catherine; Zimmermann, Antoine (eds.). "One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web". The Semantic Web: ESWC 2015 Satellite Events. Lecture Notes in Computer Science. Cham: Springer International Publishing. 9341: 271–282. doi:10.1007/978-3-319-25639-9_42. ISBN 978-3-319-25639-9.
- Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat: Corralling data categories in the wild. In 6th International Conference on Language Resources and Evaluation (LREC 2008).
- Nordhoff, Sebastian (2012), Chiarcos, Christian; Nordhoff, Sebastian; Hellmann, Sebastian (eds.), "Linked Data for Linguistic Diversity Research: Glottolog/Langdoc and ASJP Online", Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata, Springer, pp. 191–200, doi:10.1007/978-3-642-28249-2_18, ISBN 978-3-642-28249-2
- "Language Resources and Evaluation". Springer. Retrieved 2020-05-13.
- "Best Practices for Multilingual Linked Open Data Community Group". www.w3.org. Retrieved 2020-05-13.
- "Linked Data for Language Technology Community Group". www.w3.org. Retrieved 2020-05-13.
- "Ontology-Lexica Community Group". www.w3.org. Retrieved 2020-05-13.
- "Linguistic Linked Open Data".
- "TEI: Text Encoding Initiative". tei-c.org. Retrieved 2020-05-13.