CorCenCC
CorCenCC or (Welsh: Corpws Cenedlaethol Cymraeg Cyfoes) the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur[1] – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.
Launched in September 2020, CorCenCC is the first corpus of the Welsh language that includes all three aspects of contemporary Welsh: spoken, written and electronically mediated (e-language).
Composition
CorCenCC extends to 11 million words of naturally occurring Welsh language (note: the version of the corpus available on the CorCenCC website reports results in tokens rather than words). The creation of CorCenCC was a community-driven project, which offered users of Welsh an opportunity to contribute to a Welsh language resource that reflects how Welsh is currently used. The dataset, therefore, offers a snapshot of the Welsh language across a range of contexts of use, e.g. private conversations, group socialising, business and other work situations, in education, in the various published media, and in public spaces. A full list of contexts, genres and topics included are available on the project's website.
Conversations were recorded by the research team, and a crowdsourcing app enabled Welsh speakers in the community to record and upload samples of their own language use to the corpus. The published CorCenCC corpus was sampled from a range of different speakers and users of Welsh, from all regions of Wales, of all ages and genders, with a wide range of occupations, and with a variety of linguistic backgrounds (e.g. how they came to speak Welsh), to reflect the diversity of text types and of Welsh speakers found in contemporary Wales.[2]
Tools
- 11 million word Welsh language dataset
- The CorCenCC sampling frame
- Transcription protocols for spoken Welsh
- Welsh-language POS tagset and tagger, CyTag[3] (English: /ˈkətæɡ/): a Welsh POS tagger (with bespoke tagset) designed and constructed for the project. It is used in conjunction with the semantic tagger to tag all lexical items in the corpus.
- CySemTag (English: /ˈkəsɛmˌtæɡ/): The Welsh Semantic Tagger[4][5][6] applies corpus annotation automatically to Welsh language data.
- A Welsh language pedagogic toolkit, Y Tiwtiadur[7] (Welsh pronunciation: [ə tiutˈjadɪr]), which includes:
- a Gap Filling (Cloze) tool
- a Word Profiler tool
- a Word Identification tool
- a Word Task Creator tool
- Crowdsourcing app[2] for data collection: designed to allow Welsh speakers to record conversations between themselves and others across a range of contexts and to upload them, complete with ethically compliant consent from participants, for inclusion in the final corpus. Crowdsourced corpus data is a relatively new direction that complements more traditional language data collection methods, and is suited to the community spirit that exists among speakers and learners of Welsh and other minoritised languages.
- CorCenCC’s new corpus infrastructure[8] query tools which include the following functionalities:
- Simple query
- Complex query
- Frequency list generation
- Collocation analysis
- N-gram analysis
- Concordancing
- Keyword analysis
Funding
The research on which CorCenCC project was based was funded by the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) as "Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project" (Grant Number ES/M011348/1).
External links
- CorCenCC National Corpus of Contemporary Welsh website
- CorCenCC GitHub
- Y Tiwtiadur, a Welsh language pedagogic toolkit
References
- "Y Tiwtiadur – CorCenCC – National Corpus of Contemporary Welsh". Retrieved 2020-09-18.
- Neale, S.; Spasić, I.; Needs, J.; Watkins, G.; Morris, S.; Fitzpatrick, T.; Marshall, L.; Knight, D. (2017), "The CorCenCC crowdsourcing app: A bespoke tool for the user-driven creation of the national corpus of contemporary Welsh", Corpus Linguistics Conference 2017, Newcastle University
- Neale, S.; Donnelly, K.; Watkins, G.; Knight, D. (May 2018). "Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh". Poster presented at the LREC (Language Resources Evaluation) 2018 Conference. Miyazaki, Japan.CS1 maint: date and year (link)
- "UCREL Semantic Analysis System (USAS)". ucrel.lancs.ac.uk. Retrieved 2020-09-18.
- Piao, S.; Rayson, P.; Knight, D.; Watkins, G. (May 2018), "Towards a Welsh Semantic Annotation System", Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, Miyazaki, JapanCS1 maint: date and year (link)
- Piao, S.; Rayson, P.; Knight, D.; Watkins, G.; Donnelly, K. (July 2017), "Towards a Welsh Semantic Tagger: Creating Lexicons for A Resource Poor Language", Proceedings of The Corpus Linguistics 2017 Conference, University of Birmingham, Birmingham, UKCS1 maint: date and year (link)
- Davies, J.; Thomas, E-M.; Fitzpatrick, T.; Needs, J.; Anthony, L.; Cobb, T.; Knight, D (2020). "Y Tiwtiadur. [Digital Resource]".
- Knight, D.; Loizides, F.; Neale, S.; Anthony, L.; Spasić, I. (2020). "Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh". Language Resources and Evaluation: 1–28. doi:10.1007/s10579-020-09501-9.