This website holds bilingual conversational corpora aimed at furthering research on linguistics topics such as code-switching.
* 'Indeterminate' means that the relevant word appears in the dictionaries of both main languages.
These corpora were assembled by the former ESRC Centre for Research on Bilingualism in Theory and Practice at Bangor University by the following researchers: Prof Margaret Deuchar, Dr Diana Carter, Dr Peredur Davies, Dr Kevin Donnelly, Dr Jon Herring, Dr María del Carmen Parafita Couto, Dr Jonathan Stammers, Fraibet Aveledo, Marika Fusser, Lowri Jones, Siân Lloyd-Williams, Myfyr Prys, Elen Robert.
The researchers gratefully acknowledge the support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government.
The audio files were transcribed using CHAT conventions. The Siarad corpus was initially glossed manually. Subsequently, all three corpora were glossed automatically.
The glossed text on this website shows the automatic gloss for all three corpora, but for the Siarad corpus this is augmented by showing both the manual gloss and then the automatic gloss.
For ease of reading, language tags in the CHAT files have been replaced here with superscripts: C = @s:cym (Welsh), E = @s:eng (English), S = @s:spa (Spanish), CE = @s:cym&eng (Welsh and English), CS = @s:cym&spa (Welsh and Spanish), SE = @s:spa&eng (Spanish and English).
The Siarad corpus (version 1) was originally distributed on CD in 2009, with manual glosses only, and is also available on TalkBank. The Siarad data on this website (version 1.5) includes both manual and automatically-generated glosses. Version 2 of Siarad texts, including some corrections and emendations, with manual glosses only, is available in a GitHub repository
All the material on this website is under the Free Software Foundation's General Public License v3 (or later). This means it can be used freely, adapted and extended as required by the user, provided that the same GPLv3 (or later) licence is used for any derived version that is distributed.
In line with the GPLv3 licence, note that permission is NOT granted to use any of the material on this website to train an AI large language model UNLESS all the training data for that LLM is made publicly available.
For further information, please contact in the first instance:
More details about the corpora can be found in: