BangorTalk bilingual conversational corpora

This website holds bilingual conversational corpora aimed at furthering research on linguistics topics such as code-switching.

The Siarad Welsh-English corpus (around 450,000 words, 84% Welsh, 4% English, 13% indeterminate*).
The Patagonia Welsh-Spanish corpus (around 195,000 words, 78% Welsh, 17% Spanish, 5% indeterminate).
The Miami English-Spanish corpus (around 240,000 words, 63% English, 34% Spanish, 3% indeterminate).

* 'Indeterminate' means that the relevant word appears in the dictionaries of both main languages.

These corpora were assembled by the former ESRC Centre for Research on Bilingualism in Theory and Practice at Bangor University by the following researchers: Prof Margaret Deuchar, Dr Diana Carter, Dr Peredur Davies, Dr Kevin Donnelly, Dr Jon Herring, Dr María del Carmen Parafita Couto, Dr Jonathan Stammers, Fraibet Aveledo, Marika Fusser, Lowri Jones, Siân Lloyd-Williams, Myfyr Prys, Elen Robert.

Please cite the corpora as: Deuchar, Margaret and Parafita Couto, Maria del Carmen and Davies, Peredur and Donnelly, Kevin and Aveledo, Fraibet and Carter, Diana and Fusser, Marika and Herring, Jon and Jones, Lowri and Lloyd-Williams, Siân and Prys, Myfyr and Robert, Elen and Stammers, Jonathan (2009-11), Bangortalk Corpora, https://bangortalk.org.uk.

The researchers gratefully acknowledge the support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government.

The audio files were transcribed using CHAT conventions. The Siarad corpus was initially glossed manually. Subsequently, all three corpora were glossed automatically.

The glossed text on this website shows the automatic gloss for all three corpora, but for the Siarad corpus this is augmented by showing both the manual gloss and then the automatic gloss.

For ease of reading, language tags in the CHAT files have been replaced here with superscripts: C = @s:cym (Welsh), E = @s:eng (English), S = @s:spa (Spanish), CE = @s:cym&eng (Welsh and English), CS = @s:cym&spa (Welsh and Spanish), SE = @s:spa&eng (Spanish and English).

The Siarad corpus (version 1) was originally distributed on CD in 2009, with manual glosses only, and is also available on TalkBank. The Siarad data on this website (version 1.5) includes both manual and automatically-generated glosses. Version 2 of Siarad texts, including some corrections and emendations, with manual glosses only, is available in a GitHub repository

All the material on this website is under the Free Software Foundation's General Public License v3 (or later). This means it can be used freely, adapted and extended as required by the user, provided that the same GPLv3 (or later) licence is used for any derived version that is distributed.

In line with the GPLv3 licence, note that permission is NOT granted to use any of the material on this website to train an AI large language model UNLESS all the training data for that LLM is made publicly available.

For further information, please contact in the first instance:

More details about the corpora can be found in:

The monograph Building and Using the Siarad Corpus: Bilingual conversations in Welsh and English (Margaret Deuchar, Peredur Webb-Davies, and Kevin Donnelly), published by John Benjamins (2018). The first part of the book describes the methods used to build the Siarad corpus, while the second part describes various linguistic analyses of the corpus data.

The book chapter "Building bilingual corpora" (Margaret Deuchar, Peredur Davies, Jon Russell Herring, M. Carmen Parafita Couto and Diana Carter). In: E Môn Thomas and I Mennen (Eds.), Advances in the Study of Bilingualism (2014) pp. 93–110. Multilingual Matters.
The paper "Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text"" (Kevin Donnelly and Margaret Deuchar). In: Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia. NEALT Proceedings Series, Tartu.