skip to content

sgs corpus

What is sgs?

The sgs corpus is a multilingual database that comprises annotated spontaneous speech data, gradient acceptability judgments as well as social metadata for every participant (Data types), collected by Aria Adli in 2004-2005 and 2008. 98 Persian, 54 Spanish-Catalan and 102 French speakers from Tehran, Barcelona and Paris, respectively, participated in this study. The speaker sample was roughly balanced in terms of gender and age.

What does sgs offer?

The combination of different approaches to data collection and annotation allows for variationist sociolinguistic and formal linguistic studies. Furthermore, the inclusion of data from various languages encourages a cross-linguistic approach to the study of specific linguistic phenomena.

Current work

All the data has been collected and properly stored. The spontaneous speech recordings have been transcribed, segmented and time-stamped; the database currently contains manual annotation of various syntactic properties in Spanish, French and Persian. In addition, the Spanish part contains annotations on topicality. We are working on the transformation of the data into a TEI-compliant format for online publication (Development). The following parts will be made accessible online, starting with Spanish:

  • The anonymized segmented transcriptions of the interviews
  • (Morpho-)syntactic and topic annotation
  • The results of the gradient acceptability judgment tasks
  • Parts of the social metadata (in compliance with ethical standards in research)