Spontaneous speech

Data collection

Spontaneous speech data was collected from every participant through an interview. The recording quality was very high, and therefore allows for acoustic-phonetic studies (48 KHz, 24 Bit, uncompressed wav-format, separated channels for interviewer and interviewee, control of ambient noise where possible, digitally recorded using high-end microphones).
The strategy used for the collection of speech data in Spanish, French and Persian was to engage the participants in a game-task in which they took on the role of a police investigator in a fictive murder case who talked to the doorman of the building that the dead body had been found in. Although this procedure is not a classic sociolinguistic interview, it still allows the interviewee to choose which aspects of the case he wants to talk about. Also, it has the advantage of requiring the interviewee to produce both declarative and interrogative utterances, the latter being rather rare in free interview situations. The participants were instructed to remain themselves and not to imitate, for example, commonly known literary figures.
For Catalan, a map-task was used to elicit production data (Material). In this task, one of the participants (the instructor) had to describe the path he saw on his version of the map to the participant he was paired with (the instructee), who then had to mark the path on his version of the map. In order to animate a conversation, a few disruptions had been added: for example, the map of the instructor contained a few orientation marks that the one of the instructee did not, and vice versa.

Data processing

The interviews were first manually transcribed by native speakers and then segmented and time-stamped sentence by sentence; i.e. each matrix clause and its subordinate clauses (if any) were treated as a single unit. Transcription and time-stamping were carried out with the Clan-Editor (MacWhinney, 2000, 2011). The sentences were then annotated. In the first version of the corpus, only sentences uttered by the interviewees have been syntactically annotated; at the moment, the syntactic annotation of interviewers’ utterances is being added. 
The syntactic annotation contains information on sentence mood, embedded clauses including their type, movement to the periphery and grammatical roles. For each grammatical role, we also annotated some lexical, syntactic and semantic properties such as lexical realization (full NP, quantified NP, NPI, etc.), animacy as well as verb class, aspect, modality, finiteness, person and number. Some language-specific features (differential object marking in Spanish, scrambling in Persian) have also been captured.
Aside from the syntactic annotation, the corpus also contains some information structural analysis. An algorithm for topic annotation has been developed for Spanish and will be adopted for other languages as well.