skip to content

Spontaneous speech


Data collection

Spontaneous speech data was collected from every participant through an interview. The strategy used for the collection of speech data in Spanish, French and Persian was to engage the participants in a game-task in which they took on the role of a police investigator in a fictive murder case who talked to the doorman of the building that the dead body had been found in. This procedure allows participants to choose which aspects of the case they want to talk about. Moreover, it has the advantage of requiring the interviewee to produce both declarative and interrogative utterances, the latter being rather rare in free interview situations.

For Catalan, a map-task was used to elicit production data (material). In order to animate a conversation, the instructor’s map differed slightly from the instructee’s one.

Data processing

The interviews were first manually transcribed by native speakers and then segmented and time-stamped sentence by sentence; i.e. each matrix clause and its subordinate clauses (if any) were treated as a single unit. The sentences were then annotated syntactically and include information on sentence mood, embedded clauses, type, movement to the periphery and grammatical roles. For each grammatical role, we also annotated some lexical, syntactic and semantic properties such as lexical realization (full NP, quantified NP, NPI, etc.), verb class, aspect, modality, finiteness, person and number. Some language-specific features (differential object marking in Spanish, scrambling in Persian) have also been captured.

Aside from the syntactic annotation, the corpus also contains some information-structural analysis. An algorithm for topic annotation has been developed for Spanish and will be adopted for other languages as well.