Übersicht
Übersicht
Übersicht
Data Types
Three types of data were collected from each participant, with a total of 98 Persian, 54 Spanish-Catalan and 102 French speakers from Tehran, Barcelona and Paris, respectively.
The speaker sample was roughly balanced in terms of gender and age, the Spanish and Catalan data was collected from fairly balanced bilinguals.
Spontaneous speech
Data collection
Spontaneous speech data was collected from every participant through an interview. The strategy used for the collection of speech data in Spanish, French and Persian was to engage the participants in a game-task in which they took on the role of a police investigator in a fictive murder case who talked to the doorman of the building that the dead body had been found in. This procedure allows participants to choose which aspects of the case they want to talk about. Moreover, it has the advantage of requiring the interviewee to produce both declarative and interrogative utterances, the latter being rather rare in free interview situations.
For Catalan, a map-task was used to elicit production data (material). In order to animate a conversation, the instructor’s map differed slightly from the instructee’s one.
Data processing
The interviews were first manually transcribed by native speakers and then segmented and time-stamped sentence by sentence; i.e. each matrix clause and its subordinate clauses (if any) were treated as a single unit. The sentences were then annotated syntactically and include information on sentence mood, embedded clauses, type, movement to the periphery and grammatical roles. For each grammatical role, we also annotated some lexical, syntactic and semantic properties such as lexical realization (full NP, quantified NP, NPI, etc.), verb class, aspect, modality, finiteness, person and number. Some language-specific features (differential object marking in Spanish, scrambling in Persian) have also been captured.
Aside from the syntactic annotation, the corpus also contains some information-structural analysis. An algorithm for topic annotation has been developed for Spanish and will be adopted for other languages as well.
Gradient acceptability judgements
The second data type is based on the gradient acceptability judgment tests in all four languages. The stimuli mainly consist of theoretically decisive wh-, cleft-, focus- and scrambling constructions.
Persian and French participants took part in a paper-and-pencil experiment in which they read stimuli on a test sheet and expressed their judgments by drawing a line.
In order to also account for phonological factors, a gradient acceptability judgment test with auditory stimuli was developed for Spanish and Catalan. The test was a randomized computer-based version of the previous paper-and-pencil version. The participants read and listened to a test sentence and rated it on a continuous scale.
Social metadata
The third type of data presented in the corpus is based on a social metadata questionnaire and observation.
The questionnaire is comprised of the following information:
- General demographic information (sex, age, etc.)
- Occupation of interviewees and their close relatives (father, mother and partner) (according to ISCO, ILO 2008 and PCS, INSEE 2003)
- Education level of interviewees and their close relatives (father, mother and partner)
- Socio-economic indicators such as housing situation, neighborhood, vacation type, interviewees’ income and possessions such as household, basic or luxury goods
- Interviewees’ dwelling occupancy index (INSEE, 2005)
- Interviewees’ lifestyle such as favorite types of leisure, media, clothing, values
Another piece of information concerning the social status of interviewees was collected through the observation of clothing, appearance, speech register, etc.