The sgs online version is planned to include the transcribed and annotated spontaneous speech recordings (anonymized in accordance with ethical norms in social research), the results of a gradient acceptability judgment task and parts of the social metadata along with the possibility to query the data across languages and data types.

For instance, the online resource would make it possible to search for syntactic, semantic and information-structural features of left-dislocated elements in French and Spanish contrastively. Or, to give another example, we could ask if speakers with different social backgrounds differ in their preference for wh-in-situ questions vs. questions involving wh-movement, both in the acceptability rating task and in spontaneous speech.

Current state

Currently, we are working on (i) the conversion of the original, idiosyncratic annotation of the spontaneous speech to a TEI-compliant format, which allows for a standardized representation of linguistic information and is flexible enough to include additional types of annotation in the future, and (ii) the development of the current annotation levels.

From text to XML-TEI

In the first version of the annotation, linguistic information was added to the transcription in the sgsTree format, a system of in-line tags following each matrix clause that was specifically developed for this project. The processing was carried out through a Perl script which took the idiosyncratic annotation format as input and exported the search result to plain text or to SPSS for further statistical analysis.

As a first step towards an online accessible, more user-friendly version of the corpus, we began converting the existing annotation to the XML-TEI format with the help of Dr. Laurent Romary (INRIA). The TEI format proves useful as a unified data representation format and in terms of its inter-operability with existing corpus annotation and processing software (EXMARaLDA, MMAX2, ANNIS, etc.).

State of the annotation

Up to now, the progress of the annotation differs with respect to the languages, with Spanish taking the lead. Also, in the first run of the annotation, only the utterances from the interviewees had been syntactically coded.

LanguageNumber of interviewsTranscription & time-alignmentAnnotation of various syntactic propertiesAnnotation of topic & theticity
Interviewee InterviewerIntervieweeInterviewer
Persian98yesyesnopilot versionpilot version
French102yesyesnopilot versionno