Automatically extending linguistically enriched data

There are many situations in which speakers can choose between two or more structural variants which are equally grammatical but may differ in their acceptability in the given context. In the current project, we explore the use of Bayesian Network Modelling (BNM) for the purpose of modelling such syntactic variability. At present, we investigate the dative alternation, where speakers and writers can choose between structures with a double object (e.g. She gave him the book.) or prepositional dative structure (e.g. She gave the book to him.). Employing the one-million-word syntactically annotated ICE-GB Corpus, we were able to extract 790 relevant instances. The data set proves too small to allow drawing conclusions about the suitability of BNM for modelling syntactic variability.

To solve this (very common) data sparseness problem, we developed an approach to automatically extend our data, employing large corpora without syntactic annotations (BNC and COCA). First, we created a list of verbs occurring in both constructions and used them to find potentially relevant sentences in the corpora. The sentences found were then (partly) automatically filtered. Next, we wrote algorithms for automatic enrichment with the linguistic and discourse information desired: the animacy, concreteness, definiteness, discourse givenness, pronominality, person and number of the objects (the book and him in the example), and the semantic class of the verb. We evaluated the automatic labelling with the help of the existing data set of 790 manually annotated instances. The details of the method and the results found are presented at the conference.

Presented at: The 19th meeting of Computational Linguistics In the Netherlands (CLIN-19), 22 January 2009, University of Groningen, Groningen, the Netherlands.
Slides (pdf; 412kB)

back to presentations and posters