The more the merrier? How data set size and noisiness affect the accuracy of predicting the dative alternation

In the dative alternation in English, speakers and writers choose between the prepositional dative construction (‘I gave the ball to him’ and the double object construction (‘I gave him the ball’). Logistic regression models have already been shown to be able to predict over 90% of the choices correctly (e.g. Bresnan et al. 2007).

Collecting dative instances from a corpus and encoding them with the required information is a costly procedure. We therefore developed a semi-automatic approach to do this, consisting of three steps: (1) automatically extracting dative candidates, (2) manually approving or rejecting these candidates, and (3) automatically annotating the approved candidates with the required information. The resulting data sets are noisier than data sets that have been checked completely manually, but the approach can yield much larger data sets.

We compare the effect of data set size and noisiness on the accuracy of predicting the dative alternation. We employ a ‘manual’ set of 2,877 instances in spoken English, taken from Switchboard (Godfrey et al. 1992) by Bresnan et al (2007) and from ICE-GB (Greenbaum 1996) by Theijssen (2010). In addition, we use a ‘semi-automatic’ set with 7,755 instances from Switchboard, ICE-GB and BNC (BNC Consortium 2007). We compare the learning curves of various machine learning algorithms by randomly selecting subsets of the data and extending them with 500 instances each time. We do this for different levels of noisiness, i.e. varying the proportion of ‘semi-automatic’ instances (0%, 25%, 50%, 75%, 100%). The results are presented at the conference.

References
BNC Consortium (2007). The British National Corpus, version 3 (BNC XML Edition). Oxford University Computing Services.
Bresnan Joan, Anna Cueni, Tatiana Nikitina and R. Harald Baayen (2007). Predicting the Dative Alternation. In Bouma, Gerlof, Irene Kraemer and Joost Zwarts (eds.), Cognitive Foundations of Interpretation, Royal Netherlands Academy of Science, Amsterdam, pp 69-94.
Godfrey, John J., Edward C. Holliman and Jane McDaniel (1992). Switchboard: Telephone speech corpus for research and development. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP-92), pp. 517-20.
Greenbaum, Sidney (ed.) (1996). Comparing English Worldwide: The International Corpus of English. Oxford, U.K.
Theijssen, Daphne (2010). Variable selection in Logistic Regression: The British English dative alternation. In Icard, Thomas and Reinhard Muskens (eds.), Interfaces: Explorations in Logic, Language and Computation. Series: Lecture Notes in Computer Science (subseries: Lecture Notes in Artificial Intelligence), volume 6211, Springer.

Presented at: The 21st meeting of Computational Linguistics In the Netherlands (CLIN-21), 11 February 2011, University College Ghent, Ghent, Belgium.
Slides (pdf; 1,221kB)


back to presentations and posters