In this paper, we address the problem of selecting the 'optimal' variable subset in a logistic regression model for a medium-sized data set. As a case study, we take the British English dative alternation, where speakers and writers can choose between two (equally grammatical) syntactic constructions to express the same meaning. With the help of 29 explanatory variables taken from the literature, we build two types of models: (1) with the verb sense included as a random effect (verb senses often have a bias towards one of the two variants), and (2) without a random effect. For each type, we build three different models by including all variables and keeping the significant ones, by sequentially adding the most predictive variable (forward regression), and by sequentially removing the least predictive variable (backward regression). Seeing that the six approaches lead to five different models, we advise researchers to be careful to base their conclusions solely on the one 'optimal' model they found.
Presented at: Student session of the 21st European Summer School in Logic, Language and Information (ESSLLI09 StuS), 20-31 July 2009, Bordeaux, France.
Slides (pdf; 418kB)
back to presentations and posters