This paper addresses the problem of selecting the 'optimal' variable subset in a logistic regression model for a medium-sized data set. As a case study, we take the British English dative alternation, where speakers and writers can choose between two (equally grammatical) syntactic constructions to express the same meaning. With 29 explanatory variables taken from the literature, we build two types of models: one with the verb sense included as a random effect, and one without a random effect. For each type, we build three different models by including all variables and keeping the significant ones, by successively adding the most predictive variable (forward selection), and by successively removing the least predictive variable (backward elimination). Seeing that the six approaches lead to six different variable selections (and thus six different models), we conclude that the selection of the 'best' model requires a substantial amount of linguistic expertise.
Reference: Daphne Theijssen (2010). Variable selection in Logistic Regression: The British English dative alternation. Thomas Icard and Reinhard Muskens (eds.), Interfaces: Explorations in Logic, Language and Computation. Series: Lecture Notes in Computer Science (subseries: Lecture Notes in Artificial Intelligence), volume 6211, Springer, pp. 87-101.
Paper (pdf; 183kB) ; BibTeX
back to publications