Increasing (quantitative) precision: blend production, contexts of alternations, and corpus-linguistic association

DOI : 10.34847/nkl.1e6fa2v2 Publique
Auteur : ORCID Stefan Th. Gries

In this talk I will discuss three different case studies, all of which are concerned with increasing the degree of precision — especially quantitative precision — of previous work in well-researched areas of linguistic study.

Part 1 is concerned with morphological blends (e.g. breakfast x lunch → brunch). While much traditional research has concluded that blends are formed largely arbitrary, re...search over the last 20 years or so has discovered a variety of probabilistic patterns governing the selection of words to blend and the way they are merged into a blend. However, much of this research — including my own — has been based on what are ultimately convenience samples: collections of blends encountered in 'the wild', which may distort the frequencies with which certain patterns are observed. To test the observational data's validity, Stefanie Wulff and I did a series of blend production experiments under controlled conditions and I will report a few very small case studies designed to determine whether certain observational results are confirmed or not.

Part 2 is a very exploratory and tentative kind of suggestion for the corpus-based analysis of the lexical context of syntactic alternations. Studies of alternations/choices in particular in corpus linguistics have become increasingly sophisticated in terms of the statistical methods they employ and the ever larger number of predictors they involve many different levels of linguistic analysis — phonology, morphosyntax, semantics, pragmatics/discoursal, textual, psycholinguistic, sociolinguistic, and others. These predictors are usually contextual in nature, meaning they characterize the context of the choice the language user needs to make or has just made. However, one aspect of the context seems to be crucially underutilized when it comes to modeling speakers' choices: the actual lexical context. In this part, I use recent work in computational psycholinguistics to (i) define a lexical-distribution prototype of each of the (typically, but not necessarily, two) alternants of an alternation and (ii) compute the degree to which each instance of the alternation in question diverges from each of the prototypes. Then, (iii) the values that all choices score on the divergences from each of the prototypes are entered as predictors to all others in statistical models to, minimally, serve as a variable that controls for whatever information is contained in the lexical context of an instance of speaker's choice. I exemplify the approach and its sometimes amazing predictive power on the basis of a choice between near synonyms and two morphosyntactic alternations (preposition stranding vs. pied-piping and of- vs. s-genitives).

Part 3 discusses a variety of potential shortcomings of most of the most widely-used association measures as used in collocational/collostructional research. To address these shortcomings, I then discuss a research program called tupleization, an approach that does away with the usual kinds of information conflation by keeping relevant corpus-linguistic dimensions of information — e.g., frequency, association/contingency, dispersion, entropy, etc. — separate and analyzing them in a multidimensional way; I conclude with pointers towards how these dimensions could, if deemed absolutely necessary, be conflated for the simplest kinds of of rankings as well as strategies for future research.

Fichier  
Visualisation

ID : 10.34847/nkl.1e6fa2v2/f1bc72afd646f38805455a87b4795d11715e3c24

Url d'intégration : https://api.nakala.fr/embed/10.34847/nkl.1e6fa2v2/f1bc72afd646f38805455a87b4795d11715e3c24

Url de téléchargement : https://api.nakala.fr/data/10.34847/nkl.1e6fa2v2/f1bc72afd646f38805455a87b4795d11715e3c24

Licence
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
Citer
Gries, Stefan Th. (2021) «Increasing (quantitative) precision: blend production, contexts of alternations, and corpus-linguistic association» [Audiovisual] NAKALA. https://doi.org/10.34847/nkl.1e6fa2v2
Déposée par Revue OpenEdition CogniTextes le 02/03/2021
nakala:title xsd:string Anglais Increasing (quantitative) precision: blend production, contexts of alternations, and corpus-linguistic association
nakala:creator ORCID Stefan Th. Gries
nakala:created xsd:string 2020-12-15
nakala:type xsd:anyURI Vidéo
nakala:license xsd:string Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
dcterms:description xsd:string Anglais In this talk I will discuss three different case studies, all of which are concerned with increasing the degree of precision — especially quantitative precision — of previous work in well-researched areas of linguistic study.

Part 1 is concerned with morphological blends (e.g. breakfast x lunch → brunch). While much traditional research has concluded that blends are formed largely arbitrary, research over the last 20 years or so has discovered a variety of probabilistic patterns governing the selection of words to blend and the way they are merged into a blend. However, much of this research — including my own — has been based on what are ultimately convenience samples: collections of blends encountered in 'the wild', which may distort the frequencies with which certain patterns are observed. To test the observational data's validity, Stefanie Wulff and I did a series of blend production experiments under controlled conditions and I will report a few very small case studies designed to determine whether certain observational results are confirmed or not.

Part 2 is a very exploratory and tentative kind of suggestion for the corpus-based analysis of the lexical context of syntactic alternations. Studies of alternations/choices in particular in corpus linguistics have become increasingly sophisticated in terms of the statistical methods they employ and the ever larger number of predictors they involve many different levels of linguistic analysis — phonology, morphosyntax, semantics, pragmatics/discoursal, textual, psycholinguistic, sociolinguistic, and others. These predictors are usually contextual in nature, meaning they characterize the context of the choice the language user needs to make or has just made. However, one aspect of the context seems to be crucially underutilized when it comes to modeling speakers' choices: the actual lexical context. In this part, I use recent work in computational psycholinguistics to (i) define a lexical-distribution prototype of each of the (typically, but not necessarily, two) alternants of an alternation and (ii) compute the degree to which each instance of the alternation in question diverges from each of the prototypes. Then, (iii) the values that all choices score on the divergences from each of the prototypes are entered as predictors to all others in statistical models to, minimally, serve as a variable that controls for whatever information is contained in the lexical context of an instance of speaker's choice. I exemplify the approach and its sometimes amazing predictive power on the basis of a choice between near synonyms and two morphosyntactic alternations (preposition stranding vs. pied-piping and of- vs. s-genitives).

Part 3 discusses a variety of potential shortcomings of most of the most widely-used association measures as used in collocational/collostructional research. To address these shortcomings, I then discuss a research program called tupleization, an approach that does away with the usual kinds of information conflation by keeping relevant corpus-linguistic dimensions of information — e.g., frequency, association/contingency, dispersion, entropy, etc. — separate and analyzing them in a multidimensional way; I conclude with pointers towards how these dimensions could, if deemed absolutely necessary, be conflated for the simplest kinds of of rankings as well as strategies for future research.
dcterms:language xsd:string anglais
dcterms:subject xsd:string AFlico
xsd:string CogniTextes
xsd:string Cognitive linguistics
xsd:string Statistics
xsd:string Corpus linguistics
xsd:string Lecture Series