Increasing (quantitative) precision: blend production, contexts of alternations, and corpus-linguistic association

Stefan Th. Gries

doi:10.34847/nkl.1e6fa2v2

Chargement

Increasing (quantitative) precision: blend production, contexts of alternations, and corpus-linguistic association

DOI : 10.34847/nkl.1e6fa2v2 Publique

Contacter le gestionnaire

Auteur :

Stefan Th. Gries

In this talk I will discuss three different case studies, all of which are concerned with increasing the degree of precision — especially quantitative precision — of previous work in well-researched areas of linguistic study.

Part 1 is concerned with morphological blends (e.g. breakfast x lunch → brunch). While much traditional research has concluded that blends are formed largely arbitrary, re...search over the last 20 years or so has discovered a variety of probabilistic patterns governing the selection of words to blend and the way they are merged into a blend. However, much of this research — including my own — has been based on what are ultimately convenience samples: collections of blends encountered in 'the wild', which may distort the frequencies with which certain patterns are observed. To test the observational data's validity, Stefanie Wulff and I did a series of blend production experiments under controlled conditions and I will report a few very small case studies designed to determine whether certain observational results are confirmed or not.

Part 2 is a very exploratory and tentative kind of suggestion for the corpus-based analysis of the lexical context of syntactic alternations. Studies of alternations/choices in particular in corpus linguistics have become increasingly sophisticated in terms of the statistical methods they employ and the ever larger number of predictors they involve many different levels of linguistic analysis — phonology, morphosyntax, semantics, pragmatics/discoursal, textual, psycholinguistic, sociolinguistic, and others. These predictors are usually contextual in nature, meaning they characterize the context of the choice the language user needs to make or has just made. However, one aspect of the context seems to be crucially underutilized when it comes to modeling speakers' choices: the actual lexical context. In this part, I use recent work in computational psycholinguistics to (i) define a lexical-distribution prototype of each of the (typically, but not necessarily, two) alternants of an alternation and (ii) compute the degree to which each instance of the alternation in question diverges from each of the prototypes. Then, (iii) the values that all choices score on the divergences from each of the prototypes are entered as predictors to all others in statistical models to, minimally, serve as a variable that controls for whatever information is contained in the lexical context of an instance of speaker's choice. I exemplify the approach and its sometimes amazing predictive power on the basis of a choice between near synonyms and two morphosyntactic alternations (preposition stranding vs. pied-piping and of- vs. s-genitives).

Part 3 discusses a variety of potential shortcomings of most of the most widely-used association measures as used in collocational/collostructional research. To address these shortcomings, I then discuss a research program called tupleization, an approach that does away with the usual kinds of information conflation by keeping relevant corpus-linguistic dimensions of information — e.g., frequency, association/contingency, dispersion, entropy, etc. — separate and analyzing them in a multidimensional way; I conclude with pointers towards how these dimensions could, if deemed absolutely necessary, be conflated for the simplest kinds of of rankings as well as strategies for future research.

Fichier

Visualisation

Stefan Th. Gries.mp4

Mots-clés

AFlico CogniTextes Cognitive linguistics Statistics Corpus linguistics Lecture Series

Licence

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

Collection

AFLiCo Lecture Series 2020

Citer

Gries, Stefan Th. (2021) «Increasing (quantitative) precision: blend production, contexts of alternations, and corpus-linguistic association» [Audiovisual] NAKALA. https://doi.org/10.34847/nkl.1e6fa2v2

Partager

Email Facebook Twitter LinkedIn

Déposée par Revue OpenEdition CogniTextes le 02/03/2021

nakala:title	xsd:string	Anglais	Increasing (quantitative) precision: blend production, contexts of alternations, and corpus-linguistic association
nakala:creator			Stefan Th. Gries
nakala:created	xsd:string		2020-12-15
nakala:type	xsd:anyURI		Vidéo
nakala:license	xsd:string		Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
dcterms:description	xsd:string	Anglais	In this talk I will discuss three different case studies, all of which are concerned with increasing the degree of precision — especially quantitative precision — of previous work in well-researched areas of linguistic study. Part 1 is concerned with morphological blends (e.g. breakfast x lunch → brunch). While much traditional research has concluded that blends are formed largely arbitrary, research over the last 20 years or so has discovered a variety of probabilistic patterns governing the selection of words to blend and the way they are merged into a blend. However, much of this research — including my own — has been based on what are ultimately convenience samples: collections of blends encountered in 'the wild', which may distort the frequencies with which certain patterns are observed. To test the observational data's validity, Stefanie Wulff and I did a series of blend production experiments under controlled conditions and I will report a few very small case studies designed to determine whether certain observational results are confirmed or not. Part 2 is a very exploratory and tentative kind of suggestion for the corpus-based analysis of the lexical context of syntactic alternations. Studies of alternations/choices in particular in corpus linguistics have become increasingly sophisticated in terms of the statistical methods they employ and the ever larger number of predictors they involve many different levels of linguistic analysis — phonology, morphosyntax, semantics, pragmatics/discoursal, textual, psycholinguistic, sociolinguistic, and others. These predictors are usually contextual in nature, meaning they characterize the context of the choice the language user needs to make or has just made. However, one aspect of the context seems to be crucially underutilized when it comes to modeling speakers' choices: the actual lexical context. In this part, I use recent work in computational psycholinguistics to (i) define a lexical-distribution prototype of each of the (typically, but not necessarily, two) alternants of an alternation and (ii) compute the degree to which each instance of the alternation in question diverges from each of the prototypes. Then, (iii) the values that all choices score on the divergences from each of the prototypes are entered as predictors to all others in statistical models to, minimally, serve as a variable that controls for whatever information is contained in the lexical context of an instance of speaker's choice. I exemplify the approach and its sometimes amazing predictive power on the basis of a choice between near synonyms and two morphosyntactic alternations (preposition stranding vs. pied-piping and of- vs. s-genitives). Part 3 discusses a variety of potential shortcomings of most of the most widely-used association measures as used in collocational/collostructional research. To address these shortcomings, I then discuss a research program called tupleization, an approach that does away with the usual kinds of information conflation by keeping relevant corpus-linguistic dimensions of information — e.g., frequency, association/contingency, dispersion, entropy, etc. — separate and analyzing them in a multidimensional way; I conclude with pointers towards how these dimensions could, if deemed absolutely necessary, be conflated for the simplest kinds of of rankings as well as strategies for future research.
dcterms:language	xsd:string		anglais
dcterms:subject	xsd:string		AFlico
	xsd:string		CogniTextes
	xsd:string		Cognitive linguistics
	xsd:string		Statistics
	xsd:string		Corpus linguistics
	xsd:string		Lecture Series