Abstract

Speaking utterance fluency, as a dimension of L2 performance, is assumed to be correlated to L2 proficiency, and the ability to measure it objectively and precisely is key for testing and research. Many utterance fluency metrics have been proposed, compared, and validated in terms of how well they discriminate or predict proficiency levels, allow to measure short-term L2 development or correlate with perceived fluency (e.g., Segalowitz et al, 2017; Tavakoli et al, 2020). However, the precise operationalization of these fluency measurements is rarely discussed in detail and often diverges among studies (Dumont, 2018). While some issues, such as the silent pause threshold, have been studied in more detail (de Jong & Bosker, 2013), others, such as pruning, have rarely been discussed in depth.

The present study attempts to (semi-)automatize the testing and the computation of multiple variations of L2 fluency metrics, to compare how well they predict external proficiency estimates, including within a limited proficiency range, and how sensitive they are to very-short-term developmental changes.

We used a computer-delivered oral interview to record 215 young low-intermediate learners of French in a pre- and a posttest separated by 1-3 weeks and, for the experimental group, a short pedagogical intervention based on interactions in a dialogue-based computer-assisted language learning game. The resulting 12’000 audio files were transcribed by automatic speech recognition, manually corrected, and annotated for a series of “disfluencies”. We computed both signal-based (e.g., via de Jong et al 2020) and transcription-based fluency metrics, in as many variations as possible in terms of pruning (e.g., do L1-words count? proper nouns? self-talk?) and normalizations (words, syllables, silent pauses…).

We evaluate how well each metric’s variations correlate with external proficiency estimates, including a vocabulary size test, and are able to detect changes in such a short timeframe, and how reliable the fully automated metrics are.

Methods

Results

Automated estimators vs. Manual annotation

Raw metrics MAE
(accur.)
RMSE
(accur.)
R2
(consist.)
  Cron. α
(intern. consist.)
$r$#Syll.-VS
(pred. power)
Nb of syllables (auto count, manual transcript) “truth”       .92 .373
vs. Google ASR transcript (auto count) 1.23 2.93 .874   .91 .370
vs. Syllable Nuclei Praat script (de Jong et al.) 4.25 7.60 .585   .88 .154

Pruning

Number of syllables Variant / Pruning M SD Cron. α $r$#Syll.-VS $r$SpeechRate-VS
Unpruned (manual transcript) 13.4 5.44 .92 .373 .579
‘Meant’ pruning: –disfluencies (f.pauses, repet., self-corr., meta) 12.2 5.10 .92 .443 .597
‘Meant’, L2-only pruning: –L1/lingua franca words 12.1 5.07 .93 .459 .603
‘Meant’, L2-only, –proper nouns 12.0 5.02 .93 .473 .609

Best predictors of L2 proficiency

Semi-auto vs. fully automated composite metrics
Metric Semi-auto,
pruned
Fully auto*,
ASR-based count
Fully auto*,
signal-based(deJong)
Fully auto
signal alt.
Length of runs .628 .588 .479  
Speech rate .609 .585 .461  
Articulation rate .524 .496 .392 .172
Syllable duration-1 .473 .283 .473 .106
Number of syllables .473 .370 .154  
Number of words .463 .355  
Silent pausing rate-1     .409 .428
Duration of runs     .338 .352
Speech-time ratio     .269 .305

Developmental sensitivity

References