All AI-powered test scores are grounded in extensive research, providing a reliable basis for evaluating the test-taker's language proficiency.
Contact us »
                            Our assessment evaluates a test-taker’s English language skills in real-life situations, especially in professional and academic contexts where English communication is essential.
                            
                            To achieve this, the AI-powered test assesses the accuracy, variety, and clarity of spoken and written English, covering both linguistic and some pragmatic aspects.
                            
                            Our adaptive listening and reading tests are based on authentic audio and text passages.
                            
                            The test is structured to enable the use of scores for evaluating the test-taker's language proficiency. All score interpretations, as well as the tests themselves, are built on a strong theoretical foundation.
                        
 
                    Recently, Large Language Models (LLMs) have proliferated and become more powerful. Given these models’ ability to generate language and interact with people, a seemingly logical conclusion is that these models should be able to evaluate language proficiency and could potentially replace more established methods for assessing language proficiency. This conclusion is flawed for several reasons. This paper will explore the reasons LLMs cannot and should not be used in place of effective language assessment, especially in high stakes contexts. Although language assessments and LLMs both deal with language, they do it in vastly different ways, even in the case of language assessments that integrate Artificial Intelligence (AI). Specifically, this paper will explain the ways in which LLMs are not suitable to measure English language proficiency by themselves and, because of fundamental aspects of their design, are unlikely to become more suited to this task. In contrast, the paper will examine why the Speaknow Assessment, with its vertical, integrated machine learning solution, is effective at assessing language proficiency.
This paper will address the following aspects of LLMs and contrast them with the Speaknow solution:
The first, and best reason not to rely on LLMs to rate language proficiency is their lack of ability to perform the task. Out of all of the LLMs tested, only 3 were even willing to attempt the task. According to one LLM, "The final CEFR level for the speaker based on the transcripts provided is not possible for me to evaluate or give you a score on. For more information and key take aways of CEFR you should visit an educational website." Another stated, “As an AI language model, I cannot provide a definitive CEFR score. . . This typically involves trained human assessors who can evaluate the performance against specific CEFR descriptors.” A third stated, “I would not recommend using me to grade a CEFR-aligned language test instead of established methods.” There was, in fact, only one engine that recommended using itself to assess the CEFR. However, this LLM was overconfident in its abilities. In our tests, that LLM provided the least accurate results across all measures.
LLMs are able only to use transcription to rate language, rather than audio or multimodal input. This poses difficulties in rating all aspects of language proficiency, but particularly in rating fluency and pronunciation, which require analysis of acoustic features rather than just transcription texts. This limitation is reflected in the even poorer performance in scoring these features. Speaknow is able to use audio and video to rate these features, resulting in richer and more accurate analysis.
Inconsistency is an inherent characteristic of LLMs. This is because of the fundamentally probabilistic nature of LLMs. In order to generate responses, the model calculates probabilities for all possible tokens. The actual selection of the next token involves an element of randomness, even if slight. In order to generate these probabilities, LLMs depend on sampling of available data. Responses are also influenced by the prompt and by previous responses. What this means for practical use is that LLMs are built to return different responses, even when given the same data and prompt. This feature allows LLMs to sound more realistic and appear more creative.
One of the fundamental principles of effective testing is reliability. Reliability means that given the same response, the same score should be returned, regardless of factors inherent to the test candidate or to the rating process. Given that LLMs are built to vary their responses to the same prompt, their levels of reliability will always be low. This is not a bug that can be overcome, but an intrinsic feature of LLMs.
In contrast to LLMs, Speaknow is an assessment built on years of language learning research. The algorithms developed to rate tests are built on language learning research related to the Common European Framework of Reference for Languages (CEFR). In this scenario, there are aspects of language production that are definitional, rather than probabilistic. For example, someone who can freely produce and correctly use the words “fundamental”, “regardless”, “inherent”, and “intrinsic”, which appear in the second paragraph of this section, is not “likely” to have a C2 level vocabulary or have a high probability of having a C2 level vocabulary but actually has a C2 level vocabulary as demonstrated by the use of those words. In this case, a probability-based model, such as an LLM, will be less successful at reliably identifying the CEFR vocabulary level of a candidate than will other artificial intelligence models which will consistently return a vocabulary score of C2 for someone producing those words. For this reason, Speaknow uses other types of AI models in rating vocabulary.
This example of vocabulary is one of the more straightforward examples of why LLMs are not the best option for rating language proficiency. However, the same logic applies to many other aspects of language proficiency.
Another well-known issue with LLMs is their propensity for confidently stating information that is completely false. This issue is called hallucination. LLMs hallucinate for several reasons that include mistakes or bias in the training data, insufficient data, or overfitting to the training data. When there is a gap in the data, an LLM will confidently make up information to fill in the gap, even if that information is incorrect. An example of this that occurred in our testing when an LLM stated “The speaker demonstrates a basic range of vocabulary, using simple and familiar words to express ideas. There is some use of topic-specific vocabulary, such as "tragic situation" and "complicated situation,” as a rationale for assigning A2 to the speaker’s vocabulary. However, those phrases are examples of B1 vocabulary use, and neither is “simple”.
As discussed above, one of the issues in an LLM’s lack of ability to produce accurate language assessment results is the nature of the data on which it is trained. LLMs are trained on a broad range of generally available data. Across all disciplines, vertical solutions, which are trained on deep industry-specific knowledge, are able to outperform LLMs. These solutions reduce hallucinations by providing the model with data relevant to the context. Providing a model with contextually appropriate data that is curated to minimize bias produces more accurate results. In the case of CEFR aligned assessments, appropriate data includes samples of speech and writing at various levels of the CEFR, tagged and verified by experienced data taggers. These taggers have previous experience in rating CEFR aligned exams and are further trained in rating data samples from the Speaknow assessment, including the features specific to this type of data. A large database of high-quality data is expensive and takes time to develop. Speaknow has a database of more than 850,000 samples of tagged speech and writing samples and uses this data in training all of its language assessment algorithms. The Speaknow data was tagged by raters with hundreds of years of experience among them. The data was trained using the highest-level psychometric principles, with up to 6 independently scored ratings for each sample. There is no replacement for high quality data in training any type of automated model.
In addition to speech transcripts and writing samples, many other types of data help to demonstrate language proficiency. While LLMs are only able to make use of only language-based data, and that only in the form of text, the Speaknow solution includes many other types of information gathered throughout the test administration, including acoustic and video data. Because Speaknow’s data is gathered through a single sitting of a 4 skills assessment, there are many other types of data available for analysis including test-taker behavior along with language samples.
LLMs base their scores on one model which assesses all aspects of the language it is presented with. This method leads to bias and inevitable flattening of score profiles. Because Speaknow develops individual models for each language skill, each of which incorporates different technologies and scores different parameters separately, a greater level of detail can be provided. This information is essential for developing individualized learning plans, tailored to the strengths and weaknesses of the user.
What goes into LLMs’ scoring is essentially a mystery. Even when an LLM explains what went into the scoring, the information is superficial, and often inaccurate. Feature-based algorithms provide the opportunity to provide detailed information about the user’s skills. This information can help to build trust in the solution and information for the language learner. There is potential for a much more detailed provision of information when a variety of solutions are used in scoring.
In order to compare the functioning of the Speaknow rating solution to LLMs, 200 speaking exams distributed among the six CEFR score levels were selected and run through two of the leading LLMs. Speaking analysis was chosen because it is more interesting and challenging to score automatically.
The scores of the Speaknow solution and the LLMs were compared to the average rating of 6 expert human raters, all of whom have extensive experience rating CEFR aligned exams, in addition to specialized training and experience rating the CEFR aligned exams. The accuracy is measured by percentage of difference from the human rating at the level of 0.5 difference or less.
The Speaknow rating solution significantly outperformed the leading LLMs in all categories. Interestingly, a single human rater also significantly outperformed all of the LLMs, although did less well than the Speaknow solution. Particularly notable were the ends of the rating scale, in which the Speaknow solution outperformed the LLMs by as much as 50%. While all LLMs struggled with ends of the rating scales (A1s and C2s), some struggled on both ends, with others succeeding more at one end and failing at the other. One LLM was able to rate A1-B1 with better accuracy than others, but then the accuracy dropped to 68% at the B1+ level and did not exceed 78% accuracy at any higher level.
Within the individual parameters, the Speaknow solution also outperformed the LLMs, with fluency and pronunciation showing the largest differences.
| Overall CEFR Score | Vocabulary | Vocabulary | Cohesion | Fluency | Pronunciation | |
|---|---|---|---|---|---|---|
| Speaknow Solution | 96% | 92% | 92% | 91% | 97% | 93% | 
| LLM1 | 52% | 52% | 56% | 65% | 44% | 48% | 
| LLM2 | 69% | 68% | 68% | 75% | 69% | 61% | 
*Accuracy is measured by percentage of difference of 0.5 point or less from the average rating of 6 expert raters.
LLMs have much to offer in many language applications. They are able to generate language realistically and are able to deal well with language probabilities. However, their strengths are more generative than evaluative. Because of the ways in which LLMs are designed, they are fundamentally unsuited for use as a standalone CEFR aligned language assessment. They may be useful as a part of a language assessment solution, in conjunction with other forms of AI, but their use must be considered carefully, and, if used, should be part of a more comprehensive solution. Because of its specialized data set and ability to incorporate multiple forms and formats of data, along with its ability to maintain control of the features analyzed, Speaknow is poised to set a standard for the future of AI powered language assessment.