writing an essay on google

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

College admissions

Course: college admissions > unit 4, writing a strong college admissions essay.

Avoiding common admissions essay mistakes
Brainstorming tips for your college essay
How formal should the tone of your college essay be?
Taking your college essay to the next level
Sample essay 1 with admissions feedback
Sample essay 2 with admissions feedback
Student story: Admissions essay about a formative experience
Student story: Admissions essay about personal identity
Student story: Admissions essay about community impact
Student story: Admissions essay about a past mistake
Student story: Admissions essay about a meaningful poem
Writing tips and techniques for your college essay

Want to join the conversation?

Upvote Button navigates to signup page
Downvote Button navigates to signup page
Flag Button navigates to signup page

Video transcript

Choose Your Test

Sat / act prep online guides and tips, getting college essay help: important do's and don’ts.

College Essays

If you grow up to be a professional writer, everything you write will first go through an editor before being published. This is because the process of writing is really a process of re-writing —of rethinking and reexamining your work, usually with the help of someone else. So what does this mean for your student writing? And in particular, what does it mean for very important, but nonprofessional writing like your college essay? Should you ask your parents to look at your essay? Pay for an essay service?

If you are wondering what kind of help you can, and should, get with your personal statement, you've come to the right place! In this article, I'll talk about what kind of writing help is useful, ethical, and even expected for your college admission essay . I'll also point out who would make a good editor, what the differences between editing and proofreading are, what to expect from a good editor, and how to spot and stay away from a bad one.

What Kind of Help for Your Essay Can You Get?

What's Good Editing?

What should an editor do for you, what kind of editing should you avoid, proofreading, what's good proofreading, what kind of proofreading should you avoid.

What Do Colleges Think Of You Getting Help With Your Essay?

Who Can/Should Help You?

Advice for editors.

Should You Pay Money For Essay Editing?

The Bottom Line

What's next, what kind of help with your essay can you get.

Rather than talking in general terms about "help," let's first clarify the two different ways that someone else can improve your writing . There is editing, which is the more intensive kind of assistance that you can use throughout the whole process. And then there's proofreading, which is the last step of really polishing your final product.

Let me go into some more detail about editing and proofreading, and then explain how good editors and proofreaders can help you."

Editing is helping the author (in this case, you) go from a rough draft to a finished work . Editing is the process of asking questions about what you're saying, how you're saying it, and how you're organizing your ideas. But not all editing is good editing . In fact, it's very easy for an editor to cross the line from supportive to overbearing and over-involved.

Ability to clarify assignments. A good editor is usually a good writer, and certainly has to be a good reader. For example, in this case, a good editor should make sure you understand the actual essay prompt you're supposed to be answering.

Open-endedness. Good editing is all about asking questions about your ideas and work, but without providing answers. It's about letting you stick to your story and message, and doesn't alter your point of view.

Think of an editor as a great travel guide. It can show you the many different places your trip could take you. It should explain any parts of the trip that could derail your trip or confuse the traveler. But it never dictates your path, never forces you to go somewhere you don't want to go, and never ignores your interests so that the trip no longer seems like it's your own. So what should good editors do?

Help Brainstorm Topics

Sometimes it's easier to bounce thoughts off of someone else. This doesn't mean that your editor gets to come up with ideas, but they can certainly respond to the various topic options you've come up with. This way, you're less likely to write about the most boring of your ideas, or to write about something that isn't actually important to you.

If you're wondering how to come up with options for your editor to consider, check out our guide to brainstorming topics for your college essay .

Help Revise Your Drafts

Here, your editor can't upset the delicate balance of not intervening too much or too little. It's tricky, but a great way to think about it is to remember: editing is about asking questions, not giving answers .

Revision questions should point out:

Places where more detail or more description would help the reader connect with your essay
Places where structure and logic don't flow, losing the reader's attention
Places where there aren't transitions between paragraphs, confusing the reader
Moments where your narrative or the arguments you're making are unclear

But pointing to potential problems is not the same as actually rewriting—editors let authors fix the problems themselves.

Want to write the perfect college application essay? We can help. Your dedicated PrepScholar Admissions counselor will help you craft your perfect college essay, from the ground up. We learn your background and interests, brainstorm essay topics, and walk you through the essay drafting process, step-by-step. At the end, you'll have a unique essay to proudly submit to colleges. Don't leave your college application to chance. Find out more about PrepScholar Admissions now:

Bad editing is usually very heavy-handed editing. Instead of helping you find your best voice and ideas, a bad editor changes your writing into their own vision.

You may be dealing with a bad editor if they:

Add material (examples, descriptions) that doesn't come from you
Use a thesaurus to make your college essay sound "more mature"
Add meaning or insight to the essay that doesn't come from you
Tell you what to say and how to say it
Write sentences, phrases, and paragraphs for you
Change your voice in the essay so it no longer sounds like it was written by a teenager

Colleges can tell the difference between a 17-year-old's writing and a 50-year-old's writing. Not only that, they have access to your SAT or ACT Writing section, so they can compare your essay to something else you wrote. Writing that's a little more polished is great and expected. But a totally different voice and style will raise questions.

Where's the Line Between Helpful Editing and Unethical Over-Editing?

Sometimes it's hard to tell whether your college essay editor is doing the right thing. Here are some guidelines for staying on the ethical side of the line.

An editor should say that the opening paragraph is kind of boring, and explain what exactly is making it drag. But it's overstepping for an editor to tell you exactly how to change it.
An editor should point out where your prose is unclear or vague. But it's completely inappropriate for the editor to rewrite that section of your essay.
An editor should let you know that a section is light on detail or description. But giving you similes and metaphors to beef up that description is a no-go.

Proofreading (also called copy-editing) is checking for errors in the last draft of a written work. It happens at the end of the process and is meant as the final polishing touch. Proofreading is meticulous and detail-oriented, focusing on small corrections. It sands off all the surface rough spots that could alienate the reader.

Because proofreading is usually concerned with making fixes on the word or sentence level, this is the only process where someone else can actually add to or take away things from your essay . This is because what they are adding or taking away tends to be one or two misplaced letters.

Laser focus. Proofreading is all about the tiny details, so the ability to really concentrate on finding small slip-ups is a must.

Excellent grammar and spelling skills. Proofreaders need to dot every "i" and cross every "t." Good proofreaders should correct spelling, punctuation, capitalization, and grammar. They should put foreign words in italics and surround quotations with quotation marks. They should check that you used the correct college's name, and that you adhered to any formatting requirements (name and date at the top of the page, uniform font and size, uniform spacing).

Limited interference. A proofreader needs to make sure that you followed any word limits. But if cuts need to be made to shorten the essay, that's your job and not the proofreader's.

A bad proofreader either tries to turn into an editor, or just lacks the skills and knowledge necessary to do the job.

Some signs that you're working with a bad proofreader are:

If they suggest making major changes to the final draft of your essay. Proofreading happens when editing is already finished.
If they aren't particularly good at spelling, or don't know grammar, or aren't detail-oriented enough to find someone else's small mistakes.
If they start swapping out your words for fancier-sounding synonyms, or changing the voice and sound of your essay in other ways. A proofreader is there to check for errors, not to take the 17-year-old out of your writing.

What Do Colleges Think of Your Getting Help With Your Essay?

Admissions officers agree: light editing and proofreading are good—even required ! But they also want to make sure you're the one doing the work on your essay. They want essays with stories, voice, and themes that come from you. They want to see work that reflects your actual writing ability, and that focuses on what you find important.

On the Importance of Editing

Get feedback. Have a fresh pair of eyes give you some feedback. Don't allow someone else to rewrite your essay, but do take advantage of others' edits and opinions when they seem helpful. ( Bates College )

Read your essay aloud to someone. Reading the essay out loud offers a chance to hear how your essay sounds outside your head. This exercise reveals flaws in the essay's flow, highlights grammatical errors and helps you ensure that you are communicating the exact message you intended. ( Dickinson College )

On the Value of Proofreading

Share your essays with at least one or two people who know you well—such as a parent, teacher, counselor, or friend—and ask for feedback. Remember that you ultimately have control over your essays, and your essays should retain your own voice, but others may be able to catch mistakes that you missed and help suggest areas to cut if you are over the word limit. ( Yale University )

Proofread and then ask someone else to proofread for you. Although we want substance, we also want to be able to see that you can write a paper for our professors and avoid careless mistakes that would drive them crazy. ( Oberlin College )

On Watching Out for Too Much Outside Influence

Limit the number of people who review your essay. Too much input usually means your voice is lost in the writing style. ( Carleton College )

Ask for input (but not too much). Your parents, friends, guidance counselors, coaches, and teachers are great people to bounce ideas off of for your essay. They know how unique and spectacular you are, and they can help you decide how to articulate it. Keep in mind, however, that a 45-year-old lawyer writes quite differently from an 18-year-old student, so if your dad ends up writing the bulk of your essay, we're probably going to notice. ( Vanderbilt University )

Now let's talk about some potential people to approach for your college essay editing and proofreading needs. It's best to start close to home and slowly expand outward. Not only are your family and friends more invested in your success than strangers, but they also have a better handle on your interests and personality. This knowledge is key for judging whether your essay is expressing your true self.

Parents or Close Relatives

Your family may be full of potentially excellent editors! Parents are deeply committed to your well-being, and family members know you and your life well enough to offer details or incidents that can be included in your essay. On the other hand, the rewriting process necessarily involves criticism, which is sometimes hard to hear from someone very close to you.

A parent or close family member is a great choice for an editor if you can answer "yes" to the following questions. Is your parent or close relative a good writer or reader? Do you have a relationship where editing your essay won't create conflict? Are you able to constructively listen to criticism and suggestion from the parent?

One suggestion for defusing face-to-face discussions is to try working on the essay over email. Send your parent a draft, have them write you back some comments, and then you can pick which of their suggestions you want to use and which to discard.

Teachers or Tutors

A humanities teacher that you have a good relationship with is a great choice. I am purposefully saying humanities, and not just English, because teachers of Philosophy, History, Anthropology, and any other classes where you do a lot of writing, are all used to reviewing student work.

Moreover, any teacher or tutor that has been working with you for some time, knows you very well and can vet the essay to make sure it "sounds like you."

If your teacher or tutor has some experience with what college essays are supposed to be like, ask them to be your editor. If not, then ask whether they have time to proofread your final draft.

Guidance or College Counselor at Your School

The best thing about asking your counselor to edit your work is that this is their job. This means that they have a very good sense of what colleges are looking for in an application essay.

At the same time, school counselors tend to have relationships with admissions officers in many colleges, which again gives them insight into what works and which college is focused on what aspect of the application.

Unfortunately, in many schools the guidance counselor tends to be way overextended. If your ratio is 300 students to 1 college counselor, you're unlikely to get that person's undivided attention and focus. It is still useful to ask them for general advice about your potential topics, but don't expect them to be able to stay with your essay from first draft to final version.

Friends, Siblings, or Classmates

Although they most likely don't have much experience with what colleges are hoping to see, your peers are excellent sources for checking that your essay is you .

Friends and siblings are perfect for the read-aloud edit. Read your essay to them so they can listen for words and phrases that are stilted, pompous, or phrases that just don't sound like you.

You can even trade essays and give helpful advice on each other's work.

If your editor hasn't worked with college admissions essays very much, no worries! Any astute and attentive reader can still greatly help with your process. But, as in all things, beginners do better with some preparation.

First, your editor should read our advice about how to write a college essay introduction , how to spot and fix a bad college essay , and get a sense of what other students have written by going through some admissions essays that worked .

Then, as they read your essay, they can work through the following series of questions that will help them to guide you.

Introduction Questions

Is the first sentence a killer opening line? Why or why not?
Does the introduction hook the reader? Does it have a colorful, detailed, and interesting narrative? Or does it propose a compelling or surprising idea?
Can you feel the author's voice in the introduction, or is the tone dry, dull, or overly formal? Show the places where the voice comes through.

Essay Body Questions

Does the essay have a through-line? Is it built around a central argument, thought, idea, or focus? Can you put this idea into your own words?
How is the essay organized? By logical progression? Chronologically? Do you feel order when you read it, or are there moments where you are confused or lose the thread of the essay?
Does the essay have both narratives about the author's life and explanations and insight into what these stories reveal about the author's character, personality, goals, or dreams? If not, which is missing?
Does the essay flow? Are there smooth transitions/clever links between paragraphs? Between the narrative and moments of insight?

Reader Response Questions

Does the writer's personality come through? Do we know what the speaker cares about? Do we get a sense of "who he or she is"?
Where did you feel most connected to the essay? Which parts of the essay gave you a "you are there" sensation by invoking your senses? What moments could you picture in your head well?
Where are the details and examples vague and not specific enough?
Did you get an "a-ha!" feeling anywhere in the essay? Is there a moment of insight that connected all the dots for you? Is there a good reveal or "twist" anywhere in the essay?
What are the strengths of this essay? What needs the most improvement?

Should You Pay Money for Essay Editing?

One alternative to asking someone you know to help you with your college essay is the paid editor route. There are two different ways to pay for essay help: a private essay coach or a less personal editing service , like the many proliferating on the internet.

My advice is to think of these options as a last resort rather than your go-to first choice. I'll first go through the reasons why. Then, if you do decide to go with a paid editor, I'll help you decide between a coach and a service.

When to Consider a Paid Editor

In general, I think hiring someone to work on your essay makes a lot of sense if none of the people I discussed above are a possibility for you.

If you can't ask your parents. For example, if your parents aren't good writers, or if English isn't their first language. Or if you think getting your parents to help is going create unnecessary extra conflict in your relationship with them (applying to college is stressful as it is!)

If you can't ask your teacher or tutor. Maybe you don't have a trusted teacher or tutor that has time to look over your essay with focus. Or, for instance, your favorite humanities teacher has very limited experience with college essays and so won't know what admissions officers want to see.

If you can't ask your guidance counselor. This could be because your guidance counselor is way overwhelmed with other students.

If you can't share your essay with those who know you. It might be that your essay is on a very personal topic that you're unwilling to share with parents, teachers, or peers. Just make sure it doesn't fall into one of the bad-idea topics in our article on bad college essays .

If the cost isn't a consideration. Many of these services are quite expensive, and private coaches even more so. If you have finite resources, I'd say that hiring an SAT or ACT tutor (whether it's PrepScholar or someone else) is better way to spend your money . This is because there's no guarantee that a slightly better essay will sufficiently elevate the rest of your application, but a significantly higher SAT score will definitely raise your applicant profile much more.

Should You Hire an Essay Coach?

On the plus side, essay coaches have read dozens or even hundreds of college essays, so they have experience with the format. Also, because you'll be working closely with a specific person, it's more personal than sending your essay to a service, which will know even less about you.

But, on the minus side, you'll still be bouncing ideas off of someone who doesn't know that much about you . In general, if you can adequately get the help from someone you know, there is no advantage to paying someone to help you.

If you do decide to hire a coach, ask your school counselor, or older students that have used the service for recommendations. If you can't afford the coach's fees, ask whether they can work on a sliding scale —many do. And finally, beware those who guarantee admission to your school of choice—essay coaches don't have any special magic that can back up those promises.

Should You Send Your Essay to a Service?

On the plus side, essay editing services provide a similar product to essay coaches, and they cost significantly less . If you have some assurance that you'll be working with a good editor, the lack of face-to-face interaction won't prevent great results.

On the minus side, however, it can be difficult to gauge the quality of the service before working with them . If they are churning through many application essays without getting to know the students they are helping, you could end up with an over-edited essay that sounds just like everyone else's. In the worst case scenario, an unscrupulous service could send you back a plagiarized essay.

Getting recommendations from friends or a school counselor for reputable services is key to avoiding heavy-handed editing that writes essays for you or does too much to change your essay. Including a badly-edited essay like this in your application could cause problems if there are inconsistencies. For example, in interviews it might be clear you didn't write the essay, or the skill of the essay might not be reflected in your schoolwork and test scores.

Should You Buy an Essay Written by Someone Else?

Let me elaborate. There are super sketchy places on the internet where you can simply buy a pre-written essay. Don't do this!

For one thing, you'll be lying on an official, signed document. All college applications make you sign a statement saying something like this:

I certify that all information submitted in the admission process—including the application, the personal essay, any supplements, and any other supporting materials—is my own work, factually true, and honestly presented... I understand that I may be subject to a range of possible disciplinary actions, including admission revocation, expulsion, or revocation of course credit, grades, and degree, should the information I have certified be false. (From the Common Application )

For another thing, if your academic record doesn't match the essay's quality, the admissions officer will start thinking your whole application is riddled with lies.

Admission officers have full access to your writing portion of the SAT or ACT so that they can compare work that was done in proctored conditions with that done at home. They can tell if these were written by different people. Not only that, but there are now a number of search engines that faculty and admission officers can use to see if an essay contains strings of words that have appeared in other essays—you have no guarantee that the essay you bought wasn't also bought by 50 other students.

You should get college essay help with both editing and proofreading
A good editor will ask questions about your idea, logic, and structure, and will point out places where clarity is needed
A good editor will absolutely not answer these questions, give you their own ideas, or write the essay or parts of the essay for you
A good proofreader will find typos and check your formatting
All of them agree that getting light editing and proofreading is necessary
Parents, teachers, guidance or college counselor, and peers or siblings
If you can't ask any of those, you can pay for college essay help, but watch out for services or coaches who over-edit you work
Don't buy a pre-written essay! Colleges can tell, and it'll make your whole application sound false.

Ready to start working on your essay? Check out our explanation of the point of the personal essay and the role it plays on your applications and then explore our step-by-step guide to writing a great college essay .

Using the Common Application for your college applications? We have an excellent guide to the Common App essay prompts and useful advice on how to pick the Common App prompt that's right for you . Wondering how other people tackled these prompts? Then work through our roundup of over 130 real college essay examples published by colleges .

Stressed about whether to take the SAT again before submitting your application? Let us help you decide how many times to take this test . If you choose to go for it, we have the ultimate guide to studying for the SAT to give you the ins and outs of the best ways to study.

Want to improve your SAT score by 160 points or your ACT score by 4 points? We've written a guide for each test about the top 5 strategies you must be using to have a shot at improving your score. Download them for free now:

Anna scored in the 99th percentile on her SATs in high school, and went on to major in English at Princeton and to get her doctorate in English Literature at Columbia. She is passionate about improving student access to higher education.

Ask a Question Below

Have any questions about this article or other topics? Ask below and we'll reply!

Improve With Our Famous Guides

For All Students

The 5 Strategies You Must Be Using to Improve 160+ SAT Points

How to Get a Perfect 1600, by a Perfect Scorer

Series: How to Get 800 on Each SAT Section:

Score 800 on SAT Math

Score 800 on SAT Reading

Score 800 on SAT Writing

Series: How to Get to 600 on Each SAT Section:

Score 600 on SAT Math

Score 600 on SAT Reading

Score 600 on SAT Writing

Free Complete Official SAT Practice Tests

What SAT Target Score Should You Be Aiming For?

15 Strategies to Improve Your SAT Essay

The 5 Strategies You Must Be Using to Improve 4+ ACT Points

How to Get a Perfect 36 ACT, by a Perfect Scorer

Series: How to Get 36 on Each ACT Section:

36 on ACT English

36 on ACT Math

36 on ACT Reading

36 on ACT Science

Series: How to Get to 24 on Each ACT Section:

24 on ACT English

24 on ACT Math

24 on ACT Reading

24 on ACT Science

What ACT target score should you be aiming for?

ACT Vocabulary You Must Know

ACT Writing: 15 Tips to Raise Your Essay Score

How to Get Into Harvard and the Ivy League

How to Get a Perfect 4.0 GPA

How to Write an Amazing College Essay

What Exactly Are Colleges Looking For?

Is the ACT easier than the SAT? A Comprehensive Guide

Should you retake your SAT or ACT?

When should you take the SAT or ACT?

Stay Informed

Get the latest articles and test prep tips!

Looking for Graduate School Test Prep?

Check out our top-rated graduate blogs here:

GRE Online Prep Blog

GMAT Online Prep Blog

TOEFL Online Prep Blog

Holly R. "I am absolutely overjoyed and cannot thank you enough for helping me!”

Get science-backed answers as you write with Paperpal's Research feature

How to Write an Essay Introduction (with Examples)  

The introduction of an essay plays a critical role in engaging the reader and providing contextual information about the topic. It sets the stage for the rest of the essay, establishes the tone and style, and motivates the reader to continue reading.

What is an essay introduction , what to include in an essay introduction, how to create an essay structure , step-by-step process for writing an essay introduction , how to write an introduction paragraph , how to write a hook for your essay , how to include background information , how to write a thesis statement .

Argumentative Essay Introduction Example:
Expository Essay Introduction Example

Literary Analysis Essay Introduction Example

Check and revise – checklist for essay introduction , key takeaways , frequently asked questions .

An introduction is the opening section of an essay, paper, or other written work. It introduces the topic and provides background information, context, and an overview of what the reader can expect from the rest of the work. 1 The key is to be concise and to the point, providing enough information to engage the reader without delving into excessive detail.

The essay introduction is crucial as it sets the tone for the entire piece and provides the reader with a roadmap of what to expect. Here are key elements to include in your essay introduction:

Hook : Start with an attention-grabbing statement or question to engage the reader. This could be a surprising fact, a relevant quote, or a compelling anecdote.
Background information : Provide context and background information to help the reader understand the topic. This can include historical information, definitions of key terms, or an overview of the current state of affairs related to your topic.
Thesis statement : Clearly state your main argument or position on the topic. Your thesis should be concise and specific, providing a clear direction for your essay.

Before we get into how to write an essay introduction, we need to know how it is structured. The structure of an essay is crucial for organizing your thoughts and presenting them clearly and logically. It is divided as follows: 2

Introduction:  The introduction should grab the reader’s attention with a hook, provide context, and include a thesis statement that presents the main argument or purpose of the essay.
Body:  The body should consist of focused paragraphs that support your thesis statement using evidence and analysis. Each paragraph should concentrate on a single central idea or argument and provide evidence, examples, or analysis to back it up.
Conclusion:  The conclusion should summarize the main points and restate the thesis differently. End with a final statement that leaves a lasting impression on the reader. Avoid new information or arguments.

Here’s a step-by-step guide on how to write an essay introduction:

Start with a Hook : Begin your introduction paragraph with an attention-grabbing statement, question, quote, or anecdote related to your topic. The hook should pique the reader’s interest and encourage them to continue reading.
Provide Background Information : This helps the reader understand the relevance and importance of the topic.
State Your Thesis Statement : The last sentence is the main argument or point of your essay. It should be clear, concise, and directly address the topic of your essay.
Preview the Main Points : This gives the reader an idea of what to expect and how you will support your thesis.
Keep it Concise and Clear : Avoid going into too much detail or including information not directly relevant to your topic.
Revise : Revise your introduction after you’ve written the rest of your essay to ensure it aligns with your final argument.

Here’s an example of an essay introduction paragraph about the importance of education:

Education is often viewed as a fundamental human right and a key social and economic development driver. As Nelson Mandela once famously said, “Education is the most powerful weapon which you can use to change the world.” It is the key to unlocking a wide range of opportunities and benefits for individuals, societies, and nations. In today’s constantly evolving world, education has become even more critical. It has expanded beyond traditional classroom learning to include digital and remote learning, making education more accessible and convenient. This essay will delve into the importance of education in empowering individuals to achieve their dreams, improving societies by promoting social justice and equality, and driving economic growth by developing a skilled workforce and promoting innovation.

This introduction paragraph example includes a hook (the quote by Nelson Mandela), provides some background information on education, and states the thesis statement (the importance of education).

This is one of the key steps in how to write an essay introduction. Crafting a compelling hook is vital because it sets the tone for your entire essay and determines whether your readers will stay interested. A good hook draws the reader in and sets the stage for the rest of your essay.

Avoid Dry Fact : Instead of simply stating a bland fact, try to make it engaging and relevant to your topic. For example, if you’re writing about the benefits of exercise, you could start with a startling statistic like, “Did you know that regular exercise can increase your lifespan by up to seven years?”
Avoid Using a Dictionary Definition : While definitions can be informative, they’re not always the most captivating way to start an essay. Instead, try to use a quote, anecdote, or provocative question to pique the reader’s interest. For instance, if you’re writing about freedom, you could begin with a quote from a famous freedom fighter or philosopher.
Do Not Just State a Fact That the Reader Already Knows : This ties back to the first point—your hook should surprise or intrigue the reader. For Here’s an introduction paragraph example, if you’re writing about climate change, you could start with a thought-provoking statement like, “Despite overwhelming evidence, many people still refuse to believe in the reality of climate change.”

Including background information in the introduction section of your essay is important to provide context and establish the relevance of your topic. When writing the background information, you can follow these steps:

Start with a General Statement:  Begin with a general statement about the topic and gradually narrow it down to your specific focus. For example, when discussing the impact of social media, you can begin by making a broad statement about social media and its widespread use in today’s society, as follows: “Social media has become an integral part of modern life, with billions of users worldwide.”
Define Key Terms : Define any key terms or concepts that may be unfamiliar to your readers but are essential for understanding your argument.
Provide Relevant Statistics:  Use statistics or facts to highlight the significance of the issue you’re discussing. For instance, “According to a report by Statista, the number of social media users is expected to reach 4.41 billion by 2025.”
Discuss the Evolution:  Mention previous research or studies that have been conducted on the topic, especially those that are relevant to your argument. Mention key milestones or developments that have shaped its current impact. You can also outline some of the major effects of social media. For example, you can briefly describe how social media has evolved, including positives such as increased connectivity and issues like cyberbullying and privacy concerns.
Transition to Your Thesis:  Use the background information to lead into your thesis statement, which should clearly state the main argument or purpose of your essay. For example, “Given its pervasive influence, it is crucial to examine the impact of social media on mental health.”

A thesis statement is a concise summary of the main point or claim of an essay, research paper, or other type of academic writing. It appears near the end of the introduction. Here’s how to write a thesis statement:

Identify the topic:  Start by identifying the topic of your essay. For example, if your essay is about the importance of exercise for overall health, your topic is “exercise.”
State your position:  Next, state your position or claim about the topic. This is the main argument or point you want to make. For example, if you believe that regular exercise is crucial for maintaining good health, your position could be: “Regular exercise is essential for maintaining good health.”
Support your position:  Provide a brief overview of the reasons or evidence that support your position. These will be the main points of your essay. For example, if you’re writing an essay about the importance of exercise, you could mention the physical health benefits, mental health benefits, and the role of exercise in disease prevention.
Make it specific:  Ensure your thesis statement clearly states what you will discuss in your essay. For example, instead of saying, “Exercise is good for you,” you could say, “Regular exercise, including cardiovascular and strength training, can improve overall health and reduce the risk of chronic diseases.”

Examples of essay introduction

Here are examples of essay introductions for different types of essays:

Argumentative Essay Introduction Example:

Topic: Should the voting age be lowered to 16?

“The question of whether the voting age should be lowered to 16 has sparked nationwide debate. While some argue that 16-year-olds lack the requisite maturity and knowledge to make informed decisions, others argue that doing so would imbue young people with agency and give them a voice in shaping their future.”

Expository Essay Introduction Example

Topic: The benefits of regular exercise

“In today’s fast-paced world, the importance of regular exercise cannot be overstated. From improving physical health to boosting mental well-being, the benefits of exercise are numerous and far-reaching. This essay will examine the various advantages of regular exercise and provide tips on incorporating it into your daily routine.”

Text: “To Kill a Mockingbird” by Harper Lee

“Harper Lee’s novel, ‘To Kill a Mockingbird,’ is a timeless classic that explores themes of racism, injustice, and morality in the American South. Through the eyes of young Scout Finch, the reader is taken on a journey that challenges societal norms and forces characters to confront their prejudices. This essay will analyze the novel’s use of symbolism, character development, and narrative structure to uncover its deeper meaning and relevance to contemporary society.”

Engaging and Relevant First Sentence : The opening sentence captures the reader’s attention and relates directly to the topic.
Background Information : Enough background information is introduced to provide context for the thesis statement.
Definition of Important Terms : Key terms or concepts that might be unfamiliar to the audience or are central to the argument are defined.
Clear Thesis Statement : The thesis statement presents the main point or argument of the essay.
Relevance to Main Body : Everything in the introduction directly relates to and sets up the discussion in the main body of the essay.

Writing a strong introduction is crucial for setting the tone and context of your essay. Here are the key takeaways for how to write essay introduction: 3

Hook the Reader : Start with an engaging hook to grab the reader’s attention. This could be a compelling question, a surprising fact, a relevant quote, or an anecdote.
Provide Background : Give a brief overview of the topic, setting the context and stage for the discussion.
Thesis Statement : State your thesis, which is the main argument or point of your essay. It should be concise, clear, and specific.
Preview the Structure : Outline the main points or arguments to help the reader understand the organization of your essay.
Keep it Concise : Avoid including unnecessary details or information not directly related to your thesis.
Revise and Edit : Revise your introduction to ensure clarity, coherence, and relevance. Check for grammar and spelling errors.
Seek Feedback : Get feedback from peers or instructors to improve your introduction further.

The purpose of an essay introduction is to give an overview of the topic, context, and main ideas of the essay. It is meant to engage the reader, establish the tone for the rest of the essay, and introduce the thesis statement or central argument.

An essay introduction typically ranges from 5-10% of the total word count. For example, in a 1,000-word essay, the introduction would be roughly 50-100 words. However, the length can vary depending on the complexity of the topic and the overall length of the essay.

An essay introduction is critical in engaging the reader and providing contextual information about the topic. To ensure its effectiveness, consider incorporating these key elements: a compelling hook, background information, a clear thesis statement, an outline of the essay’s scope, a smooth transition to the body, and optional signposting sentences.

The process of writing an essay introduction is not necessarily straightforward, but there are several strategies that can be employed to achieve this end. When experiencing difficulty initiating the process, consider the following techniques: begin with an anecdote, a quotation, an image, a question, or a startling fact to pique the reader’s interest. It may also be helpful to consider the five W’s of journalism: who, what, when, where, why, and how. For instance, an anecdotal opening could be structured as follows: “As I ascended the stage, momentarily blinded by the intense lights, I could sense the weight of a hundred eyes upon me, anticipating my next move. The topic of discussion was climate change, a subject I was passionate about, and it was my first public speaking event. Little did I know , that pivotal moment would not only alter my perspective but also chart my life’s course.”

Crafting a compelling thesis statement for your introduction paragraph is crucial to grab your reader’s attention. To achieve this, avoid using overused phrases such as “In this paper, I will write about” or “I will focus on” as they lack originality. Instead, strive to engage your reader by substantiating your stance or proposition with a “so what” clause. While writing your thesis statement, aim to be precise, succinct, and clear in conveying your main argument.

To create an effective essay introduction, ensure it is clear, engaging, relevant, and contains a concise thesis statement. It should transition smoothly into the essay and be long enough to cover necessary points but not become overwhelming. Seek feedback from peers or instructors to assess its effectiveness.

References

Cui, L. (2022). Unit 6 Essay Introduction.  Building Academic Writing Skills .
West, H., Malcolm, G., Keywood, S., & Hill, J. (2019). Writing a successful essay.  Journal of Geography in Higher Education ,  43 (4), 609-617.
Beavers, M. E., Thoune, D. L., & McBeth, M. (2023). Bibliographic Essay: Reading, Researching, Teaching, and Writing with Hooks: A Queer Literacy Sponsorship. College English, 85(3), 230-242.

Paperpal is a comprehensive AI writing toolkit that helps students and researchers achieve 2x the writing in half the time. It leverages 21+ years of STM experience and insights from millions of research articles to provide in-depth academic writing, language editing, and submission readiness support to help you write better, faster.

Get accurate academic translations, rewriting support, grammar checks, vocabulary suggestions, and generative AI assistance that delivers human precision at machine speed. Try for free or upgrade to Paperpal Prime starting at US$19 a month to access premium features, including consistency, plagiarism, and 30+ submission readiness checks to help you succeed.

Experience the future of academic writing – Sign up to Paperpal and start writing for free!

Similarity Checks: The Author’s Guide to Plagiarism and Responsible Writing

Types of plagiarism and 6 tips to avoid it in your writing , you may also like, how paperpal can boost comprehension and foster interdisciplinary..., what is the importance of a concept paper..., how to write the first draft of a..., mla works cited page: format, template & examples, how to ace grant writing for research funding..., powerful academic phrases to improve your essay writing , how to write a high-quality conference paper, how paperpal’s research feature helps you develop and..., how paperpal is enhancing academic productivity and accelerating..., academic editing: how to self-edit academic text with....

Purdue Online Writing Lab Purdue OWL® College of Liberal Arts

Welcome to the Purdue Online Writing Lab

Welcome to the Purdue OWL

This page is brought to you by the OWL at Purdue University. When printing this page, you must include the entire legal notice.

Copyright ©1995-2018 by The Writing Lab & The OWL at Purdue and Purdue University. All rights reserved. This material may not be published, reproduced, broadcast, rewritten, or redistributed without permission. Use of this site constitutes acceptance of our terms and conditions of fair use.

The Online Writing Lab at Purdue University houses writing resources and instructional material, and we provide these as a free service of the Writing Lab at Purdue. Students, members of the community, and users worldwide will find information to assist with many writing projects. Teachers and trainers may use this material for in-class and out-of-class instruction.

The Purdue On-Campus Writing Lab and Purdue Online Writing Lab assist clients in their development as writers—no matter what their skill level—with on-campus consultations, online participation, and community engagement. The Purdue Writing Lab serves the Purdue, West Lafayette, campus and coordinates with local literacy initiatives. The Purdue OWL offers global support through online reference materials and services.

A Message From the Assistant Director of Content Development

The Purdue OWL® is committed to supporting students, instructors, and writers by offering a wide range of resources that are developed and revised with them in mind. To do this, the OWL team is always exploring possibilties for a better design, allowing accessibility and user experience to guide our process. As the OWL undergoes some changes, we welcome your feedback and suggestions by email at any time.

Please don't hesitate to contact us via our contact page if you have any questions or comments.

All the best,

Social Media

Facebook twitter.

Help Center
Google Docs Editors
Privacy Policy
Terms of Service
Submit feedback

Write with AI in Google Docs (Workspace Labs)

On Google Docs, you can use the “Help me write” prompt to suggest text using artificial intelligence. You can use the prompt to:

Write new text. For example, you can ask Google Docs to draft a letter or a social media caption.
Rewrite existing text. For example, you can rephrase text, or you can make it more formal, more concise, or more detailed.

This feature is currently available on desktop.

Use AI to write something new

On your computer, open a document on Google Docs .
In the document, click where you want to write.
“Write a poem about the life of a 6 year old boy”
“How-to guide for operating a lawn mower”
“Thank you letter after an interview"
Click Create .

Edit your prompt: At the top of the pop-up window, click the prompt. Edit your prompt and click Update .
Tone: Select Forma l or Casual
Summarize: Gives the key points of the text
Bulletize: Formats the text into a bulleted list
Elaborate: Adds details to build upon the text
Shorten: Makes the text more concise
Important : After creating a new version, you can’t go back to the previous version.
When you’re finished, click Insert.

Use AI to rewrite existing text

Select the text you want to rewrite.
Rephrase: Rewords the text
Custom: You can also write your own prompt to refine the text.
Continue refining the suggested text: Click Refine and repeat step 4.

Important: After creating a new version, you can’t go back to the previous generated version.
Click Replace to accept the new text.
Click Insert to add the new text under the existing text.

Give feedback on generated text

Gemini for Google Workspace is constantly learning and may not be able to support your request.

If you get a suggestion that’s inaccurate or that you feel is unsafe, you can let us know by submitting feedback. Your feedback can help improve AI-assisted Workspace features and broader Google efforts in AI.

Optional: To review data that will be attached with your feedback, at the bottom, select What data will be attached? If you don’t want to include the data with your feedback, uncheck Attach collected data to your feedback to help us improve the product experience .
Select Next .
Review additional context that you can share with your feedback. If you don’t want to include the additional context with your feedback, uncheck Additional context (content referenced to create outputs) .
Select Submit.

To report a legal issue, create a request .

Turn off the “Help me write” prompt

To turn off any of the features on Google Workspace Labs, you must exit Workspace Labs. If you exit, you will permanently lose access to all Workspace Labs features , and you won’t be able to rejoin Workspace Labs. Learn more about how to exit Workspace Labs .

Learn about Workspace Labs feature suggestions

Workspace Labs feature suggestions don’t represent Google’s views, and should not be attributed to Google.
Don’t rely on Workspace Labs features as medical, legal, financial or other professional advice.
Workspace Labs features may suggest inaccurate or inappropriate information. Your feedback makes Workspace Labs more helpful and safe.
Don’t include personal, confidential, or sensitive information in your prompts.
Google uses Workspace Labs data and metrics to provide, improve, and develop products, services, and machine learning technologies across Google.
Your Workspace Labs Data may also be read, rated, annotated, and reviewed by human reviewers. Importantly, where Google uses Google-selected input (as described in the Privacy Notice) to generate output, Google will aggregate and/or pseudonymize that content and resulting output before it is viewed by human reviewers, unless it is specifically provided as part of your feedback to Google.

You can review the Google Workspace Labs Privacy Notice and Terms for Personal Accounts .

How Workspace Labs data in Google Docs is collected

When you use the “Help me write (Labs)” prompt in Google Docs, Google uses and stores the following data:

Prompts you enter or select
Text you select to rewrite
Generated text
Document content that is referenced to generate text
Your feedback on generated text

Related resources

Get started with Google Workspace Labs
Collaborate with Gemini in Google Docs
Google Workspace Labs Privacy Notice and Terms for Personal Accounts

Need more help?

Try these next steps:.

Using Google products, like Google Docs, at work or school? Try powerful tips, tutorials, and templates. Learn to work on Office files without installing Office, create dynamic project plans and team calendars, auto-organize your inbox, and more.

Informational Essay

And the writing process……..

What is it?

An informative essay educates your reader on a topic. They can have one of several functions: to define a term, compare and contrast something, analyze data, or provide a how-to.

They do not, however, present an opinion or try to persuade your reader.

The author of an informational essay becomes a teacher that informs the reader about a topic or process.

Informative Essay

This kind of writing explains something, tells something, or it gives directions.

�• For example, if you wrote about your favorite aunt, you would be�writing an informative/expository essay telling us something about�your aunt. Likewise, if you wrote an essay that gave directions for�making a paper airplane, you would also be writing an�informative/expository essay that gives directions.

How do I do it?

Because you are informing a reader about a subject at the most basic level, you must have structure to your writing to ensure it makes sense.

It is best to break apart the writing act into a process that makes the job easier for you and more logical for your audience.

The Writing Process

1. Prewrite and Explore �• Brainstorm and collect thoughts.�• Use a graphic organizer to arrange your ideas (web, chart, outline, list).�• Analyze prompt to understand audience, purpose, and type of writing required. �

2. Rough Draft and Discover �• Put ideas into sentences and paragraphs.�• Get a rough draft onto paper.�• Draft means “to write.”�• Don’t worry about getting all of your ideas in the right order or using just the right words; this step will come later in the process.�• Read your draft aloud. �

3. Revise �• After reading aloud, look for sentences that need improving.�• Rearrange what you’ve written if you need to.�• Add ideas and sentences if you need to.�• Check for beginning, middle, and end.�• Check for a topic sentence, supporting details, and concluding sentence in each paragraph.�• Check for a claim in introductory paragraph, support for claim in body paragraphs, and claim restated in conclusion.�

4. Proofread and Edit �• Look for errors in spelling, grammar, and punctuation.

� 5. Publish/Final Draft �• Type to display.�• Submit to a contest or publication.�• Add to your portfolio. �

A Few More Tips!

Do not use first person! Remember, this is NOT an opinion piece.
Avoid informal, conversational expressions that are, u know, like, well, so, lol, idk, etc.
Use evidence!!!

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 03 June 2024

Applying large language models for automated essay scoring for non-native Japanese

Wenchao Li 1 &
Haitao Liu 2

Humanities and Social Sciences Communications volume 11 , Article number: 723 ( 2024 ) Cite this article

185 Accesses

2 Altmetric

Metrics details

Language and linguistics

Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated listening tests, and automated oral proficiency assessments. The application of LLMs for AES in the context of non-native Japanese, however, remains limited. This study explores the potential of LLM-based AES by comparing the efficiency of different models, i.e. two conventional machine training technology-based methods (Jess and JWriter), two LLMs (GPT and BERT), and one Japanese local LLM (Open-Calm large model). To conduct the evaluation, a dataset consisting of 1400 story-writing scripts authored by learners with 12 different first languages was used. Statistical analysis revealed that GPT-4 outperforms Jess and JWriter, BERT, and the Japanese language-specific trained Open-Calm large model in terms of annotation accuracy and predicting learning levels. Furthermore, by comparing 18 different models that utilize various prompts, the study emphasized the significance of prompts in achieving accurate and reliable evaluations using LLMs.

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Testing theory of mind in large language models and humans

Highly accurate protein structure prediction with AlphaFold

Conventional machine learning technology in aes.

AES has experienced significant growth with the advancement of machine learning technologies in recent decades. In the earlier stages of AES development, conventional machine learning-based approaches were commonly used. These approaches involved the following procedures: a) feeding the machine with a dataset. In this step, a dataset of essays is provided to the machine learning system. The dataset serves as the basis for training the model and establishing patterns and correlations between linguistic features and human ratings. b) the machine learning model is trained using linguistic features that best represent human ratings and can effectively discriminate learners’ writing proficiency. These features include lexical richness (Lu, 2012 ; Kyle and Crossley, 2015 ; Kyle et al. 2021 ), syntactic complexity (Lu, 2010 ; Liu, 2008 ), text cohesion (Crossley and McNamara, 2016 ), and among others. Conventional machine learning approaches in AES require human intervention, such as manual correction and annotation of essays. This human involvement was necessary to create a labeled dataset for training the model. Several AES systems have been developed using conventional machine learning technologies. These include the Intelligent Essay Assessor (Landauer et al. 2003 ), the e-rater engine by Educational Testing Service (Attali and Burstein, 2006 ; Burstein, 2003 ), MyAccess with the InterlliMetric scoring engine by Vantage Learning (Elliot, 2003 ), and the Bayesian Essay Test Scoring system (Rudner and Liang, 2002 ). These systems have played a significant role in automating the essay scoring process and providing quick and consistent feedback to learners. However, as touched upon earlier, conventional machine learning approaches rely on predetermined linguistic features and often require manual intervention, making them less flexible and potentially limiting their generalizability to different contexts.

In the context of the Japanese language, conventional machine learning-incorporated AES tools include Jess (Ishioka and Kameda, 2006 ) and JWriter (Lee and Hasebe, 2017 ). Jess assesses essays by deducting points from the perfect score, utilizing the Mainichi Daily News newspaper as a database. The evaluation criteria employed by Jess encompass various aspects, such as rhetorical elements (e.g., reading comprehension, vocabulary diversity, percentage of complex words, and percentage of passive sentences), organizational structures (e.g., forward and reverse connection structures), and content analysis (e.g., latent semantic indexing). JWriter employs linear regression analysis to assign weights to various measurement indices, such as average sentence length and total number of characters. These weights are then combined to derive the overall score. A pilot study involving the Jess model was conducted on 1320 essays at different proficiency levels, including primary, intermediate, and advanced. However, the results indicated that the Jess model failed to significantly distinguish between these essay levels. Out of the 16 measures used, four measures, namely median sentence length, median clause length, median number of phrases, and maximum number of phrases, did not show statistically significant differences between the levels. Additionally, two measures exhibited between-level differences but lacked linear progression: the number of attributives declined words and the Kanji/kana ratio. On the other hand, the remaining measures, including maximum sentence length, maximum clause length, number of attributive conjugated words, maximum number of consecutive infinitive forms, maximum number of conjunctive-particle clauses, k characteristic value, percentage of big words, and percentage of passive sentences, demonstrated statistically significant between-level differences and displayed linear progression.

Both Jess and JWriter exhibit notable limitations, including the manual selection of feature parameters and weights, which can introduce biases into the scoring process. The reliance on human annotators to label non-native language essays also introduces potential noise and variability in the scoring. Furthermore, an important concern is the possibility of system manipulation and cheating by learners who are aware of the regression equation utilized by the models (Hirao et al. 2020 ). These limitations emphasize the need for further advancements in AES systems to address these challenges.

Deep learning technology in AES

Deep learning has emerged as one of the approaches for improving the accuracy and effectiveness of AES. Deep learning-based AES methods utilize artificial neural networks that mimic the human brain’s functioning through layered algorithms and computational units. Unlike conventional machine learning, deep learning autonomously learns from the environment and past errors without human intervention. This enables deep learning models to establish nonlinear correlations, resulting in higher accuracy. Recent advancements in deep learning have led to the development of transformers, which are particularly effective in learning text representations. Noteworthy examples include bidirectional encoder representations from transformers (BERT) (Devlin et al. 2019 ) and the generative pretrained transformer (GPT) (OpenAI).

BERT is a linguistic representation model that utilizes a transformer architecture and is trained on two tasks: masked linguistic modeling and next-sentence prediction (Hirao et al. 2020 ; Vaswani et al. 2017 ). In the context of AES, BERT follows specific procedures, as illustrated in Fig. 1 : (a) the tokenized prompts and essays are taken as input; (b) special tokens, such as [CLS] and [SEP], are added to mark the beginning and separation of prompts and essays; (c) the transformer encoder processes the prompt and essay sequences, resulting in hidden layer sequences; (d) the hidden layers corresponding to the [CLS] tokens (T[CLS]) represent distributed representations of the prompts and essays; and (e) a multilayer perceptron uses these distributed representations as input to obtain the final score (Hirao et al. 2020 ).

AES system with BERT (Hirao et al. 2020 ).

The training of BERT using a substantial amount of sentence data through the Masked Language Model (MLM) allows it to capture contextual information within the hidden layers. Consequently, BERT is expected to be capable of identifying artificial essays as invalid and assigning them lower scores (Mizumoto and Eguchi, 2023 ). In the context of AES for nonnative Japanese learners, Hirao et al. ( 2020 ) combined the long short-term memory (LSTM) model proposed by Hochreiter and Schmidhuber ( 1997 ) with BERT to develop a tailored automated Essay Scoring System. The findings of their study revealed that the BERT model outperformed both the conventional machine learning approach utilizing character-type features such as “kanji” and “hiragana”, as well as the standalone LSTM model. Takeuchi et al. ( 2021 ) presented an approach to Japanese AES that eliminates the requirement for pre-scored essays by relying solely on reference texts or a model answer for the essay task. They investigated multiple similarity evaluation methods, including frequency of morphemes, idf values calculated on Wikipedia, LSI, LDA, word-embedding vectors, and document vectors produced by BERT. The experimental findings revealed that the method utilizing the frequency of morphemes with idf values exhibited the strongest correlation with human-annotated scores across different essay tasks. The utilization of BERT in AES encounters several limitations. Firstly, essays often exceed the model’s maximum length limit. Second, only score labels are available for training, which restricts access to additional information.

Mizumoto and Eguchi ( 2023 ) were pioneers in employing the GPT model for AES in non-native English writing. Their study focused on evaluating the accuracy and reliability of AES using the GPT-3 text-davinci-003 model, analyzing a dataset of 12,100 essays from the corpus of nonnative written English (TOEFL11). The findings indicated that AES utilizing the GPT-3 model exhibited a certain degree of accuracy and reliability. They suggest that GPT-3-based AES systems hold the potential to provide support for human ratings. However, applying GPT model to AES presents a unique natural language processing (NLP) task that involves considerations such as nonnative language proficiency, the influence of the learner’s first language on the output in the target language, and identifying linguistic features that best indicate writing quality in a specific language. These linguistic features may differ morphologically or syntactically from those present in the learners’ first language, as observed in (1)–(3).

我-送了-他-一本-书

Wǒ-sòngle-tā-yī běn-shū

1 sg .-give. past- him-one .cl- book

“I gave him a book.”

Agglutinative

彼-に-本-を-あげ-まし-た

Kare-ni-hon-o-age-mashi-ta

3 sg .- dat -hon- acc- give.honorification. past

Inflectional

give, give-s, gave, given, giving

Additionally, the morphological agglutination and subject-object-verb (SOV) order in Japanese, along with its idiomatic expressions, pose additional challenges for applying language models in AES tasks (4).

足-が棒-になり-ました

Ashi-ga bo-ni nar-mashita

leg- nom stick- dat become- past

“My leg became like a stick (I am extremely tired).”

The example sentence provided demonstrates the morpho-syntactic structure of Japanese and the presence of an idiomatic expression. In this sentence, the verb “なる” (naru), meaning “to become”, appears at the end of the sentence. The verb stem “なり” (nari) is attached with morphemes indicating honorification (“ます” - mashu) and tense (“た” - ta), showcasing agglutination. While the sentence can be literally translated as “my leg became like a stick”, it carries an idiomatic interpretation that implies “I am extremely tired”.

To overcome this issue, CyberAgent Inc. ( 2023 ) has developed the Open-Calm series of language models specifically designed for Japanese. Open-Calm consists of pre-trained models available in various sizes, such as Small, Medium, Large, and 7b. Figure 2 depicts the fundamental structure of the Open-Calm model. A key feature of this architecture is the incorporation of the Lora Adapter and GPT-NeoX frameworks, which can enhance its language processing capabilities.

GPT-NeoX Model Architecture (Okgetheng and Takeuchi 2024 ).

In a recent study conducted by Okgetheng and Takeuchi ( 2024 ), they assessed the efficacy of Open-Calm language models in grading Japanese essays. The research utilized a dataset of approximately 300 essays, which were annotated by native Japanese educators. The findings of the study demonstrate the considerable potential of Open-Calm language models in automated Japanese essay scoring. Specifically, among the Open-Calm family, the Open-Calm Large model (referred to as OCLL) exhibited the highest performance. However, it is important to note that, as of the current date, the Open-Calm Large model does not offer public access to its server. Consequently, users are required to independently deploy and operate the environment for OCLL. In order to utilize OCLL, users must have a PC equipped with an NVIDIA GeForce RTX 3060 (8 or 12 GB VRAM).

In summary, while the potential of LLMs in automated scoring of nonnative Japanese essays has been demonstrated in two studies—BERT-driven AES (Hirao et al. 2020 ) and OCLL-based AES (Okgetheng and Takeuchi, 2024 )—the number of research efforts in this area remains limited.

Another significant challenge in applying LLMs to AES lies in prompt engineering and ensuring its reliability and effectiveness (Brown et al. 2020 ; Rae et al. 2021 ; Zhang et al. 2021 ). Various prompting strategies have been proposed, such as the zero-shot chain of thought (CoT) approach (Kojima et al. 2022 ), which involves manually crafting diverse and effective examples. However, manual efforts can lead to mistakes. To address this, Zhang et al. ( 2021 ) introduced an automatic CoT prompting method called Auto-CoT, which demonstrates matching or superior performance compared to the CoT paradigm. Another prompt framework is trees of thoughts, enabling a model to self-evaluate its progress at intermediate stages of problem-solving through deliberate reasoning (Yao et al. 2023 ).

Beyond linguistic studies, there has been a noticeable increase in the number of foreign workers in Japan and Japanese learners worldwide (Ministry of Health, Labor, and Welfare of Japan, 2022 ; Japan Foundation, 2021 ). However, existing assessment methods, such as the Japanese Language Proficiency Test (JLPT), J-CAT, and TTBJ Footnote 1 , primarily focus on reading, listening, vocabulary, and grammar skills, neglecting the evaluation of writing proficiency. As the number of workers and language learners continues to grow, there is a rising demand for an efficient AES system that can reduce costs and time for raters and be utilized for employment, examinations, and self-study purposes.

This study aims to explore the potential of LLM-based AES by comparing the effectiveness of five models: two LLMs (GPT Footnote 2 and BERT), one Japanese local LLM (OCLL), and two conventional machine learning-based methods (linguistic feature-based scoring tools - Jess and JWriter).

The research questions addressed in this study are as follows:

To what extent do the LLM-driven AES and linguistic feature-based AES, when used as automated tools to support human rating, accurately reflect test takers’ actual performance?

What influence does the prompt have on the accuracy and performance of LLM-based AES methods?

The subsequent sections of the manuscript cover the methodology, including the assessment measures for nonnative Japanese writing proficiency, criteria for prompts, and the dataset. The evaluation section focuses on the analysis of annotations and rating scores generated by LLM-driven and linguistic feature-based AES methods.

Methodology

The dataset utilized in this study was obtained from the International Corpus of Japanese as a Second Language (I-JAS) Footnote 3 . This corpus consisted of 1000 participants who represented 12 different first languages. For the study, the participants were given a story-writing task on a personal computer. They were required to write two stories based on the 4-panel illustrations titled “Picnic” and “The key” (see Appendix A). Background information for the participants was provided by the corpus, including their Japanese language proficiency levels assessed through two online tests: J-CAT and SPOT. These tests evaluated their reading, listening, vocabulary, and grammar abilities. The learners’ proficiency levels were categorized into six levels aligned with the Common European Framework of Reference for Languages (CEFR) and the Reference Framework for Japanese Language Education (RFJLE): A1, A2, B1, B2, C1, and C2. According to Lee et al. ( 2015 ), there is a high level of agreement (r = 0.86) between the J-CAT and SPOT assessments, indicating that the proficiency certifications provided by J-CAT are consistent with those of SPOT. However, it is important to note that the scores of J-CAT and SPOT do not have a one-to-one correspondence. In this study, the J-CAT scores were used as a benchmark to differentiate learners of different proficiency levels. A total of 1400 essays were utilized, representing the beginner (aligned with A1), A2, B1, B2, C1, and C2 levels based on the J-CAT scores. Table 1 provides information about the learners’ proficiency levels and their corresponding J-CAT and SPOT scores.

A dataset comprising a total of 1400 essays from the story writing tasks was collected. Among these, 714 essays were utilized to evaluate the reliability of the LLM-based AES method, while the remaining 686 essays were designated as development data to assess the LLM-based AES’s capability to distinguish participants with varying proficiency levels. The GPT 4 API was used in this study. A detailed explanation of the prompt-assessment criteria is provided in Section Prompt . All essays were sent to the model for measurement and scoring.

Measures of writing proficiency for nonnative Japanese

Japanese exhibits a morphologically agglutinative structure where morphemes are attached to the word stem to convey grammatical functions such as tense, aspect, voice, and honorifics, e.g. (5).

食べ-させ-られ-まし-た-か

tabe-sase-rare-mashi-ta-ka

[eat (stem)-causative-passive voice-honorification-tense. past-question marker]

Japanese employs nine case particles to indicate grammatical functions: the nominative case particle が (ga), the accusative case particle を (o), the genitive case particle の (no), the dative case particle に (ni), the locative/instrumental case particle で (de), the ablative case particle から (kara), the directional case particle へ (e), and the comitative case particle と (to). The agglutinative nature of the language, combined with the case particle system, provides an efficient means of distinguishing between active and passive voice, either through morphemes or case particles, e.g. 食べる taberu “eat concusive . ” (active voice); 食べられる taberareru “eat concusive . ” (passive voice). In the active voice, “パンを食べる” (pan o taberu) translates to “to eat bread”. On the other hand, in the passive voice, it becomes “パンが食べられた” (pan ga taberareta), which means “(the) bread was eaten”. Additionally, it is important to note that different conjugations of the same lemma are considered as one type in order to ensure a comprehensive assessment of the language features. For example, e.g., 食べる taberu “eat concusive . ”; 食べている tabeteiru “eat progress .”; 食べた tabeta “eat past . ” as one type.

To incorporate these features, previous research (Suzuki, 1999 ; Watanabe et al. 1988 ; Ishioka, 2001 ; Ishioka and Kameda, 2006 ; Hirao et al. 2020 ) has identified complexity, fluency, and accuracy as crucial factors for evaluating writing quality. These criteria are assessed through various aspects, including lexical richness (lexical density, diversity, and sophistication), syntactic complexity, and cohesion (Kyle et al. 2021 ; Mizumoto and Eguchi, 2023 ; Ure, 1971 ; Halliday, 1985 ; Barkaoui and Hadidi, 2020 ; Zenker and Kyle, 2021 ; Kim et al. 2018 ; Lu, 2017 ; Ortega, 2015 ). Therefore, this study proposes five scoring categories: lexical richness, syntactic complexity, cohesion, content elaboration, and grammatical accuracy. A total of 16 measures were employed to capture these categories. The calculation process and specific details of these measures can be found in Table 2 .

T-unit, first introduced by Hunt ( 1966 ), is a measure used for evaluating speech and composition. It serves as an indicator of syntactic development and represents the shortest units into which a piece of discourse can be divided without leaving any sentence fragments. In the context of Japanese language assessment, Sakoda and Hosoi ( 2020 ) utilized T-unit as the basic unit to assess the accuracy and complexity of Japanese learners’ speaking and storytelling. The calculation of T-units in Japanese follows the following principles:

A single main clause constitutes 1 T-unit, regardless of the presence or absence of dependent clauses, e.g. (6).

ケンとマリはピクニックに行きました (main clause): 1 T-unit.

If a sentence contains a main clause along with subclauses, each subclause is considered part of the same T-unit, e.g. (7).

天気が良かったので (subclause)、ケンとマリはピクニックに行きました (main clause): 1 T-unit.

In the case of coordinate clauses, where multiple clauses are connected, each coordinated clause is counted separately. Thus, a sentence with coordinate clauses may have 2 T-units or more, e.g. (8).

ケンは地図で場所を探して (coordinate clause)、マリはサンドイッチを作りました (coordinate clause): 2 T-units.

Lexical diversity refers to the range of words used within a text (Engber, 1995 ; Kyle et al. 2021 ) and is considered a useful measure of the breadth of vocabulary in L n production (Jarvis, 2013a , 2013b ).

The type/token ratio (TTR) is widely recognized as a straightforward measure for calculating lexical diversity and has been employed in numerous studies. These studies have demonstrated a strong correlation between TTR and other methods of measuring lexical diversity (e.g., Bentz et al. 2016 ; Čech and Miroslav, 2018 ; Çöltekin and Taraka, 2018 ). TTR is computed by considering both the number of unique words (types) and the total number of words (tokens) in a given text. Given that the length of learners’ writing texts can vary, this study employs the moving average type-token ratio (MATTR) to mitigate the influence of text length. MATTR is calculated using a 50-word moving window. Initially, a TTR is determined for words 1–50 in an essay, followed by words 2–51, 3–52, and so on until the end of the essay is reached (Díez-Ortega and Kyle, 2023 ). The final MATTR scores were obtained by averaging the TTR scores for all 50-word windows. The following formula was employed to derive MATTR:

${\rm{MATTR}}({\rm{W}})=\frac{{\sum }_{{\rm{i}}=1}^{{\rm{N}}-{\rm{W}}+1}{{\rm{F}}}_{{\rm{i}}}}{{\rm{W}}({\rm{N}}-{\rm{W}}+1)}$

Here, N refers to the number of tokens in the corpus. W is the randomly selected token size (W < N). ${F}_{i}$ is the number of types in each window. The ${\rm{MATTR}}({\rm{W}})$ is the mean of a series of type-token ratios (TTRs) based on the word form for all windows. It is expected that individuals with higher language proficiency will produce texts with greater lexical diversity, as indicated by higher MATTR scores.

Lexical density was captured by the ratio of the number of lexical words to the total number of words (Lu, 2012 ). Lexical sophistication refers to the utilization of advanced vocabulary, often evaluated through word frequency indices (Crossley et al. 2013 ; Haberman, 2008 ; Kyle and Crossley, 2015 ; Laufer and Nation, 1995 ; Lu, 2012 ; Read, 2000 ). In line of writing, lexical sophistication can be interpreted as vocabulary breadth, which entails the appropriate usage of vocabulary items across various lexicon-grammatical contexts and registers (Garner et al. 2019 ; Kim et al. 2018 ; Kyle et al. 2018 ). In Japanese specifically, words are considered lexically sophisticated if they are not included in the “Japanese Education Vocabulary List Ver 1.0”. Footnote 4 Consequently, lexical sophistication was calculated by determining the number of sophisticated word types relative to the total number of words per essay. Furthermore, it has been suggested that, in Japanese writing, sentences should ideally have a length of no more than 40 to 50 characters, as this promotes readability. Therefore, the median and maximum sentence length can be considered as useful indices for assessment (Ishioka and Kameda, 2006 ).

Syntactic complexity was assessed based on several measures, including the mean length of clauses, verb phrases per T-unit, clauses per T-unit, dependent clauses per T-unit, complex nominals per clause, adverbial clauses per clause, coordinate phrases per clause, and mean dependency distance (MDD). The MDD reflects the distance between the governor and dependent positions in a sentence. A larger dependency distance indicates a higher cognitive load and greater complexity in syntactic processing (Liu, 2008 ; Liu et al. 2017 ). The MDD has been established as an efficient metric for measuring syntactic complexity (Jiang, Quyang, and Liu, 2019 ; Li and Yan, 2021 ). To calculate the MDD, the position numbers of the governor and dependent are subtracted, assuming that words in a sentence are assigned in a linear order, such as W1 … Wi … Wn. In any dependency relationship between words Wa and Wb, Wa is the governor and Wb is the dependent. The MDD of the entire sentence was obtained by taking the absolute value of governor – dependent:

MDD = $\frac{1}{n}{\sum }_{i=1}^{n}|{\rm{D}}{{\rm{D}}}_{i}|$

In this formula, $n$ represents the number of words in the sentence, and ${DD}i$ is the dependency distance of the ${i}^{{th}}$ dependency relationship of a sentence. Building on this, the annotation of sentence ‘Mary-ga-John-ni-keshigomu-o-watashita was [Mary- top -John- dat -eraser- acc -give- past] ’. The sentence’s MDD would be 2. Table 3 provides the CSV file as a prompt for GPT 4.

Cohesion (semantic similarity) and content elaboration aim to capture the ideas presented in test taker’s essays. Cohesion was assessed using three measures: Synonym overlap/paragraph (topic), Synonym overlap/paragraph (keywords), and word2vec cosine similarity. Content elaboration and development were measured as the number of metadiscourse markers (type)/number of words. To capture content closely, this study proposed a novel-distance based representation, by encoding the cosine distance between the essay (by learner) and essay task’s (topic and keyword) i -vectors. The learner’s essay is decoded into a word sequence, and aligned to the essay task’ topic and keyword for log-likelihood measurement. The cosine distance reveals the content elaboration score in the leaners’ essay. The mathematical equation of cosine similarity between target-reference vectors is shown in (11), assuming there are i essays and ( L i , …. L n ) and ( N i , …. N n ) are the vectors representing the learner and task’s topic and keyword respectively. The content elaboration distance between L i and N i was calculated as follows:

$\cos \left(\theta \right)=\frac{{\rm{L}}\,\cdot\, {\rm{N}}}{\left|{\rm{L}}\right|{\rm{|N|}}}=\frac{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}{N}_{i}}{\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}^{2}}\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{N}_{i}^{2}}}$

A high similarity value indicates a low difference between the two recognition outcomes, which in turn suggests a high level of proficiency in content elaboration.

To evaluate the effectiveness of the proposed measures in distinguishing different proficiency levels among nonnative Japanese speakers’ writing, we conducted a multi-faceted Rasch measurement analysis (Linacre, 1994 ). This approach applies measurement models to thoroughly analyze various factors that can influence test outcomes, including test takers’ proficiency, item difficulty, and rater severity, among others. The underlying principles and functionality of multi-faceted Rasch measurement are illustrated in (12).

$\log \left(\frac{{P}_{{nijk}}}{{P}_{{nij}(k-1)}}\right)={B}_{n}-{D}_{i}-{C}_{j}-{F}_{k}$

(12) defines the logarithmic transformation of the probability ratio ( P nijk /P nij(k-1) )) as a function of multiple parameters. Here, n represents the test taker, i denotes a writing proficiency measure, j corresponds to the human rater, and k represents the proficiency score. The parameter B n signifies the proficiency level of test taker n (where n ranges from 1 to N). D j represents the difficulty parameter of test item i (where i ranges from 1 to L), while C j represents the severity of rater j (where j ranges from 1 to J). Additionally, F k represents the step difficulty for a test taker to move from score ‘k-1’ to k . P nijk refers to the probability of rater j assigning score k to test taker n for test item i . P nij(k-1) represents the likelihood of test taker n being assigned score ‘k-1’ by rater j for test item i . Each facet within the test is treated as an independent parameter and estimated within the same reference framework. To evaluate the consistency of scores obtained through both human and computer analysis, we utilized the Infit mean-square statistic. This statistic is a chi-square measure divided by the degrees of freedom and is weighted with information. It demonstrates higher sensitivity to unexpected patterns in responses to items near a person’s proficiency level (Linacre, 2002 ). Fit statistics are assessed based on predefined thresholds for acceptable fit. For the Infit MNSQ, which has a mean of 1.00, different thresholds have been suggested. Some propose stricter thresholds ranging from 0.7 to 1.3 (Bond et al. 2021 ), while others suggest more lenient thresholds ranging from 0.5 to 1.5 (Eckes, 2009 ). In this study, we adopted the criterion of 0.70–1.30 for the Infit MNSQ.

Moving forward, we can now proceed to assess the effectiveness of the 16 proposed measures based on five criteria for accurately distinguishing various levels of writing proficiency among non-native Japanese speakers. To conduct this evaluation, we utilized the development dataset from the I-JAS corpus, as described in Section Dataset . Table 4 provides a measurement report that presents the performance details of the 14 metrics under consideration. The measure separation was found to be 4.02, indicating a clear differentiation among the measures. The reliability index for the measure separation was 0.891, suggesting consistency in the measurement. Similarly, the person separation reliability index was 0.802, indicating the accuracy of the assessment in distinguishing between individuals. All 16 measures demonstrated Infit mean squares within a reasonable range, ranging from 0.76 to 1.28. The Synonym overlap/paragraph (topic) measure exhibited a relatively high outfit mean square of 1.46, although the Infit mean square falls within an acceptable range. The standard error for the measures ranged from 0.13 to 0.28, indicating the precision of the estimates.

Table 5 further illustrated the weights assigned to different linguistic measures for score prediction, with higher weights indicating stronger correlations between those measures and higher scores. Specifically, the following measures exhibited higher weights compared to others: moving average type token ratio per essay has a weight of 0.0391. Mean dependency distance had a weight of 0.0388. Mean length of clause, calculated by dividing the number of words by the number of clauses, had a weight of 0.0374. Complex nominals per T-unit, calculated by dividing the number of complex nominals by the number of T-units, had a weight of 0.0379. Coordinate phrases rate, calculated by dividing the number of coordinate phrases by the number of clauses, had a weight of 0.0325. Grammatical error rate, representing the number of errors per essay, had a weight of 0.0322.

Criteria (output indicator)

The criteria used to evaluate the writing ability in this study were based on CEFR, which follows a six-point scale ranging from A1 to C2. To assess the quality of Japanese writing, the scoring criteria from Table 6 were utilized. These criteria were derived from the IELTS writing standards and served as assessment guidelines and prompts for the written output.

A prompt is a question or detailed instruction that is provided to the model to obtain a proper response. After several pilot experiments, we decided to provide the measures (Section Measures of writing proficiency for nonnative Japanese ) as the input prompt and use the criteria (Section Criteria (output indicator) ) as the output indicator. Regarding the prompt language, considering that the LLM was tasked with rating Japanese essays, would prompt in Japanese works better Footnote 5 ? We conducted experiments comparing the performance of GPT-4 using both English and Japanese prompts. Additionally, we utilized the Japanese local model OCLL with Japanese prompts. Multiple trials were conducted using the same sample. Regardless of the prompt language used, we consistently obtained the same grading results with GPT-4, which assigned a grade of B1 to the writing sample. This suggested that GPT-4 is reliable and capable of producing consistent ratings regardless of the prompt language. On the other hand, when we used Japanese prompts with the Japanese local model “OCLL”, we encountered inconsistent grading results. Out of 10 attempts with OCLL, only 6 yielded consistent grading results (B1), while the remaining 4 showed different outcomes, including A1 and B2 grades. These findings indicated that the language of the prompt was not the determining factor for reliable AES. Instead, the size of the training data and the model parameters played crucial roles in achieving consistent and reliable AES results for the language model.

The following is the utilized prompt, which details all measures and requires the LLM to score the essays using holistic and trait scores.

Please evaluate Japanese essays written by Japanese learners and assign a score to each essay on a six-point scale, ranging from A1, A2, B1, B2, C1 to C2. Additionally, please provide trait scores and display the calculation process for each trait score. The scoring should be based on the following criteria:

Moving average type-token ratio.

Number of lexical words (token) divided by the total number of words per essay.

Number of sophisticated word types divided by the total number of words per essay.

Mean length of clause.

Verb phrases per T-unit.

Clauses per T-unit.

Dependent clauses per T-unit.

Complex nominals per clause.

Adverbial clauses per clause.

Coordinate phrases per clause.

Mean dependency distance.

Synonym overlap paragraph (topic and keywords).

Word2vec cosine similarity.

Connectives per essay.

Conjunctions per essay.

Number of metadiscourse markers (types) divided by the total number of words.

Number of errors per essay.

Japanese essay text

出かける前に二人が地図を見ている間に、サンドイッチを入れたバスケットに犬が入ってしまいました。それに気づかずに二人は楽しそうに出かけて行きました。やがて突然犬がバスケットから飛び出し、二人は驚きました。バスケットの中を見ると、食べ物はすべて犬に食べられていて、二人は困ってしまいました。(ID_JJJ01_SW1)

The score of the example above was B1. Figure 3 provides an example of holistic and trait scores provided by GPT-4 (with a prompt indicating all measures) via Bing Footnote 6 .

Example of GPT-4 AES and feedback (with a prompt indicating all measures).

Statistical analysis

The aim of this study is to investigate the potential use of LLM for nonnative Japanese AES. It seeks to compare the scoring outcomes obtained from feature-based AES tools, which rely on conventional machine learning technology (i.e. Jess, JWriter), with those generated by AI-driven AES tools utilizing deep learning technology (BERT, GPT, OCLL). To assess the reliability of a computer-assisted annotation tool, the study initially established human-human agreement as the benchmark measure. Subsequently, the performance of the LLM-based method was evaluated by comparing it to human-human agreement.

To assess annotation agreement, the study employed standard measures such as precision, recall, and F-score (Brants 2000 ; Lu 2010 ), along with the quadratically weighted kappa (QWK) to evaluate the consistency and agreement in the annotation process. Assume A and B represent human annotators. When comparing the annotations of the two annotators, the following results are obtained. The evaluation of precision, recall, and F-score metrics was illustrated in equations (13) to (15).

${\rm{Recall}}(A,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,A}$

${\rm{Precision}}(A,\,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,B}$

The F-score is the harmonic mean of recall and precision:

${\rm{F}}-{\rm{score}}=\frac{2* ({\rm{Precision}}* {\rm{Recall}})}{{\rm{Precision}}+{\rm{Recall}}}$

The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either precision or recall are zero.

In accordance with Taghipour and Ng ( 2016 ), the calculation of QWK involves two steps:

Step 1: Construct a weight matrix W as follows:

${W}_{{ij}}=\frac{{(i-j)}^{2}}{{(N-1)}^{2}}$

i represents the annotation made by the tool, while j represents the annotation made by a human rater. N denotes the total number of possible annotations. Matrix O is subsequently computed, where O_( i, j ) represents the count of data annotated by the tool ( i ) and the human annotator ( j ). On the other hand, E refers to the expected count matrix, which undergoes normalization to ensure that the sum of elements in E matches the sum of elements in O.

Step 2: With matrices O and E, the QWK is obtained as follows:

K = 1- $\frac{\sum i,j{W}_{i,j}\,{O}_{i,j}}{\sum i,j{W}_{i,j}\,{E}_{i,j}}$

The value of the quadratic weighted kappa increases as the level of agreement improves. Further, to assess the accuracy of LLM scoring, the proportional reductive mean square error (PRMSE) was employed. The PRMSE approach takes into account the variability observed in human ratings to estimate the rater error, which is then subtracted from the variance of the human labels. This calculation provides an overall measure of agreement between the automated scores and true scores (Haberman et al. 2015 ; Loukina et al. 2020 ; Taghipour and Ng, 2016 ). The computation of PRMSE involves the following steps:

Step 1: Calculate the mean squared errors (MSEs) for the scoring outcomes of the computer-assisted tool (MSE tool) and the human scoring outcomes (MSE human).

Step 2: Determine the PRMSE by comparing the MSE of the computer-assisted tool (MSE tool) with the MSE from human raters (MSE human), using the following formula:

${\rm{PRMSE}}=1-\frac{({\rm{MSE}}\,{\rm{tool}})\,}{({\rm{MSE}}\,{\rm{human}})\,}=1-\,\frac{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-{\hat{{\rm{y}}}}_{{\rm{i}}})}^{2}}{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-\hat{{\rm{y}}})}^{2}}$

In the numerator, ŷi represents the scoring outcome predicted by a specific LLM-driven AES system for a given sample. The term y i − ŷ i represents the difference between this predicted outcome and the mean value of all LLM-driven AES systems’ scoring outcomes. It quantifies the deviation of the specific LLM-driven AES system’s prediction from the average prediction of all LLM-driven AES systems. In the denominator, y i − ŷ represents the difference between the scoring outcome provided by a specific human rater for a given sample and the mean value of all human raters’ scoring outcomes. It measures the discrepancy between the specific human rater’s score and the average score given by all human raters. The PRMSE is then calculated by subtracting the ratio of the MSE tool to the MSE human from 1. PRMSE falls within the range of 0 to 1, with larger values indicating reduced errors in LLM’s scoring compared to those of human raters. In other words, a higher PRMSE implies that LLM’s scoring demonstrates greater accuracy in predicting the true scores (Loukina et al. 2020 ). The interpretation of kappa values, ranging from 0 to 1, is based on the work of Landis and Koch ( 1977 ). Specifically, the following categories are assigned to different ranges of kappa values: −1 indicates complete inconsistency, 0 indicates random agreement, 0.0 ~ 0.20 indicates extremely low level of agreement (slight), 0.21 ~ 0.40 indicates moderate level of agreement (fair), 0.41 ~ 0.60 indicates medium level of agreement (moderate), 0.61 ~ 0.80 indicates high level of agreement (substantial), 0.81 ~ 1 indicates almost perfect level of agreement. All statistical analyses were executed using Python script.

Results and discussion

Annotation reliability of the llm.

This section focuses on assessing the reliability of the LLM’s annotation and scoring capabilities. To evaluate the reliability, several tests were conducted simultaneously, aiming to achieve the following objectives:

Assess the LLM’s ability to differentiate between test takers with varying levels of oral proficiency.

Determine the level of agreement between the annotations and scoring performed by the LLM and those done by human raters.

The evaluation of the results encompassed several metrics, including: precision, recall, F-Score, quadratically-weighted kappa, proportional reduction of mean squared error, Pearson correlation, and multi-faceted Rasch measurement.

Inter-annotator agreement (human–human annotator agreement)

We started with an agreement test of the two human annotators. Two trained annotators were recruited to determine the writing task data measures. A total of 714 scripts, as the test data, was utilized. Each analysis lasted 300–360 min. Inter-annotator agreement was evaluated using the standard measures of precision, recall, and F-score and QWK. Table 7 presents the inter-annotator agreement for the various indicators. As shown, the inter-annotator agreement was fairly high, with F-scores ranging from 1.0 for sentence and word number to 0.666 for grammatical errors.

The findings from the QWK analysis provided further confirmation of the inter-annotator agreement. The QWK values covered a range from 0.950 ( p = 0.000) for sentence and word number to 0.695 for synonym overlap number (keyword) and grammatical errors ( p = 0.001).

Agreement of annotation outcomes between human and LLM

To evaluate the consistency between human annotators and LLM annotators (BERT, GPT, OCLL) across the indices, the same test was conducted. The results of the inter-annotator agreement (F-score) between LLM and human annotation are provided in Appendix B-D. The F-scores ranged from 0.706 for Grammatical error # for OCLL-human to a perfect 1.000 for GPT-human, for sentences, clauses, T-units, and words. These findings were further supported by the QWK analysis, which showed agreement levels ranging from 0.807 ( p = 0.001) for metadiscourse markers for OCLL-human to 0.962 for words ( p = 0.000) for GPT-human. The findings demonstrated that the LLM annotation achieved a significant level of accuracy in identifying measurement units and counts.

Reliability of LLM-driven AES’s scoring and discriminating proficiency levels

This section examines the reliability of the LLM-driven AES scoring through a comparison of the scoring outcomes produced by human raters and the LLM ( Reliability of LLM-driven AES scoring ). It also assesses the effectiveness of the LLM-based AES system in differentiating participants with varying proficiency levels ( Reliability of LLM-driven AES discriminating proficiency levels ).

Reliability of LLM-driven AES scoring

Table 8 summarizes the QWK coefficient analysis between the scores computed by the human raters and the GPT-4 for the individual essays from I-JAS Footnote 7 . As shown, the QWK of all measures ranged from k = 0.819 for lexical density (number of lexical words (tokens)/number of words per essay) to k = 0.644 for word2vec cosine similarity. Table 9 further presents the Pearson correlations between the 16 writing proficiency measures scored by human raters and GPT 4 for the individual essays. The correlations ranged from 0.672 for syntactic complexity to 0.734 for grammatical accuracy. The correlations between the writing proficiency scores assigned by human raters and the BERT-based AES system were found to range from 0.661 for syntactic complexity to 0.713 for grammatical accuracy. The correlations between the writing proficiency scores given by human raters and the OCLL-based AES system ranged from 0.654 for cohesion to 0.721 for grammatical accuracy. These findings indicated an alignment between the assessments made by human raters and both the BERT-based and OCLL-based AES systems in terms of various aspects of writing proficiency.

Reliability of LLM-driven AES discriminating proficiency levels

After validating the reliability of the LLM’s annotation and scoring, the subsequent objective was to evaluate its ability to distinguish between various proficiency levels. For this analysis, a dataset of 686 individual essays was utilized. Table 10 presents a sample of the results, summarizing the means, standard deviations, and the outcomes of the one-way ANOVAs based on the measures assessed by the GPT-4 model. A post hoc multiple comparison test, specifically the Bonferroni test, was conducted to identify any potential differences between pairs of levels.

As the results reveal, seven measures presented linear upward or downward progress across the three proficiency levels. These were marked in bold in Table 10 and comprise one measure of lexical richness, i.e. MATTR (lexical diversity); four measures of syntactic complexity, i.e. MDD (mean dependency distance), MLC (mean length of clause), CNT (complex nominals per T-unit), CPC (coordinate phrases rate); one cohesion measure, i.e. word2vec cosine similarity and GER (grammatical error rate). Regarding the ability of the sixteen measures to distinguish adjacent proficiency levels, the Bonferroni tests indicated that statistically significant differences exist between the primary level and the intermediate level for MLC and GER. One measure of lexical richness, namely LD, along with three measures of syntactic complexity (VPT, CT, DCT, ACC), two measures of cohesion (SOPT, SOPK), and one measure of content elaboration (IMM), exhibited statistically significant differences between proficiency levels. However, these differences did not demonstrate a linear progression between adjacent proficiency levels. No significant difference was observed in lexical sophistication between proficiency levels.

To summarize, our study aimed to evaluate the reliability and differentiation capabilities of the LLM-driven AES method. For the first objective, we assessed the LLM’s ability to differentiate between test takers with varying levels of oral proficiency using precision, recall, F-Score, and quadratically-weighted kappa. Regarding the second objective, we compared the scoring outcomes generated by human raters and the LLM to determine the level of agreement. We employed quadratically-weighted kappa and Pearson correlations to compare the 16 writing proficiency measures for the individual essays. The results confirmed the feasibility of using the LLM for annotation and scoring in AES for nonnative Japanese. As a result, Research Question 1 has been addressed.

Comparison of BERT-, GPT-, OCLL-based AES, and linguistic-feature-based computation methods

This section aims to compare the effectiveness of five AES methods for nonnative Japanese writing, i.e. LLM-driven approaches utilizing BERT, GPT, and OCLL, linguistic feature-based approaches using Jess and JWriter. The comparison was conducted by comparing the ratings obtained from each approach with human ratings. All ratings were derived from the dataset introduced in Dataset . To facilitate the comparison, the agreement between the automated methods and human ratings was assessed using QWK and PRMSE. The performance of each approach was summarized in Table 11 .

The QWK coefficient values indicate that LLMs (GPT, BERT, OCLL) and human rating outcomes demonstrated higher agreement compared to feature-based AES methods (Jess and JWriter) in assessing writing proficiency criteria, including lexical richness, syntactic complexity, content, and grammatical accuracy. Among the LLMs, the GPT-4 driven AES and human rating outcomes showed the highest agreement in all criteria, except for syntactic complexity. The PRMSE values suggest that the GPT-based method outperformed linguistic feature-based methods and other LLM-based approaches. Moreover, an interesting finding emerged during the study: the agreement coefficient between GPT-4 and human scoring was even higher than the agreement between different human raters themselves. This discovery highlights the advantage of GPT-based AES over human rating. Ratings involve a series of processes, including reading the learners’ writing, evaluating the content and language, and assigning scores. Within this chain of processes, various biases can be introduced, stemming from factors such as rater biases, test design, and rating scales. These biases can impact the consistency and objectivity of human ratings. GPT-based AES may benefit from its ability to apply consistent and objective evaluation criteria. By prompting the GPT model with detailed writing scoring rubrics and linguistic features, potential biases in human ratings can be mitigated. The model follows a predefined set of guidelines and does not possess the same subjective biases that human raters may exhibit. This standardization in the evaluation process contributes to the higher agreement observed between GPT-4 and human scoring. Section Prompt strategy of the study delves further into the role of prompts in the application of LLMs to AES. It explores how the choice and implementation of prompts can impact the performance and reliability of LLM-based AES methods. Furthermore, it is important to acknowledge the strengths of the local model, i.e. the Japanese local model OCLL, which excels in processing certain idiomatic expressions. Nevertheless, our analysis indicated that GPT-4 surpasses local models in AES. This superior performance can be attributed to the larger parameter size of GPT-4, estimated to be between 500 billion and 1 trillion, which exceeds the sizes of both BERT and the local model OCLL.

Prompt strategy

In the context of prompt strategy, Mizumoto and Eguchi ( 2023 ) conducted a study where they applied the GPT-3 model to automatically score English essays in the TOEFL test. They found that the accuracy of the GPT model alone was moderate to fair. However, when they incorporated linguistic measures such as cohesion, syntactic complexity, and lexical features alongside the GPT model, the accuracy significantly improved. This highlights the importance of prompt engineering and providing the model with specific instructions to enhance its performance. In this study, a similar approach was taken to optimize the performance of LLMs. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. Model 1 was used as the baseline, representing GPT-4 without any additional prompting. Model 2, on the other hand, involved GPT-4 prompted with 16 measures that included scoring criteria, efficient linguistic features for writing assessment, and detailed measurement units and calculation formulas. The remaining models (Models 3 to 18) utilized GPT-4 prompted with individual measures. The performance of these 18 different models was assessed using the output indicators described in Section Criteria (output indicator) . By comparing the performances of these models, the study aimed to understand the impact of prompt engineering on the accuracy and effectiveness of GPT-4 in AES tasks.

Based on the PRMSE scores presented in Fig. 4 , it was observed that Model 1, representing GPT-4 without any additional prompting, achieved a fair level of performance. However, Model 2, which utilized GPT-4 prompted with all measures, outperformed all other models in terms of PRMSE score, achieving a score of 0.681. These results indicate that the inclusion of specific measures and prompts significantly enhanced the performance of GPT-4 in AES. Among the measures, syntactic complexity was found to play a particularly significant role in improving the accuracy of GPT-4 in assessing writing quality. Following that, lexical diversity emerged as another important factor contributing to the model’s effectiveness. The study suggests that a well-prompted GPT-4 can serve as a valuable tool to support human assessors in evaluating writing quality. By utilizing GPT-4 as an automated scoring tool, the evaluation biases associated with human raters can be minimized. This has the potential to empower teachers by allowing them to focus on designing writing tasks and guiding writing strategies, while leveraging the capabilities of GPT-4 for efficient and reliable scoring.

PRMSE scores of the 18 AES models.

This study aimed to investigate two main research questions: the feasibility of utilizing LLMs for AES and the impact of prompt engineering on the application of LLMs in AES.

To address the first objective, the study compared the effectiveness of five different models: GPT, BERT, the Japanese local LLM (OCLL), and two conventional machine learning-based AES tools (Jess and JWriter). The PRMSE values indicated that the GPT-4-based method outperformed other LLMs (BERT, OCLL) and linguistic feature-based computational methods (Jess and JWriter) across various writing proficiency criteria. Furthermore, the agreement coefficient between GPT-4 and human scoring surpassed the agreement among human raters themselves, highlighting the potential of using the GPT-4 tool to enhance AES by reducing biases and subjectivity, saving time, labor, and cost, and providing valuable feedback for self-study. Regarding the second goal, the role of prompt design was investigated by comparing 18 models, including a baseline model, a model prompted with all measures, and 16 models prompted with one measure at a time. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. The PRMSE scores of the models showed that GPT-4 prompted with all measures achieved the best performance, surpassing the baseline and other models.

In conclusion, this study has demonstrated the potential of LLMs in supporting human rating in assessments. By incorporating automation, we can save time and resources while reducing biases and subjectivity inherent in human rating processes. Automated language assessments offer the advantage of accessibility, providing equal opportunities and economic feasibility for individuals who lack access to traditional assessment centers or necessary resources. LLM-based language assessments provide valuable feedback and support to learners, aiding in the enhancement of their language proficiency and the achievement of their goals. This personalized feedback can cater to individual learner needs, facilitating a more tailored and effective language-learning experience.

There are three important areas that merit further exploration. First, prompt engineering requires attention to ensure optimal performance of LLM-based AES across different language types. This study revealed that GPT-4, when prompted with all measures, outperformed models prompted with fewer measures. Therefore, investigating and refining prompt strategies can enhance the effectiveness of LLMs in automated language assessments. Second, it is crucial to explore the application of LLMs in second-language assessment and learning for oral proficiency, as well as their potential in under-resourced languages. Recent advancements in self-supervised machine learning techniques have significantly improved automatic speech recognition (ASR) systems, opening up new possibilities for creating reliable ASR systems, particularly for under-resourced languages with limited data. However, challenges persist in the field of ASR. First, ASR assumes correct word pronunciation for automatic pronunciation evaluation, which proves challenging for learners in the early stages of language acquisition due to diverse accents influenced by their native languages. Accurately segmenting short words becomes problematic in such cases. Second, developing precise audio-text transcriptions for languages with non-native accented speech poses a formidable task. Last, assessing oral proficiency levels involves capturing various linguistic features, including fluency, pronunciation, accuracy, and complexity, which are not easily captured by current NLP technology.

Data availability

The dataset utilized was obtained from the International Corpus of Japanese as a Second Language (I-JAS). The data URLs: [ https://www2.ninjal.ac.jp/jll/lsaj/ihome2.html ].

J-CAT and TTBJ are two computerized adaptive tests used to assess Japanese language proficiency.

SPOT is a specific component of the TTBJ test.

J-CAT: https://www.j-cat2.org/html/ja/pages/interpret.html

SPOT: https://ttbj.cegloc.tsukuba.ac.jp/p1.html#SPOT .

The study utilized a prompt-based GPT-4 model, developed by OpenAI, which has an impressive architecture with 1.8 trillion parameters across 120 layers. GPT-4 was trained on a vast dataset of 13 trillion tokens, using two stages: initial training on internet text datasets to predict the next token, and subsequent fine-tuning through reinforcement learning from human feedback.

https://www2.ninjal.ac.jp/jll/lsaj/ihome2-en.html .

http://jhlee.sakura.ne.jp/JEV/ by Japanese Learning Dictionary Support Group 2015.

We express our sincere gratitude to the reviewer for bringing this matter to our attention.

On February 7, 2023, Microsoft began rolling out a major overhaul to Bing that included a new chatbot feature based on OpenAI’s GPT-4 (Bing.com).

Appendix E-F present the analysis results of the QWK coefficient between the scores computed by the human raters and the BERT, OCLL models.

Attali Y, Burstein J (2006) Automated essay scoring with e-rater® V.2. J. Technol., Learn. Assess., 4

Barkaoui K, Hadidi A (2020) Assessing Change in English Second Language Writing Performance (1st ed.). Routledge, New York. https://doi.org/10.4324/9781003092346

Bentz C, Tatyana R, Koplenig A, Tanja S (2016) A comparison between morphological complexity. measures: Typological data vs. language corpora. In Proceedings of the workshop on computational linguistics for linguistic complexity (CL4LC), 142–153. Osaka, Japan: The COLING 2016 Organizing Committee

Bond TG, Yan Z, Heene M (2021) Applying the Rasch model: Fundamental measurement in the human sciences (4th ed). Routledge

Brants T (2000) Inter-annotator agreement for a German newspaper corpus. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece, 31 May-2 June, European Language Resources Association

Brown TB, Mann B, Ryder N, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, Online, 6–12 December, Curran Associates, Inc., Red Hook, NY

Burstein J (2003) The E-rater scoring engine: Automated essay scoring with natural language processing. In Shermis MD and Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Čech R, Miroslav K (2018) Morphological richness of text. In Masako F, Václav C (ed) Taming the corpus: From inflection and lexis to interpretation, 63–77. Cham, Switzerland: Springer Nature

Çöltekin Ç, Taraka, R (2018) Exploiting Universal Dependencies treebanks for measuring morphosyntactic complexity. In Aleksandrs B, Christian B (ed), Proceedings of first workshop on measuring language complexity, 1–7. Torun, Poland

Crossley SA, Cobb T, McNamara DS (2013) Comparing count-based and band-based indices of word frequency: Implications for active vocabulary research and pedagogical applications. System 41:965–981. https://doi.org/10.1016/j.system.2013.08.002

Article Google Scholar

Crossley SA, McNamara DS (2016) Say more and be more coherent: How text elaboration and cohesion can increase writing quality. J. Writ. Res. 7:351–370

CyberAgent Inc (2023) Open-Calm series of Japanese language models. Retrieved from: https://www.cyberagent.co.jp/news/detail/id=28817

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, 2–7 June, pp. 4171–4186. Association for Computational Linguistics

Diez-Ortega M, Kyle K (2023) Measuring the development of lexical richness of L2 Spanish: a longitudinal learner corpus study. Studies in Second Language Acquisition 1-31

Eckes T (2009) On common ground? How raters perceive scoring criteria in oral proficiency testing. In Brown A, Hill K (ed) Language testing and evaluation 13: Tasks and criteria in performance assessment (pp. 43–73). Peter Lang Publishing

Elliot S (2003) IntelliMetric: from here to validity. In: Shermis MD, Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Google Scholar

Engber CA (1995) The relationship of lexical proficiency to the quality of ESL compositions. J. Second Lang. Writ. 4:139–155

Garner J, Crossley SA, Kyle K (2019) N-gram measures and L2 writing proficiency. System 80:176–187. https://doi.org/10.1016/j.system.2018.12.001

Haberman SJ (2008) When can subscores have value? J. Educat. Behav. Stat., 33:204–229

Haberman SJ, Yao L, Sinharay S (2015) Prediction of true test scores from observed item scores and ancillary data. Brit. J. Math. Stat. Psychol. 68:363–385

Halliday MAK (1985) Spoken and Written Language. Deakin University Press, Melbourne, Australia

Hirao R, Arai M, Shimanaka H et al. (2020) Automated essay scoring system for nonnative Japanese learners. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 1250–1257. European Language Resources Association

Hunt KW (1966) Recent Measures in Syntactic Development. Elementary English, 43(7), 732–739. http://www.jstor.org/stable/41386067

Ishioka T (2001) About e-rater, a computer-based automatic scoring system for essays [Konpyūta ni yoru essei no jidō saiten shisutemu e − rater ni tsuite]. University Entrance Examination. Forum [Daigaku nyūshi fōramu] 24:71–76

Hochreiter S, Schmidhuber J (1997) Long short- term memory. Neural Comput. 9(8):1735–1780

Article CAS PubMed Google Scholar

Ishioka T, Kameda M (2006) Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–18 July 2006, pp. 233-240. Association for Computational Linguistics, USA

Japan Foundation (2021) Retrieved from: https://www.jpf.gp.jp/j/project/japanese/survey/result/dl/survey2021/all.pdf

Jarvis S (2013a) Defining and measuring lexical diversity. In Jarvis S, Daller M (ed) Vocabulary knowledge: Human ratings and automated measures (Vol. 47, pp. 13–44). John Benjamins. https://doi.org/10.1075/sibil.47.03ch1

Jarvis S (2013b) Capturing the diversity in lexical diversity. Lang. Learn. 63:87–106. https://doi.org/10.1111/j.1467-9922.2012.00739.x

Jiang J, Quyang J, Liu H (2019) Interlanguage: A perspective of quantitative linguistic typology. Lang. Sci. 74:85–97

Kim M, Crossley SA, Kyle K (2018) Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. Mod. Lang. J. 102(1):120–141. https://doi.org/10.1111/modl.12447

Kojima T, Gu S, Reid M et al. (2022) Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, New Orleans, LA, 29 November-1 December, Curran Associates, Inc., Red Hook, NY

Kyle K, Crossley SA (2015) Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Q 49:757–786

Kyle K, Crossley SA, Berger CM (2018) The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behav. Res. Methods 50:1030–1046. https://doi.org/10.3758/s13428-017-0924-4

Article PubMed Google Scholar

Kyle K, Crossley SA, Jarvis S (2021) Assessing the validity of lexical diversity using direct judgements. Lang. Assess. Q. 18:154–170. https://doi.org/10.1080/15434303.2020.1844205

Landauer TK, Laham D, Foltz PW (2003) Automated essay scoring and annotation of essays with the Intelligent Essay Assessor. In Shermis MD, Burstein JC (ed), Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174

Laufer B, Nation P (1995) Vocabulary size and use: Lexical richness in L2 written production. Appl. Linguist. 16:307–322. https://doi.org/10.1093/applin/16.3.307

Lee J, Hasebe Y (2017) jWriter Learner Text Evaluator, URL: https://jreadability.net/jwriter/

Lee J, Kobayashi N, Sakai T, Sakota K (2015) A Comparison of SPOT and J-CAT Based on Test Analysis [Tesuto bunseki ni motozuku ‘SPOT’ to ‘J-CAT’ no hikaku]. Research on the Acquisition of Second Language Japanese [Dainigengo to shite no nihongo no shūtoku kenkyū] (18) 53–69

Li W, Yan J (2021) Probability distribution of dependency distance based on a Treebank of. Japanese EFL Learners’ Interlanguage. J. Quant. Linguist. 28(2):172–186. https://doi.org/10.1080/09296174.2020.1754611

Article MathSciNet Google Scholar

Linacre JM (2002) Optimizing rating scale category effectiveness. J. Appl. Meas. 3(1):85–106

PubMed Google Scholar

Linacre JM (1994) Constructing measurement with a Many-Facet Rasch Model. In Wilson M (ed) Objective measurement: Theory into practice, Volume 2 (pp. 129–144). Norwood, NJ: Ablex

Liu H (2008) Dependency distance as a metric of language comprehension difficulty. J. Cognitive Sci. 9:159–191

Liu H, Xu C, Liang J (2017) Dependency distance: A new perspective on syntactic patterns in natural languages. Phys. Life Rev. 21. https://doi.org/10.1016/j.plrev.2017.03.002

Loukina A, Madnani N, Cahill A, et al. (2020) Using PRMSE to evaluate automated scoring systems in the presence of label noise. Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA → Online, 10 July, pp. 18–29. Association for Computational Linguistics

Lu X (2010) Automatic analysis of syntactic complexity in second language writing. Int. J. Corpus Linguist. 15:474–496

Lu X (2012) The relationship of lexical richness to the quality of ESL learners’ oral narratives. Mod. Lang. J. 96:190–208

Lu X (2017) Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing assessment. Lang. Test. 34:493–511

Lu X, Hu R (2022) Sense-aware lexical sophistication indices and their relationship to second language writing quality. Behav. Res. Method. 54:1444–1460. https://doi.org/10.3758/s13428-021-01675-6

Ministry of Health, Labor, and Welfare of Japan (2022) Retrieved from: https://www.mhlw.go.jp/stf/newpage_30367.html

Mizumoto A, Eguchi M (2023) Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 3:100050

Okgetheng B, Takeuchi K (2024) Estimating Japanese Essay Grading Scores with Large Language Models. Proceedings of 30th Annual Conference of the Language Processing Society in Japan, March 2024

Ortega L (2015) Second language learning explained? SLA across 10 contemporary theories. In VanPatten B, Williams J (ed) Theories in Second Language Acquisition: An Introduction

Rae JW, Borgeaud S, Cai T, et al. (2021) Scaling Language Models: Methods, Analysis & Insights from Training Gopher. ArXiv, abs/2112.11446

Read J (2000) Assessing vocabulary. Cambridge University Press. https://doi.org/10.1017/CBO9780511732942

Rudner LM, Liang T (2002) Automated Essay Scoring Using Bayes’ Theorem. J. Technol., Learning and Assessment, 1 (2)

Sakoda K, Hosoi Y (2020) Accuracy and complexity of Japanese Language usage by SLA learners in different learning environments based on the analysis of I-JAS, a learners’ corpus of Japanese as L2. Math. Linguist. 32(7):403–418. https://doi.org/10.24701/mathling.32.7_403

Suzuki N (1999) Summary of survey results regarding comprehensive essay questions. Final report of “Joint Research on Comprehensive Examinations for the Aim of Evaluating Applicability to Each Specialized Field of Universities” for 1996-2000 [shōronbun sōgō mondai ni kansuru chōsa kekka no gaiyō. Heisei 8 - Heisei 12-nendo daigaku no kaku senmon bun’ya e no tekisei no hyōka o mokuteki to suru sōgō shiken no arikata ni kansuru kyōdō kenkyū’ saishū hōkoku-sho]. University Entrance Examination Section Center Research and Development Department [Daigaku nyūshi sentā kenkyū kaihatsubu], 21–32

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1–5 November, pp. 1882–1891. Association for Computational Linguistics

Takeuchi K, Ohno M, Motojin K, Taguchi M, Inada Y, Iizuka M, Abo T, Ueda H (2021) Development of essay scoring methods based on reference texts with construction of research-available Japanese essay data. In IPSJ J 62(9):1586–1604

Ure J (1971) Lexical density: A computational technique and some findings. In Coultard M (ed) Talking about Text. English Language Research, University of Birmingham, Birmingham, England

Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Long Beach, CA, 4–7 December, pp. 5998–6008, Curran Associates, Inc., Red Hook, NY

Watanabe H, Taira Y, Inoue Y (1988) Analysis of essay evaluation data [Shōronbun hyōka dēta no kaiseki]. Bulletin of the Faculty of Education, University of Tokyo [Tōkyōdaigaku kyōiku gakubu kiyō], Vol. 28, 143–164

Yao S, Yu D, Zhao J, et al. (2023) Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36

Zenker F, Kyle K (2021) Investigating minimum text lengths for lexical diversity indices. Assess. Writ. 47:100505. https://doi.org/10.1016/j.asw.2020.100505

Zhang Y, Warstadt A, Li X, et al. (2021) When do you need billions of words of pretraining data? Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, pp. 1112-1125. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.90

Download references

This research was funded by National Foundation of Social Sciences (22BYY186) to Wenchao Li.

Author information

Authors and affiliations.

Department of Japanese Studies, Zhejiang University, Hangzhou, China

Department of Linguistics and Applied Linguistics, Zhejiang University, Hangzhou, China

You can also search for this author in PubMed Google Scholar

Contributions

Wenchao Li is in charge of conceptualization, validation, formal analysis, investigation, data curation, visualization and writing the draft. Haitao Liu is in charge of supervision.

Corresponding author

Correspondence to Wenchao Li .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

Ethical approval was not required as the study did not involve human participants.

Informed consent

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material file #1, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, W., Liu, H. Applying large language models for automated essay scoring for non-native Japanese. Humanit Soc Sci Commun 11 , 723 (2024). https://doi.org/10.1057/s41599-024-03209-9

Download citation

Received : 02 February 2024

Accepted : 16 May 2024

Published : 03 June 2024

DOI : https://doi.org/10.1057/s41599-024-03209-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Supported by

Google’s A.I. Search Leaves Publishers Scrambling

Since Google overhauled its search engine, publishers have tried to assess the danger to their brittle business models while calling for government intervention.

Share full article

Sundar Pichai, wearing jeans and a sweater, stands on a colorful stage with the word “Gemini” displayed behind him.

By Nico Grant and Katie Robertson

Nico Grant reports on Google from San Francisco and Katie Robertson reports on media from New York.

When Frank Pine searched Google for a link to a news article two months ago, he encountered paragraphs generated by artificial intelligence about the topic at the top of his results. To see what he wanted, he had to scroll past them.

That experience annoyed Mr. Pine, the executive editor of Media News Group and Tribune Publishing, which own 68 daily newspapers across the country. Now, those paragraphs scare him.

In May, Google announced that the A.I.-generated summaries, which compile content from news sites and blogs on the topic being searched, would be made available to everyone in the United States. And that change has Mr. Pine and many other publishing executives worried that the paragraphs pose a big danger to their brittle business model, by sharply reducing the amount of traffic to their sites from Google.

“It potentially chokes off the original creators of the content,” Mr. Pine said. The feature, AI Overviews, felt like another step toward generative A.I. replacing “the publications that they have cannibalized,” he added.

Media executives said in interviews that Google had left them in a vexing position. They want their sites listed in Google’s search results, which for some outlets can generate more than half of their traffic. But doing that means Google can use their content in AI Overviews summaries.

Publishers could also try to protect their content from Google by forbidding its web crawler from sharing any content snippets from their sites. But then their links would show up without any description, making people less likely to click.

Another alternative — refusing to be indexed by Google, and not appearing on its search engine at all — could be fatal to their business, they said.

“We can’t do that, at least for now,” said Renn Turiano, the head of product at Gannett, the country’s largest newspaper publisher.

Yet AI Overviews, he said, “is greatly detrimental to everyone apart from Google, but especially to consumers, smaller publishers and businesses large and small that use search results.”

Google said its search engine continued to send billions of visits to websites, providing value to publishers. The company has also said it has not showcased its A.I. summaries when it was clear that users were looking for news on current events.

Liz Reid, Google’s vice president of search, said in an interview before the introduction of AI Overviews that there were hopeful signs for publishers during testing.

“We do continue to see that people often do click on the links in AI Overviews and explore,” she said. “A website that appears in the AI Overview actually gets more traffic” than one with just a traditional blue link.

On Thursday afternoon, Ms. Reid wrote in a blog post that Google would limit AI Overviews to a smaller set of search results after it produced some high-profile errors , but added that the company was still committed to improving the system.

The A.I.-generated summaries are the latest area of tension between tech companies and publishers. The use of articles from news sites has also set off a legal fight over whether companies like OpenAI and Google violated copyright law by taking the content without permission to build their A.I. models.

The New York Times sued OpenAI and its partner, Microsoft, in December, claiming copyright infringement of news content related to the training and servicing of A.I. systems. Seven newspapers owned by Media News Group and Tribune Publishing, including The Chicago Tribune, brought a similar suit against the same tech companies. OpenAI and Microsoft have denied any wrongdoing.

AI Overviews is Google’s latest attempt to catch up to rivals Microsoft and OpenAI, the maker of ChatGPT, in the A.I. race.

More than a year ago, Microsoft put generative A.I. at the heart of its search engine, Bing. Google, afraid to mess with its cash cow, initially took a more cautious approach. But the company announced an aggressive rollout for the A.I. feature at its annual developer conference in mid-May: By the end of the year, more than a billion people would have access to the technology.

AI Overviews combine statements generated from A.I. models with snippets of content from live links across the web. The summaries often contain excerpts from multiple websites while citing sources, giving comprehensive answers without the user ever having to click to another page.

Since its debut, the tool has not always been able to differentiate between accurate articles and satirical posts. When it recommended that users put glue on pizza or eat rocks for a balanced diet, it caused a furor online.

Publishers said in interviews that it was too early to see a difference in traffic from Google since AI Overviews arrived. But the News/Media Alliance, a trade group of 2,000 newspapers, has sent a letter to the Justice Department and the Federal Trade Commission urging the agencies to investigate Google’s “misappropriation” of news content and stop the company from rolling out AI Overviews.

Many publishers said the rollout underscored the need to develop direct relationships with readers, including getting more people to sign up for digital subscriptions and visit their sites and apps directly, and be less reliant on search engines.

Nicholas Thompson, the chief executive of The Atlantic, said his magazine was investing more in all the areas where it had a direct relationship to readers, such as email newsletters.

Newspapers such as The Washington Post and The Texas Tribune have turned to a marketing start-up, Subtext, that helps companies connect with subscribers and audiences through text messaging.

Mike Donoghue, Subtext’s chief executive, said media companies were no longer chasing the largest audiences, but were trying to keep their biggest fans engaged. The New York Post, one of his customers, lets readers exchange text messages with sports reporters on staff as an exclusive subscriber benefit.

Then there’s the dispute over copyright. It took an unexpected turn when OpenAI, which scraped news sites to build ChatGPT, started cutting deals with publishers. It said it would pay companies, including The Associated Press, The Atlantic and News Corp., which owns The Wall Street Journal, to access their content. But Google, whose ad technology helps publishers make money, has not yet signed similar deals. The internet giant has long resisted calls to compensate media companies for their content, arguing that such payments would undermine the nature of the open web.

“You can’t opt out of the future, and this is the future,” said Roger Lynch, the chief executive of Condé Nast, whose magazines include The New Yorker and Vogue. “I’m not disputing whether it will happen or whether it should happen, only that it should happen on terms that will protect creators.”

He said search remained “the lifeblood and majority of traffic” for publishers and suggested that the solution to their woes could come from Congress. He has asked lawmakers in Washington to clarify that the use of content for training A.I. is not “fair use” under existing copyright law and requires a licensing fee.

Mr. Thompson of The Atlantic, whose publication announced a deal with OpenAI on Wednesday, still wishes Google would pay publishers as well. While waiting, he said before the rollout of AI Overviews that despite industry concerns, The Atlantic wanted to be part of Google’s summaries “as much as possible.”

“We know traffic will go down as Google makes this transition,” he said, “but I think that being part of the new product will help us minimize how much it goes down.”

David McCabe contributed reporting.

Nico Grant is a technology reporter covering Google from San Francisco. Previously, he spent five years at Bloomberg News, where he focused on Google and cloud computing. More about Nico Grant

Katie Robertson covers the media industry for The Times. Email: [email protected] More about Katie Robertson

Explore Our Coverage of Artificial Intelligence

News and Analysis

Humane’s Ai Pin was supposed to free people from smartphones, but sales have been slow. Now Humane is talking to HP and others about a potential sale. Here’s how the device flopped .

Federal regulators have reached a deal that allows them to proceed with antitrust investigations into the dominant roles that Microsoft, OpenAI and Nvidia play in the A.I. industry.

Google appears to have rolled back its new A.I. Overviews after the technology produced a litany of untruths and errors.

The Age of A.I.

After some trying years during which Mark Zuckerberg could do little right, many developers and technologists have embraced the Meta chief as their champion of “open-source” A.I.

D’Youville University in Buffalo had an A.I. robot speak at its commencement . Not everyone was happy about it.

A new program, backed by Cornell Tech, M.I.T. and U.C.L.A., helps prepare lower-income, Latina and Black female computing majors for A.I. careers.

You are viewing this page in an unauthorized frame window.

This is a potential security issue, you are being redirected to https://nvd.nist.gov

You have JavaScript disabled. This site requires JavaScript to be enabled for complete site functionality.

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Information Technology Laboratory

National vulnerability database.

Vulnerabilities

The NVD has a new announcement page with status updates, news, and how to stay connected!

This CVE is in CISA's Known Exploited Vulnerabilities Catalog

Reference CISA's BOD 22-01 and Known Exploited Vulnerabilities Catalog for further guidance and requirements.

Weakness Enumeration

Known affected software configurations switch to cpe 2.2, cpes loading, please wait..

Denotes Vulnerable Software Are we missing a CPE here? Please let us know .

Change History

Cve modified by chrome 6/04/2024 3:20:48 pm, initial analysis by nist 5/20/2024 10:08:51 am, cve cisa kev update by cybersecurity and infrastructure security agency (cisa) u.s. civilian government 5/17/2024 9:00:01 pm, new cve received by nist 5/14/2024 12:17:35 pm.

IMAGES

Essay on Google
How to use Google Docs for essay writing
How to Edit an Essay on Google Docs
Using Google Essay Example
Essay Format Google Docs
✍️How to make a student ESSAY TEMPLATE on google docs to make your life much easier

VIDEO

Essay on Google in english//Google essay in english//jsj jesy education
How to Write an Essay Step by Step
Essay On Google
How to Write Essays Using Google Bard in Under 2 Minutes
Google: Submit Homework/Essays
10 lines on GOOGLE in English // essay on Google // google essay

COMMENTS

6 Ways to Use Google Docs for Better Writing
Google Docs is a popular online word processor that offers many features and benefits for writers of all kinds. Whether you are writing a blog post, a report, an essay, or a novel, you can use ...
Scribbr
Help you achieve your academic goals. Whether we're proofreading and editing, checking for plagiarism or AI content, generating citations, or writing useful Knowledge Base articles, our aim is to support students on their journey to become better academic writers. We believe that every student should have the right tools for academic success.
The Beginner's Guide to Writing an Essay
Come up with a thesis. Create an essay outline. Write the introduction. Write the main body, organized into paragraphs. Write the conclusion. Evaluate the overall organization. Revise the content of each paragraph. Proofread your essay or use a Grammar Checker for language errors. Use a plagiarism checker.
How to Use Google Bard for Writing Essays
1. Input your query in a simple form. First things first, click to visit bard.google.com and type in your essay question in the field below. Then hit Enter (or click the 'Send' button). When Bard gives you its response - which we'll consider the first draft - make sure to read everything.
Example of a Great Essay
An essay is a focused piece of writing that explains, argues, describes, or narrates. In high school, you may have to write many different types of essays to develop your writing skills. Academic essays at college level are usually argumentative : you develop a clear thesis about your topic and make a case for your position using evidence ...
PDF Strategies for Essay Writing
Harvard College Writing Center 5 Asking Analytical Questions When you write an essay for a course you are taking, you are being asked not only to create a product (the essay) but, more importantly, to go through a process of thinking more deeply about a question or problem related to the course. By writing about a
How to Write an Essay: 4 Minute Step-by-step Guide
There are three main stages to writing an essay: preparation, writing and revision. In just 4 minutes, this video will walk you through each stage of an acad...
How to Create a Clearly Structured Essay Outline
An essay outline is a way of planning the structure of your essay before you start writing. In just 3 minutes, this video will show you how to organize your ...
How to Write an Essay with Google Gemini
Structure Your Prompts: Approach your prompts as if assigning Gemini a mini-essay, including clear instructions and desired outcomes. Experiment and Use Critically: Don't hesitate to explore ...
Writing tips and techniques for your college essay
Google Classroom. Tip #1. ... So when writing an application essay, you can write about anything as long as it describes you and your character. Like it says in the article above, "Admissions look for essays where student highlights their growth and introspection, so your essay should focus on you learning and growing as a person." ...
Writing a strong college admissions essay
Writing a strong college admissions essay. College admissions essays should showcase a student's unique voice, intellectual curiosity, and resilience. Simple, everyday topics can make powerful essays. It's important to have someone read the essay and share their impressions, ensuring it reflects the student's personality and experiences.
Google Essay for Students and Teacher
500+ Words Essay on Google. Google is named after the mathematical word "googol," described as the value represented by one followed by 100 zeros. Google is the leading Internet search engine; its main service provides customers with targeted search outcomes chosen from over 8 billion web pages. Both Stanford dropouts, Larry Page and Sergey ...
Getting College Essay Help: Important Do's and Don'ts
Keep in mind, however, that a 45-year-old lawyer writes quite differently from an 18-year-old student, so if your dad ends up writing the bulk of your essay, we're probably going to notice. (Vanderbilt University) So, basically, a big old thumbs up on the whole "get someone to look at your essay" situation, as far as colleges are concerned.
How to Write an Essay Introduction (with Examples)
Writing a strong introduction is crucial for setting the tone and context of your essay. Here are the key takeaways for how to write essay introduction: 3. Hook the Reader: Start with an engaging hook to grab the reader's attention. This could be a compelling question, a surprising fact, a relevant quote, or an anecdote.
Welcome to the Purdue Online Writing Lab
The Online Writing Lab at Purdue University houses writing resources and instructional material, and we provide these as a free service of the Writing Lab at Purdue. Students, members of the community, and users worldwide will find information to assist with many writing projects. Teachers and trainers may use this material for in-class and out ...
Google Docs: MLA Format Essay (2016)
Watch the video updated for 2017 here: https://www.youtube.com/watch?v=ZPL5v4AXcIwHow to set up an MLA format essay (2016) in Google Docs.Text: http://simple...
Write with AI in Google Docs (Workspace Labs)
Use AI to write something new. On your computer, open a document on Google Docs. In the document, click where you want to write. On the right, click Help me write . Enter a prompt. For example: "Write a poem about the life of a 6 year old boy". "How-to guide for operating a lawn mower".
How to Write a College Essay
Prompt overlap, allowing you to write one essay for similar prompts; You can build your own essay tracker using our free Google Sheets template. College essay tracker template. Choose a unique topic. Ideally, you should start brainstorming college essay topics the summer before your senior year. Keep in mind that it's easier to write a ...
Informational Essay and the Writing Process
The Writing Process. 2. Rough Draft and Discover • Put ideas into sentences and paragraphs. • Get a rough draft onto paper. • Draft means "to write." • Don't worry about getting all of your ideas in the right order or using just the right words; this step will come later in the process. • Read your draft aloud. .
Expert Tips on How To Write a Compare and Contrast Essay Successfully
A compare and contrast essay requires deep thought. The considerations you make can deliver great insight about your subject of choice. Here are some tips to help.
AI and the Death of Student Writing
Grammarly is free, but for a small monthly fee of $12, Grammarly can write most, or all, of an essay. I wrote my department head back, saying, "What the school is doing is the same as handing ...
Applying large language models for automated essay scoring for non
Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated ...
How to Structure an Essay
The basic structure of an essay always consists of an introduction, a body, and a conclusion. But for many students, the most difficult part of structuring an essay is deciding how to organize information within the body. This article provides useful templates and tips to help you outline your essay, make decisions about your structure, and ...
How to leave a Google review
Step 3: Write your review Once you click "Write a review," a window will pop up prompting you to rate the business or location out of five stars.
Why the Pandemic Probably Started in a Lab, in 5 Key Points
To submit a letter to the editor for publication, write to [email protected]. Opinion Guest Essay. Why the Pandemic Probably Started in a Lab, in 5 Key Points.
Opinion
Mr. Wicker, a Republican, is the ranking member of the U.S. Senate Armed Services Committee. "To be prepared for war," George Washington said, "is one of the most effectual means of ...
Google's A.I. Search Leaves Publishers Scrambling
Google's chief executive, Sundar Pichai, last year. A new A.I.-generated feature in Google search results "is greatly detrimental to everyone apart from Google," a newspaper executive said.
AI firms mustn't govern themselves, say ex-members of OpenAI's board
Last November, in an effort to salvage this self-regulatory structure, the OpenAI board dismissed its CEO, Sam Altman.The board's ability to uphold the company's mission had become ...
Nvd
Google Chromium V8 Out-of-Bounds Memory Write Vulnerability: 05/16/2024: 06/06/2024: Apply mitigations per vendor instructions or discontinue use of the product if mitigations are unavailable. Weakness Enumeration. CWE-ID CWE Name Source; CWE-787: Out-of-bounds Write:

College admissions

Want to join the conversation?

Video transcript

Choose Your Test

Table of Contents

What's Good Editing?

Who Can/Should Help You?

The Bottom Line

Help Brainstorm Topics

Help Revise Your Drafts

Where's the Line Between Helpful Editing and Unethical Over-Editing?

What Do Colleges Think of Your Getting Help With Your Essay?

On the Importance of Editing

On the Value of Proofreading

On Watching Out for Too Much Outside Influence

Parents or Close Relatives

Teachers or Tutors

Guidance or College Counselor at Your School

Friends, Siblings, or Classmates

Introduction Questions

Essay Body Questions

Reader Response Questions

Should You Pay Money for Essay Editing?

When to Consider a Paid Editor

Should You Hire an Essay Coach?

Should You Send Your Essay to a Service?

Should You Buy an Essay Written by Someone Else?

Ask a Question Below

Improve With Our Famous Guides

Series: How to Get 800 on Each SAT Section:

Series: How to Get to 600 on Each SAT Section:

Series: How to Get 36 on Each ACT Section:

Series: How to Get to 24 on Each ACT Section:

Stay Informed

Looking for Graduate School Test Prep?

How to Write an Essay Introduction (with Examples)

Table of Contents

Literary Analysis Essay Introduction Example

Examples of essay introduction

Argumentative Essay Introduction Example:

Expository Essay Introduction Example

Related Reads:

Similarity Checks: The Author’s Guide to Plagiarism and Responsible Writing

Welcome to the Purdue Online Writing Lab

Welcome to the Purdue OWL

A Message From the Assistant Director of Content Development

Social Media

Write with AI in Google Docs (Workspace Labs)

Use AI to write something new

Use AI to rewrite existing text

Give feedback on generated text

Turn off the “Help me write” prompt

Learn about Workspace Labs feature suggestions

How Workspace Labs data in Google Docs is collected

Related resources

Need more help?

Applying large language models for automated essay scoring for non-native Japanese

Similar content being viewed by others

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Testing theory of mind in large language models and humans

Highly accurate protein structure prediction with AlphaFold

Deep learning technology in AES

Methodology

Measures of writing proficiency for nonnative Japanese

Criteria (output indicator)

Japanese essay text

Statistical analysis

Results and discussion

Inter-annotator agreement (human–human annotator agreement)

Agreement of annotation outcomes between human and LLM

Reliability of LLM-driven AES’s scoring and discriminating proficiency levels

Reliability of LLM-driven AES scoring

Reliability of LLM-driven AES discriminating proficiency levels

Comparison of BERT-, GPT-, OCLL-based AES, and linguistic-feature-based computation methods

Prompt strategy

Data availability

Author information

Contributions

Corresponding author

Ethics declarations

How to Write an Essay Introduction (with Examples)