Why candidate A and candidate B get different TOEFL…

The TOEFL iBT Writing Task 2 — often called "Writing for an Academic Discussion" — asks a candidate to contribute one written post to a simulated online discussion alongside two classmates' posts. The candidate has ten minutes to read the prompt, take a position, and write a single response of about 120 words, after which a short optional break leads into the integrated Writing Task 1. The task looks small, but the rubric is dense: a single 1-to-6 score is produced by human raters plus an e-rater engine, and a single point on the writing scale can move the overall iBT score. Most candidates reading this lose marks not because their grammar is poor, but because they misread what the rater is actually scoring. The rest of this article walks through the rubric levers, the planning budget, the scaffolding patterns, and the tactical errors that determine whether a response lands at 4 or climbs to 5.

The shape of the prompt: what the candidate actually sees on screen

Every TOEFL Writing Task 2 prompt presents the same skeleton. A short professor prompt frames a question on an academic topic — usually inside higher-education territory such as pedagogy, campus policy, or research methods. Two short posts from named classmates follow; each states a position with one or two reasons. A text box waits for the candidate's own contribution. The candidates are told to read the question and the two posts, take a clear position, and support it with reasons and examples. No minimum word count is enforced, but the integrated e-rater has internal thresholds tied to length, and the response is expected to be substantive enough to demonstrate stance, support, and synthesis. The whole exchange is described in the official materials as an online discussion in an academic class.

Three features of this prompt format are worth memorising before a practice session. First, the question is always framed in a way that has at least two defensible sides, so a candidate cannot fake expertise by restating a fact — they must choose, defend, and add value. Second, the two classmates' posts are written in deliberately distinct voices, so a response that simply agrees with both, or contradicts both without reasoning, loses the synthesis element the rubric rewards. Third, the prompt appears on the screen with a built-in timer that begins the moment the task is revealed; the candidate cannot pause it. That last point is the one that turns a 4 into a 5 in practice: the ten minutes have to be spent on planning before they are spent on sentences.

Many candidates treat the prompt as a mini-essay. It is closer to a focused argument: one position, two or three supporting reasons, and a visible engagement with at least one of the classmates' posts. The rater's eye is trained to look for the spine of the argument, not for decorative vocabulary. Trying to write a long response usually produces a string of loosely related sentences that drift away from the classmate posts; trying to write a very short one usually starves the rubric of evidence that the candidate can develop ideas at all. A target in the 120-to-180-word window is what most strong candidates converge on after a few practice rounds.

How the 1-to-6 score is actually built

The TOEFL iBT Writing Task 2 response is scored on a 1-to-6 scale rather than the older 0-to-30 scale. Three human raters do not, in fact, all touch the response: a single human rater assigns the overall score, while the e-rater engine produces a separate automated score. The two are combined through a rule specified by ETS, and the higher of the combined and the human score is reported. Knowing this changes the way a candidate should write, because the e-rater is not impressed by clever idioms; it scans for sentence-level features, vocabulary range, and the structural fingerprints of an organised response. The human rater reads for content, development, and the visible quality of the thinking.

Three rubric dimensions drive both scores, even though the official materials describe them as a single holistic judgement. The first is task fulfilment, meaning the response takes a clear position, addresses the question, and engages with the classmates' posts. The second is development, meaning the position is supported with reasons and examples rather than asserted as a bare claim. The third is language use, covering grammar, vocabulary, and the range of sentence structures a candidate can deploy without losing accuracy. Candidates who score 5 typically show all three at a working academic level, while 4-level responses tend to satisfy one or two strongly and the third only weakly.

The rater's reading order

In my experience marking or simulating marking with students, the rater's eyes do something predictable. The first five seconds go to the opening sentence: does the candidate state a position, or do they hedge into nothing? The next ten seconds look for the body: are there visible reasons, or is the candidate padding with restatements? The final pass checks language: is the range wide enough that the response would not be flagged as formulaic? A response that fails the first check rarely recovers, even if its grammar is clean. This is why a forceful opening sentence is the single highest-leverage move in the ten-minute budget.

The 10-minute planning budget: where most candidates lose their score

Ten minutes is not a long time. Reading the prompt, planning, drafting, and checking inside that window forces a candidate to make deliberate trade-offs. The most common trade-off I see — and the one that costs a band point more often than any grammar problem — is the choice to start writing before planning. Candidates who jump straight into the text box produce responses that look like a list of opinions, with a stance buried in the third sentence and the engagement with classmates reduced to a token mention. The same ten minutes, spent roughly 3 minutes on reading and outlining and 7 minutes on drafting, produces a noticeably tighter response. Three to four minutes of outline time is the practical sweet spot for a 120-to-180-word response.

The outline itself does not need to be elaborate. A workable plan has four ingredients: a clear stance in one sentence, two or three supporting reasons, one piece of evidence or example for the strongest reason, and a short note on which classmate's post the candidate will engage with and how. Candidates who skip the example step almost always end up repeating the same reason twice in different wording, which the rubric reads as low development. Candidates who skip the engagement note end up ignoring the classmates entirely, which the rubric reads as low task fulfilment. The four-line outline is what separates these two failure modes from a 5-level response.

A worked outline for a representative prompt

Consider a prompt asking whether professors should record and post their lectures. Two classmates post: one argues yes, for accessibility reasons; the other argues no, because students stop attending in person. A strong four-line outline would read: stance — recordings should be posted, with a clear caveat; reason 1 — supports students with health or work conflicts; reason 2 — reduces note-taking pressure and lets students focus on comprehension; example — a specific scenario such as a part-time worker who cannot attend every session; engagement — agree with the accessibility classmate, partially concede the attendance point, and add a fix. A draft built from this outline usually lands between 140 and 170 words and contains every rubric ingredient the rater looks for.

Scaffolding templates that hold the response together

Strong responses tend to share three scaffolding patterns, and weaker responses tend to share three failure patterns. Recognising both is faster than memorising sample essays. Below are the patterns ranked by how often they appear in scored responses; each pattern is something a candidate can practise into muscle memory without sounding formulaic, provided the inside of the template is filled with topic-specific content.

Stance-first opening. Sentence 1 names the position. Sentence 2 previews the strongest reason. The reader knows within ten words where the response is going.
Reason-and-example body. Each supporting reason is followed by an example, an analogy, or a concrete scenario. Reasons never repeat each other in different words.
Synthesis closer. The final sentence refers back to at least one classmate by name or by content, and either concedes a point, extends the classmate's reasoning, or proposes a compromise.

Weak responses tend to do the opposite on all three. They open with a long setup sentence that delays the stance to sentence three. They list reasons without examples. They close with a generic summary sentence such as "in conclusion, I agree" that ignores the classmates. None of these failure patterns destroys the response on its own, but the three together reliably cap a response at 4 even when the language is excellent.

Three failure patterns to avoid

The first failure pattern is stance drift, where the candidate starts by agreeing with one classmate, drifts into the language of the other, and finishes with a sentence that contradicts the opening. The second is reason repetition, where two reasons are stated as separate sentences but are actually the same idea in different words; the rubric reads this as a single reason padded out. The third is classmate erasure, where the response is well-organised and well-written but the candidates' posts are never referred to; the rubric treats this as failure to engage with the discussion format, which is the specific task being tested. Avoiding all three is a matter of planning, not of language.

Language use: the rubric's most forgiving dimension

Language use is the dimension candidates worry about most and the one the rubric is most forgiving on. The 1-to-6 scale does not require native-like fluency. It requires that the candidate's grammar and vocabulary are accurate enough to communicate ideas without distracting the rater, and that the range of sentence structures is wide enough to show control. A response written entirely in short, simple sentences can score 4 on language if the ideas are strong; the same response written in long, complex sentences with frequent errors can score 3 because the errors interrupt the rater's reading. Range matters, but accuracy matters more.

Practically, this means candidates should aim for a mix of sentence lengths rather than a single register. Two short sentences, a medium sentence, and a longer complex sentence, repeated across the response, produce a readable rhythm that signals control. Candidates who try to sound academic by using long noun phrases and rare vocabulary often introduce agreement errors and article errors that pull the language score down. In my experience the cleanest path to a 5 is to write in the register the candidate is most comfortable with, then add one or two slightly more complex structures where the candidate is sure the grammar is correct.

Vocabulary that helps without showing off

Some vocabulary moves pay off cheaply. Replacing "I think" with a stronger stance verb such as "I would argue," "the evidence suggests," or "in my view" raises the perceived confidence of the response without raising the difficulty of the grammar. Replacing "very important" with a more precise word such as "essential," "central," or "decisive" raises the perceived range of the vocabulary. Replacing "a lot of students" with a more academic phrase such as "a significant number of students" raises the perceived register. None of these moves is risky in terms of grammar, and together they lift a response from sounding conversational to sounding academic. They are the cheapest upgrades available inside a ten-minute budget.

Engaging with the classmates' posts: the most under-rehearsed skill

The single most under-rehearsed skill in TOEFL Writing Task 2 preparation is engagement with the classmates' posts. Most practice sessions focus on stance and language; almost none focus on the move that actually differentiates a 4 from a 5, which is the synthesis move. Synthesis has three forms. The candidate can agree with a classmate and extend the reasoning. The candidate can agree with part of a classmate's post and add a new dimension. The candidate can disagree, but must add a reason rather than just asserting the opposite. Any of the three is acceptable; ignoring the classmates is not.

The synthesis move usually lives in the second or third sentence and reappears in the closing sentence. Placing it in both positions is overkill and reads as formulaic; placing it once, in the body, is enough. A common mistake is to begin with a sentence that addresses the classmate, then drift into a stand-alone mini-essay. The rater is trained to look for synthesis as evidence that the candidate understood the discussion format; a single sentence of synthesis, well placed, does this work without taking up word count that could be used for development.

A worked synthesis move

Returning to the recorded-lectures example: a synthesis sentence in the second position could read, "I share Maria's concern about accessibility, but I think the attendance problem she raises is solvable." The sentence names the classmate's content rather than using a generic phrase, signals the candidate's own position, and bridges into the candidate's own reasons. The same move, in the closing sentence, could read, "Used as a complement to live sessions, recorded lectures address Maria's accessibility concern without reinforcing the attendance decline Daniel predicts." Both moves do synthesis work; either one alone is sufficient.

Common pitfalls and how to avoid them

Five pitfalls account for most of the lost points I see in candidate responses. Each has a recognisable signature in the writing and a tactical fix that can be practised. The pitfalls are not about grammar; they are about how the candidate uses the ten minutes.

Stance drift. The opening sentence is hedged, the body contradicts it, and the closer adds a third view. Fix: write the stance as a single declarative sentence and refuse to soften it.
Reason repetition. Two reasons are stated as separate sentences but are the same idea. Fix: in the outline, write each reason as a noun phrase; if two phrases collapse into one, merge them.
Classmate erasure. The response is well-organised but the classmates' posts are never referred to. Fix: include one synthesis sentence in the body, using the classmate's name or content.
Word-count starvation. The response is shorter than 110 words, so the e-rater reads low development. Fix: aim for the 120-to-180-word window; the third or fourth body sentence is usually the one that pushes length up.
Proofreading blind spot. The candidate spends the last 30 seconds adding words instead of checking what is already there. Fix: reserve the last 60 seconds for a single read-through, looking only for subject-verb agreement and article errors.

The proofreading blind spot is the most expensive of the five because it costs the language score in addition to the development score. A 60-second read-through catches the errors a candidate cannot see while writing, and the 60 seconds come out of the drafting time, not the planning time. This is another reason the planning budget matters: a planned response needs less editing, so the 60 seconds is genuinely available for proofing.

A practice schedule that builds the response under time pressure

Skill on Writing Task 2 is built by timed practice, but the practice has to be sequenced. The order below is the order I have seen work across many preparation plans, and each stage has a concrete deliverable rather than a vague instruction. Candidates who skip stages or run stages in parallel usually end up with a fluent but unstructured response, which is the failure mode that caps scores at 4.

Stage 1: outline drills (untimed)

Take ten prompts, ignore the timer, and write only the four-line outline for each. The goal at this stage is not the response but the plan. Candidates should compare outlines with a tutor or with the official sample responses, and should track whether the stance sentence, the reasons, the example, and the engagement note are all present in every outline. Most candidates notice within five outlines that they tend to drop one ingredient — usually the example. Stage 1 is over when the candidate can produce a usable four-line outline inside two minutes without prompting.

Stage 2: timed drafts (15 minutes)

Move to a 15-minute budget, which gives the candidate a safety margin above the real 10-minute task. The goal here is to internalise the rhythm: 3 to 4 minutes of planning, 8 to 10 minutes of drafting, 1 to 2 minutes of proofing. The candidate should write full responses and then score them against the rubric. Stage 2 ends when the candidate can produce a 120-to-180-word response inside 12 minutes that contains all four rubric ingredients and has no more than two minor language errors.

Stage 3: real-time simulation (10 minutes)

The final stage is the real thing. The candidate takes a fresh prompt and writes inside the actual 10-minute budget, ideally with the official testing interface if available. The score from this stage is the most predictive of the actual test performance, because it is the only stage where the time pressure is genuine. Candidates should run this stage at least five times before test day, ideally spaced across a week rather than crammed into a single sitting.

How Writing Task 2 sits inside the wider TOEFL iBT writing score

Writing Task 2 is one of two writing tasks on the TOEFL iBT. Writing Task 1, the integrated writing task, asks the candidate to summarise a reading passage and a lecture into a single response of about 150 to 225 words. The two task scores are averaged to produce a single writing section score on the older 0-to-30 scale, which is then folded into the overall iBT score. For candidates whose applications weight writing heavily — common in graduate programmes and in English-medium undergraduate admissions — a single point on the writing scale can move the overall score by a similar amount. This is why the difference between a 4 and a 5 on the 1-to-6 scale is worth taking seriously.

Reading across tasks

The two writing tasks test different sub-skills. Task 1 measures the ability to extract and compare information from two sources; Task 2 measures the ability to construct an argument and engage with a discussion. The language use dimension is shared, which is why strong preparation on one task tends to lift the other. Candidates who score 5 on Task 2 and 4 on Task 1 usually have a vocabulary and grammar range that is consistent; the gap is in the integrated summarising skill, not in the writing. A preparation plan that treats the two tasks as separate skill tracks, with shared language practice, is the most efficient use of preparation time.

Putting it together: a single response, end to end

The table below compares two responses to the same recorded-lectures prompt. Response A scores 4; response B scores 5. The differences map directly to the rubric dimensions, and the table is a useful self-check after a practice session. Read it slowly, then write your own response to a different prompt and score it against the same grid.

Rubric dimension	Response A (4-level)	Response B (5-level)
Stance clarity	Stance appears in sentence 2, hedged	Stance appears in sentence 1, declarative
Engagement with classmates	One classmate mentioned in passing	One classmate engaged with a synthesis move
Reason development	Two reasons, no example	Two reasons, one example
Language range	Mostly simple sentences, some complex	Mixed lengths, controlled complex structures
Word count	About 105 words	About 155 words

The table is the smallest possible summary of what the rubric rewards. Candidates who internalise the five rows before test day rarely lose points on Task 2 for reasons that the rubric actually cares about. The remaining errors are language errors, and those are caught by the 60-second proofing window.

Conclusion and next steps

TOEFL Writing Task 2 is a small task with a dense rubric, and the difference between a 4 and a 5 is rarely a vocabulary problem. It is a planning problem, a synthesis problem, and a proofing problem, in that order. A candidate who can produce a four-line outline inside two minutes, write a 120-to-180-word response inside seven, and run a 60-second proofing pass will outscore a candidate who writes more fluently but plans less, almost every time. The next step is to run the three-stage practice schedule above against three or four fresh prompts, score each response against the rubric grid, and track the planning time as carefully as the word count.

TestPrep Europe's diagnostic assessment on the academic discussion task is a natural starting point for candidates who want a scored baseline before they begin a structured preparation plan.

Frequently asked questions

How long should a TOEFL Writing Task 2 response be?

Most strong responses fall between 120 and 180 words. Shorter responses tend to starve the e-rater of development evidence; longer responses tend to drift away from the classmates' posts. The official materials do not enforce a minimum or maximum, but the 120-to-180-word window is where 5-level responses usually land.

Does the rater prefer agreement or disagreement with the classmates?

The rater has no preference on direction. What the rubric rewards is a clear stance and visible reasoning, regardless of which classmate the candidate sides with. A well-supported disagreement scores as high as a well-supported agreement; a hedged middle position usually scores lower than either.

How is the 1-to-6 score combined with the other writing task?

Writing Task 2 produces a single 1-to-6 score. Writing Task 1 produces a separate score on the same scale. The two scores are averaged to produce the writing section score on the older 0-to-30 scale, which is then folded into the overall iBT score. A single point on the writing scale can therefore move the overall iBT score.

Can grammar errors cost more than a full band point?

Yes. A response that contains repeated subject-verb agreement errors, article errors, or tense shifts can be marked down on language use even when the ideas are strong. The 60-second proofing window at the end of the task is the cheapest available defence against this. Candidates should run the proofing pass on every practice response so the habit is automatic on test day.

What is the single highest-leverage move in the ten minutes?

Writing the stance as the first sentence of the response. A clear, declarative opening sentence signals to the rater that the candidate understood the task and gives the rest of the response a spine to build on. Candidates who delay the stance to sentence two or three almost always cap their score at 4, regardless of how well the rest of the response is written.

Why candidate A and candidate B get different TOEFL Writing Task 2 scores on the same prompt