EnglandPsychologySyllabus dot point

What are the key classic studies, and how are studies evaluated and reviewed?

Key studies and classic research: the named classic studies across topics, how to evaluate studies methodologically and ethically, and reviewing and synthesising research evidence.

An Edexcel A-Level Psychology answer to the key studies and classic research, covering the named classic studies (Milgram, Sherif, Baddeley, Watson and Rayner, Raine, Rosenhan), how to evaluate methodology and ethics with GRAVE, and how to review and synthesise evidence for Paper 3.

Generated by Claude Opus 4.814 min answerUpdated 2026-06-02

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

Jump to a section

What this dot point is asking
The answer
Examples in context
Try this

What this dot point is asking

Edexcel sets a named classic study for each foundation topic, and Paper 3 tests reviewing, analysing and evaluating studies. You must know each classic study and be able to assess its methodology and ethics and synthesise evidence across several studies.

The answer

The named classic studies

Evaluating methodology and ethics

Reviewing and synthesising evidence

Paper 3 asks you to review research: to compare studies, weigh evidence for and against an explanation, and judge how method affects the conclusions that can be drawn. Strong answers synthesise several studies rather than describing one, and they connect findings to issues and debates such as ethics, reductionism, determinism and generalisability. Reviewing is itself a method with its own reliability (do reviewers agree on quality ratings?) and bias (publication bias toward significant results).

Evaluation (GRAVE)

Generalisability. Many classic studies used biased samples (Milgram's 40 American men, Sherif's 22 American boys), limiting how far findings apply across gender and culture.
Reliability. Standardised classic studies (Milgram, Baddeley) replicate well, giving them strong reliability and making synthesis across replications possible.
Application. Classic studies underpin real applications: Loftus informs the cognitive interview, Rosenhan informs cautious diagnosis, and Bandura informs media-effects policy.
Validity. Lab studies (Milgram, Baddeley) can lack ecological validity; field studies (Sherif) gain it but lose control, so validity must be judged per study.
Ethics. Several classic studies breach modern principles (Little Albert's distress, Milgram's deception and harm), which must be weighed against their scientific value.

Reading inter-rater reliability for a research review

Two reviewers rate the methodological quality of a set of studies and you must judge how reliable the review is.

step 1 Define the measure

Inter-rater reliability is the extent to which two independent raters reach the same judgement. It is often reported as percentage agreement or as Cohen's kappa.

step 2 Calculate percentage agreement

If the reviewers agree on $24$ of $30$ studies, agreement is $\frac{24}{30} \times 100 = 80\%$ .

step 3 Judge the figure

$80\%$ is fairly high but leaves $20\%$ disagreement. Researchers often look for around $0.8$ or above on kappa for acceptable reliability.

step 4 Note the limitation

Percentage agreement does not correct for chance agreement, so it can overstate reliability. Cohen's kappa, which adjusts for chance, is the preferred statistic.

Examples in context

Example 1. Synthesising obedience evidence across studies. Rather than describing Milgram alone, a strong review compares Milgram (65 per cent gave the maximum shock), his telephone variation (obedience fell to about 21 per cent) and cross-cultural replications (broadly similar high rates). Synthesising these shows the effect is reliable and that situational factors (the proximity and legitimacy of authority) systematically change obedience, supporting agency theory. This synthesis, with a judgement about the weight of evidence, is what Paper 3 rewards, and it also lets you fold in the ethics debate (deception and harm) and the generalisability debate (androcentric samples).

Example 2. Evaluating Little Albert methodologically and ethically. Watson and Rayner's study is a single-participant case study, which gives rich detail but very poor generalisability (one infant). The lack of a control and of standardised testing weakens internal validity, and the fear response was not reliably reconditioned out before Albert left. Ethically, the study caused distress to an infant who could not consent and was never deconditioned, breaching protection from harm. Yet it provided early evidence that emotional responses can be classically conditioned, influencing later treatments. This balanced methodological and ethical evaluation, with a judgement, models the Paper 3 skill.

Try this

Q1. Evaluate the ethics of Milgram's obedience study. [4 marks]

Cue. It used deception and caused psychological distress, breaching protection from harm and pressuring the right to withdraw, but Milgram debriefed participants, most were glad to take part, and the findings were valuable.

Q2. Explain why a biased sample limits a study's conclusions. [2 marks]

Cue. A biased sample is not representative of the wider population, so the findings cannot be confidently generalised beyond the participants studied.

Q3. Assess how reviewing and synthesising studies improves the evaluation of psychological evidence. [8 marks]

Cue. Argue that synthesis weighs converging and conflicting evidence, links method to conclusions and reveals reliability across replications, but note review reliability (inter-rater agreement) and bias (publication bias) as limits.

Exam-style practice questions

Practice questions written in the style of Pearson Edexcel exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.

Edexcel 20198 marksEvaluate Milgram's obedience study in terms of its methodology and ethics. [8 marks]

Show worked answer →

An evaluate question: marks for methodological and ethical assessment, not retelling the procedure (AO3 dominant).

Methodology. Strengths: a highly standardised, controlled lab procedure (same prods, same set-up) gives high reliability and replicability, and Milgram's later variations isolated situational variables. Weaknesses: low ecological validity (an artificial task), possible demand characteristics, and an androcentric, culturally narrow sample of 40 American men, limiting generalisability.

Ethics. The study used deception (participants thought the shocks were real), risked psychological harm (visible distress), and the prods pressured participants in a way that undermined the right to withdraw. In defence, participants were debriefed, most said they were glad to have taken part, and the findings had great value for understanding obedience.

Markers reward balanced methodological points (reliability and validity and generalisability) and ethical points (deception, harm, withdrawal versus debriefing and value), with a judgement.

Edexcel 20216 marksTwo researchers reviewing the same studies agreed on

27

30

quality ratings. Calculate the percentage agreement and explain what this tells you about the reliability of the review and one limit of using percentage agreement. [6 marks]

Show worked answer →

A quantitative item: show the calculation (AO2) then interpret (AO3).

Percentage agreement (a measure of inter-rater reliability): $\frac{27}{30} \times 100 = 90\%$ .

Interpretation: inter-rater reliability is the extent to which two independent reviewers reach the same judgement. A $90\%$ agreement is high, suggesting the review applied its quality criteria consistently and is reliable, so the synthesis is trustworthy.

One limit: percentage agreement does not correct for agreement that happens by chance, so it can overstate reliability. A statistic such as Cohen's kappa, which adjusts for chance agreement, gives a more accurate figure and is preferred when judging the reliability of a review.

Markers reward the correct percentage ( $90\%$ ), a definition of inter-rater reliability, and the point that percentage agreement ignores chance agreement (kappa is better).

Related dot points

Sources & how we know this

Pearson Edexcel A-Level Psychology (9PS0) specification — Pearson Edexcel (2015)