Skip to main content
EnglandStatisticsSyllabus dot point

How do you fit a line to data and use it to make and interpret predictions?

Line of best fit by eye through the double mean point; the regression line y = a + bx; interpreting gradient and intercept; using the line for prediction with awareness of interpolation and extrapolation.

A focused answer to Edexcel GCSE Statistics on lines of best fit and regression, covering drawing a line of best fit through the double mean point, the regression line y = a + bx, interpreting the gradient and intercept, and using the line to make predictions with awareness of extrapolation.

Generated by Claude Opus 4.89 min answer

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

Jump to a section
  1. What this dot point is asking
  2. Drawing the line of best fit
  3. The regression line y = a + bx
  4. Interpreting gradient and intercept
  5. Using the line for prediction
  6. When a line of best fit is appropriate

What this dot point is asking

Edexcel code 2e.04 requires you to determine a line of best fit by eye, drawn through the calculated double mean point (xˉ,yˉ)(\bar{x}, \bar{y}), and (Higher tier) to use the regression line of the form y=a+bxy = a + bx. You must interpret the gradient and intercept in context and use the line to make predictions, while being aware of the dangers of interpolation and extrapolation. Non-linear models are not tested.

Drawing the line of best fit

A line of best fit is a single straight line that passes as close as possible to all the points on a scatter diagram, showing the overall trend. The key technique Edexcel requires is to use the double mean point.

Plotting (xˉ,yˉ)(\bar{x}, \bar{y}) first gives you a fixed, accurate anchor, so your line is no longer drawn purely by eye. You then rotate the line about this point until it balances the points above and below.

The regression line y = a + bx

At Higher tier the line of best fit is treated as a regression line written in the form

This is the same straight-line idea as y=mx+cy = mx + c from mathematics, just with the constant written first. You can find the equation from two points on the line, or from the gradient and the double mean point, by substituting into y=a+bxy = a + bx and solving for aa.

Interpreting gradient and intercept

Marks are won by interpreting the line in context, not just stating numbers:

  • The gradient bb is the rate of change: "for each extra year of age, the value falls by GBP 800800" for a gradient of 0.8-0.8 (thousand pounds per year).
  • The intercept aa is the predicted yy when x=0x = 0: "a brand new car (age 00) is predicted to be worth GBP 12,00012{,}000". Be careful, because the intercept is only meaningful if x=0x = 0 is sensible for the context.

Using the line for prediction

To predict a value, substitute the known xx into the equation (or read off the line). This is reliable for interpolation (an xx inside the data range) but unreliable for extrapolation (an xx beyond the data), because the linear trend may not continue. Always check whether the prediction point lies within the range of the original data before trusting it.

The strength of the correlation also affects how much you should trust a prediction. If the points lie close to the line (strong correlation), predictions made by interpolation are reasonably reliable; if the points are widely scattered (weak correlation), even an interpolated prediction carries a large uncertainty. So when judging a prediction, consider both whether it is an interpolation or extrapolation and how strong the underlying correlation is.

When a line of best fit is appropriate

Edexcel only tests linear models, so a line of best fit should be used when the scatter diagram shows a roughly straight-line trend. If the points clearly curve, a straight line is a poor model and any prediction from it is unreliable. You should also ignore a single outlier when positioning the line by eye, since one stray point can distort it; mention the outlier rather than letting it drag the line. Recognising that a straight line does not suit curved data is part of choosing an appropriate model.

Exam-style practice questions

Practice questions written in the style of Pearson Edexcel exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.

Edexcel 1ST0 20204 marksA scatter diagram plots the age, xx years, and value, yy thousand pounds, of 1010 cars. The mean age is 55 years and the mean value is GBP 80008000. (a) Explain how the double mean point helps you draw the line of best fit. (b) The line passes through (5,8)(5, 8) and has gradient 0.8-0.8. Write its equation and interpret the gradient.
Show worked answer →

(a) The double mean point (xˉ,yˉ)=(5,8)(\bar{x}, \bar{y}) = (5, 8) always lies on the line of best fit, so plotting it gives a fixed point to draw the line through, making the line more accurate than drawing by eye alone.

(b) Using y=a+bxy = a + bx with gradient b=0.8b = -0.8 through (5,8)(5, 8): 8=a+(0.8)(5)8 = a + (-0.8)(5), so a=8+4=12a = 8 + 4 = 12. Equation: y=120.8xy = 12 - 0.8x.

The gradient 0.8-0.8 means the value falls by about GBP 800800 for each extra year of age.

Markers reward explaining the double mean point lies on the line, forming the equation, and interpreting the gradient in context (GBP 800800 per year).

Edexcel 1ST0 20224 marksThe regression line for the marks in a mock test (xx) and a final test (yy) is y=15+0.7xy = 15 + 0.7x. (a) Predict the final mark of a student who scored 4040 in the mock. (b) The mock marks ranged from 2020 to 8080. Explain why using the line to predict the final mark for a student who scored 55 in the mock is unreliable.
Show worked answer →

(a) Substitute x=40x = 40: y=15+0.7×40=15+28=43y = 15 + 0.7 \times 40 = 15 + 28 = 43. The predicted final mark is 4343.

(b) A mock mark of 55 is well outside the range of the data (2020 to 8080), so using the line there is extrapolation. The linear relationship is only known to hold within the data range and may not continue, so the prediction is unreliable.

Markers reward the substitution and prediction 4343, and identifying extrapolation beyond the data range as the reason for unreliability.

Related dot points

Sources & how we know this