Quite often I see papers that report how much of the variance in an outcome has been explained by the risk factor(s) of interest. The higher percentage explained (higher R squared) the better seems to be the thought. The authors think that important variables have been identified.

But consider this famous example. Everyone in a rich country smokes 20 cigarettes a day. You study the reasons for lung cancer in this population. Smoking wouldn't explain any of the variance in lung cancer, it wouldn't be identified as a cause of lung cancer. But it is the cause of why this country has a much higher rate of lung cancer than a rich country where nobody smokes. This is summarised as the causes of cases not necessarily being the same as causes of incidence (the rate). In population health we mostly want to change the causes of incidence. Of course even if you're dealing in prediction rather than cause it is still the case that predictors of cases are not necessarily the predictors of incidence.

So while smoking in a particular cohort of individuals might explain only 10% of the variation in lung cancer, smoking explains (around 90%) differences in rates between areas. Something I have seen mentioned less often is that the same analysis on the same data can give a different value of the same R squared.

Sounds like magic! Hey presto, this Stata code illustrates in detail. Using data on mortality and smoking and age, and a Poisson model of individual data (with dead or not as the outcome), I get an R squared of 9%. But I can rearrange the data, run the same model, get the same results (effect size and Cis) but get a completely different R squared (93%). The difference is I changed the number of observations from 181,467 individuals to 10 groups. In the latter I controlled for the size of groups using an offset. So at the group level the explained variance is pretty high. Given they are essentially the same analysis then actually their predictive ability is the same. Of course R squared in Poisson and logistic models is a pseudo R squared calculated differently to the vanilla R squared. So don't take this as a technically accurate description but I think the spirit of what I say is right.

*note the dataset is actually in person-years but I have pretended it is persons followed up for a year just to save the complication of writing about person-years.

For attribution, please cite this work as

Popham (2017, Dec. 17). Frank Popham: Same model, different R squared.. Retrieved from https://www.frankpopham.com/posts/2017-12-17-same-model-different-r-squared/

BibTeX citation

@misc{popham2017same, author = {Popham, Frank}, title = {Frank Popham: Same model, different R squared.}, url = {https://www.frankpopham.com/posts/2017-12-17-same-model-different-r-squared/}, year = {2017} }