Data extraction
We collected data on author, year, study design, sample size, method of assessing LGA, intervention (induction, spontaneous, caesarean delivery or not specified), age at follow-up, cognitive or academic outcome, and confounders. The primary data extraction was performed by one reviewer (X.Z.) and checked for accuracy by a second reviewer (M.S.). Any disagreements were fully discussed until a consensus was reached. Corresponding authors of included papers were contacted by email to provide further details if data were insufficient or missing.
To perform meta-analyses, for continuous variables, we extracted the mean, standard deviation (SD), and total sample size (N), or mean difference, lower/upper limit, and total N, for the exposed and control groups in the cognitive assessment scores. For the dichotomous variables, we extracted the 2*2 table or Odds Ratio and lower/upper limit.
There were two types of reference groups for comparison of early-term infants; one type of study compared early-term infants (37-38 weeks) with full-term infants (39-41 weeks), in which case we used full-term infants (39-41 weeks) as the reference group. The second of studies showed results for 37, 38, 39, 40, and 41 weeks GA separately, in which case we used 40 weeks as the reference group to examine 37w vs 40w and 38w vs 40w GA.
Any measure of cognitive function was considered for inclusion. When results were reported as both an overall test score (e.g. Intelligence Quotient; IQ) and a domain-specific score (e.g. receptive vocabulary delay), we chose the overall one in data synthesis. When results were only reported as domain-specific scores within the same study population, we calculated the mean score across domain-specific tests. Where multiple cognitive or academic outcomes were reported, we selected the one that provided the most reliable information for analysis (e.g. IQ test vs. school grade). Studies with follow-up of at least 6 months were eligible. When the outcomes were measured more than once at different ages for the same study population, we selected the oldest age group with the most reliable cognitive assessment. If multiple multivariable models were reported, we extracted data from the model with the most confounder-adjusted model (e.g. adjusted by education and sex vs. adjusted by sex).
We extracted data according to three primary outcomes as follows. Cognitive outcomes were based on cognitive scores (e.g., Bayley Scale of Infant and Toddler Development Mental Developmental Index,23-27 and Wechsler Abbreviated Scale of Intelligence,28) or cognitive impairment (e.g., Wechsler Intelligence Scale for Children-full scale IQ below average defined as scores below 85 or one standard deviation below the mean29). Academic outcomes were based on low academic performance (e.g. special education needs defined as children in Scottish schools 2005 census requires special education provision, which comprises both children with learning disabilities, such as dyslexia and dyspraxia, and children with physical disabilities that affect learning30). See Appendix S2 for full details of outcome definitions.
To allow comparability of primary outcomes harmonization was required using the extracted data: (a) If the study reported a cognitive test T score, percentile or Z score, we converted it into intelligence quotient (IQ with mean: 100; SD: 15); (b) if the direction of a study’s outcome was inconsistent with others (e.g., receiving a longer education rather than shorter), we converted it to a same-direction outcome; (c) if an LGA-related study defined LGA not in terms of percentiles but in terms of SD or absolute values, we converted it to percentiles using the World Health Organization foetal growth calculator (unknown foetal sex).31