The reliability and validity of the Timed Up and Go test in patients ongoing or following lumbar spine surgery: a systematic review and meta-analysis

Background No other systematic review examined the measurement properties of the TUG in LSS. The present systematic review and meta-analysis aimed to investigate the measurement properties of the Timed Up and Go (TUG) in patients with Lumbar Spine Surgery (LSS). A literature search yielded 906 studies [PubMed:71, Web of Science (WoS):80, Scopus:214, ScienceDirect:471 and Cochrane Library:70]. Included 10 studies were assessed for risk of bias and quality using the “four-point COSMIN tool” and “COSMIN quality criteria tool”. Criterion validity and responsiveness results were pooled with “correlation coefficient” and “Hedges’ g” based effect size, respectively. Results The correlation coefficient pooling between TUG and VAS back and leg pain was 0.26 (moderate) (95% CI 0.19–0.34) and 0.28 (moderate) (95% CI 0.20–0.36). The pooled coefficient of TUG with ODI and RMDI was 0.33 (moderate) (95% CI 0.27–0.39) and 0.33 (moderate) (95% CI 0.24–0.42), respectively. Besides, TUG has correlated with the quality-of-life PROMs with a coefficient of − 0.22 to − 0.26 (moderate) (EQ5D Index 95% CI − 0.35 to − 0.16), (SF12-PCS 95% CI − 0.33 to − 0.15) and (SF12-MCS 95% CI − 0.32 to − 0.13). The pooled coefficient of TUG with COMI, ZCQ-PF and ZCQ-SS was 0.46 (moderate) (95% CI 0.30–0.59), 0.43 (moderate) (95% CI 0.26–0.56), and 0.38 (moderate) (95% CI 0.21–0.52), respectively. TUG’s 3-day and 6-week responsiveness results were 0.14 (low) (95% CI − 0.02 to 0.29) and 0.74 (moderate to strong) (95% CI 0.60–0.89), respectively. TUG was responsive at the mid-term (6 weeks) follow-up. Conclusion In clinical practice, the TUG can be used as a reliable, valid and responsive tool to assess LSS patients’ general status, especially in mid-term.


Introduction
Assessment of pain, range of motion, function, quality of life, and psychosocial status before and after lumbar spine surgery (LSS) is essential to monitor the success of surgery and rehabilitation [1,2].Function evaluation is mainly evaluated with physical performance tests or patient-reported outcome measures (PROMs) [3].PROMs are valuable for evaluating subjective patient opinions [4].In particular, the functional status of patients before and after surgery and the assessment of personal difficulty-ease improvements in activities of daily living can be evaluated practically and cost-effectively with questionnaires [5].However, physical performance tests are used as a gold standard measurement method to observe the objective performance-based functions of individuals [6,7].
Various physical performance tests containing daily life tasks (gait, sit to stand, turns, steps, stair ascent and descent, straight leg raising, squat) are developed within standardized protocols, and their measurement properties are proven in clinical studies [3,8].Since the essence of pain and functional advancements before and after LSS surgery is known, functional improvements of individuals are objectively evaluated with performance tests [9].One of the most preferred tests in individuals with LSS is Timed Up and Go (TUG).TUG is a practical assessment tool including sit-to-stand, gait, and 180-degree turnaround tasks without requiring expensive equipment [10].
LSS patients have rehabilitated to be independent during the activities of daily living in the post-operative period [11,12].Holistic exercise programs, including strengthening, endurance, balance, core stabilization, proprioception and aerobic exercises, provide essential recovery during the post-operative period [13,14].Studies demonstrated the improvements in sit-to-stand and gait speed in individuals with LSS regarding lower extremity strength and endurance progress [15,16].Patients' somatosensorial parameters, including balance and proprioception, also improve during the turn tasks of walking.Therefore, the TUG test is a significant physical indicator assessment of patients before and after LSS [10,17].
Measurement properties are essential to reveal whether physical performance tests provide accurate measurement responses in the relevant case group [26].In addition, considering the different types of surgery (fusion, decompression, instrumentation), intervention methods (minimally invasive, conventional methods), patient follow-up duration (immediate, acute, mid-term, chronic) and differences in statistical methods (reliability, validity, responsiveness), it is essential to review whether TUG provides consistent results in individuals with LSS [13,14,26].No other systematic review examined the measurement properties of the TUG in LSS.The present systematic review and meta-analysis aimed to investigate TUG's measurement properties (including criterion validity, responsiveness, measurement error and reliability) in patients with LSS.

Search strategy and selection criteria
The recommendations and guidelines of the "Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)" [27], the "COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN)" [26], and the "Cochrane recommendations for systematic reviews" were followed in conducting this systematic review and meta-analysis [28].The literature was searched with the relevant keywords (combination of boolean operators: "AND, OR") ["Lumbar Surgery" AND "Timed Up and Go Test"; "Lumbar Degenerative Disease" AND "Timed Up and Go Test"; "Lumbar Fusion" AND "Timed Up and Go Test"; "Lumbar Decompression" AND "Timed Up and Go Test"; "Lumbar AND Timed Up and Go Test"] between October 2022 and December 2022.A total of 906 studies [PubMed:71, Web of Science (WoS):80, Scopus:214, ScienceDirect:471 and Cochrane Library:70] were obtained.Details of the search are presented in Additional file 1: Appendix S1.

Eligibility criteria
The inclusion criteria of the review were; (1) studies including patients before or after LSS, (2) studies including the intervention of decompression surgery with or without fusion, (3) cohort or cross-sectional studies to provide an analysis of measurement properties (validity, reliability, measurement error, responsiveness).The exclusion criteria of the review were (1) studies with an external aim than TUG clinometric, (2) studies without primary details of measurement properties of TUG, (3) non-English studies, and (4) studies without full-text available.

Study selection and data extraction
The data files of the obtained studies (906) were transferred to Rayyan (Rayyan Systems Inc., USA) software via endnote (Clarivate Analytics, USA) outputs.Rayyan is a systematic review screening software to detect irrelevant or duplicate studies [29].During the screening process, two expert academicians independently searched the studies' topic (title, abstract and keywords) and checked the "include, exclude or maybe" options.In cases where consensus could not be reached in the choices of two academicians, the decisive opinion of a third colleague was obtained.As a result of this initial screening, a total of 18 studies were acquired.Eight studies were excluded for the reasons as follows: (5 studies) did not provide measurement properties, (2 studies) had no full-text available, and (1 study) did not provide specific values of measurement properties.A total of 10 studies were included in the systematic review and meta-analysis (Fig. 1).Descriptive information about the studies (year, study type, study population, follow-up period, number of cases, age, gender, surgery, diagnosis, and outcome measures) is presented in Table 1.

Risk of bias and quality assessment
The "COSMIN" tools were used for risk of bias and quality analysis.Included 10 studies were assessed for risk of bias and quality using the "four-point COSMIN tool" [26].This tool classifies the studies as "poor, fair, good and excellent" by considering the sample size of the measurement characteristics, statistical method, and methodological deficiencies regarding possible bias.In addition, qualitative analysis of methodological design was classified with the "COSMIN quality criteria tool" [30].This instrument classified the studies according to their primary methodological features and resulted in positive (+), indeterminate (?), negative (−) scores, and (0) no information categories.Both instruments scored the criterion validity, responsiveness and other measurement characteristics (if any) of the studies.Two independent expert academicians rated the risk of bias and quality of the included studies.

Evidence synthesis
Measurement properties of the studies with heterogenous data were presented by narrative/qualitative synthesis.These studies' results are also presented in Table 2 with the outcomes of the numerical data.Qualitative Fig. 1 PRISMA flow diagram of the study synthesis was performed through three steps: "pre-synthesis, exploring the relationships within and between the experiments, and evaluating the synthesis's robustness" [31].The results of the synthesis are also detailed in "Results" section.

Meta-analysis (quantitative analysis of studies)
Meta-Mar software (Philipps-Universität Marburg, Germany) was used to meta-analyze the included studies [32].The results of criterion validity and responsiveness of homogeneous data were pooled in the meta-analysis with "correlation coefficient" and "Hedges' g" based effect size, respectively.In correlation pooling, correlation coefficients of TUG with Visual Analog Scale (VAS) based back pain and leg pain, Oswestry Disability Index (ODI), Roland Morris Disability Questionnaire (RMDQ), EuroQoL 5 Dimension (EQ5D) index score, Short Form-12 (SF-12), Core Outcome Measures Index (COMI), and Zurich Claudication Questionnaire (ZCQ) were used.In responsiveness pooling, the mean change, standard deviation (SD) of the changed score, and Standardized Mean Difference (SMD) for sample sizes were calculated for two separate follow-up periods: pre-op to 3 days and pre-op to 6 weeks.The Cochrane handbook guidelines were used to determine the undefined SD of studies."SMD, confidence interval (CI), weighted mean effect size and p-value of each pooled score" are given."I 2 , Tau 2 and Chi 2 " values described the heterogeneity  of the calculations.Forest plots of the results were also provided.The interpretation of effect sizes, as stated by Cohen, was considered for the correlation coefficient (r); 0.10: small, 0.30, medium and 0.50: large; for the coefficient in the responsiveness analysis (d); 020: small, 0.50: medium and 0.80: large [33].

Other psychometric properties
The reliability results analyzed in only one study were excellent, with 0.97 for intra-rater ICC and 0.99 for interrater ICC.Gautschi et al. [10] also provided the SEM value of TUG.The SEM intrarater and interrater values were 0.21 s and 0.23 s, respectively.In the three studies, the MCID was between 0.9 and 3.4 s [3,24,25].Only one study calculated the MIC value as (95% CI) − 17.6% (− 20.7 to − 10.2%) [20] (Table 2).

Discussion
TUG test is one of the most commonly used physical performance assessment tools for ongoing and following LSS [10,22].The present systematic review and metaanalysis aimed to investigate the measurement properties of the TUG in patients with LSS.According to the results, TUG was agreeably responsive (moderate to strong) at the mid-term (6 weeks) follow-up.TUG was primarily associated with COMI (moderate), evaluating pain, function, symptom-specific well-being, quality of life, and disability.TUG was also moderately related to physical function, pain and quality of life, respectively.In clinical practice, the TUG can be used as a reliable, valid and responsive tool to assess LSS patients' general status, especially in the mid-term.
Lumbar decompression surgery (with or without fusion) is a safe surgical procedure that has been performed for years to reduce pain, loss of function and improve patients' independence in daily living [13,14].It is crucial to evaluate the physical performance of individuals before these surgeries with measurement tests that Fig. 2 Pooling results of the correlation coefficient between TUG and VAS include standardized protocols in order to evaluate the patient's actual clinical condition objectively and quantitatively [3,8].To our knowledge, no other study has examined the measurement properties of TUG, perhaps the most important of the tests used in clinical practice, in individuals before and after LSS.
The mean age of the sample of the included studies ranged between 46 and 66 years [3,10,[18][19][20][21][22][23][24][25].A vast majority of the studies include middle-aged individuals.Hence, some studies enrolled older adults.However, since most of the studies included middle-aged individuals (median 56.25), the decline in physical function observed due to the physiology of aging can be disregarded.The patients were followed during immediate, acute and chronic periods.Responsiveness of TUG during these several follow-up periods provided essential data to clinical practice [18,20].In addition, although there were more male subjects in most studies, approximately 40% of female subjects displayed a homogeneous gender distribution.
The most notable result of the quality analysis was a negative (−) and "fair to good" score in most studies for criterion validity.The main reason for this issue was the < 100 sample size and correlation coefficient values less than 0.70 in COSMIN scoring [26,30].In the responsiveness analysis, studies ranked "fair to good", "(0) no information", and "(?) indeterminate" scores as a result of insufficient data in sample size and statistical analysis.In addition, only 1 of the studies provided measurement and statistical data on reliability.On the other hand, due to lacking statistical analysis and a small sample size on "measurement error", the results of the studies had lower quality.In this context, future studies can address TUG's test-retest or inter-rater reliability more comprehensively with specific ICC Shrout Fleiss models [34].In addition, responsiveness results should also address the ROC and AUC curve with longer-term follow-up to provide more apparent measurement characteristics of TUG in individuals with LSS [35].Within the scope of criterion validity, TUG needed to be adequately compared with gold-standard performance tests such as the Five Times Sit to Stand Test, Stair Test, 6MWT, and 30 s Chair Sit to Stand Test.The correlation of these tests with each other may provide coefficients above 0.70, which might improve validity inferences' quality at a higher evidence level [26,30]."Validity" is an analysis to indicate the degree of accuracy of the test for an intended parameter [36].Validity results showed that TUG was primarily related to COMI.Since it is comprehended that COMI represents the general condition, such as function, pain, symptoms, and quality of life, owing to its holistic structure, it can be argued that TUG provides a comprehensive evaluation in cases with LSS [37].TUG was secondarily associated with ZC-PF, ZCQ-SS, ODI and RMDI.This concordance suggests that TUG secondarily indicates the function of the patients, as expected.It should be noted that TUG represents general condition rather than function.Thirdly, the relationship between pain and TUG was noteworthy.Since it is known that the increase in the pain level of individuals would increase the loss of function, the moderate pooled coefficient correlation with low back and leg pain was not surprising [9].Among the correlation coefficient pooling, TUG was least associated with quality-of-life scores.Since the correlational analysis of individuals in the pre-op period is usually presented, the correlation of TUG with SF-12 and EQ5D after surgical and rehabilitation interventions may present higher validation coefficients.Also, since the quality of life is more perceptible in the chronic period after the health service is provided, it would be vital to examine the criterion validity after long-term follow-up in future studies [13,14,38].
Responsiveness analysis investigated whether the TUG provides a clinical improvement response following the treatment at different follow-up times.While the TUG was low responsive at a 3-day follow-up, it revealed a more responsive clinical improvement at a 6-week midterm follow-up.This outcome suggests that postoperative functional gains usually occur in a moderate-term period, as rehabilitation effectiveness usually occurs after 1 month in LSS.It would be essential to prove the further responsiveness of TUG in terms of long-term monitorization of individuals.As a matter of fact, Jakobsson and colleagues and Master and colleagues, which we could not include in the meta-analysis, confirmed that TUG was responsive in individuals after LSS at 6 and 12 months, respectively [3,20].Considering the data within the scope of effect size with additional studies may provide pooling results at a high level of evidence.Only 1 study demonstrated test-retest and inter-rater reliability.Reliability indicates whether the questionnaire can consistently capture the clinical condition of the same individual under identical clinical conditions [26,39].The TUG provided highly reliable results in individuals with LSS.In future studies, presenting the reliability with Bland Altman agreement analysis could reveal the reliability of TUG in individuals with LSS more comprehensively.MCID revealed the smallest clinically significant change in "seconds".Among these studies, MCID was found to be 3.4 s in the study with a mean age of 46 years and 1.3 s in the study with a mean age of 62 years.In another study with an average age of 49 years, results ranging between 0.9 and 3 s were noteworthy.It was observed that advancements in smaller units were more clinically significant in aging (with greater age) individuals.These data may provide reference outcomes on treatment improvements in clinical practice.

Limitations
All databases were not searched in the present systematic review.Some databases (CINAHL) were inaccessible regarding public sources.Secondly, the surgical procedures in the studies were not homogenous.Since it is comprehended that the outcomes and rehabilitation responses of individuals with "minimally invasive or conventional surgical" methods or "decompression or fusion" techniques differ [13,14], a more homogeneous pooling should be considered for future studies.Last but not least, the study was not registered in a "systematic review database" (International Prospective Register of Systematic Reviews-PROSPERO).Protocol registration of reviews is essential for the integrity of the methodology.

Conclusions
In conclusion, TUG was agreeably responsive (moderate to strong) at the mid-term (6 weeks) follow-up.TUG was primarily associated with COMI (moderate), evaluating pain, function, symptom-specific well-being, quality of life, and disability.TUG was also moderately related to physical function, pain and quality of life, respectively.In clinical practice, the TUG can be used as a reliable, valid and responsive tool to assess LSS patients' general status, especially in the mid-term.

Fig. 3
Fig. 3 Pooling results of the correlation coefficient between TUG with ODI and RMDI

Table 3
Evidence level of the studies