A review of the quality of current diabetes clinical practice guidelines

ABSTRACT Objective To obtain an evaluation of current type 2 diabetes mellitus (T2DM) clinical practice guidelines. Methods Relevant guidelines were identified through a systematic search of MEDLINE/PubMed. Pan American Health Organization (PAHO) country offices were also contacted to obtain national diabetes guidelines in use but not published/available online. Overall, 770 records were identified on MEDLINE/PubMed for citations published from 2008 to 2013. After an initial screening of these records, 146 were found to be guidelines related to diabetes. Inclusion and exclusion criteria were used to further refine the search and obtain a feasible number of guidelines for appraisal. Guideline evaluation was conducted by health professionals using the Appraisal of Guidelines for Research and Evaluation (AGREE) II instrument, which was developed to address the issue of variability in guideline quality and assesses the methodological rigor and transparency in which a guideline is developed. A total of 17 guidelines were selected and evaluated. Results Ten guidelines scored ≥ 70% and seven guidelines scored ≥ 80%. The range was 21%–100%. The mean scores for Latin America and the Caribbean (LAC) country guidelines (n = 6) were compared to the mean scores for non-LAC country guidelines (n = 11). International guidelines consistently scored notably higher in all domains and overall quality than LAC guidelines. Conclusions Based on this study’s findings, it is clear that T2DM clinical practice guideline development requires further improvements, particularly with regard to the involvement of stakeholders and editorial independence. This issue is most apparent for LAC country guidelines, as their quality requires major improvement in almost all aspects of the AGREE II criteria. Continued efforts should be made to generate and update high-quality guidelines to improve the management of increasingly prevalent noncommunicable diseases, such as T2DM.

of diabetes, with worldwide prevalence among adults (20-79 years old) estimated at 415 million in 2015 (1). There are three main forms of diabetes mellitus: type 1, type 2, and gestational diabetes. Among them, type 2 diabetes mellitus (T2DM) is the most common form, accounting for more than 90% of all cases (2). T2DM, characterized by insulin resistance and relative insulin deficiency, was once primarily observed in economically affluent countries. Recent epidemiological trends, however, suggest an increasing prevalence of T2DM in low-and middle-income countries (1). Consequently, the health systems of countries such as those in Latin America and the Caribbean (LAC) face major challenges in coping with the rising burden of T2DM (3,4). With the prevalence expected to continue to rise in LAC countries, there is an urgent need to improve the management and care of T2DM based on the best available clinical evidence (1,4).

CLINICAL PRACTICE GUIDELINES
Evidence-based clinical practice guidelines (CPGs) are documents designed to guide decision-making with regard to the diagnosis, management, and treatment of various disorders. CPGs are based on a systematic review of evidence and an assessment of the benefits and harms of various care options. Thus, these guidelines bridge the gap between research findings and practice. The quality of CPGs is therefore essential as they can influence health care delivery methods and, ultimately, affect patient outcomes. High-quality guidelines promote the use of effective clinical services, reduce the amount of practice variation, minimize the use of ineffective or questionable procedures and services, and increase the use of underused effective services (5). Nevertheless, a number of studies have reported that the quality of CPGs in a variety of clinical areas is modest or variable (6,7). Consequently, the evaluation of CPGs is of great importance. The objective of this study was to obtain an evaluation of current T2DM clinical practice guidelines.

Identification of guidelines
Relevant guidelines were identified through a systematic search of MEDLINE/PubMed. Pan American Health Organization (PAHO) country offices were also contacted to obtain national diabetes guidelines in use but not published/available online. The online search used the following strategy: diabetes mellitus as a MeSH term (mh) or all variations of Diabet after the "t" in the article title (ti) and guideline, practice guideline, or consensus development conference as a publication type (pt) or health planning or clinical protocols as a MeSH term (mh) or all variations of guideline after the "n" in the article title (ti). Through the specified search strategy, 770 records were identified on MEDLINE/PubMed for citations published from 2008 to 2013. After an initial screening of these records, 146 were found to be guidelines related to diabetes. Inclusion and exclusion criteria were used to further refine the search and obtain a feasible number of guidelines for appraisal. Inclusion criteria for selected guidelines included 1) focus on adult populations only; 2) published between 2008 and 2013; 3) written in English, French, Portuguese, or Spanish; and 4) focus on the screening, diagnosis and/or management of T2DM only. Exclusion criteria for this search included guidelines that focused on specific diabetic/nondiabetic complications (e.g., diabetic foot, hypertension, and stroke). Guidelines that focused on recommendations for specific therapies or interventions were also excluded from the search. A total of 17 guidelines (8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24) were selected and evaluated (Table 1).

Appraisal instrument
Guideline evaluation was conducted using the Appraisal of Guidelines for Research and Evaluation (AGREE) II instrument, which was developed to address the issue of variability in guideline quality and assesses the methodological rigor and transparency in which a guideline is developed. The AGREE II instrument is rapidly becoming the gold standard for the evaluation of CPG quality and is endorsed by the World Health Organization (WHO); the Council of Europe (Strasbourg, France); and the Guidelines International Network (Pitlochry, Perthshire, Scotland) (25). The AGREE instrument consists of 23 key items divided into six domains: 1) scope and purpose (Items 1-3); 2) stakeholder involvement (Items 4-6); 3) rigor of development (Items 7-14); 4) clarity of presentation (Items 15-17); 5) applicability (Items 18-21); and 6) editorial independence (Items [22][23]. The AGREE II instrument uses a seven-point rating scale for each of the 23 items. A score of 1 is given to an item when there is no/poor information provided, and a score of 7 is given when there is exceptional quality of reporting for the item (26). The final component of the AGREE II evaluation consists of a rating for the overall quality of the guideline based on a scale of 1 (lowest quality) to 7 (highest quality), and a recommendation of "Yes," "Yes, with modifications," or "No" in regard to the use of the evaluated guideline in practice. The score for each domain is then standardized and expressed on a scale of 0-100. For the purposes of this report, a minimum standardized score of 70% (roughly equating to a given score of 4 on a scale of 1 to 7) is considered acceptable quality. A standardized score of 80% or higher is considered excellent (roughly equating to a given score of 5-6 on a scale of 7). The Appendix provides a complete breakdown of the AGREE II domains and the 23 corresponding items, and an example of how domain scores were calculated and standardized.

Evaluator selection
Guideline evaluators were selected on a volunteer basis. An official request was sent to selected individuals at PAHO regional offices and professionals in the field to secure volunteers willing to evaluate the quality of selected guidelines using the AGREE II instrument. PAHO country offices were also asked to provide staff members for evaluation, and individual experts in diabetes and/or guideline evaluations were contacted to provide support for the project. Before the assessment, the evaluators were required to review the provided training material for the AGREE II instrument, which included the full user guide/manual for the AGREE II instrument and a short tutorial video on how to use and apply the AGREE II instrument for assessing practice guidelines. The number of evaluators for each guideline varied, but all guidelines were required to be evaluated by the AGREE II-stipulated minimum of two individuals. Once the selected diabetes CPGs were evaluated, reviewers measured their quality, according to scores from AGREE II, and provided any suggestions for improving them.

RESULTS
A summary of the six assessment domains and corresponding criteria for the AGREE II quality assessment instrument is provided below. A summary of the scores for each guideline, by domain, can be found in Table 2.

AGREE II evaluation instrument domains/criteria
Domain 1: Scope and purpose. Domain 1 involves the degree to which the overall objectives, health questions covered, and guideline target population are specifically described. The majority of the guidelines scored very high for this domain, with only two guidelines scoring < 50%. In total, 15 (88%) of the guidelines scored ≥ 70% and eight (47%) guidelines scored ≥ 90%. Overall, the mean score for the guidelines was relatively high at 84% (range: 31%-100%). The median score for this domain was 89%.
Domain 2: Stakeholder involvement. This domain focuses on the participation of relevant groups in the guideline's development, the extent to which the views of the target population were considered, and how clearly the target group was defined. This was the most poorly scored domain. Only five guidelines scored ≥ 70%, and only two scored ≥ 80%. No guideline received a score of 100%. Five guidelines (29%) scored < 50%. The mean score for all guidelines was 61%-the lowest average score across all six domains. The median score was 58%, the lowest median score for all guidelines. The range was 33%-94%.
Domain 3: Rigor of development. This domain involves the process used to gather and synthesize evidence and the methods used to formulate guideline recommendations. For rigor of development, nine guidelines scored ≥ 70%, and six scored ≥ 80%. Two guidelines scored < 50%. Overall, the mean score was 70% (range: 23%-100%), with a median score of 71%.
Domain 4: Clarity of presentation. This domain deals with the level of clarity and ease of identifying recommendations in the guideline, along with whether alternative options for management of the health condition are present. This was the highest-scored domain of all six domains in terms of both the mean and median scores. A total of 15 guidelines scored ≥ 80% and 11 (79%) guidelines scored ≥ 90%. Only one guideline scored < 70%. Overall, the mean score was 88% and the median score was 92%. The range was 28%-100%.
Domain 5: Applicability. This domain addresses the feasibility of guideline implementation, as well as strategies for and resource costs associated with guideline implementation. A total of 10 guidelines scored ≥ 70% and five scored ≥ 80%. No guideline received a score of 100%, and three guidelines scored < 30%. Overall, the mean score for this domain was 67% (range: 17%-94%), with a median score of 77%. Domain 6: Editorial independence. This domain concerns potential conflicts of interest associated with the development and content of the guideline and the extent to which this has been addressed. Along with stakeholder involvement, editorial independence scored very poorly relative to the other four domains. Five guidelines scored ≤ 50% and five guidelines scored ≥ 80%. Overall, the mean and median scores for the guidelines were 62% and 63%, respectively, with a range of 19%-100%.
Four guidelines ("ALAD," "CHIL," "CTFPHC," and "EUR" 6 ) received a unanimous recommendation of "Yes" for use in practice. Eight guidelines received recommendations of both "Yes" and "Yes, with modifications" for use in practice, and three guidelines received rec- 5 National Institute for Health and Care Excellence (London) (21). 6 Latin American Diabetes Association (16), Chilean Ministry of Health (13), Canadian Task Force on Preventive Health Care (11), and European IMAGE project (14) respectively. ommendations of only "Yes, with modifications." Only two guidelines ("BRA" 7 and "NICA") received evaluator recommendations of both "Yes, with modifications" and "No" for their use in practice.

LAC versus international guidelines scores
The mean scores (by domain) for LAC country guidelines (n = 6) were compared to those for the non-LAC country guidelines (n = 11) (referred to here as "international guidelines") and are shown in Figure 1.
International guidelines consistently scored notably higher in all domains and overall quality than LAC guidelines. The smallest difference in scores was for "Scope and purpose," where mean scores for LAC and international guidelines 7 Brazilian Ministry of Health (10).
were 76% and 88% respectively. The biggest difference in scores was for "Editorial independence," where mean scores were 76% for international and 36% for LAC. Interestingly, although international guidelines scored poorly for "Stakeholder involvement" (66%), the mean score for "Editorial independence" was a much more respectable 76%. Overall, excluding "Stakeholder involvement," the mean domain scores for international guidelines all were ≥ 70%. LAC guidelines scores were much lower, with four mean domain scores ≥ 60%. The highest mean domain score for a LAC guideline was for "Clarity of presentation" (79%) and the lowest was for "Editorial independence" (36%).

DISCUSSION
Domain scores were generally highly variable within guidelines. For example,  the U.S. Preventive Services Task Force guidelines ("USPSTF") scored ≥ 90% in three domains and < 70% in three other domains. Scores were also variable between guidelines. For example, the narrowest score range was for "Scope and purpose" (31-100). Domains corresponding to "Stakeholder involvement" and "Editorial independence" had the lowest median and mean scores for all guidelines. These results were consistent with similar studies evaluating CPG quality (5,27). The poor scores for "Editorial independence" may stem from a lack of appreciation by guideline developers for the importance of conflict of interest disclosures, or could be indicative of actual conflicts of interests between sources of funding and guideline developers. Indeed, there is considerable evidence that financial conflicts of interest are highly prevalent among CPGs in a variety of clinical areas (28,29). Furthermore, there is emerging evidence that conflicts of interest may affect guideline recommendations (30). Conversely, "Clarity and presentation" and "Scope and purpose" were the highest scored domains, consistent with other guideline evaluation studies (27,31). This might reflect greater priority placed on these domains by guideline developers, or greater ease in achieving high scores for these domains. Increased attention was given to scores for Domain 3 ("Rigor of development"), which may be considered a strong indicator of guideline quality (5,32). Overall, the mean score for this domain for all guidelines was 70%, with a median score of 71%. However, like all of the other domains, there was a large discrepancy between Domain 3 scores for LAC and those for international guidelines. LAC guidelines performed much poorer (mean score of 50%) than international guidelines (mean score of 81%).

standardized mean domain and overall scores from an evaluation of Latin America and the Caribbean (LAC) and international type 2 diabetes mellitus clinical practice guidelines (n = 17), 2008-2013
Overall, the discrepancy between LAC and international guideline scores was striking. Of the 17 guidelines evaluated, however, only two (from the latter group-"EUR" and "ICSI" 8 ) scored ≥ 70% in all six domains. The LAC guidelines displayed some of the lowest sets of domain scores. The "BRA" and "NICA" guidelines had only one domain each-"-Clarity of presentation" and "Scope and purpose" respectively-that scored ≥ 50%. Among all mean domain scores for LAC guidelines, the highest was 79% (for "Clarity of presentation"). Thus, it appears LAC guideline development is severely lacking in virtually all respects. This is especially concerning given the staggering prevalence and burden of T2DM in LAC countries (1,3).

Limitations
There were a number of limitations to this study. First, reviewers did not compare their individual scores with one another for each item, as suggested in the literature to reduce bias (5). The lack of comparison may be problematic, especially given the wide variation in this study in reviewers' scores for a number of domains/items for the same guideline. Cross-checking scores between 8 The U.S. Institute for Clinical Systems Improvement guideline (23). reviewers to identify large discrepancies and achieving consensus on those differences could have helped reduce the variation in scoring. This was impossible, however, as the reviewers were located in different countries with major connectivity limitations. However, all participating evaluators were given the same training and terms of reference in an effort to decrease individual bias. Another limitation of this study was that the search did not include databases other than MEDLINE/PubMed, and this may have limited the number of identified guidelines. However, the authors believe that 1) the number of identified international guidelines was sufficient for a good-quality evaluation, and 2) the lack of searches in Latin American databases was likely mitigated by the request to PAHO offices in the region for available guidelines that fulfilled the selection criteria, and that all available, qualifying guidelines from PAHO member states were obtained. Other limitations stemmed from issues related to the AGREE II tool itself. For example, no threshold is provided in the AGREE II instrument to distinguish high-quality guidelines from low-quality ones based on the scores. In this report, minimum values of 70% and 80% were arbitrarily chosen as cutoff points for "acceptable" and "excellent" quality respectively. Therefore, the quality ratings provided here are subjective and must be considered within that context. The statistical reliability and significance of the data is also questionable due to the limited number of appraisers scoring the guidelines. Thus, conclusions based on these data must be made with caution. Owing to the small sample size of evaluated guidelines, conclusions about the comparison between LAC and international guidelines must also be made with caution. Finally, the international guidelines were largely from more highly developed countries, such as Canada, the United States, and Australia, which may have skewed the data showing higher international guidelines scores.

Conclusions
In recent years, there has been a trend of an increasing number of CPGs being developed by various organizations and other entities worldwide. These guidelines are designed to aid decision-making and reduce clinical practice variability and thus play a pivotal role in health care. It is therefore important that guideline development adheres to strict standards to ensure both consistency and high quality. Based on this study's findings, it is clear that guideline development requires further improvements, particularly with regard to the involvement of stakeholders and editorial independence. This issue is most apparent for LAC country guidelines, as their quality requires major improvement in almost all aspects of the AGREE II criteria. Therefore, continued efforts should be made to generate and update high-quality guidelines in order to improve the quality of management for T2DM and other increasingly prevalent noncommunicable diseases.

Conflicts of interest. None.
Disclaimer. Authors hold sole responsibility for the views expressed in the manuscript, which may not necessarily reflect the opinion or policy of the RPSP/PAJPH or the Pan American Health Organization (PAHO).