Low inter-rater reliability in traditional Chinese medicine for female infertility
  1. Oddveig Birkeflet1,
  2. Petter Laake2,
  3. Nina Vøllestad3
  1. 1Institute of Health and Society, University of Oslo, Oslo, Norway
  2. 2Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
  3. 3Institute of Health and Society, University of Oslo, Oslo, Norway
  Correspondence to Mrs Oddveig Birkeflet, Institute of Health and Society, University of Oslo, N-0318 Oslo, Norway;


Background Treatment of patients according to individual pattern diagnoses is an important feature of acupuncture rooted in traditional Chinese medicine (TCM). Little is known about the reliability of TCM pattern diagnoses.

Objective To examine in a cross-sectional study the inter-rater reliability of TCM diagnoses and acupuncture point selection.

Methods 30 infertile and 24 previously pregnant women were examined for TCM patterns by two acupuncturists. An operational interview guide related to gynaecology was used. The acupuncturists independently decided on the TCM patterns (categorised as excess, deficiency and merged patterns) and the prescription of acupuncture points. Kappa Statistics were used in the analyses.

Results 39 different TCM patterns and 36 different acupuncture points were used. For the choice of acupuncture points, poor to no agreement was found. Moderate to fair agreement was seen in excess/deficiency and merged patterns. Perfect match to moderate agreement on treatment was obtained when choosing meridians given certain TCM patterns.

Conclusions The low agreement on diagnoses indicates that acupuncturists follow individual pattern differentiation processes. The selection of acupuncture points seem to be closely related to the choice of TCM pattern diagnoses. The results indicate that the poor reliability in the diagnoses and thus treatment received by a patient will vary individually, which in turn is a challenge for clinical trials of acupuncture.

Acupuncture is commonly used as an adjunct to in vitro fertilisation (IVF).1 Although IVF and acupuncture is an active area of research,1 2 the results of studies are difficult to interpret and compare because of a large variation in diagnostic criteria for patient inclusion and in the choice of acupuncture points.1 This has been emphasised as a major obstacle for systematic reviews in this area.3 4

Individualised pattern diagnoses (based on signs and symptoms) and treatment according to their individual patterns are important features of acupuncture rooted in traditional Chinese medicine (TCM).5,,8 To ensure consistent and optimal treatment the pattern diagnoses must be reliable, but only a few studies have assessed the reliability of diagnostic data collected during a TCM examination.9

O'Brien and Birch reviewed studies of the reliability of traditional East Asian medicine diagnoses and concluded that reliability of pattern diagnoses and treatment was low.10 Reliability has been studied for different patient groups using a wide range of research designs. Some studies have examined how several clinicians diagnose and suggest treatment for one or only a few patients,11 12 whereas others have compared pairs of clinicians examining a number of patients.13 The results of these studies show no consistent pattern. A relatively good agreement has been reported for TCM diagnoses while suggestions for acupuncture points have shown greater variation.12 Most of the studies report percentage agreement and sometimes the correlation between raters. Relevant statistical analyses such as Kappa (κ) statistics are rarely reported.

Decisions about treatments should be based on the TCM pattern diagnoses and hence it is possible that low agreement between clinicians for suggested acupuncture points might be caused by differences in diagnoses. Agreement on the relationship between diagnoses and acupuncture treatment has to our knowledge never been investigated. In one study on low back pain, Hogeboom et al11 reported insignificant correlations between diagnoses and acupuncture point selection by six acupuncturists. κ Statistics were not reported.

Several factors may potentially affect the poor reliability in TCM pattern diagnoses and suggestions for treatment. The practitioners may vary owing to differences in clinical education and experience. Furthermore, studies have often used a design in which patients are examined in sequence by the acupuncturists. If more than one acupuncturist interviews the same patient at different times, this may induce variations in the way in which the patient presents their symptoms and signs. A low agreement may thus not reflect differences between clinicians, but variations that can be attributed to the patients. There is a need for studies that reduce this effect, in order to examine the true inter-rater differences.

The studies so far have examined the inter-rater reliability of TCM patterns in specific patient groups where the variation in signs and symptoms might be small. Hence, the expectations of the practitioner may influence the examination and the conclusions about TCM patterns. The variability may then be underestimated and reliability higher than in a normal clinical setting. In our study, we wanted to increase the variability and thus strengthen the test by examining the reliability in a mixed group of fertile and infertile woman. Furthermore, different biomedical causes of infertility were included for the same reason.

The objective of this study was to determine the inter-rater reliability in TCM patterns and prescriptions of acupuncture points.


Study design

This study was designed as a cross-sectional study of two groups of women: infertile and fertile. For this analysis, the participants were combined into one group. For two acupuncturists we examined three aspects of the inter-rater reliability: diagnoses of TCM patterns, single acupuncture point selection and point prescription according to TCM patterns.


Participants were 54 Norwegian-speaking women with an average age of 33.3 years (range 24–42). In the period from September 2007 to October 2008, 30 infertile and 24 women who had previously achieved spontaneous pregnancy were consecutively interviewed. The infertile women were recruited among those included in an IVF programme and they met the medical requirement for infertility diagnosis—a failure to conceive after 12 months of unprotected intercourse.14 Twenty-four of them were primary infertile (never been pregnant) and six had secondary infertility (had children in the past). Eleven were still under medical examination and were regarded as unexplained infertility. The self-reported biomedical diagnoses were endometriosis (n=5), polycystic ovarian syndrome (n=6) and poor egg quality (n=3). Fertile women were recruited from women who had previously been spontaneously pregnant. They had delivered within the last 3–12 months before participating. Participants were recruited via advertisements on websites for maternity care and posters displayed at doctors' offices. All participants volunteered for the study and signed a written informed consent.


The interview took place in an acupuncture clinic in Asker, Norway. For each participant the two acupuncturists attended the consultation together, to ensure simultaneous access to the information. This minimised the potential for observational changes/bias. The consultation was based on the four diagnostic methods of inquiry as described by Maciocia: case history taking, palpation, observation and auscultation.15 An operational structured interview guide according to Maciocia's symptoms and signs in gynaecology15 ensured that all the participants were asked identical questions. Supplementary information was collected according to individual symptoms. While one acupuncturist guided the interview, the other listened and had an opportunity to make notes and ask additional questions. Both acupuncturists concurrently examined the tongue and the radial pulse bilaterally on each participant. They did not discuss their findings with each other. After completing the interview, the acupuncturists individually diagnosed the TCM patterns based on their own judgement and criteria. Finally, they provided the patterns with recommended acupuncture points. The acupuncturists were aware of whether the woman examined was fertile or infertile. The use of a structured interview guide resulted in interviews lasting about 60 min.

Both acupuncturists were educated at a Norwegian acupuncture college offering a bachelor's degree in TCM acupuncture. One acupuncturist had 6 years and the other 20 years of clinical experience plus an advanced course from Nanjing College of TCM in Nanjing China. One of the acupuncturists participated in the research team.

Data analysis

As several TCM patterns were used only once, we analysed the patterns in merged groups. The single TCM patterns from each acupuncturist were merged into excess and deficiency patterns as illustrated in figure 1. To examine if the agreement improved when merged further on a higher level, we united the excess and deficiency patterns from the same organ system to the merged patterns. The pattern categories consist of dichotomous variables (figure 1).

Figure 1

Definition of the variables when merging single patterns into excess and deficiency patterns and further to organ/category level.

One pattern identifies imbalances in two organs: Heart and Spleen blood deficiency. This pattern was categorised under Heart deficiency.

To examine if the agreement for recommended treatment improved on a higher level, the single acupuncture points on the same meridian were merged. The frequency of agreement on diagnosis or acupuncture points (or their merged groups) is termed ‘mutual positive score’.

Statistical methods

To examine inter-rater agreement the κ statistic was used to assess the level of agreement between two acupuncturists beyond that expected by chance. κ values <0.20 were considered as poor agreement, 0.21–0.40 as fair, 0.41–0.60 as moderate and 0.61–0.80 were considered as good. Values between 0.81 and 1.00 were regarded as very good agreement.16 The marginal totals for the 2×2 table are not balanced, the observed proportion of agreement is quite high, but the value of κ indicates a low level of reliability. This is a known paradox of the κ statistic. The κ statistic alone is insufficient. Therefore, we also report the maximum κ value. The maximum agreement for κ is 1.00, a perfect agreement and 0 indicates agreement no better than a chance. Negative values show a worse than chance agreement.16 The calculations were done in SPSS 16.0 for Windows. The 95% CI and maximum κ were calculated with DAG_Stat.17


Altogether 39 different TCM patterns were used; acupuncturist 1 (acu1) used 32 and acupuncturist 2 (acu2) used 29 different patterns. Most often, several patterns were diagnosed on each participant, on average acu1 set six and acu2 set five patterns. A total of 36 different acupuncture points were used. Acu1 used 34 and acu2 used 22 different points. On average, they selected eight and six points, respectively, for individual subjects.

The reliability of TCM patterns and acupuncture points was analysed separately and thereafter the reliability of acupuncture points for given merged TCM patterns was analysed.

TCM patterns

Damp excess pattern was diagnosed for 21 and 27 women by the two acupuncturists, respectively. There was a fair agreement for this category (table 1). The Liver excess pattern was diagnosed for most of the participants and the maximum κ indicates a moderate agreement (table 1). Spleen and Kidney deficiency were the most commonly used deficiency patterns. Maximum κ showed a fair agreement for both patterns. Among the merged patterns, Liver was the most used pattern, in 94% of the respondents. It was the pattern with the highest inter-rater agreement on diagnoses; maximum κ indicated a moderate agreement. The cases for which κ showed no agreement and κ maximum showed a fair agreement were interpreted as agreement of not using these patterns, since the patterns were used infrequently.

Table 1

Frequencies of the merged traditional Chinese medicine patterns diagnosed by each acupuncturist, their mutual positive score, κ, CI and the maximum κ value

The acupuncture points

For the least used points, the maximum κ values indicate a fair to moderate agreement (table 2). Since these points were seldom in use, this probably reflects agreement of not using these points.

Table 2

Distribution of the acupuncture points and reliability measures. The frequency of each acupuncturist's use of points, the mutual positive score, κ, CI and the maximum κ value

The most commonly used point was LR3, which was used on almost all the women. The negative κ value is due to imbalance in the distribution of marginal totals in the 2×2 matrix used to calculate the κ value. However, maximum κ shows a fair agreement. For KI3 and SP6, which were used on a majority of the women, the agreement was much lower. For the other points the two acupuncturists differed to a large extent in the frequency of their use, with poor to fair agreement based on maximum κ.

The meridians

Agreement was also examined after merging the points for each meridian—for example, all the Liver points were collected and named as the Liver meridian (table 3). In general, the maximum κ values were about the same as for single points.

Table 3

Frequencies of the acupuncturists' use of the most used meridians, their mutual positive score, κ, CI and the maximum κ value

For the most often used meridians (Liver, Kidney, Stomach and Spleen) the acupuncturists showed a moderate to fair agreement (maximum κ). Also for the least used meridians, a fair agreement was seen (maximum κ). This again may reflect agreement in not choosing the points on the meridian. For the other meridians the agreement was poor.

Meridians for given merged patterns

In those cases where the two acupuncturists agreed on the merged patterns, almost complete agreement was seen for choosing the Liver meridian (table 4). Furthermore, the Liver meridian was selected for almost all women and consequently, the Liver meridian was found in combination with nearly all other patterns. Hence, a high agreement was seen for Liver meridian at all patterns. For the other meridians fair agreement was seen in the use on the merged patterns. κ could not be calculated in some cases because acu2 had used the meridian in all cases.

Table 4

Frequencies of the meridians of the Liver, Spleen, Kidney and Stomach used on the patterns of the Liver, Spleen, Kidney, Damp and Blood, the mutual positive score and the κ, CI and the maximum κ value


Our findings showed moderate agreement for the Liver patterns and fair to poor agreement for the other patterns. For the most used acupuncture points the agreement was fair and fair to moderate agreement for the meridian level. For the selection of meridians to use in the treatment of given merged TCM patterns, we found 100% agreement in using the Liver meridian on 50% of the patterns and the data show moderate to fair agreement for the other patterns.

The chosen design with simultaneous participation of the consultation did not allow more than two acupuncturists. However, this ensured that the two acupuncturists simultaneously accessed the same clinical information. Hence, the differences in diagnostics must arise in the interpretation process from symptoms and signs to conclusion about the TCM pattern diagnoses.

Previous studies have used a different design, where the patients are examined repeatedly.11 12 18,,21 Hence, one cannot distinguish between differences owing to presentations of symptoms and differences that can be ascribed to the clinician's interpretation.

One possible explanation for different interpretation may be differences in their background and clinical practice. However, the acupuncturists had their education from the same school and foolowed the same curriculum. Their clinical practice was similar, although one of them had greater experience as acupuncturist (20 vs 6 years). Previous studies show that acupuncturists possessing at least a bachelor's degree and a minimum of 5 and up to 20 years of experience, report poor consistency in agreement on TCM pattern diagnoses.9 11 13 18,,24 Hence, it seems that low agreement is typical even among those with a long education and clinical experience and not caused by the design of this study.

This study showed that even when TCM patterns were collapsed into broader categories, such as the merged patterns, higher inter-rater reliability was not achieved. This finding is consistent with the results of Mist et al, who found no improvement in agreement when they united TCM patterns into broader categories.13 Our study partly followed the same procedure in grouping patterns.

One complication in the process of differentiating symptoms and signs into TCM patterns is that some symptoms and signs observed together may provide conflicting information. In gynaecology, Maciocia states that conflicting and contradictory gynaecological manifestations occur commonly.15 Yet TCM textbooks lack clear guidelines for interpreting contradictory information.6 15 25 One possible cause for the differences in diagnoses may thus be the different importance and weight put on the different observations. Mist et al found that the inter-rater reliability of TCM diagnoses was improved through a calibration training process among practitioners when they used a questionnaire-based diagnoses process designed to cover the major factors for pattern differentiations.13 Their results indicate that clear guidelines and definitions lead to greater consensus among the practitioners. This study only used a common interview guide to obtain the same procedure for obtaining data information. No attempts were used to standardise the interpretation process for signs and symptoms of TCM diagnosis.

The reliability was also poor for the acupuncture points, even when we merged them according to the meridians. Similarly, Hogeboom et al found very low agreement among practitioners about which patients should receive a given acupuncture point.11 However, they also examined the acupuncture points for relationship with specific TCM diagnoses. By grouping some of the points into 12 clusters, they found that only two clusters were strongly associated with a specific diagnosis.11 We found improved reliability when the acupuncture points were categorised to meridians and examined for a given merged pattern diagnosis. Thus, it seems that the two acupuncturists achieved greater consensus about the treatment for a given diagnosis than in making the diagnoses. Correspondingly, Zhang et al, found that practitioners had better agreement about a diagnosis and the herbal prescription required, than agreement between the practitioners' prescriptions.21 This is in keeping with the textbooks and education that provide recommendations about which treatment/acupuncture points to use on specific patterns.25


We found unsatisfactory low inter-rater reliability in the individualised TCM pattern diagnoses and also in the selection of acupuncture points. The low agreement about diagnoses and acupuncture point selection indicate that acupuncturists follow individual pattern differentiation processes. This leads to differences in treatment of similar conditions, which in turn is a challenge for clinical trials of acupuncture. The key seem to be in the diagnostic process, since there is a strong relationship between TCM pattern diagnosis and selection of acupuncture points.

Summary points

  • Previous studies have mostly found that TCM diagnosis has poor reliability.

  • We compared diagnosis and point selection made by two acupuncturists at the same consultation.

  • Agreement was generally unsatisfactory, though moderate for selecting points for a given diagnosis.


The authors acknowledge Christina Weseth for providing study data and they also thank the participating women.


  • Funding Norwegian Acupuncturist Assosiation, The National Research Center in Complementary and Alternative Medicine through The Norway–China Cooperation funds and Pharma West AS.

  • Competing interests None.

  • Patient consent Obtained.

  • Ethics approval This study was conducted with the approval of the Regional Committee for Medical and Health Research Ethics.

  • Provenance and peer review Not commissioned; externally peer reviewed.

