Poor multi-rater reliability in TCM pattern diagnoses and variation in the use of symptoms to obtain a diagnosis
  1. Oddveig Birkeflet1,
  2. Petter Laake2,
  3. Nina K Vøllestad1
  1. 1Institute of Health and Society, University of Oslo, Oslo, Norway
  2. 2Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
  Correspondence to Oddveig Birkeflet, Institute of Health and Society, University of Oslo, N-0318 Oslo, Norway


Background Pattern differentiation and diagnosis are fundamental principles of Traditional Chinese Medicine (TCM). Studies have shown low inter-rater reliability in TCM pattern diagnoses. This variability may originate from both the identification and the interpretation of symptoms and signs.

Objective To examine the inter-rater reliability in TCM pattern diagnoses made in the style of Maciocia for 25 case histories by eight acupuncturists and to explore the impact of demographic factors on the diagnostic conclusion. Further, the association between the diagnosis and the presence of symptoms was examined for a single TCM diagnosis.

Methods Eight acupuncturists independently diagnosed 25 women (15 fertile, 10 infertile) based on written case histories. Descriptive statistics, logistic regression and inter-rater reliability (κ) were used.

Results Poor inter-rater reliability on TCM patterns (κ<0.20) and large variation in the number of TCM pattern diagnoses were found. Sex, duration of practice and education had a highly significant effect (p<0.001) on the use of TCM patterns and working hours had a significant effect (p=0.029). There was considerable intra- and inter-rater variation in the use of symptoms to make a diagnosis. Symptoms occurring frequently as well as infrequently were inconsistently used to diagnose Liver Qi Stagnation. The study was limited by a small sample size.

Conclusions The results showed extensive variation and poor inter-rater reliability in TCM diagnoses. Demographic variables influenced the frequency of diagnoses and symptoms were used inconsistently to set a diagnosis. The variability shown could impede individually tailored treatment.


Pattern differentiation is essential in Traditional Chinese Medicine (TCM) in the process of making a diagnosis from symptoms and signs.1 According to TCM theory presented by Maciocia, diseases are reflected in symptoms and signs that may be interpreted as imbalanced internal organs.2 The signs and symptoms are used to establish TCM pattern diagnoses and to form the basis of individual treatment, which is the selection of acupuncture points tailored to each patient's diagnosis.2–4 Based on these principles, one might expect that a given set of symptoms and signs results in the same TCM diagnosis across acupuncturists— that is, that the diagnosis has high reliability.

However, previous studies have shown low inter-rater reliability and large variability in TCM pattern diagnoses.5–10 It is unknown whether this low reliability is due to variability in identification of symptoms and signs or variability in the interpretation of them with regard to diagnosis.

Two previous studies have examined the agreement between acupuncturists in the recognition of symptoms and signs in the examination of patients. Hua et al found low agreement between two TCM practitioners in their identification of symptoms by visual inspection and palpation.11 O'Brien et al compared three TCM practitioners and the level of agreement on the inspection, auscultation and palpation variables ranged from slight agreement to almost perfect agreement.12 Hence, the low reliability in making a TCM diagnosis might be due to a low agreement in identifying symptoms and signs, but this does not preclude variability in the interpretation of symptoms and signs. It is important also to examine the variability in the interpretation of the diagnostic process to provide a comprehensive foundation for securing precision and reliability. By presenting a set of symptoms and signs to a group of TCM practitioners, one may be able to examine how they use and interpret the information to set TCM diagnoses.

The diagnostic process might also be influenced by personal and demographic factors. It has been proposed that the individual patients’ symptoms and signs play a lesser role in TCM diagnosis,6 and that the final diagnosis depends on the acupuncturists’ subjective interpretation, their education, clinical experience and which TCM books are consulted as the foundation for pattern differentiation.13

Unreliable TCM pattern diagnosis does not ensure optimal treatment tailored to the individual patient and provides a weak basis for TCM research. Evaluation of the TCM pattern identification process is therefore essential.10

The objective of this study was to examine the inter-rater reliability in TCM pattern diagnoses made by eight acupuncturists from 25 case histories. We also explored the impact of the demographic background of the acupuncturists on the diagnostic results. Finally, we examined how the acupuncturists used symptoms and signs to diagnose one common TCM pattern—namely, Liver Qi Stagnation.

Material and methods

Study design

In this cross-sectional study, 25 case histories were selected from 54 Norwegian women (24 fertile and 30 infertile) included in a previous study on TCM acupuncture diagnostics.5 None had sought acupuncture for health problems apart from infertility. The women had given written informed consent to distribute an anonymous description of their symptoms and signs to other acupuncturists for the purpose of the present study.

The 54 women had all been through a TCM consultation by a senior acupuncturist (first author of the paper) who was educated at the Norwegian Acupuncture College offering a Bachelor's degree in TCM acupuncture and holds the Advance Course at Nanjing College of TCM. Data recording was based on The Four Diagnostic Methods of inquiry as described in Maciocia—namely, case history taking, palpation, observation and auscultation.14 An operationalised structured interview guide based on Maciocia's symptoms and signs in gynaecology14 ensured that all the participants had identical questions. Supplementary information was collected according to individual symptoms. We randomly sampled 25 women, 15 fertile and 10 infertile, to construct the case histories.

All the symptoms and signs described by the women during the consultation were presented in the case histories (an example is shown in figure 1). The quality of the pulse and the tongue was described as perceived by the senior acupuncturist and reported without using any TCM organ names. The qualities of the pulses were specified according to Maciocia's statements. The pulse qualities were described as felt in each of the three pulse depths (superficial, middle and deep) and in the three positions (front, middle and rear).15 Photographs of the tongues were not used because the pictures appeared with different colours on different computers. The tongues were therefore described according to Maciocia's aspects of tongue diagnosis: vitality of colour/tongue spirit, body colour, body shape, tongue coating and tongue moisture.16 These data were included in the case histories distributed to the acupuncturists to be diagnosed according to their knowledge of TCM criteria.

Figure 1

Example of a case history with the presentation of symptoms and signs as presented to the acupuncturists. Bold type marks the headings for the symptoms. The pulse qualities are described as they were felt in three depths and in the three positions (front, middle and rear).

Participating acupuncturists

Members of the Norwegian Acupuncture Association (NAA) were invited to participate. Membership of NAA requires an equivalent to a Bachelor's degree in TCM acupuncture, including supervised practice. All eight acupuncturists were educated at the same school, which operates in collaboration with the Nanjing College of TCM. Invitations including information about the study were sent to 16 acupuncturists selected to represent a variation in geographical location, additional education and age. The invitation was followed up by telephone. Eight acupuncturists (labelled Acu1 to Acu8 for this report) volunteered to participate and received the case histories and instructions electronically.

The acupuncturists consisted of three men and five women of mean age 50 years (range 33–59 years). Two were physiotherapists, one a registered nurse and five had 1 year of basic medical courses (BM). The average experience was 12 years (range 4–20 years). The average working hours was 24 h per week (range 15–40 h). The acupuncturists who declined to participate were seven women and one man. Two were physiotherapists, one nurse, one ergonomist, one homeopath and four were BM.

The acupuncturists were informed that exactly the same questions were given to all the women and that only the symptoms actually experienced were reported. If a problem with, for instance, the stomach was not mentioned, it meant that the woman did not have such a problem. Some women were unable to specify symptoms and signs, such as the character of pain. In these cases their non-specific descriptions were given. If the acupuncturists found the case history description obscure, they were free to ask the researchers.

Each of the 25 case histories was distributed as separate Word documents, with space for the acupuncturists to type the TCM diagnoses. There was no instruction regarding the number of TCM patterns to be diagnosed. The acupuncturists were asked to identify all symptoms and signs that were used to substantiate each TCM pattern. The acupuncturists were blinded for any clinical or personal information of the cases apart from that specified in the case reports.

Analysis of symptoms and signs used to diagnose a single TCM pattern

The most common TCM pattern, Liver Qi Stagnation, was chosen as an analytical example to explore how the acupuncturists used symptoms and signs to make a diagnosis. For each case, the acupuncturists were asked to identify all symptoms and signs that substantiated their diagnosis. We counted the use of each symptom and sign and present the data as frequencies and percentage of their occurrence in the cases.

The presence of symptoms and signs varied from occurring in all 25 cases to occurring in only one case. It might be hypothesised that the differences in frequency influenced how the practitioners use them as the foundation for a diagnosis. Common symptoms might be too non-specific and might thus be ignored, whereas less common symptoms might provide more specific information. Hence, the use of symptoms was examined according to how frequently they occurred. Three groups of symptoms were analysed: frequent (in 13–25 cases), less frequent (in 6–12 cases) and rare symptoms (in 1–5 cases). The four most common symptoms in each group were used in the analyses (table 4).


To examine the variation in frequencies of TCM diagnoses among acupuncturists and with respect to demographic covariates, binary logistic regression was used. The number of analytical categories was reduced by merging the single patterns into the corresponding excess and deficiency patterns as shown in figure 2. The frequencies of the merged patterns were used for reliability analysis. Only the seven merged TCM patterns that the acupuncturist used on average more than five times were used for the analysis.

Figure 2

The single patterns merged into the respective Excess and Deficiency pattern. The seven merged Traditional Chinese Medicine pattern included in the analyses are shown in the grey boxes.

To examine the intra-rater reliability, the level of agreement between the acupuncturists beyond that expected by chance, the inter-rater κ statistic was used. κ values <0.20 were considered as poor agreement, 0.21–0.40 as fair, 0.41–0.60 as moderate and 0.61–0.80 were considered as good. Values of 0.81–1.00 were regarded as very good agreement.17 The calculations were performed in SPSS V.18.0 for Windows.

Variation in the use of symptoms and signs to diagnose a TCM pattern was examined by descriptive analysis.


All acupuncturists diagnosed all case histories. Acu2 reported that the lack of opportunity to interview the patient was a limitation. Lack of the biomedical diagnosis to facilitate the TCM diagnosis was reported as a limitation by Acu7 and Acu4. All reported Maciocia as a main source.

The acupuncturists diagnosed a total of 114 different single TCM patterns. Some of the patterns were only used once while others were used on almost all cases by all acupuncturists. After grouping the single patterns into the merged patterns, there was still a wide variation between the acupuncturists (table 1). The total number of TCM pattern diagnoses used by each of the acupuncturists varied from 63 to 203. Acu7 diagnosed 2–3 times as many patterns as the other acupuncturists.

Table 1

The 17 merged Traditional Chinese Medicine (TCM) pattern diagnoses set on 25 case histories by eight acupuncturists (Acu1 to Acu8), the frequencies, mean and SD

There was also a wide variation between the acupuncturists with respect to the use of each individual merged pattern. Five patterns varied in use from 0 to >10: Heart Deficiency, Blood Excess, Blood Deficiency, Lung Deficiency and Stomach Deficiency. One acupuncturist diagnosed Blood Deficiency and Stomach Deficiency on 11 cases whereas three acupuncturists did not use these patterns at all. Logistic regression analysis (table 2) showed that sex, duration of practice and education had a highly significant effect (p<0.001) on the use of merged patterns (ie, the total numbers of patterns summarised in table 1). The number of working hours was also significant (p=0.029).

Table 2

Logistic regression analysis of the total number of merged Traditional Chinese Medicine pattern diagnoses versus demographic variables

Age had no significant impact on the frequency of diagnoses, but the acupuncturists diagnosed fewer TCM patterns if they were women, had long practical experience or had longer working hours. An excess of 10 years practice implies a 65% reduction in the odds of diagnosing TCM patterns, whereas 10 more hours per week implies a 16% reduction in the odds. Acupuncturists with a background in nursing or physiotherapy were more likely to diagnose TCM patterns than acupuncturists with BM education.

Agreement on diagnosis

Six merged TCM patterns in table 1 were used by all the acupuncturists; Liver Excess, Liver Deficiency, Spleen Deficiency, Kidney Deficiency, Damp Excess and Heat Excess. Except for Liver Deficiency and Heat Excess, these patterns appeared at least twice as often as the other patterns. They were on average used in more than half of the cases.

The inter-rater reliability test was performed on the seven merged TCM patterns that were diagnosed on average more than five times (table 3).

Table 3

Multiple κ and CI for the most used merged Traditional Chinese Medicine (TCM) patterns

There was poor agreement between the acupuncturists with regard to the use of all these merged patterns (κ<0.20). To explore factors that could contribute to the poor agreement, we selected a frequently used TCM pattern diagnosis and examined all the symptoms used to diagnose Liver Qi Stagnation.

Symptoms and signs used to diagnose Liver Qi Stagnation

Liver Qi Stagnation was frequently used by five acupuncturists on 21–24 cases (table 4). The remaining three acupuncturists used the pattern on 11–16 cases each.

Table 4

Symptom used to diagnose Liver Qi Stagnation distributed into three groups: frequent, moderate and rare symptoms (four symptoms in each group)

All cases were diagnosed as Liver Qi Stagnation by at least one acupuncturist. Four of the 25 cases were diagnosed by all eight acupuncturists while one case was diagnosed by only one acupuncturist. For the remaining cases there was a variation in the number of acupuncturists who gave the diagnosis (range 4–7).

Altogether, 179 different symptoms and signs, including description of the pulse and tongue, were used to describe the 25 case histories. Overall, the acupuncturists used 147 of these symptoms to diagnose Liver Qi Stagnation. For individual cases the number of symptoms ranged from 32 to 106.

There was a large variation in how the acupuncturists used symptoms and signs to diagnose Liver Qi Stagnation (table 4). In general, the acupuncturists used a symptom or sign to make the diagnosis only in some of the cases, even though it was present in other cases. Some acupuncturists never used a symptom to diagnose whereas others used it on all cases presenting with the symptom. For instance, Acu1 never used ‘red edge on tongue’ whereas Acu7 used it on all cases.

‘Wiry pulse’ was reported in all cases and was used to diagnose Liver Qi Stagnation in 32–84% of the cases (table 4). Even larger variability among the acupuncturists was seen for the other symptoms.

Acu7 used the most symptoms, ranging from 3 to 37 symptoms on a single case with an average of 15 symptoms per case. Acu4 used the fewest symptoms, ranging from 1 to 10 symptoms with an average of three symptoms per case. There seemed to be a rather consistent variation between the acupuncturists with regard to the number of symptoms used to make a diagnosis. Some acupuncturists used few symptoms whereas others seemed to use a larger number of symptoms.

The symptoms occurring less frequently were used to diagnose Liver Qi Stagnation from none to all cases with the symptom. A similar variability was observed for the rare symptoms. Hence, there seemed to be a wide variation among the acupuncturists in the use of symptoms for setting the diagnosis, irrespective of how common the symptom is.


Although acupuncturists with the same TCM background evaluated exactly the same information, there was extensive variation and poor inter-rater agreement on the merged TCM pattern diagnoses as well as for single patterns. Disagreement can occur in several stages of the diagnostic process. Variability in the data collection phase was eliminated, and the poor inter-rater reliability occurred in the understanding and interpretation of the symptoms and signs. The acupuncturists interpreted the symptoms and signs differently and turned the information into different diagnoses.

There was a large variation in how symptoms and signs were used to set a diagnosis. ‘Wiry pulse’ existed in all cases and is, according to Maciocia, a clinical manifestation of Liver Qi Stagnation,2 yet it was used to diagnose only 69% of the cases, indicating that symptoms are used inconsistently in setting a diagnosis. The study examined only how single symptoms were used to set a diagnosis and not how diagnoses were reinforced or rejected in the presence of other symptoms. Nevertheless, it is expected that acupuncturists should agree on the principles, consider the same symptoms and from that conclude with the same TCM patterns.

The variation in the use of symptoms and signs to make a diagnosis reflects an unpredictable diagnostic process, both between and within individual acupuncturists. Scheid found that personal interpretations of the textual sources were of importance. Practitioners used the same terms from the canonical literature but applied them differently.1 With a large influence of personal interpretation, the diagnostic process may be individual for each acupuncturist. In addition, there was considerable variability in how each acupuncturist used the symptoms and signs, suggesting that additional factors other than inherent personal differences also play a role. Hence, there seems to be a need for the development of guidelines to achieve a more reliable diagnostic process across acupuncturists.

The results showed that the diagnostic process was influenced by demographic factors. Acupuncturists with long clinical experience diagnosed fewer TCM patterns. Liu claims that experienced acupuncturists have more in-depth understanding of TCM theory and better communicating skills that will ensure a correct diagnosis.18 Although the communicating skill was eliminated in the present study, we found variation in diagnoses. The acupuncturists did not agree as to how symptoms should be interpreted into TCM patterns. Our sample size was too small to examine whether different educational backgrounds affected the interpretation of symptoms and signs.

The use of case histories without the possibility for the acupuncturist to gain additional information restricts the interpretation of the present study. However, it provides data on a wide range of interpretations of given sets of symptoms, and supports our previous finding that two acupuncturists examining the same patient at the same time ended up with a substantial variation in TCM diagnoses.

The variability reported in the present study is a challenge for the application of the TCM principles.5–10 Some attempts to improve the agreement on TCM patterns, such as guided training, have been tested.19 However, the effects have been meagre, with slight to substantial improvement on common diagnoses and a lack of improvement on less common diagnoses.19 This indicates that there is still a need for further development of the TCM diagnostic process.


There is extensive variation and poor multi-rater reliability in TCM diagnosis among acupuncturists. Demographic variables of acupuncturists influenced the frequency of diagnosis. Symptoms were used differently and inconsistently to set diagnoses. This variability in the diagnostic process is a threat to the aim of reliable diagnosis as a basis for individually tailored treatment.

Summary points

  • Traditional Chinese Medicine diagnosis involves identifying symptoms and signs, then interpreting them.

  • We asked eight acupuncturists to interpret 25 case histories.

  • Their diagnoses were varied and inter-rater reliability was poor.

  • There was wide variability in how clinical information was used to arrive at the diagnosis.


The authors acknowledge the participating acupuncturists from the Norwegian Acupuncture Association and the women for providing study data.


