In 1995 at a workshop held at Scripps (Workshop for Quality Control of WOCE Upper Ocean Thermal Data, WOCE Report No. 133/95), the participants agreed to carry out an intercomparison of quality control procedures. The goal of the intercomparison was to examine the value added to a data set by scientific quality control. CSIRO volunteered to assemble some data and to distribute them to others. In total, four cruises of XBT data were made available. MEDS, NODC and CSIRO took part in the intercomparison. The data were from waters near Australia where CSIRO was most knowledgeable and MEDS and NODC were less so. The cruises were made up of 88 stations. All profiles contained temperatures measured to 1250m. In all there were over 88,000 temperature observations. A picture of the cruise tracks is shown in figure 1.
Figure 1: Cruise tracks (as black dots) of data used in the intercomparison. Note that two cruises (SR9545S and SR9547s) are off the west coast of Australia. Cruise IP97S is the one coming south from Japan, and MA22N is from India to the Persian Gulf.
As part of the quality control, the position and time of every station was checked to see that they were consistent in the context of other stations in the cruise. All of the stations passed these tests at all three centres.
The quality control procedures at CSIRO use software written by them. The operator of the software works with scientists who are very knowledgeable of the regional oceanography to validate the data. The data may be examined in a number of different ways and in the end the operator sets the quality control flags manually. Reasons for assigning flags and general data processing information are documented in CSIRO Marine Laboratories Report 221, Quality Control Cookbook for XBT data, 1994. One of the capabilities of the software permits the display of profiles all along the cruise track in a waterfall plot. This is used to look for common features in adjacent profiles to help to assess whether a feature is unusual. Other capabilities allow overlaying of profiles from the same cruise, surrounding cruises, and profiles from historical archives. The operator and scientists assess all of this information to decide what quality flags are appropriate. Figures 2 to 4 show waterfall plots in which the individual profiles shown later are to be found.
The quality control software used by MEDS and NODC is nearly identical but different from that used at CSIRO. The software was built in support of the Global Temperature Salinity Profile Project, GTSPP. The version used at NODC is described in IOC Manuals and Guide #22, GTSPP Real-time Quality Control Manual, 1990. MEDS version of this software has added two other tests to compare depths of observations to a bathymetry file, and to detect certain forms of temperature inversions. At present there is no facility to display waterfall plots. Operators view profiles one at a time, although they may move forward and backwards along a cruise track to look at the adjacent profiles.
The GTSPP software is designed to be used by an operator inspecting the profiles. This is how MEDS used the software in its evaluation. However, it is possible to pass the data through the software in automatic mode such that no visual inspection is done. This is how NODC used the software. In this mode, most data which fail the suite of tests are assigned a flag of 'probably bad'. The only data flagged automatically as 'bad' are those which exceed impossible range checks. An example of this is shown later. The automatic tests are not capable of detecting small spikes in a profile, such as insulation penetration problems described in the CSIRO QC Cookbook. Likewise they cannot detect spiky behavior at all if there is more than one point in the spike. The automatic procedures cannot detect wire break or wire stretch signatures in the profile.
The quality flagging scheme used by all centres was the same. All depths and temperature observations receive a flag. The flagging convention assigns a one character code to indicate the quality of the value. The character '0' means no QC was done, '1' means the value appears good, '3' means the value is probably bad, '4' means the value is bad and '5' means the value has been changed. CSIRO uses a flag of '2' to mean probably good while NODC uses '2' to mark values which lie outside the 1982 Levitus climatology. MEDS does not use the flag of '2' at all.
CSIRO set no flags on the depths. They also made extensive use of the 'probably good' flag; the majority of observations were assigned this flag. The CSIRO Cookbook defines 'probably good' to be good data in which some features (probably real) are present but are unconfirmed. 'Probably good' data are also data in which minor malfunctions may be present but these errors are small and/or can be successfully corrected without seriously affecting the overall quality of the data. Due to the nature of the XBT, signal leakage can manifest itself as apparent fine structure. If such structure is apparent in a profile, there is no certainty that it is real unless confirmed by another drop. In the waters in which CSIRO collects data (i.e. many archipelagoes, etc.) fine structure probably does exist in a large percentage of the data. CSIRO uses the 'probably good' flag as a way to distinguish real from erroneous fine structure.
All observations within 3.7m of the surface were considered wrong by CSIRO. All of these values were changed to 99.99 and a flag of 'changed' was set. The original values at these depths are stored elsewhere in the data record. Neither MEDS nor NODC did this for the intercomparison.General Description of the Results
The first step in the evaluation was to count the number of occurrences where flags set by one centre agreed or not with those set by the others. This exercise was carried out for flags set against depth as well as temperature values. No results will be discussed for flags set against depth since CSIRO set no flags on depth.
As mentioned above, CSIRO uses the flag 'probably good' extensively. A strict comparison of flags set by the three centres showed substantial differences, but overwhelmingly these were the result of MEDS/NODC assigning a flag of 'good' when CSIRO assigned 'probably good'. There were more than 64,000 observations flagged as 'probably good' by CSIRO that MEDS flagged as 'good'. Examples of this are shown later. In addition, the changed values by CSIRO in the top 3.7m caused differences at every station compared to MEDS/NODC. This affected the first 5 observations in every profile.
Because the difference between CSIRO and MEDS/NODC flagging is dominated by the use of 'probably good' by CSIRO, 'good' and 'probably good' flags were treated as equivalent. This comparison will be biased somewhat when comparing flags set by NODC to those set by CSIRO, but because the number of climatology failures (which sets a flag of 'probably good' by NODC) is small, the comparison is not seriously affected.
Differences between CSIRO, MEDS and NODC were examined in two different ways. The data with flags set by CSIRO were used as the basis of comparison. The first comparison counts the numbers of stations at which there were differences to CSIRO. This count excludes the consistent differences in the top 3.7m and equates flags of good and probably good. MEDS showed at least one different flag from CSIRO in the profile at 32 of the 88 stations, while NODC showed differences at 17 stations. Counting stations with more than 100 depths with differences, MEDS showed differences at 22 stations and NODC at 11. Counting the number of observations which were flagged differently, with the same exclusions as before, there are similar results. That is, MEDS tended to flag about 12% and NODC about 5% of the observations differently from CSIRO.
There is a significant difference in the number of stations flagged differently while it is less when counting individual observations. Some explanations for these results are examined later.
It was clear that automatic tests are not capable of detecting the more subtle problems found by visual inspection. This is not a surprise. Some of the failings were discussed above. As a result of this intercomparison, more work can be carried out to improve the tests always keeping in mind that they are intended to be used in conjunction with visual inspection.Cruise by Cruise Differences
MA22N: There are 13 stations. CSIRO mostly set flags of 'probably good'. They made changes to 58 observations and all of these in the top 3.7m of every station. NODC set no flags other than 'good'. MEDS flagged 1086 observations as 'probably bad' where CSIRO set them to 'probably good'. This occurred at one station only. MEDS also set flags of 'probably bad' and 'bad' at 7 of the surface values that CSIRO changed. So the overall difference between MEDS and CSIRO for this cruise was at one station at depth and at all surface values.
There were no differences between NODC and CSIRO except for the surface changes.
SR9547S: There are 23 stations. CSIRO set a flag of 'probably bad' on 3444 observations and 122 values at the surface were changed. Of the observations flagged as 'probably bad' by CSIRO, MEDS flagged 2259 as 'bad' and the rest as 'good'. MEDS set no flags of 'probably bad'. MEDS also flagged 556 observations as 'bad' that CSIRO flagged as 'good' or 'probably good'. Six of the values changed by CSIRO were flagged as 'bad' by MEDS.
NODC flagged 111 observations as 'probably bad'. Of the 3444 observations flagged as 'probably bad' by CSIRO, 531 were flagged as 'probably good' (failed climatology) by NODC.
IP97S: There are 31 stations. CSIRO set a flag of 'probably bad' on 3326 observations and 150 surface values were changed. MEDS flagged 7134 observations as 'probably bad' or 'bad', 3154 were flagged as 'bad' by MEDS and 'probably bad' by CSIRO. MEDS also flagged 2156 observations as 'probably bad' or 'bad' which CSIRO flagged as 'good' or 'probably good'. MEDS also flagged 9 of the surface changed values as 'bad'.
NODC flagged 777 observations as 'probably good', 'probably bad' or 'bad' (of the 3444 flagged by CSIRO). There were differences at only 3 stations between NODC and CSIRO (apart from the changed values at the surface).
SR9545S: There are 21 stations. CSIRO set a flag of 'probably bad' on 1643 observations and 1118 were flagged as 'bad'. MEDS flagged 2750 observations as 'bad' (all of the same observations that CSIRO flagged as 'probably bad' or 'bad'). In addition MEDS flagged 1776 as 'bad' that CSIRO indicated as 'probably good'. CSIRO changed 105 surface values and MEDS flagged 6 of these as 'bad'.
NODC flagged 1111 observations as 'bad', nearly the number flagged by CSIRO. However, no observations received a flag of 'probably bad' by NODC. There were differences at only 3 stations between NODC and CSIRO (apart from the changed values at the surface).Detailed Comparisons
Before examining individual profiles, it is useful to look at waterfall plots for the cruises separately. A selection of these is shown in figures 2-4 where individual profiles shown later are identified. When looking at individual profiles and differences between flags assigned by CSIRO, MEDS and NODC, these figures should be kept in mind.
Figure 2: A waterfall plot for the entire cruise MA22N from north to south. Profiles marked F5 and F6 are shown later in figures 5 and 6.
Figure 3: A waterfall plot for the northern portion of cruise IP97S from north to south. The profile marked F7 is shown later in figure 7.
Figure 4: A waterfall plot for the southern portion of cruise IP97S from north to south. The profile marked F8 is shown later in figure 8.
It is straightforward to document the differences between flags set as a result of automatic testing, visual inspection by MEDS and scientific QC done by CSIRO. The attached figures show examples of the major differences to illustrate the points. No attempt is made to pass judgment on the suitability of the flag assignment. This topic has been debated at the WOCE UOT DAC meetings for some time. The conclusion reached there is that agreement should be possible on what is flagged as 'good' or probably good' in contrast to what is 'probably bad' or 'bad'.
It is difficult to make generalizations about the differences between the flags set by a data centre such as MEDS and those set by a science centre such as CSIRO. However, it appears that MEDS is more likely to label observations as 'bad' or 'probably bad' than is CSIRO. In some profiles, observations labeled as 'bad' by MEDS occur at shallower depths in the profile than observations labeled as 'good'. CSIRO considers it unlikely that observations collected by an XBT can recover from a failure and so deeper observations always acquire a quality flag at least as bad as the observations above.
The accompanying figures (5 to 9) illustrate some of the differences in flags set by a data centre and a science centre. A description of the features of note is included with the figures. The titles contain the cruise identifier prefixed by a two character field to identify the originator of the flags. An identifier of CS is used for CSIRO, ME for MEDS and NO for NODC. This is followed by the time and date, and the position of the station.
Every figure showing flags set by CSIRO shows a solid black line at the top of the profile. These are values set by CSIRO to 99.99 for the top 3.7m of every profile. The original values are stored elsewhere in the data record. These values are all marked as 'changed'. In the figures from other centres, the temperature scale is altered since in does not have to deal with as large a range of values.
In every profile flagged by CSIRO there are 2 lines that bracket the values in the profile. These are the plus and minus 3 standard deviation limits defined in the 1982 Levitus climatology. These always appear when a value in the profile exceeds these limits. Since CSIRO sets surface values to 99.99, these limits always are plotted.Conclusions
While automatic tests are able to pick up some obvious failings in profiles, the suite of tests employed was not able to detect subtle and some obvious errors. The most obvious failing was when a spike of a profile was composed of more than a single point. The automatic tests do not detect a problem, while visual inspection does. The conclusion is that the current suite of tests needs to be improved and that visual inspection is still important in detecting problems in data.
Deriving conclusions from the results in flagging differences is difficult because they are so mixed. NODC used only automatic tests and yet if counts are performed on the number of stations with differences or the numbers of observations with differences, the counts are lower for NODC than MEDS compared to CSIRO. This could be because the data were generally good and the automatic tests tend to fail on only obvious problems. MEDS visual inspection more often flagged fine scale features as suspect, that CSIRO judged as real. This argues that knowledge of the regions where the data are collected is valuable in deciding on the validity of data.
There was a difference about when data should be considered 'good' or 'probably good'. CSIRO employs the definition as described earlier. NODC used the 'probably good' flag only to indicate when observations fell outside of climatology (this is how the original GTSPP software was configured. MEDS made changes to its version while NODC had not done so at the time of the intercomparison). MEDS does not use the 'probably good' flag at all. There was also a difference in what was considered 'bad' and 'probably bad'. A resolution of these differences is needed to help a user interpreting the flags attached to the data.
CSIRO consistently flags data from the top 3.7m as bad/changed. Neither MEDS nor NODC did this. This matter needs to be resolved once and for all.
Figure 5: This is an typical example (applicable to about 51% of all stations) of the most common difference between the quality control applied by CSIRO and that applied by MEDS and NODC. In part a is shown the flags assigned by CSIRO. The observations in the top 3.7m are changed and below this all other temperatures have been flagged as 'probably good'.Part b shows the same profile with the flags assigned by MEDS (and NODC). In this case the entire profile was marked as 'good'.
Figure 6: These plots show a profile which CSIRO flagged as 'probably good' (part a), that MEDS flagged as 'probably bad' (part b) and that the automatic tests considered 'good' (part c). Examples such as this occured in roughly 19% of stations.
Figure 7: Part a shows a profile, a portion of which, CSIRO flagged as 'probably good'. In contrast, MEDS (part b) flagged the same portion as 'bad'. Examples such as this occured in roughly 12% of stations.
Figure 8: These plots show the differences between CSIRO, MEDS and NODC handling of a profile which has serious problems. Examples such as this occured in only 3% of stations. Part a shows the profile with flags set by CSIRO ; all points flagged as 'probably bad'. Part b shows the same profile with all flags set to be 'bad' by MEDS except for the top 80m. Part c shows the flags assigned by the automatic QC done at NODC. The upper portion of the profile is generally flagged as 'good' even though there are obvious spikes. Values which exceed climatology received flags of 'probably good'. In the lower part, observations are often assigned a flag of 'probably bad' or 'bad'. Flags of 'bad' were assigned when the profile exceeded a global range check.
Figure 9: The profile shown in this figure was flagged as 'probably good' over its length by CSIRO, except at the surface where values were changed. MEDS assigned 'good' to the entire profile. Examples such as this occured in roughly 15% of stations.