Methodology for the Identification of Duplicate Reports in the Real-time Data Flow

This document describes the method used in the GTSPP Real Time Assembly and QC Centre in the Marine Environmental Data Service (MEDS) in Canada to identify duplicate reports in the BATHY and TESAC data streams coming from the GTS. The data are received from multiple sources in order to obtain the most complete data set possible. Such a procedure results in needing to identify and remove duplicates. This process must be automated because of the volume of data.

A potential duplication between two observations is determined by applying two tests. The first test is a comparison of the platform identification (in this case, the ship call sign), date and time. If these fields are the same then the two reports being tested are considered to be potential duplicates. The second test is a comparison of the time of the observation and the geographical position. If the times are within 15 minutes and the positions are within 5 km, then the two reports are considered to be potential duplicates.

In order to perform these tests for duplicate observations on a file of ocean station data, it is necessary to bring platform identification, date-time, and position information on potential duplicates together in computer memory at the same time. To accomplish this the input data must be arranged in some convenient order so that potential duplicates will not be separated by several thousand observations in the input file. In the system implemented in the GTSPP Real Time Assembly and QC Centre, the data are first ordered by a date-time sort.

Processing then begins by reading (with a FORTRAN program) the first station in the input file and subsequent stations forward until all stations which have an observation time within 15 minutes of the first station have been read. As each station is read several fields are abstracted and stored in a series of "ring buffer arrays" in named common storage. The fields that are abstracted are as follows.

C     Common storage area and declarations for ring buffer used in the
C     duplicates identification
C
      COMMON/RING/KMSG_NO(500),KCALL(500),KDATE(500),
     + KTIME(500),KLAT(500),KLONG(500)
C
       CHARACTER*10 KCALL,KDATE*8,KTIME*4
C
       REAL*4 KLAT,KLONG
C

KMSG_NO is a variable that counts up from 1 from the first message in the input file. Its use will be discussed below.

KCALL is the platform identification.
KDATE is the date of the observation.
KTIME is the time of the observation.
KLAT is the latitude.
KLONG is the longitude.

These arrays have been termed "ring buffer arrays" because of the manner in which they are used in the computer program. When the first station is read, its information is stored in array location 1. As subsequent stations are read up to the end of the 15 minute window, the data are stored in subsequent locations in the arrays. Once the 15 minute window is reached, the first location in the array is declared to be the "target" observation and processing begins. Processing consists of comparing the information of each of the subsequent observations with that of the target as in the two tests described above. If either of the tests indicate a potential duplication between the two observations, then the value of KMSG_NO for that message is added to a list of observations that are potential duplicates of the target observation. This process results in the development of a list of potential duplicates within a 15 minute window forward of the target observation.

Once the list of potential duplicates has been established for the 15 minute window, processing continues with a second look at each observation with respect to the target observation. This second look is for the purpose of deciding whether the "potential" duplicates identified in the first pass are in fact real duplicates. It can be considered to be an analysis for the purpose of removing entries from the duplicates list. If, for example, the target and the potential duplicate are within the fuzzy area-fuzzy time window but the subsurface information is completely different for the two stations, then the observations are not duplicates and the subsequent observation should be removed from the list. In the algorithm the methodology is always to identify duplicates of the target message that are equal or forward in time of the target message. Any duplications between a message and one within the previous 15 minute window will already have been found when the previous message was the target.

There are two tests that are used to remove potential duplicates from the list. One test as has been already stated is based on an examination of the subsurface data. The other test is a second look at the position data. If the potential duplicates have been identified through a coincidence of the platform-date-time information and the positions are outside the 5 km window, then the subsequent observation is removed from the duplicates list. This case occurs, for example, when the call sign SHIP is assigned to data from different ships and there is a coincidence in time between the observations.

The subsurface test is conducted between pairs of observations. This test is quite difficult from two points of view. First of all the profiles for the two observations must be brought together in the computer for the comparison. Secondly a subsurface parameter test to determine that one profile is a duplicate of another is not simple.

Bringing the two profiles together in computer memory for the subsurface test, however, is only difficult when the input file is being processed as a sequential file and it is necessary to backspace and reread information. If the file is organized as a random access file the process is relatively simple. The input file to the duplicates identification system has therefore been organized as a random access file using the Indexed Sequential Access Method (ISAM) supported by the DEC Alpha OpenVMS operating system. The ISAM file organization is based on relatively simple access keys. A limitation on the method is that the file can only be open for use of one key at a time. This, however, has not proved to be a factor in the use of ISAMs in managing ocean data in MEDS and the various processing systems that have been built here use them extensively.

In creating the input file for the duplicates identification procedure, the variable KMSG_NO has been assigned as the ISAM key. This variable is included in the "ring buffer" for the list of duplicates and is used as the key to retrieve a pair of observations with their subsurface profiles for comparison.

Once the pair of observations have been loaded along with their subsurface profiles, several possibilities must be examined. These possibilities are based on the following considerations concerning use of the algorithm. First of all, the algorithm has been designed to be used in identifying duplications between the real time version of an observation, and the delayed mode fully processed version of an observation. In this case it is likely that all depths of observation for the two profiles will be different, and that all observations of temperature and salinity will be different at least in the first or second decimal point. Thus a straight comparison of depths and values of the variables at the depths can not be used to say that the observations are not duplicates. On the other hand however, if all the depths are the same and all the variables at each depth are the same for the two profiles already determined to be potential duplicates, it can be assumed that they are duplicates.

The second consideration is that the computer knows whether an observation came from the real time stream or the delayed mode stream since each observation carries with it a variable that identifies the stream from which it came. The computer can use this stream identification to determine whether the subsurface information from two observations should be the same or not. If the message came from two different types of streams, then one would not expect the subsurface information to be the different. If from the same stream, they should be the same.

Operation of the duplicates identification algorithm at this point reduces to one of three possibilities.

a) The two observations are from the same stream type and the subsurface values for depth, temperature, and salinity are sufficiently the same that the observations can be assumed to be duplicates.

b) The two observations are from the same stream type and the subsurface values for depth, temperature, and salinity are sufficiently different that the observations can be assumed not to be duplicates.

c) Neither of the above were true and the decision must be referred to an operator for a decision.

In case a), the observations are considered to be sufficiently the same if:

- both have the same type and number of variables observed as a function of depth (for example both have temperature only, or both have temperature and salinity),

- both have identical values for depths and the variables observed at each depth.

In case b), the observations are considered to be sufficiently different if:

- the range in depths for the observations do not overlap by at least 99%,

- the number of depths at which temperature and/or salinity are observed are different,

- more that 80% of the subsurface levels for which data are reported have a different depth, a different temperature, or a different salinity.

In case c), which occurs when the two observations come from different stream types or when neither of the cases a) or b) are satisfied, the decision is referred to an operator by printing information on the complete group of duplicates contained in the potential duplicates list. A technician then reviews the listing and makes the appropriate decision. The duplicates identification system runs as a batch job. Once the technician has reviewed the list, he or she then runs an interactive job on a workstation that allows application of the final decisions to the ISAM file used as input to the original batch run.

The following is an example of the listing for a group of potential duplicates.

MESSAGE GROUP NUMBER 4**************************************

********** NUMBER 1, ********** UNIQUE IDENT IS 0
Ident: JCC     90 Date/Time: 19901115/0228 QDT/QP/QR: 111

Latitude: 25.50 Longitude: -123.27 Header:

Profile: TEMP Segment: 01 No. Depths: 10 Deepest Depth: 450.0 Dup: D
0.0 0
25.70 0


42.0 0
25.70 0


60.0 0
23.70 0


79.0 0
23.20 0
100.0 0
21.70 0


128.0 0
19.20 0


178.0 0
16.80 0


200.0 0
16.50 0
400.0 0
10.00 0


450.0 0
9.60 0
Stream: FNBA Source: I Data type: BA Hist flag:  Update flag: S


********** NUMBER 2, ********** UNIQUE IDENT IS 0
Ident: JCCX    90 Date/Time: 19901115/0228 QDT/QP/QR: 111

Latitude: 25.50 Longitude: -123.27 Header: SOVX01 RJTD

Profile: TEMP Segment: 0 No. Depths: 10 Deepest Depth: 450.0 Dup: N
0.0 0
25.70 0


42.0 0
25.80 0


60.0 0
23.70 0


79.0 0
23.30 0
100.0 0
21.70 0


128.0 0
19.20 0


178.0 0
16.90 0


200.0 0
16.50 0
400.0 0
10.40 0


450.0 0
9.60 0
Stream: MEBA Source: I Data type: BA Hist flag:  Update flag: U


********** NUMBER 3, ********** UNIQUE IDENT IS 0
Ident: JCCX    90 Date/Time: 19901115/0228 QDT/QP/QR: 111
Latitude: 25.50 Longitude: -123.27 Header: SOVX01 RJTD
Profile: TEMP Segment: 01 No. Depths: 10 Deepest Depth: 450.0 Dup: D
0.0 0
25.70 0


42.0 0
25.80 0


60.0 0
23.70 0


79.0 0
23.30 0
100.0 0
21.70 0


128.0 0
19.20 0


178.0 0
16.90 0


200.0 0
16.50 0
400.0 0
10.40 0


450.0 0
9.60 0
Stream: MEBA Source: I Data type: BA Hist flag:  Update flag: S


********** NUMBER 4, ********** UNIQUE IDENT IS 0
Ident: JCCX    90 Date/Time: 19901115/0228 QDT/QP/QR: 111
Latitude: 25.50 Longitude: -123.27 Header:
Profile: TEMP Segment: 01 No. Depths: 10 Deepest Depth: 450.0 Dup: D
0.0 0
25.70 0


42.0 0
25.70 0


60.0 0
23.70 0


79.0 0
23.20 0
100.0 0
21.70 0


128.0 0
19.20 0


178.0 0
16.80 0


200.0 0
16.50 0
400.0 0
10.30 0


450.0 0
9.60 0
Stream: FNBA Source: I Data type: BA Hist flag:  Update flag: S


********** NUMBER 5, ********** UNIQUE IDENT IS 0
Ident: JCCX    90 Date/Time: 19901115/0228 QDT/QP/QR: 111
Latitude: 25.50 Longitude: -123.26 Header: SOVX01 RJTD
Profile: TEMP Segment: 01 No. Depths: 10 Deepest Depth: 450.0 Dup: D
0.0 1
25.70 1


42.0 1
25.80 1


60.0 1
23.70 1


79.0 1
23.30 1
100.0 1
21.70 1


128.0 1
19.20 1


178.0 1
16.90 1


200.0 1
16.50 1
400.0 1
10.00 1


450.0 1
9.60 1
Stream: NWBA Source: D Data type: BA Hist flag:  Update flag: D


In the above example there were four potential duplicates identified for the first BATHY, which was the target observation. As can be seen the stream type in each case was a BATHY observation, the depth ranges were the same, and the date-time and lat-long were the same. This group of observations were referred to the technician for a decision because there are differences between the temperatures at some depths for some of them. For example, at the 79 meter depth, the temperature is 23.20 in the fourth version of the BATHY, and 23.30 in the fifth version of the BATHY. The technician is required to make the best decision he can and for each such group and then apply the results to the appropriate data file. The best decision is relatively subjective at this point, but would include such guidelines as picking the version that covers the greatest range of depths, other things being equal, or picking the version that has a correct call sign. Note that the first message in the group has an incomplete call sign and was included in the group through the fuzzy time-fuzzy area test.

As can be seen, the above procedure is conservative in that all decisions that are not almost completely certain are referred to the operator for review. This works well when all of the data are from the same stream type and the number of references to the human operator are not too large. In a typical case of real time data for one month with about 8000 observations with 5000 of them being duplicates there will be about 60 referred to an operator for decision. It takes about one hour to review and apply the decisions to the database.