IDENTIFICATION OF DUPLICATE RECORDS IN DELAYED MODE DATA

EXACT DUPLICATES

The U.S. National Oceanographic Data Center (NODC) adds high resolution, delayed mode data to the GTSPP data base. Each new file of delayed mode data is checked internally for records with exact duplication in

In addition, each record of the new file is compared to data in the GTSPP data base to identify exact duplicate records. A data base update file is created from the input file from which all duplicates (either in the file or between the file and data base) are excluded. This prevents insertion of duplicate records into the database.

INEXACT DUPLICATES

Periodically, the GTSPP database is checked for inexact or near duplicate records in which two or more observations

The following information from "near-duplicate" records is displayed on the screen for review:

During this interactive session, the operator decides which, if any, records are to be deleted from the data base.

It is not always possible to determine from the above information whether or not two data records are actually duplicates (i.e. from the same observation). For example, the geographic positions may be the same but the times differ by a few minutes and the number of depths differ in the two records. In those situations neither record is deleted. When it is not possible to identify duplicates, we err on the side of keeping duplicate records rather than eliminating good data.

When deciding which record (if any) to delete, the operator takes into account the data source, number of observed depths, and state of data quality checking of each record.