Skip to main content
All CollectionsROOK Connect
How is ROOK dealing with duplication of data? How is ROOK cleaning the data?
How is ROOK dealing with duplication of data? How is ROOK cleaning the data?

Ensuring Data Accuracy: How ROOK cleans and manages duplicate data

Sebastian Eugenio avatar
Written by Sebastian Eugenio
Updated over 3 weeks ago

Managing data duplication is a common challenge when integrating health data from multiple sources. ROOK Connect is designed to address these issues effectively, ensuring data accuracy and consistency. This article outlines how ROOK Connect handles data duplication and employs robust cleaning mechanisms to maintain data integrity.


Handling data duplication

Causes of data duplication

Data duplication occurs when the same information is collected from multiple sources or when inconsistencies arise due to differences in data formatting across devices and platforms.

ROOK’s deduplication strategy

ROOK Connect employs a structured approach to deduplication, ensuring only the most accurate and complete data is retained. The deduplication process follows these key steps:

  1. Duplicate identification: Entries matching key parameters (e.g., timestamps, unique identifiers) are flagged as potential duplicates.

  2. Prioritization rules: If multiple sources provide overlapping data, ROOK selects the most accurate and reliable source.

  3. Duplicate removal: Redundant entries are deleted, leaving only the most relevant records.

ROOK uses a clear hierarchy to prioritize data sources and prevent duplication of summaries and event data. Biometric devices are given the highest priority, followed by health kits. Within each category, specific devices are also prioritized. For example, Garmin data is prioritized for Physical Health, while Oura data is prioritized for Sleep.


Data cleaning mechanisms

Ensuring high-quality data requires more than just removing duplicates. ROOK Connect implements various data cleaning processes to maintain data integrity.

1. Harmonization

ROOK standardizes data formats and units across all integrated sources, ensuring consistency in metrics such as weight, heart rate, and timestamps. For example, distances from different devices are converted to a uniform unit (e.g., kilometers).

2. Standardization

All data, including dates, weight units, and activity levels, is converted into a standardized format. This ensures seamless interpretation and analysis across platforms.

3. Validation

ROOK validates data by detecting inconsistencies, missing values, and outliers. This helps to identify potential data quality issues before further processing.

4. Normalization

Data is normalized to a common scale, making it easier to compare values from different sources. This process enhances accuracy in metrics such as sleep quality and physical activity levels.

5. Complementation & Prioritization

When multiple sources provide complementary data, ROOK intelligently combines the most complete and accurate information to generate a holistic view of the user’s health data.


Event and Summary Generation

Event Generation

ROOK delivers refined and processed data from our health categories as events and summaries. Duplicate data points for a single event are managed through ROOK's Data Duplicity feature, which prioritizes the first recorded instance. Subsequent events from different sources within a close timeframe (+/- 10 minutes) are discarded, allowing a maximum of two events within this window. Data sources directly connected to wearable devices are prioritized over health kits and SDK extractions.

Summary Generation

The creation of summaries relies on the data available at the time of calculation. Initially, a summary is formed using the first data source received. This summary is then enhanced with extra data from other sources. Once the initial summary is dispatched, there is a 15-minute interval before an updated summary, incorporating any new data, is sent. These subsequent summaries are marked as updated versions.


Conclusion

ROOK Connect ensures that users receive accurate, consistent, and reliable health data by implementing a comprehensive approach to data deduplication and cleaning. By harmonizing, validating, deduplicating, and normalizing data, ROOK provides a streamlined and efficient solution for integrating health data from various sources. This improves the quality of health data analysis, enabling better decision-making for users and clients alike.

For more details on ROOK Connect’s data management capabilities, refer to our API documentation.

Did this answer your question?