Healthcare Data QA
This website provides an overview of the software processing of medical data, with an emphasis on the traps that are often present.  
Home
Introduction
Software Design
Basic Obstacles
Data Input Problems
Human Obstacles
EHR DataBases
CSV Files
XML Files
Reports
Statistics
Legal
Other


© 2022 Kevin Pardo
    

Introduction

An overview of some healthcare data processing topics follows. This generally assumes data is being collected and organized to improve patient care.

Core Work: Most of the day-to-day technical work involves searching data sources, harvesting data, processing data, and preparing results for specific uses. The input and output are often low quality, for varied reasons. To ensure that the results are of acceptable quality, a lot of time and effort is needed to check source data and the processed data at all stages of processing. The following outlines basic project input and output.

Data Sources: Healthcare data may come from many sources. These include:

  • An EHR (Electronic Health Record) Database
  • CSV files exported from an EHR.
  • XML files exported from an EHR.
  • Documents from specialist providers outside a medical group and lab companies:
    • PDF Documents.
    • Other text and graphics based documents.
    • Scanned documents (typically images with no text content).
  • Electronic data feeds.

Types of Data: There are many types of data which exist within a large healthcare organization. For population management, the following are common types and example content.

  • Demographics: Name, date of birth, sex, and address.
  • Visits: Dates and an ID for at least one care provider.
  • Vitals: Systolic and diastolic blood pressure, weight, height, and possibly BMI (calculated/derived.
  • Labs: Blood, urine, and fecal tests.
  • Diagnoses (Dx): Conditions and reasons for care. ICD-10 is the current encoding system in the US.
  • Medications (Rx): Substances may include over the counter (OTC), prescription, and inpatient medications. Medication names, dosages, and refill information may be included, with part of this in the "sig/signetur."
  • Procedures: Imaging and treatment procedures.
  • Patient History: Smoking status and history, depression, fall risk. These are highly variable and often embedded deep within a data source.

There can be unexpected usages, especially for diagnoses. For example, "A wellchild office visit" is a valid ICD-10 "diagnosis" in the US. Smoking may be captured as a diagnosis as well as in a smoking status field.

One of the most common difficulties is that providers capture diagnoses both in "problem lists" and in (bulk) diagnoses areas of an EHR. A problem list is intended to capture conditions on which a provider will focus. Most of these diagnoses are for long-term (chronic) conditions, such as diabetes. Because data can put diagnoses in either "problem list" tables or diagnoses tables, both data sources should be harvested.

Normalized Data: Core data should be organized so that it is lean and easy to access. (Lean data may not always be easy to access, but the intent is to have data which is as easy to use as possible.) Data prepared for reports and direct use by medical professionals uses the normalized data as a foundation.

A few derived values, such as BMI, may be important enough to mix with the normalized data. Adding extra patient foreign key columns may help to make searches of normalized data simple. Storing diagnosis and procedure names with their codes can help to make tables easy to use. In general, though, normalized data should be as lean as possible.

Adding tables dedicated to log the history of inputs, updates, and deletes is sometimes required in a project. The use of such tables should be limited to debugging and legal record keeping. They should not be used as general data sources.

Note that data can be "normalized" and still be unusable because of poor identifier selection. It is important that people straight out of academia understand the goal is usable data, and that "normalizing" the database schema is only one aspect of this.

Translating Tags: Type tags, such as for vitals and labs, often need to be translated before they can drive reports and care decisions. For example, "wt-pnds" in the source data might need to be translated to a standard name for your project, such as "WEIGHT_POUNDS."

Derived Data: Data keys for diagnoses and procedures often need to be grouped to be useful. For example, a project might require a list of all patients with a diagnosis which counts as diabetes.

Also, "concepts" mostly will contain derived data, such as:

  • age_years: int
  • a1c_value_last: float
  • a1c_value_max: float
  • a1c_date_last: date
  • dx_dm_present: char values of 'Y'/'N'
"Concepts" will also contain copies of some normalized data, such as date of birth.

Forms of Results: There are many forms that the final output may take. Reports and interactive applications may help organizations improve population management. Reports may go to the government. Some of the most important output is for the development team and project stakeholders to perform QA.

Attitude: People join tech projects with a wide variety of expectations and beliefs. In healthcare data projects, it is important to understand that source data will be low quality and the storage and processing will often become complex. People must focus on the basics. Like most tech projects, becoming side-tracked with complex libraries and "creative ideas" prevents team members from realizing that a project is rapidly becoming a mess.

To provide high quality results, a lot of time is necessary to study the source data and choose the best sources available. The client may need to be prompted to clean-up inconsistencies and junk.

The work is primarily organizing and validating information. Having access to large data sets makes new project members dream of complex statistics and slick processing code. In reality, the data structures in a project can quickly grow until they are out of control. Limiting and managing complexity should be the focus, not fancy functionality.

While young software developers are often cowboys, many experienced people in technology work quite hard to ensure project output is of high quality. Older developers may still enjoy the technology and experiment a bit, but they really want to make valuable information available to providers and administrators. They frequently will feel strong responsibility to keep systems operational and data secure.

In software organizations and some medical environments, however, many people openly declare software developers to be "typists with a bit of technical training." They go on missions to make the technical staff members "know their place." They will attempt to retrain technical people with the belief that top-down management makes projects succeed, even when non-technical managers get involved with the details. Not in their worst nightmares can they imagine the poor quality of data and the effort needed to process it well. Such domineering people also have a habit of adding non-technical people to "assist" technical people in writing requirements. This only helps when the new project members are dedicated to creating a practical set of requirements. They must:

  • Listen to the technical staff regarding what is and what is not practical.
  • Understand that a pile of complex requirements increases confusion and the risk of project failure.
  • Have the skill and energy to organize requirements and supporting data.