Healthcare Data QA

Healthcare Data QA
This website provides an overview of the software processing of medical data, with an emphasis on the traps that are often present.

Home
Introduction
Software Design
Basic Obstacles
Data Input Problems
Human Obstacles
EHR DataBases
CSV Files
XML Files
Reports
Statistics
Legal
Other

© 2022 Kevin Pardo

EHR Databases

We initially assume that big-name product databases are organized and easy to use. In reality, they are often mixtures of data graveyards and quick-and-dirty additions. Here are notes about what data should be like, and what it is in the real world. The emphasis is on identifiers.

Column Names: The following are reasonable column names for a patient's first name:

first_name
name_first
firstname
fname
given_name
demographics_name_first (table_)
demo_name_firs (abbreviated table_)

The above all reflect different naming conventions. If the naming conventions are mixed, the result is data which is terribly error prone to use. For example, name_first, mname, and namelast in the same table or schema invites typos and causes delays and frustration. Database work becomes excruciating.

Eventually you will encounter data which should never have made it to production:

first_nme Misspelled
"FirstName" AKA [FirstName], 'FirstName'; pure evil
"first name" AKA [first name], 'first name'; as above
demo__name_first (Two underscores)
first_name, first_name_sortable, and first_name_important Duplicate columns
primary_intra_family_appellation When an academic names the columns

The above are pretty much poison for data usage, but they do exist in operational systems. (Some people appear to intentionally use double underscores, but these double underscores cause usage hassles.) Consistency can mitigate the effects of unusual naming conventions, but expect work delays if you see the above.

Duplicate columns have a tendency to propagate in derived data. People are rushed, and they begin to duplicate columns for each usage. This is horrifying, but real.

All of the earlier examples of bad column names assume that the column actually contains what its name implies. In reality, especially early in a project, you may find that a "name_first" column actually is filled with values such as: Smith, Cooper, Brown, Gonzales, and Johnson
Such problems are usually fixed early in a project's life. Later in a project, however, you may find that "phone_work" has been appropriated for emails.

My own preference for name columns is:

name_first
name_middle (in practice, middle name or middle initial)
name_last
name_full (typically a derived roster concept)

In a complex roster, the following would be reasonable:

demo_name_first
demo_name_middle
demo_name_last
demo_name_full

Prefixes can be helpful if you have reporting functionality to select columns by regular expressions: ^demo_

Ambiguous Usage: Sometimes a column has a clear name, but its usage by the EHR software and users is ambiguous. For example: diagnoses_start, diagnosis_made, diagnosis_entered, and diagnosis_updated
These may all live in the same diagnosis table, but not all columns may be populated. The harvesting and processing of ambiguous columns ideally should be discussed with the client. Non-technical people on the client side may not understand what is in the database tables, however.

An end result, for a downstream table, might be:

diagnoses_date=LEAST(diagnosis_start,diagnosis_made,diagnosis_entered,diagnosis_updated)

The LEAST() function exists in PostGres/PostgreSQL and is helpful.

Data doesn't Match the Column: As mentioned earlier, you may find that phone_work contains email values.

Multiple Key Columns for Essentially the Same Data: MRN (Medical Record Number, external ID) and patient_id (internal ID) may both identify patients. While a particular schema design may be justified in having both internal and external ID values, individual tables may almost randomly use one or the other, or both. The names can also vary. For example, you might eventually find that patient_id is pid, pat_id, or even id. (I may even have seen an EHR in which pid sometimes represented patient ID while at other times it represented provider id.)

Table Naming Problems: There are many ways in which tables can be difficult to navigate. Examples include:

Tables may contain overlapping or stale data: demographics, demographics2, demographics 2019, and demographics_new
Tables are empty: Upon first viewing table names, it is easy to make assumptions as to where the cleanest data lives. In reality, tables with clear names are often unused or have been abandoned.

Some tables will contain temporary backup copies of data. These should be avoided.