![]() |
Healthcare Data QA
This website provides an overview of the software processing of medical data, with an emphasis on the traps that are often present. |
| Home Introduction Software Design Basic Obstacles Data Input Problems Human Obstacles EHR DataBases CSV Files XML Files Reports Statistics Legal Other © 2022 Kevin Pardo | Software DesignDesign Basics: To manage the care of a population, the following are examples of what you will want at the beginning of a project:
The less defined the goals, attributes, and functionality, the more developers will be pushed to work 24/7. The client must feel responsible to provide, or at least review, data which drives the processing of patient data. The names of labs, diagnoses, and medications which are to be processed must be reviewed by the client if at all possible. Reasonable Requirements: It is important that software requirements
are realistic. Managers on both the client and the development side often underestimate how
messy the work will be, and accept intricate and impractical requirements.
It is also common for the client to be unable to articulate complex processing steps.
Even if you, as a developer, draft a list of diagnoses to be processed by the software,
the client may be too busy to approve it.
Reasonable Workloads: Medical professionals doing data entry should not be expected to add hours of new tasks to support the project. Likewise, software developers should not have to edit large files manually with each data harvest. It is easy for project managers to go into fantasy land and overload users and developers alike. Already, medical professionals consider EHR's to be burdens which interfere with patient care. Processing Model: There is variation in approaches to processing
large amounts of data, but a simple model is:
Patient ID's" One normalization task is to assign a globally unique ID for patients. For example, an EHR might have a public MRN (medical record number) as well as an internal ID. Some systems have three or more patient ID values. For patient Jane Smith:
Related to the above is the task of merging patient data which originates from different environments. An MRN from one source will not match an MRN from another source. (Typically, one MRN value will identify different patients for each EHR harvested.)
Additionally, a single source, such as one hospital's EHR, will often assign one individual multiple MRN values over time. A patient who changes his or her family name and does not use a medical facility regularly will be at risk of receiving two MRN values.
Clean Data Values Upstream: To ensure clean data, consider upstream updates to string values such as:
When accessing data prepared by a client or co-worker, we often assume that the data will be "clean." This can cause a variety of processing errors. Determining Active Patients: Identifying and processing only "active" patients is a common requirement. Deceased patients aside, many patients exit and return to a given healthcare environment. Care must be taken not to delete data permanently for "inactive" patients. (Ignoring all 2021 data for an inactive patient, in some processing schemes, may cause the 2021 data to be lost if the patient returns in 2022.) Be Alert for Obscured Data: Ideally problem lists are a subset of all patient diagnoses, but in reality providers don't enter the important diagnoses in both places. This is a result of people having both "summary" and "everything" buckets. Discrepancies are inevitable. Also, diagnoses used to justify procedures may be placeholders, though medical organizations seem to have tried to reduce these "fake" diagnoses in recent years. Don't be surprised if important data is "hidden" by the way people enter data into an EHR. Data Often Trickles into an EHR: Procedure data often enters a system late because billing processes are typically slow. Lab results and scanned documents may also take time to make it into the EHR. If a client provides data in monthly intervals, containing one month's data at a time, there should be an agreement as to how much data it is acceptable to drop. It may be best for clients to provide several year's worth of procedure data at harvest time, not just values for the last month. Data Should be Approved: Providers may be slow to sign-off on encounter findings, meaning unapproved data lingers for days or weeks. Often providers need to be encouraged to approve their encounter records within a few days. You may receive unapproved data during harvests. One Person Should Approve the Data Schemas: Tables and columns designed by groups are often a mess, and usage of the data suffers horribly. One person should write, or at least approve, all normalized data tables. It is wise to have one person approve naming conventions for derived data as well. With an unsupervised group, eventually you will find that you have created a junkyard. (Note that "supervision" should be done by someone who has hands-on experience, not just academic degrees or management titles.) An inexperienced manager who simply shouts "QUALITY", "DESIGN," and "DEADLINE" day and night will cause havoc on the database schema.) Version Control and Backups: Source code should be under version control or backed-up carefully. Both the main and helper databases should be backed-up as well. Note that the people actually performing the backups may not backup as much data as developers expect. The backup staff may also scale back or even stop providing backup and restore services without notifying anyone. This is unbelievable, but sometimes IT departments can be dysfunctional. Designs Should be Simple: Young developers often grab the latest libraries from the Internet. Even the developers who select the libraries often don't have experience with them. This is fine for a college student on summer break, but a nightmare for projects which are supposed to be robust. Security may be an issue as well. Even commonly used software libraries have so many layers that major security holes pop-up without warning. Separate Harvesting from Processing: Loading and processing can both be error prone and require a lot of work and debugging. It is tempting to apply some processing rules while loading data, but this means that changes to the business rules require large processing operations. Often EHR database harvests are only allowed late at night or on weekends, so fixing even a minor processing bug may cause project delays. Legitimate data transformations during harvesting may include:
The above are basic and can often be implemented in simple harvesting tools. Consistent MRN values in upstream data will greatly increase the ability to run QA comparisons on tables. Performance: Ensure the database server application has enough RAM and that the basic database server configurations have updates from the defaults. Focus on indexes to boost performance. Indexes may need to be dropped before some bulk operations, such as insert. Some SQL library parameters, such as those for Java's JDBC, may need to be changed to minimize program read times and prevent aborts on slow queries. Parameters include:
Custom programming to increase the performance of a database server is usually a mistake. Most of us experiment with it at some point, but it means that key processing is broken out from the rest of the data work. QA will be difficult, and other developers will not be able to maintain the code easily. In many cases, it means that data in the database cannot easily be compared against data in custom caches. Custom utilities to read data files, as mentioned earlier, may be justified. Keep life simple, and use SQL for most data transformations. (Executing SQL in a simple framework with macros and reporting may be helpful, but most data operations should be performed by the database using SQL.) |