Healthcare Data QA
This website provides an overview of the software processing of medical data, with an emphasis on the traps that are often present.  
Home
Introduction
Software Design
Basic Obstacles
Data Input Problems
Human Obstacles
EHR DataBases
CSV Files
XML Files
Reports
Statistics
Legal
Other


© 2022 Kevin Pardo
    

CSV Files

I prefer CSV files to XML files because flat files usually correspond to individual database tables. There are many problems which can occur, however.

Inconsistent Column Formats: Even within the same file, you may find that date_of_birth has a format of YYYY/MM/DD while deceased has a format of DD-MM-YYYY. Often we are not on the lookout for this level of depravity, and we misconfigure our file harvesting utilities.

Inconsistent Primary Keys: One file may have an MRN of 0012345 for Robert Brown while another has 12345 for the same person. This can cause large amounts of patient data to be dropped.

Variation in the use of double quotes: There is an unfortunate variety of ways in which double quotes can be applied. If the pipe character is the field separator, double quotes are ideally not needed. In reality, the people who generate files for export are often inconsistent in the settings they choose. Be careful that your code handles all required uses of double quotes. Expect the source files to change the use of double quotes over time.

The more variation in the data files, the more likely you are to make mistakes.

"Numeric" fields should only have Numbers: Ideally numeric fields would only contain numeric values, but this is not the case in the real world. In practice, numeric fields may have % and ! characters, among others. An A1C value might be: 8.4!
One person I met describes the struggle as, "Convincing people not to add emojis to their data."

Blank Lines: Sometimes rows are essentially blank aside from the field separators: |||||||||||||||
Ideally these should be ignored and not stored in the database.

CSV Column Names Not Usable: It may be that a CSV file has spaces and other characters in the column names which make them poor candidates for the column names in a relational database. It may be best to enter your own column names from the start.

Fixing File Problems

When possible, the people processing the data should be able to push-back against variation in source data files. If junior employees have been assigned to handle data, they may feel that they are expected to struggle through the problems. By the time a senior project member sees the junk input, it is too late to approach the client for fixes. Junior employees should report bad data, and management needs to recognize that bad data slows work.

Advising the client from the start that consistent data is the most robust data is not a bad idea, also.