![]() |
Healthcare Data QA
This website provides an overview of the software processing of medical data, with an emphasis on the traps that are often present. |
| Home Introduction Software Design Basic Obstacles Data Input Problems Human Obstacles EHR DataBases CSV Files XML Files Reports Statistics Legal Other © 2022 Kevin Pardo | CSV FilesI prefer CSV files to XML files because flat files usually correspond to individual database tables. There are many problems which can occur, however.Inconsistent Column Formats: Even within the same file, you may find that date_of_birth has a format of YYYY/MM/DD while deceased has a format of DD-MM-YYYY. Often we are not on the lookout for this level of depravity, and we misconfigure our file harvesting utilities. Inconsistent Primary Keys: One file may have an MRN of 0012345 for Robert Brown while another has 12345 for the same person. This can cause large amounts of patient data to be dropped. Variation in the use of double quotes: There is an unfortunate variety of ways in which double quotes can be applied. If the pipe character is the field separator, double quotes are ideally not needed. In reality, the people who generate files for export are often inconsistent in the settings they choose. Be careful that your code handles all required uses of double quotes. Expect the source files to change the use of double quotes over time. The more variation in the data files, the more likely you are to make mistakes. "Numeric" fields should only have Numbers: Ideally numeric fields
would only contain numeric values, but this is not the
case in the real world.
In practice, numeric fields may have % and ! characters, among others. An A1C value might be: 8.4!
Blank Lines: Sometimes rows are essentially blank aside from the
field separators: |||||||||||||||
CSV Column Names Not Usable: It may be that a CSV file has spaces and other characters in the column names which make them poor candidates for the column names in a relational database. It may be best to enter your own column names from the start. Fixing File ProblemsWhen possible, the people processing the data should be able to push-back against variation in source data files. If junior employees have been assigned to handle data, they may feel that they are expected to struggle through the problems. By the time a senior project member sees the junk input, it is too late to approach the client for fixes. Junior employees should report bad data, and management needs to recognize that bad data slows work.Advising the client from the start that consistent data is the most robust data is not a bad idea, also. |