There has been an ongoing email thread concerning the need for fake data sets, de-identified etc… Dr Hoyt of UWF Health Infomatics had spoken @tony about this need. While the email thread continued, I contacted Dr. Hoyt and sent him a tool used for generating random patient data, which he has begun evaluating for use and expansion of complexity.
So, there are things to consider with de-identified data:
- De-identified data is scary due to the potential for missing some identifiable bit in a narrative, linked document or the like.
- De-identified data can give a high level of complexity, but suffers from a number of issues.
- The data is locked in time. If you want recent dates, adaptation to new data schemes etc. you must modify that data while preserving relational structure.
- Adding more data to accommodate new features requires building it from scratch anyway.
- Data entry in samples from real-world data can be incomplete. This is useful in some ways, but doesn’t reflect evaluations of good data fairly.
Randomly generated data: Takes logic to build, and a comprehensive tool to produce. The tool can be continuously modified and equipped with algorithms that simulate both standard deviation as well as aberrations. There are no worries about PHI. Sometimes yields humorous results. Must be kept in sync with the current database schema, or produce target version data. Can be used to create very large data sets for stress-testing systems to a breaking point. Can be seeded for a specific application, such as simulating a cancer cluster, flu epidemic, or other syndromic surveillance reporting. Can create content in any language.
Dr. Hoyt’s uses are many. Developers find such a thing useful. Let’s discuss the needs and find out what can be done. It seems to me that altering the output of the existing program to something that the toolkit can consume would be reasonable, provided that a relational schema based on a competed system configuration and a single (well a few) patient records were provided.