Demo data generation

There has been an ongoing email thread concerning the need for fake data sets, de-identified etc… Dr Hoyt of UWF Health Infomatics had spoken @tony about this need. While the email thread continued, I contacted Dr. Hoyt and sent him a tool used for generating random patient data, which he has begun evaluating for use and expansion of complexity.

So, there are things to consider with de-identified data:

  • De-identified data is scary due to the potential for missing some identifiable bit in a narrative, linked document or the like.
  • De-identified data can give a high level of complexity, but suffers from a number of issues.
  • The data is locked in time. If you want recent dates, adaptation to new data schemes etc. you must modify that data while preserving relational structure.
  • Adding more data to accommodate new features requires building it from scratch anyway.
  • Data entry in samples from real-world data can be incomplete. This is useful in some ways, but doesn’t reflect evaluations of good data fairly.

Randomly generated data: Takes logic to build, and a comprehensive tool to produce. The tool can be continuously modified and equipped with algorithms that simulate both standard deviation as well as aberrations. There are no worries about PHI. Sometimes yields humorous results. Must be kept in sync with the current database schema, or produce target version data. Can be used to create very large data sets for stress-testing systems to a breaking point. Can be seeded for a specific application, such as simulating a cancer cluster, flu epidemic, or other syndromic surveillance reporting. Can create content in any language.

Dr. Hoyt’s uses are many. Developers find such a thing useful. Let’s discuss the needs and find out what can be done. It seems to me that altering the output of the existing program to something that the toolkit can consume would be reasonable, provided that a relational schema based on a competed system configuration and a single (well a few) patient records were provided.


At Indiana University (IU), Dept of BioHealth Informatics, we have a LibreHealth Toolkit instance with primary care, deidentified data from community health centers (CHC). These CHC have allowed us to use data for training students. IU wants to ensure that the data is properly de-identified and that there is a secure way to share this with other institutions. That work should be supported through grants in a collaboration for it to shared with others, who will again use it for training purposes.

A good fake dataset, which I’ve also previously imported to OpenMRS is Mimic - But since I teach classes in EHR systems and data analysis, I can clearly see that fake data doesn’t give meaningful explanations in many clinical cases… and real data is often useful to show clinical errors. That might be the biggest reason why people worry about sharing real data. Its not as much about patient privacy, as it is about errors. So we want to also avoid the source of shared data.

But for a variety of software testing purposes, it would indeed be useful to generate fake data. In fact, we can do that if we use the OpenMRS reference application and use the referencedemodata module -

So that piece of code can generate patients, obs, visit notes.

1 Like

I have data resources you can plug directly into that. Nothing fancy, just larger lists of names and things. I am more used to thinking about things in terms of database inserts myself. I find the best first step to widespread forgery is having a very complete single record set database dump, or anything else that could be directly imported by a tool that has a high level of data coverage. I know that OpenMRS doesn’t have a billing component, but translating out the diagnosis portion is obviously possible. I am just as interested in this as a familiarization project as much as anything else. Not so much the abstracted Java, but the data schemes. With that, I may be of significant use to someone doing practice management type feature development.


This is a great discussion. I was under the impression that MIMIC data is real but de-identified "MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts" from their web site. The issue there is that it is intensive care information and OpenEMR/MRS deals with outpatients. Nevertheless, a limited number of data fields, such as demographics, labs, diagnoses and medications could be imported.

I could not find much background information on the OpenMRS reference application. How difficult would it be to have it work with LibreHealth?

We have created grant opportunities from our textbook proceeds so over the past two years we have given out grants from $1-10 K. I don’t know if this amount is too small to be useful to kickstart a project. Should we offer both synthetic and real data??

About 400 instructors have requested a download of my textbook so I have their contact information and occasionally send them a MailChimp Newsletter. Should we attempt a needs assessment survey to get input regarding 1. An educational open source EHR 2. Sample data for teaching.

I look forward to hearing everyone’s thoughts…Bob

I guess defining the initial scope of a base data set to be generated on demand would be a good thing to have, then an overview of scope and scale of special case needs that a data gen would need to accommodate. My idea is this: Take real data, use it to create a deviation model matching the data schema of the target application. Generate that data as part of the basic set, but allow for intentional or random extreme deviation to simulate syndromes. An example is the statistical distribution of biometrics (vitals). This can be based on gender, age, whatever. My preferred method is use a random base, then use modifiers to affect the final values for a fake patient based on gender, age, BMI, and randomly assigned diagnosis, procedure/intervention and outcome codes. We have plenty enough resources and statisticians around here to generate some data that looks perfectly believable.

Hi folks. I am ready to try out my demo data generator based on specialty. Clinical note references are drawn from ICD10 and CPT text descriptions, and correlation of the diagnosis to interventions are taken from the PQRS and CQM standards. Now that the base data generator engine can consume this data, I have to do a test run with some sort of clinical area in mind. Any particular specialty you folks want to see?

1 Like

Hello, Art. Elliot Sloane from Sarasota, FL here.

Thanks for tackling this.

I’d like to suggest single diseases like diabetes, congestive heart failure, chronic obstructive pulmonary disease, and/or hypertension. Later, demo date for patients with pairs of diseases would be helpful, like diabetes with hypertension, or CHF with COPD.

You’ll probably need normal clinical data value targets and variance ranges for each disease in order to ensure the data seems “realistic,” too. Who is supplying those for you?

Maybe some of us could form “panels” to help review or score the data sets in some way? Culling of unreasonable data may be needed.

If you can produce concentrated sets of “disease” patients, and those can be mixed into generic/blind/random data sets later, it may make analytics learning, training, and research faster and more efficient for teachers and students.

Thanks, Elliot

1 Like

Well, first Elliot, Think you might head up towards Pinellas County anytime soon? Bet we could sit and whip up some real good data sources! I could hit “infectious diseases” as a broader spectrum first try, and the typical “train wreck” combos from the smaller sub-groups, but I think that just hitting a single disease is the right way to go.
You are right that we need the deviation, variance, and typical aging response for vital statistics…and I don’t have anyone supplying that or anything else yet. Algorithms that are related to multiple patient systems would be great, even just to give realistic seeming (if not statistically perfect) results.

Hi, Art.

I will be driving to Tampa through Pinellas County every Monday to teach later this month, so I can probably stop by to work with you Art. Infectious diseases is indeed a broad category, but even the basic flu kills 35,000 Americans annually. Flu tends to afflict those with other diseases early, too, and many of those who are hospitalized or die have underlying chronic diseases and deficits. The flu is interesting, too, because CDC posts weekly updates about nationwide statistics. It is therefore an interesting disease to use as a teaching model because it is very tangible.

More complex infectious diseases can be challenging, though, because they are often very localized, and the symptoms can be hard to tease out. Any single disease is a fine start, because each disease initially lends itself to profiling strategies that a student can understand.

Most diseases, like diabetes or CHF, have fairly natural progressions over time. Perhaps you can build in a “nominal” disease road map of sorts (train wreck?) for each disease? That can give students a chance to develop a more mature, time-variant concept about diseases.

Could your data generation tools lend themselves to Markovian or Monte Carlo simulrunsation of symptoms and patient data? Even if only limited, it could give learners more interesting and complicated disease patterns to analyze.

Starting simple is the way to go. If reasonable sets of representative patient data can be generated, that’s a good start. Maybe your data generation tools can “grow diseases” from/into de-identified “healthy patient” data sets, too. i.e., if a multi-year deidentified patient data set is available, maybe some of the patients could randomly “acquire” flu, CHF, diabetes, or other disease?)

Ah, the mind runs wild with the possibilities!


1 Like

Too cool. Being a fan of the boardgame Pandemic, a 2D game designer, and a medical IT guy, I think there could be a really cool way to do the “take a regional map background image and simulate an epidemic” but that sounds like too much morbid fun. A great design philosophy states that after you have decided that something should be built (people in this case), you first determine what the end-of-service conditions are, and how to refit/recycle/dispose/recoup from the ownership. With that in mind, as we build fake people, we should start with what they look like at the end. Determine mortality conditions, then use statistical data to work from birth to that condition for each patient. Once that is done, you break up that record into something like 50 different databases that represent every clinic or hospital they have ever been to. Would that not give the statisticians great fun trying to put that back together? It could really show the need for interoperability.

I say we work from mortality statistics backwards.

In any case, I am working from the premise that all the data is more fun to generate out of whole cloth. The issues with deidentified data make it very difficult to make a more whole data set, or even a limited single-source data set of quality. Easier to do a full health history then break it up, hide parts, mis-diagnose/document and other fun stuff to make it “realistic”. Compared to typical notes out there, building docbocks out of things like “wiped incision area with moonshine”, “Inserted 28mm french catheter cryoablation sheath up left nostril” and all that stuff could create more complete documentation than would normally be available. Take a “real” note, then write the script that creates it. Then, if you like, leave some parts out. It is important to note that we are not trying to create individualistic writing styles and the like. It doesn’t matter that we use boilerplate pieces, especially when we have a LOT of boilerplate pieces available.
I really like the idea of being able to create “evolving situations” with the patient data.

OK, Jotting down bits on a notepad while staring over the Panamanian countryside slurping high quality coffee brewed in the most hideous way ever invented (Los perculatores hacer el cafe malo!), I spent a good amount of time figuring how I was going to pull together database generation for a fairly simple proof of concept that might pass clinical muster. I settled on Hypertension. I figured I would take the following elements into consideration:

  • Height
  • Weight
  • Calculated BMI (possibly tested body fat)
  • Age
  • Gender
  • Exercise and activity levels/types
  • Diet factors/eating habits
  • The usual lifestyle factors of alcohol/caffine/nicotine
  • Medications
  • Systolic
  • Diastolic
  • Heart Rate
  • Respiration Rate
  • O2 saturation
  • Genetic factors
  • geopolitical factors

Naturally the above list needs to be compared against a real risk-factor screening list, as I had no access to such at the time.

So, the idea is this:

  • Generate baseline vital statistics for a hypertension risk patient.
  • Randomly determine a compliance factor for that patient, meaning how well the person is going to stick to any given regimen.
  • Factor the above into how well the patient is going to be monitored, which will determine how frequent/regular the generated patient encounters will be.
  • Use flow-chart logic to determine interventions and outcomes as each “visit” is generated with random statistically relevant variations.

The “constants” that affect the outcomes are in a configurable list so that if you decide that eating beef, smoking cigars and drinking sherry are a sure cure for hypertension, you could skew the data to that effect

I have everything in place to do the above, meaning create the notes with appropriate text and empirical data points in the vitals, medications etc…, with the exception of the range of factors that determine a HPB patient, the logic that says what interventions/regimens are prescribed, and the co-factor and recombient effects.

I would like assistance with the logic flow and deterministic factors of this, as well as determining the relevance of interventions, contraindicated events etc…

Print-a-Patient Version 0.7.1.beta

Get it while it is hot…even though the bandwidth is a soda straw… It is tested and works for the current dev tip of LibreHealthEHR. It is plenty good if you just want to fill u a database with standard stuff for development testing. Once I am finished with all this quality measure stuff, I will turn on the bits that put out medical coding for diagnosis, procedures and outcomes as per the measure expectations. Using those 243 quality measure sets as a basis for a “who gets what” scheme, I will start saddling our fake patients with all kinds of misery, just like you folks enjoy.

Just so I understand what you are doing. The PrintaPatient tool will create SQL commands to populate the LibreHealth EHR demo with synthetic patient data (demographics, X12, insurance and appointments?). Is this primarily intended for developers to test the system?

Tests are important, yes. The ability to load a half a million records to stress test things is important to devs. The ability to generate a demo database without any worry at all is important for vendors and other demos. The ability to create safe databases for technical application training, clinical data entry, and medical billing entry is also very important. As the system grows and more diagnosis, medication, and procedures are added to encounters, the logic behind it will be queued by both fairly random data as well as diagnostic indicators…with the later “encounters” getting the relevant diagnosis and treatment protocols added (with random variation for realistic effect). is the latest version of the PrintaPatient program. Creates database inserts for the base install.