Disappointing email today from Harrisburg College that was supposed to be one of the four beta testers. For whatever reason, the program head thought he had the luxury of testing it for the next 3 semesters so his students have not started to use it yet and hence they will have no input for the survey. We are down to 3 beta testing universities.Hard to understand!
Yea, a little disappointing. We are planning to use it in two courses next semester at IUPUI, but we would like to include more continuous data i.e. multiple sequential encounters for a single patient. Dr. Josette Jones has created synthetic data using Synthea - https://github.com/synthetichealth/synthea/wiki and we are looking to import this using the scrip that was created by @yehster.
What are the best ways to get this import? @yehster do you have some documentation on your import script?
It would be very interesting to have an instance of LibreHealth EHR using Synthea data. I did look at Synthea about six months ago and didn’t see much granularity but perhaps I didn’t understand everything. Please do share your progress. There might be interest in having LibreHealth EHR populated with either NHANES or Synthea based on student needs.
Why not just import the various data sets into one instance?
The plan is to import it into the same database which has NHANES. The problem we see with NHANES is that it’s a one-off and survey data. Hopefully, Synthea will be more realistic of patient visits.
Please do keep us posted by continuing to update what you find on this Forum. I suppose synthetic patients might be mixed with NHANES patients. I suspect neither data is ideal but we could not afford to purchase “real” de-identified data. I tried to explain that to the National Library of Medicine but I don’t think they understood the need for real outpatient data
Can you share the data you have generated?
I have managed to get Synthea running on my system, but don’t really understand the output yet.
I can share the code for my scripts, but there isn’t documentation.
There are three components to my scripts for NHANES.
- Retrieving the data from the CDC and loading them into a database
- Mapping the NHANES database records into LibreEHR concepts
- Loading that mapped data into the EHR
While there is some potential for code reuse of the third component, loading Synthea data into the EHR will be a large undertaking.
One example, because NHANES is only a single time slice, my scripts aren’t capable of generating multiple sequential encounters.
I assume you could use NLP to analyze the synthetic patient notes?
We discussed the idea of uploading de-identified patient encounters to LibreHealth EHR, as NHANES data lacks patient encounters. With 9600 patients uploaded we could probably match demographics and disease entities i.e. 62 yo female with type 2 diabetes, HTN, etc. This also might give us longitudinal data by having more than one encounter on a patient uploaded. I think there would be great teaching potential in doing this for both clinical and non-clinical students. Tony had mentioned getting de-identified patient encounters from you earlier but I thought that since you have now joined the Forum I would ask you directly. Thanks
It would be fine with me. I thought Tony had already asked and I said it would fine. We have about 30,000 registered patients dating back to 2004.
Sam Bowen, MD
I can be a source for that data (redacted and converted to LHEHR).
@aethelwulffe Please clarify, as the last time you were talking about synthetic data from a patient generator. Are you saying you can do something with Dr. Bowen’s real data?
Art is helping us with the conversion from OpenEMR to LibreEHR, the data can be de-identified and then used for the educational version. He already has the converted data. It is just a process of de-identifying the data. He already has my permission to do this.
Sam Bowen, MD
Art, you will need to look at de-identyfing the speech dictation data as well. I normally used the full name of the patient in these encounters.
Sam Bowen, MD
Great. I don’t think we are talking about that many patients and I would be willing to match these patients to the demographics in the NHANES dataset. In other words, match the 30 yo AA male with HTN in your patient population with the 30 yo AA male with HTN in the NHANES dataset
If the NHANES data has what we need…Dr. Bowen’s database alone is larger than the NHANES data, and I would intend to combine it with others. Aside from that, we can generate meaningful patient names quite easily. Otherwise, what is the use of combining it with your NHANES database? That matching would be much harder, and have a large number of edge cases.
Sam, we have to remove not only the patient names, but also run a de-identification round where we first replace any instances of the patients known original name with [FirstName] [LastName], then run the whole content through a personal pronoun limiting check and a manual edit pass. This can either be done carefully, or just stamp on any words not in the medical dictionary with [REDACTED]. This collides with the typical medical documentation practice of miss-spelling every other word though…
Art, after further thought you are correct. There is no reason I would have to merge the patients. I would be very happy if we just started with a small experimental cohort (e.g. 10 patients) we could look at. They need to be identified somehow as a special cohort (what is the best way to do this?. SSNs?) We might tell students if you are looking for longitudinal data use the following patients… What would these patients come with? Encounter notes plus - lab?, xray reports? ICD-10 codes? Can these data be uploaded into existing EHR fields? Thanks
The redacted data is essentially an entire clinic dump with a 15 year history. Images and similar data becomes very difficult to redact. It may be wise to set up a very restricted access to a pre-redacted copy of this data, then have a review process that allows us to completely review patient charts for identifiable data, clean them up and mark each for export manually. It’s a lot of work, but for really complete real-world data, that might be the final solution. @sam-bowen can give a more detailed description of what data is in existence.
-and we don’t need SSN’s. We have about a million choices of ways to organize different groups of patients and who can access them by about any type of flag you want to use. If you want to categorize them in a certain way other than by a diagnosis or something, then we can add a TAG or a data field to designate the training program they are to be included in.
Have you already made the first redaction pass?
Sam Bowen, MD