From VistA EMR Community: Would like to collaborate on synthetic patient data

shabiel · June 29, 2017, 1:09pm

Hello there,

I am Sam Habiel, Pharm.D.; Technical Fellow at the Open Source Electronic Health Record Alliance (OSEHRA). I met Tony McCormick at the OSEHRA Summit this year.

Tony told me about the project to create synthetic patients using NHANES data. I am interested as that will help us produce educational and demo instances of VistA. Lack of good demo data is a big issue for our community. I saw that there is big thread on the topic here. If somebody can summarize where you are at right now and how outsiders can contribute, I would appreciate it.

I have another project that I would like collaboration on: I and others in an organization called WorldVistA–are working on creating a public domain high quality drug interactions set. We have made several releases. The latest can be found here: https://github.com/glilly/osdi. You may find this data useful for your EMR. If you would like to use it and support us in maintaining the data, we will be happy to collaborate.

–Sam Habiel, Pharm.D. Technical Fellow OSEHRA

yehster · June 29, 2017, 3:46pm

Hi Sam, We are using the NHANES data set to create simulated patients. There are 9000+ records created on nhanes.librehealth.io https://www.cdc.gov/nchs/nhanes/index.htm username/password admin/password If you want to check it out.

In addition Dr. Hoyt has been manual creating some more detail patient histories (10 or so at the moment.)

I have scripts which download and parse the 2011-2012 NHANES datasets, then create patients records with demographics, problem lists, medications, a few labs and other things like smoking, alcohol, race, income.

It should be possible to use the same data to create records in VistA, the key issue to solve would be the interface to create the records.

Because LibreEHR is web based, my scripts are simulating browser activity to create records by mimicking the same actions a human would to manually enter the data.

One big question is what we need to accomplish to secure additional funding/resources to sustain this effort.

shabiel · June 29, 2017, 5:04pm

Thank you Kevin.

I actually will be writing M code to add the data directly into VistA. Can you share the scripts with me? We tried using runnable scripts for other projects and they just take too long for a Continuous Integration pipeline, which is what we are aiming for.

–Sam

yehster · June 29, 2017, 5:08pm

It’s nodejs code, not well documented yet but shared on github.

I can also share database dumps/csv files of the NHANES data.

shabiel · June 29, 2017, 5:32pm

Thank you.

Where’s the code that reads the NHANES files?

–Sam

yehster · June 29, 2017, 6:26pm

I am using R to load the NHANES data and converting them to .csv https://www.r-project.org/

github.com

yehster/NHANESImport/blob/master/main.js#L205


        }
        if(parseDocuments)
        {
            parseCodeBook(year,yearPath,curRow.document);                
        }
    }
    
});
}


function createConvertScript(year)
{
var yearPath=process.cwd() + path.sep + "datafiles"+ path.sep + year+ path.sep;


dbConn.query("SELECT * FROM filelist WHERE year=?",[year],function(err,rows)
{
    var script="library(Nmisc)\n";
    for(var fileIdx=0;fileIdx<rows.length;fileIdx++)
    {
        var curRow=rows[fileIdx];
        var baseFile=curRow.sasfile.substring(0,curRow.sasfile.lastIndexOf("."));

This part of my scripts uses the command line to load the files and make .csv files.

I can send you the .csv files as a .zip if you want. Probably simpler.

shabiel · June 29, 2017, 7:45pm

Let’s do that. .zip will be good.

yehster · June 30, 2017, 12:14pm

The “pieces” of this process, are as follows.

Web crawler which downloads the data files from the CDC and a conversion step from SAS to .csv (resulting in the .zip file contents)
Scripts which load the .CSV files into MySQL/MariaDB. (1 to 1 mapping of the .csv files to database tables/columns)
Code which translates the database representations into JavaScript object This maps the “numeric values” from the CDC data into things like problem lists For example, based on yes/no (1/2) answers to parts of the questionnaire for each respondent, we build problem lists. https://github.com/yehster/NHANESImport/blob/master/NHANESData.js#L180
Code which then uses these JavaScript objects and “applies them” to the EHR Example here is a form post to create the patients https://github.com/yehster/NHANESImport/blob/master/EHRConnection.js#L128

shabiel · June 30, 2017, 12:28pm

Kevin,

You are a good elucidator! Thank you.

–Sam