In the process of researching NLP and cTAKES in more detail, I did run across the i2B2 project and the fact that they release de-identified research patient datasets for NLP training. Looks like they have about 1500 patients in the released datasets. Has anyone explored this before? The requesting university needs to register and sign a data use agreement. Would this be simpler than de-identifying Sam’s progress notes? Just asking
I have looked at the i2b2 dataset before, but it didnt seem useful to be in an EHR unless we make it fit the SOAP note structure or another specific type of clinical notes and build those forms in the EHR. The NLP in cTakes works good, but there is a rise of deep learning methods for NLP that have cropped up in the last 2-3 years.
I was hoping for more discussion regarding the issue of NLP. It is my understanding that cTAKES was designed specifically for free text in EHRs which is a plus. On the other hand, there might be better NLP software that is less labor intensive. I think it would be appealing to have NLP integrated with LibreHealth but I frankly don’t know how many universities would take advantage of it. Building and testing the pipelines is complex
I’m going to start working on more details on this next week… Thanks for the extras!
Please be aware that e-mail communication can be intercepted in transmission or misdirected. Please consider communicating any sensitive information by telephone. The information contained in this message may be privileged and confidential. If you are NOT the intended recipient, please notify the sender immediately with a copy to email@example.com and destroy this message.
Have you seen this - https://github.com/GoTeamEpsilon/ctakes-rest-service ? Know anything about this group? @rhoyt I found that these people are building a RESTful service, webclient UI for cTakes and also use OpenEMR
I was unaware of the cTakes RESTful API project. Sounds like Tony should review this to be sure he is not trying to re-invent the wheel
I will definitely take a look
If the format of the I2b2 dataset means more work, I would be happy to take some (10-20) of Sam’s encounter notes and de-identify them for student exercises and possibly NLP down the road. Can we make that happen?
It has been extremely difficult to find a grant for LibreHealth EHR enhancement but easier as a biomedical data science platform, when coupled with the sister project on Data World. In other words, if our initiative can promote biomedical data science education and training by providing NLP training, FHIR sandbox, descriptive analytics and possibly predictive analytics this might be attractive for future grants. Thoughts?
Give the cTakes NLP guys (@hzbarcea) a bit to see what they can do first, but after that we can provide the notes privately for editing if needed (or better find a chunk that don’t need editing) .
I am in agreement, I think we can create a very cool, closed loop, teaching tool using NLP, NHANES, FHIR an other resources.
Another thought would be to somehow connect EHR data output (csv) to the open-source Java based machine learning program WEKA that accepts csv input. Truthfully, most programs are not ready for this but it would be an interesting proposal to create the EHR-machine learning loop for predictive analytics
Over the course of the past year, 17 universities have contacted me with interest in the EHR and/or data science platform. Many of the recent ones followed the two talks I gave at AMIA. Clearly, most of these schools are academic non-medical centers and tend to be small to medium programs. They are thirsty for tools to help teach EHR competency and basic data science.
I have opted to treat the LibreHealth EHR and Data World projects as sister projects as both would benefit informatics programs. It is my hope and vision that we could get an educational grant for this new platform to promote biomedical data science. We would hope that the API sandbox, FHIR, cTAKES, etc. can be part of that platform.
The fall semester is right around the corner so perhaps we can get 3-5 universities to use the platform, but gear up for more by the next semester. So that I don’t have to type an email epiphany each time to explain our status, I put together a little “white paper” on where we are at I can send to universities. Feel free to comment and correct. Happy Fourth.
Thanks for sharing the doc. It’s an important project and Weka integration will be very useful. We also need to allow a way to import data from CSV format, and not need custom tools each time. A configurable CSV importer will be very useful. I do think that it’s not a small project. It will take nearly half a year of development, and more for testing, implementation, feedback and rework. I think this is very suitable for the SCH grants - https://www.nsf.gov/pubs/2018/nsf18541/nsf18541.htm#pgm_desc_txt. particularly the Health information infrastructure or connecting data seems appropriate.
I’m glad you see the merit in this proposed platform. Thanks for sharing the link to the NSF grant. I was not familiar with it. As a non-computer scientist, I see the merit in using WEKA to teach Introduction to Machine learning with no coding or math. I also put together a concept/mind map that ties all of this together. Connecting WEKA directly with LibreHealth EHR would be very significant.
Do we need other academic partners, particularly programs who have successfully been funded by the NIH or NSF?
There is already a java web connector to connect Data World to WEKA. The issue is that it opens up a SQL query window, rather than loading the csv file, which would be better in my opinion.
I believe it would be unique to integrate a machine learning program like WEKA with an electronic health record. The same could be said for integrating NLP software and a few other ideas we have come up with
Would be nice to have other interested academic partners, if they want to be part of the work that needs to be done.
Several universities are trying to get me to join their faculty and one of my criteria will be their willingness and depth to support this platform
Sorry for the late reply, 4th of July and all. But I do get the notifications.
I had a few conversations with @tony and I think the idea is great and we support it. In more concrete terms, cTakes uses individual text snippets, doesn’t matter if they come from a SQL database, individual files etc. We packaged it as a service and it works with both synchronous (REST) and asynchronous (JMS) calls. Working on the anonymizer now.
It’ll take a bit to productize it. Because of the constant changes in the cTakes processing pipelines I think it’s more suitable so run it as an external service, part of the mentioned platform. Also, because of difference in pipelines for different needs/users it looks like a multi-tenant model is needed, not supported directly by cTakes.
I ran across information about a WEKA server option: https://wiki.pentaho.com/display/DATAMINING/Weka+Server
As a non-computer scientists I’m not sure if this helps in any way our desire to integrate machine learning software with LibreHealth EHR