Project: FHIR Analytics Using Apache Spark and Cassandra

toolkit
gsoc2018-project
fhir
gsoc2018

(Prashadi) #13

Hi all,

@sunbiz @namratanehete @judywawira

I have done some comprehensive research on using CQL and Spark SQL.

Cassandra is a write optimized database. Cassandra user guides suggest people to duplicate data if it optimize the read operations. CQL or Cassandra APIs does not provide support for JOIN operations across multiple tables. Which means running complex queries across multiple tables can’t be supported. This is purely based on the NoSQL concepts. As I shown earlier, spark SQL allow users to write complex SQLs to do more drill down analysis. If we use CQL, we only have ability to run CQL on single resources only.

Complex query example which selects patients based on valueQuantity on Spark SQL.

SELECT patient.id, observation.id, observation.subject, observation.valueQuantity FROM patient inner join observation where observation.subject.reference == patient.id and observation.valueQuantity.value > 15

Another limitation of Cassandra is that, WHERE clause only can be used with indexed columns. As an example, if user want to search patients by firstname, then firstname should column should be indexed. Likewise, if we needs to provide more query options based on columns, we will need to create secondary indexes. Cassandra doesn’t encourage people to create large number of indexes as it will affect the performance of write queries.

Cassandra CQL provides option to filter non indexed columns[2] via ALLOW FILTERING option. But usage of this isn’t highly encourage.

According to the online resources[3], it suggest to use Spark as a viable option to execute complex queries.

@sunbiz @yashdsaraf I think we will need to go through our data model of storing patient carefully. According to my understanding with FHIR, we will need to support basic search operations based on resource attributes. If our data model doesn’t fit in, it will create performance issues when data grows up. I’m going to write a blogpost about my findings.

According to the aforesaid limitations, I think Spark is the viable option. Are we going to have a modular approach in LibreHealth? If so we might fit FHIR Analytics as a separate module and use it appropriately.

References

[1] https://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html

[2] https://www.datastax.com/dev/blog/allow-filtering-explained-2

[3] https://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise


(Prashadi) #14

@sunbiz @yashdsaraf is it good to store the entire resource as JSON in a column of each resource?


(Prashadi) #15

@sunbiz @namratanehete @judywawira I completed the blogpost with findings in https://medium.com/@prkpbandara/gsoc-librehealth-working-with-cassandra-for-fhir-analytics-9e66eecec6a7.

Let me know your thoughts and suggestions.


(Prashadi) #16

@sunbiz @namratanehete @judywawira since we have several limitation with CQL, I’ll be focusing on Spark Based Query Builder approach. If you have any suggestions, please do let me know. My plan is to get the initial version done and merge with @yashdsaraf repository.


(Namratanehete) #17

@prashadi I asked you yesterday in LibreHealth chat. Are you missing any dependency for jackson-databind? I am getting “java.lang.ClassNotFoundException: com.fasterxml.jackson.databind.exc.InvalidDefinitionException” while trying to execute your code. I can see you have specified jackson.version in pom.xml but not used anywhere.


(Prashadi) #18

@namratanehete Let me quickly check on this and get back to you soon. Sorry I missed the notifications.


(Prashadi) #19

@namratanehete I have converted project to war. After several hours of solving dependency issues, now it’s getting successfully deploy as a web application.

Please follow https://gitlab.com/kavindya89/librehealth-fhir-analytics/blob/master/README.md to deploy the project and try out the functionality.


(Prashadi) #20

Hi All,

To give a update, I’m working on patient attribute bases search view with query builder for Apache Spark. I’ll push changes as soon as possible when initial version is done.

I came across article which discuss about FHIR analytics (https://www.linkedin.com/pulse/analytics-fhir-chris-grenz). In here, author is using Apache Drill. Apache Drill is a SQL Query Engine for Big Data. Apache Drill allow users to load FHIR data in files from MongoDB, HBase and S3. Not very good fit for us. But interesting. In converts FHIR resource to a SQL table base format like we do with the Apache Spark using Bunsen.


(Prashadi) #21

@sunbiz @namratanehete @judywawira I have added basic version of query builder along with support for search for patient attributes. I have wrote a complete blogpost on current progress and code is committed.

https://medium.com/@prkpbandara/gsoc-librehealth-fhir-analytics-using-spark-sql-9019dcb41593

Please do let me know any suggestions and improvements.


(Prashadi) #22

@yashdsaraf is it good if I start integrate this component with your repository now? I saw CRUD operations are working with minor issues now.


(Yash D. Saraf) #23

@prashadi For executing CRUD operations will you be using spring data repositories or making REST API calls?
If it’s the former, you can use the PatientRepository directly, if it’s the latter then only READ and DELETE operations are functional right now.


(Prashadi) #24

For Analytics, I need to read the data directly from the Cassandra. Because I’ll be using Spark Cassandra connector to access data. I’m currently working on UI and resource analytic. You can work on adding few more resources.


(Prashadi) #25

@sunbiz @namratanehete @judywawira Please find my current progress in https://medium.com/@prkpbandara/gsoc-librehealth-librehealth-fhir-analytic-capabilities-c35f57e36a29. Since functionalities of UIs are completed, my next target is to integrate this module with @yashdsaraf project. Since he almost completed with patient resource, I believe I can test the module using patient resource. Let me know if you have any suggestions or improvements after going through the blog posts.


(Prashadi) #26

Interesting article which use amazon Alexa with FHIR https://medium.com/@alastairallen/evolve-and-alexa-on-fhir-30041e49bc51


(Namratanehete) #27

Hey @prashadi, I am getting following error after the latest update of the code. Please let me how to fix this error.

https://pastebin.com/800YQhE5


(Prashadi) #28

@namratanehete it’s strange as it’s working fine in my machine. Are you trying this war file with apache-tomcat-8.5.31 and jdk1.8.0_111?


(Prashadi) #29

@namratanehete I added another update as well with spring auto configuration. Please take latest update of the code and try it.


(Namratanehete) #30

Ok let me take update and get back to you. Thank you


(Prashadi) #31

Hi All,

This is to give an update about my current task which is integrating analytic component with spring data module. During past week, I have fully worked on integrating the spring data module and the analytic module. After several complex dependency issues, I managed to sort out them one by one. Since apply all of my changes to spring data module is adding too much changes, I have integrate spring data with the analytic module which we can later rename and push to spring data module after functionality is working fine.

I have hit a issue with using RouterFunction and RestController together https://jira.spring.io/browse/SPR-15405. The issue mentioned that both can be used together. I can see that all the request mappers are getting registered. But I can’t access spring data REST APIs with the given URL. @yashdsaraf would be able to have a quick look at why the Spring Data REST APIs aren’t working? Integration can be found in https://gitlab.com/kavindya89/librehealth-fhir-analytics/tree/integration.

@sunbiz @namratanehete @judywawira I’ll be continue to check this issue in my side as well. This is the only blocker that I encounter during the integration of two modules.


(Yash D. Saraf) #32

@prashadi I have checked out the integration branch, I can see the routerfunctions are loaded but are not accessible. However when I switched to using the rest controller instead it seems to be working. Although I can’t test it since I’m getting this error while sample data is being loaded

Exception in thread "restartedMain" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.boot.devtools.restart.RestartLauncher.run(RestartLauncher.java:49)
Caused by: java.lang.NullPointerException
	at org.librehealth.fhir.analytics.cassandra.CassandraDataServiceImpl.insertDemoData(CassandraDataServiceImpl.java:108)
	at org.librehealth.fhir.LibreHealthFHIRAnalyticsApplication.init(LibreHealthFHIRAnalyticsApplication.java:78)
	at org.librehealth.fhir.LibreHealthFHIRAnalyticsApplication.main(LibreHealthFHIRAnalyticsApplication.java:64)
	... 5 more

It would be great if you know how to fix it, as soon as I’m able to fix it myself I’ll make a pull request with all RouterFunctions switched to RestControllers.

Update

@prashadi The error I mentioned was just a file separator not being parsed correctly in windows for the DATA_PATH constant. The data is being loaded correctly now and I’m able to access both our namespaces’ data using cqlsh, but the REST query always returns empty. it doesn’t return a code 404 though, just an empty JSON object with code 200.