Project: FHIR Analytics Using Apache Spark and Cassandra

toolkit
gsoc2018-project
gsoc2018
fhir

(Saptarshi Purkayastha) #8

Why do you think it will not scale?


(Prashadi) #9

@sunbiz sorry for the delay as I had to handover my laptop to upgrade the RAM as always my machine is getting stucked when running cassandra and spark together with the IDE.

The reason I believe it will not scale is that when database grows, cassandra stores data across the cluster node. So if we going to execute cluster wide joins over the FHIR data, it will add very high load in the cassandra nodes. Using spark will reduce the load as it’s execute the queries across cluster in distributed manner. What do you think?


(Namratanehete) #10

@prashadi You have a good point. But that means spark will be a required dependency. For the analytics engine spark might be a requirement but for the whole platform I suggest that we do not want a spark dependency. Cassandra alone should work.


(Prashadi) #11

Thank you for the reply @namratanehete. We can make analytics to work only with Cassandra. It will be solely based on CQL. Also with Bunsen FHIR structures exposed as a unified format which allow users to perform advanced queries in Spark compared to using cassandra CQL. The capabilities of CQL will be depend on @yashdsaraf structure of storing data in the cassandra. One more thing we should consider is that, single FHIR resource it self contains complex elements within the resource. So we will need to stored data in appropriate columns which should be able to use via CQL. Because the query builder will ultimately build a CQL or Spark SQL depend on the approach we are going to take. I will do more research on this and come up with small comparison.


(Yash D. Saraf) #12

Currently I’m using converters to store complex (non primitive) data types. This means a lot of them are stored in JSON format. To store the data in separate columns, these complex data types will need to be declared as User Defined Types in Cassandra.
Spring Data Cassandra provides a way to do this by simply annotating the complex data type classes with @UserDefinedType which again poses the same problem as before i.e We can’t annotate the HAPI FHIR classes without extending (in this case extending won’t work either) or duplicating them.


(Prashadi) #13

Hi all,

@sunbiz @namratanehete @judywawira

I have done some comprehensive research on using CQL and Spark SQL.

Cassandra is a write optimized database. Cassandra user guides suggest people to duplicate data if it optimize the read operations. CQL or Cassandra APIs does not provide support for JOIN operations across multiple tables. Which means running complex queries across multiple tables can’t be supported. This is purely based on the NoSQL concepts. As I shown earlier, spark SQL allow users to write complex SQLs to do more drill down analysis. If we use CQL, we only have ability to run CQL on single resources only.

Complex query example which selects patients based on valueQuantity on Spark SQL.

SELECT patient.id, observation.id, observation.subject, observation.valueQuantity FROM patient inner join observation where observation.subject.reference == patient.id and observation.valueQuantity.value > 15

Another limitation of Cassandra is that, WHERE clause only can be used with indexed columns. As an example, if user want to search patients by firstname, then firstname should column should be indexed. Likewise, if we needs to provide more query options based on columns, we will need to create secondary indexes. Cassandra doesn’t encourage people to create large number of indexes as it will affect the performance of write queries.

Cassandra CQL provides option to filter non indexed columns[2] via ALLOW FILTERING option. But usage of this isn’t highly encourage.

According to the online resources[3], it suggest to use Spark as a viable option to execute complex queries.

@sunbiz @yashdsaraf I think we will need to go through our data model of storing patient carefully. According to my understanding with FHIR, we will need to support basic search operations based on resource attributes. If our data model doesn’t fit in, it will create performance issues when data grows up. I’m going to write a blogpost about my findings.

According to the aforesaid limitations, I think Spark is the viable option. Are we going to have a modular approach in LibreHealth? If so we might fit FHIR Analytics as a separate module and use it appropriately.

References

[1] https://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html

[2] https://www.datastax.com/dev/blog/allow-filtering-explained-2

[3] https://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise


(Prashadi) #14

@sunbiz @yashdsaraf is it good to store the entire resource as JSON in a column of each resource?


(Prashadi) #15

@sunbiz @namratanehete @judywawira I completed the blogpost with findings in https://medium.com/@prkpbandara/gsoc-librehealth-working-with-cassandra-for-fhir-analytics-9e66eecec6a7.

Let me know your thoughts and suggestions.


(Prashadi) #16

@sunbiz @namratanehete @judywawira since we have several limitation with CQL, I’ll be focusing on Spark Based Query Builder approach. If you have any suggestions, please do let me know. My plan is to get the initial version done and merge with @yashdsaraf repository.


(Namratanehete) #17

@prashadi I asked you yesterday in LibreHealth chat. Are you missing any dependency for jackson-databind? I am getting “java.lang.ClassNotFoundException: com.fasterxml.jackson.databind.exc.InvalidDefinitionException” while trying to execute your code. I can see you have specified jackson.version in pom.xml but not used anywhere.


(Prashadi) #18

@namratanehete Let me quickly check on this and get back to you soon. Sorry I missed the notifications.


(Prashadi) #19

@namratanehete I have converted project to war. After several hours of solving dependency issues, now it’s getting successfully deploy as a web application.

Please follow https://gitlab.com/kavindya89/librehealth-fhir-analytics/blob/master/README.md to deploy the project and try out the functionality.


(Prashadi) #20

Hi All,

To give a update, I’m working on patient attribute bases search view with query builder for Apache Spark. I’ll push changes as soon as possible when initial version is done.

I came across article which discuss about FHIR analytics (https://www.linkedin.com/pulse/analytics-fhir-chris-grenz). In here, author is using Apache Drill. Apache Drill is a SQL Query Engine for Big Data. Apache Drill allow users to load FHIR data in files from MongoDB, HBase and S3. Not very good fit for us. But interesting. In converts FHIR resource to a SQL table base format like we do with the Apache Spark using Bunsen.


(Prashadi) #21

@sunbiz @namratanehete @judywawira I have added basic version of query builder along with support for search for patient attributes. I have wrote a complete blogpost on current progress and code is committed.

https://medium.com/@prkpbandara/gsoc-librehealth-fhir-analytics-using-spark-sql-9019dcb41593

Please do let me know any suggestions and improvements.


(Prashadi) #22

@yashdsaraf is it good if I start integrate this component with your repository now? I saw CRUD operations are working with minor issues now.


(Yash D. Saraf) #23

@prashadi For executing CRUD operations will you be using spring data repositories or making REST API calls?
If it’s the former, you can use the PatientRepository directly, if it’s the latter then only READ and DELETE operations are functional right now.


(Prashadi) #24

For Analytics, I need to read the data directly from the Cassandra. Because I’ll be using Spark Cassandra connector to access data. I’m currently working on UI and resource analytic. You can work on adding few more resources.


(Prashadi) #25

@sunbiz @namratanehete @judywawira Please find my current progress in https://medium.com/@prkpbandara/gsoc-librehealth-librehealth-fhir-analytic-capabilities-c35f57e36a29. Since functionalities of UIs are completed, my next target is to integrate this module with @yashdsaraf project. Since he almost completed with patient resource, I believe I can test the module using patient resource. Let me know if you have any suggestions or improvements after going through the blog posts.


(Prashadi) #26

Interesting article which use amazon Alexa with FHIR https://medium.com/@alastairallen/evolve-and-alexa-on-fhir-30041e49bc51


(Namratanehete) #27

Hey @prashadi, I am getting following error after the latest update of the code. Please let me how to fix this error.

https://pastebin.com/800YQhE5