Project: FHIR Analytics Using Apache Spark and Cassandra


(Yash D. Saraf) #44

By a single column I assume you mean the entire resource would be saved as a JSON object?
If that’s the case, instead of using the existing FHIR structures as models, we’ll need to create new models with just the id and value properties (much like your project) and use them to save the real structures.
CRUD operations could very well be adjusted for this format, but search operation using CQL wouldn’t work as it would be comparing the entire object’s JSON string against the given value.

That’s nice, I’ll make the necessary changes in my project as well, so you can directly merge the resources other than Patient.

(Prashadi) #45

@yashdsaraf I meant we can keep the existing columns, is there any possibility of adding extra column like resource which can stored the complete resource?

(Yash D. Saraf) #46

Yes we can do that, only the CREATE and UPDATE operations will have to be modified to handle the extra column. What do you think about this @sunbiz?
I’ll start working on a merge request for this.

(Saptarshi Purkayastha) #47

I don’t suggest you do that. Instead write a helper method in the service that will give you a resource. I would have suggested that this be in analytics module, but this is a generic feature that needs to be in the platform.

(Prashadi) #48

@sunbiz The reason for asking that is, analytics module read data through the spark cassandra connector. Which won’t be using the spring data APIs to query data because spark cassandra connector handle the data partitioning when it come to processing data across spark cluster. If there is no sinle column to collect data means, analytics module need to implement mapping for each resource when fetching resource data to spark. Which means for patient. we will need to implement a mapper which creates a FHIR patient resource by going through the each of the data available in the cassandra columns. Having sinle column with entire resource will useful in terms of converting this to Java Patient resource. @sunbiz do you think we should handle this at analytic module level?

(Prashadi) #49

In addition, since we created CPatient extending Patient resource in HAPI FHIR things gets bit complex as FHIR encoders only worked with base HAPI FHIR resources.

(Prashadi) #50

@yashdsaraf I have merge your changes with master branch. Let me know when you complete other resources as well. So we can combine them and merge into a single repository.

(Saptarshi Purkayastha) #51

Yes @prashadi , the analytics module should have a way to convert the resource to a Bundle or another form that it needs, instead of having a combined way to have the entire resources.

(Prashadi) #52

Ok @sunbiz . So when analytic module fetches data, it will need to convert cassandra table data to each relevant FHIR resource by taking column data and create a new HAPI FHIR resource object out of it. I’ll check for writing a mapper or look for functionality in spark Cassandra connector to map database row in to a FHIR resource.

(Prashadi) #53

@sunbiz @namratanehete @judywawira @yashdsaraf I have gone through possible approaches that I can map cassandra table to object via spark cassandra connector. Since @yashdsaraf patient representation contain complex attributes, the only way that we can map patient table data to FHIR Patient object is by going through data of each column and map it to relevant attributes. It will be a time consuming task but that is our only option. Any other thoughts on this matter?

(Yash D. Saraf) #54

@prashadi When you say mapping from table to object, in this case is the object an HAPI FHIR structure? or some other spark specific object?

(Prashadi) #55

It’s HAPI FHIR Patient resource. But data is fetching via Spark Cassandra Connector which return the row containing the patient data.

(Yash D. Saraf) #56

If you ultimately need the HAPI FHIR patient resource why don’t you try using my project to retrieve the objects? For e.g Say you need to retrieve a Patient structure from the database, you can use an implicit cast to use CPatient as Patient, something like so

Patient patient = patientRepository.find(<identifier>);

Although the find function will return a CPatient object, it’ll get implicitly casted to Patient since CPatient is a direct subclass of Patient.


I just realized that I’m assuming Spark performs the analytical processes after all the data is retrieved from the database. If that assumption is wrong the solution above won’t work.

(Prashadi) #57

@yashdsaraf spark Cassandra connector internally handle the data distribution across entire spark cluster. Hence it won’t load all data to a single node. It internally handle data distribution. That’s why we need to go through the spark Cassandra API to retrieve the data.

(Namratanehete) #58

I think we should go with column to column approach until we get alternative. What do you all think? @sunbiz @yashdsaraf @prashadi

(Prashadi) #59

Thank you for the response I’ll be looking into the mapper implementation.

(Prashadi) #60

@sunbiz @namratanehete @judywawira I have added my blogpost on work accomplished during gsco in Since I was away for few days as my university has started, I’ll be continue to work on integrating the newest changes from Yash’s module and combining the data sources. I have tried to regenerated the JSON from interating columns and it gives me a parsing error. I will be looking into that. If it’s sorted then integration will be completed. @yashdsaraf let’s test your resources with google data set, which contains different kind of resources with attributes.

(Prashadi) #61

I have merged all the recent changes from spring data repository and now working on fixing minor issues. I’ll keep update the progress.

(Prashadi) #62

The project has moved to librehalth organization in here.

(Robby O'Connor) #63

We also mirror all repos to github, I went ahead and created that and set up mirroring to GitLab using our service account – all work should be done on GitLab however.