Okay I agree with you @muarachmann, I will be there
@muarachmann I have joined meeting
Hi @Darshpreet2000, sorry I had another appointment and couldnât be available. I am about to test the pipeline on a demo branch (demo-data) with just one hospital in the url and lets see if we get the expected results. We still have plenty (not much) of things to tackle, the flask integration is just a way of testing the api with the client maybe locally or hosted. Do you think we can use it for generating the process files over the gui to keep our tech stack not to far rather than bringing in PyQT
@Darshpreet2000, @r0bby, @sunbiz mind explaining me something. I think I am confused maybe how git works these days . I am cloning the repo using git clone https://gitlab.com/muarachmann/lh-toolkit-cost-of-care-app-data-scraper.git
and the default branch is develop, just has the source codes yet on cloning, it looks massive see about 600MB
And when i finally cd in to the directory, and run a ls -lh
I see the files donât even sum up to give a 1MB see
I wish to understand whatâs going on and also when I look at my branches, it is only develop which is there and thatâs the expected result.
I really canât help as I didnât really work with it closely.
However when you work with git, remember that it is a decentralized version control system, that means when you clone, you get a copy of every change made to the repo.
Every change you make results in a new git object since git objects are immutable.
@r0bby please can you create back the token ($CI_TOKEN) for the bot, it seem to be expired, and we need it for pushing to the various branches . We keep getting this
remote: HTTP Basic: Access denied
fatal: Authentication failed for 'https://librebot:@gitlab.com/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper.git/'
ERROR: Job failed: exit code 1
@Darshpreet2000 FYI the pipeline looks great see - https://gitlab.com/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/-/jobs/675197476 . I think the minor auth issue should make it pass
No problem, we can have it whenever you feel convenient.
Yes, we can generate process files with GUI. I can do that
I will check this by cloning in my system.
Thankyou @muarachmann
Because I created CDM & Data directory earlier which were large in size & I later removed them by commit but they were not removed from Git history of commits. So, that is why repo is same large as earlier.
Pushing a commit which removes files doesnât make the git repo smaller.
We need to remove those files across the history of the git repo in order for its size to shrink.
Git keeps track of each and every line change made, it makes its history huge
One nitpick here: This is one of those rare circumstances where private messages are not only allowed but encouraged. That said, I will handle it.
Okay @r0bby, from next time I will discuss this in private
@muarachmann @sunbiz @judywawira please check the content of this about screen
@Darshpreet2000 see here, I am using CI_SSH_KEY
the but pipeline is the same - https://gitlab.com/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/-/jobs/677019928 - Here is my branch - https://gitlab.com/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/-/tree/demo-data.
What could be wrong
Also see -
col_indices.append(usecols_key.index(col))
ValueError: 'generic name' is not in list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Process CDM/Alaska/Providence Seward Medical Center/process.py", line 71, in <module>
a.ProcessXLSX(n, filename, hospitalId, charge,description, category, destination, sheet)
File "/builds/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/ParseData.py", line 55, in ProcessXLSX
content = pandas.read_excel(filename, sheet, skiprows=n, usecols=[charge, description], encoding='unicode_escape')
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 311, in read_excel
return io.parse(
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 868, in parse
return self._reader.parse(
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 492, in parse
parser = TextParser(
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2201, in TextParser
return TextFileReader(*args, **kwds)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1126, in _make_engine
self._engine = klass(self.f, **self.options)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2286, in __init__
) = self._infer_columns()
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2656, in _infer_columns
columns = self._handle_usecols(columns, columns[0])
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2714, in _handle_usecols
_validate_usecols_names(self.usecols, usecols_key)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1232, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['generic name', 'unit price']
/builds/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/CDM
Parsing /builds/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/Data/Alaska/Samuel Simmonds Memorial Hospital/SamuelSimmondsCDM.csv
(2583, 3)
No Data Found For /builds/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/Data/Alaska/Providence Valdez Medical Center
/builds/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/CDM/Alaska/PeaceHealth Medical Center Ketchikan Alaska.csv
Parsing 2020_peacehealth_medical_center_ketchikan_alaska.xlsx
(615, 4)
Parsing 2020_peacehealth_medical_center_ketchikan_alaska.xlsx
(235, 4)
/builds/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/CDM/Alaska/PeaceHealth Medical Center Cottage Grove Oregon.csv
Parsing 2020_peacehealth_medical_center_cottage_grove_oregon.xlsx
(329, 4)
Parsing 2020_peacehealth_medical_center_cottage_grove_oregon.xlsx
(255, 4)
Parsing 2020_peacehealth_medical_center_cottage_grove_oregon.xlsx
(77, 4)
/builds/librehealth/toolkit/cost-of-care/lh-toolkit-cost-of-care-app-data-scraper/CDM/Alaska/PeaceHealth Medical Center Eugene Oregon.csv
Parsing 2020_peacehealth_medical_center_eugene_oregon.xlsx
(475, 4)
Parsing 2020_peacehealth_medical_center_eugene_oregon.xlsx
(179, 4)
/builds/librehealth/toolkit/cost-of-care/lh-too
I will fix this, probably CDMs are updated & they have changed column names
Because pushing with an access token is not same as pushing with SSH key. I am also searching on how to push using SSH Key.
What I found is
We have to use GIT_SSH_COMMAND
as a variable
I found it in Git Docs Git - git Documentation
GIT_SSH_COMMAND=CI_SSH_KEY git clone example
& then use simple git commands to clone or push
I will test this on my repo & I will create a MR for this
Great, I look forward to it. Also what about the history of the commits? Anything that can be done?
Yes, I have reduced size of my forked repo by 250 MB using BFG repo cleaner or we can use git filter branch but BFG repo is faster than filter branch
After downloading BFG repo cleaner , I used this command which removed all files in History of commits which were greater than 1 MB
java -jar bfg.jar --strip-blobs-bigger-than 1M some-big-repo.git
I also have a better idea than this
We can create an orphan branch in repo which will start from scratch & we can push to it, so repo size will be 1 MB
So , I think we should create orphan branch instead of using BFG repo cleaner
I fixed CI to use a token. That token cannot be disclosed and would be very, very bad for us if ever leaked.
I hope you ticked the box which allows masking the variables in the CI?