Skip to content

Conversation

@domoritz
Copy link
Contributor

I just made a pr for a branch from @karol-szuster because these changes seem to be a valuable improvement. Haven't tested them but wanted to document.

When csv file is longer than 1000 lines only first attempt to fetch it content
will succeed. Any subsequent attempt will return only first 1000 lines. This
patch should fix this problem but it may affect memory footprint.
@pudo
Copy link
Contributor

pudo commented Aug 6, 2015

Wow, that should almost be a bug. +1

@karol-szuster
Copy link

I have created those patches because in my project I'm very often processing very large excel and csv files (larger than 500K lines). Without those patches I was not able to parse some of them because OOM killer was killing python interpreter on machine witch has 4GB of memory. With those patches python processes consumes about 400MB of RAM.
I didn't want to created pr without unit test but if you like those patches then you can take them and they should work find. At least I haven't experienced any issues with them.

Karol Szuster added 2 commits August 25, 2015 13:01
headers_guess functions reads entire content of file into memory. This leads
to huge memory footprint when working on large excel files. This patch fixes
this problem by reading only first 1000 lines of file into memory. This should
be sufficient to find the header and saves a lot of memory.
@pwalsh
Copy link
Contributor

pwalsh commented Jul 20, 2016

@pudo would be nice if you have time to check this, as you expressed interest to me. If good, I can rebase and merge.

@rufuspollock
Copy link
Member

This seem super simple improvements - looking at the patches and LGTM. I guess we need to rebase against master to get these in ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants