Skip to content

Fix in strdb.cc file for handling large file reads#18

Open
ajalagam wants to merge 1 commit intopercyliang:masterfrom
ajalagam:pull-request-strdb-fix
Open

Fix in strdb.cc file for handling large file reads#18
ajalagam wants to merge 1 commit intopercyliang:masterfrom
ajalagam:pull-request-strdb-fix

Conversation

@ajalagam
Copy link
Copy Markdown

strdb.cc terminates with segmentation fault when run on large data files of say 5 GB in size. This commit has a fix for this issue.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff71293ab in __memcpy_ssse3_back () from /lib64/libc.so.6

0x0000000000422558 in read_text (file=<optimized out>, func=0x409ed0 <read_text_process_word(int)>, db=..., read_cached=<optimized out>,
    write_cached=<optimized out>, incorp_new=true) at basic/strdb.cc:189

Copy link
Copy Markdown
Owner

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This change doesn't preserve the previous functionality, which ignores all whitespace (multiple, spaces, tabs, newlines), whereas the proposed change only breaks on single spaces. Would be good not to change this.

  2. Can you explain why this code works when the previous one doesn't? I don't know how ifstream is implemented, but the memory usage for reading in a string should be constant...

char s[16384];
char buf[16384]; int buf_i = 0; // Output buffer
while(in >> s) { // Read a string
for(std::string line; getline( in, line, ' ');) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra space

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. To preserve 'breaking on whitespace functionality' the code can be modified as below.
    for(std::string line; getline(std::ws(in), line, ' ');)

  2. I think the reason why the previous code didn't work has got something to do with the internal memory allocation, buffer handling of the inputstream and copying to a char array of [16384] size. The recommended way to read is into a std::string rather than a char array. The stacktrace also seems to indicate that the internal memcpy operation didn't go fine.

I think an understanding beyond this may require me to dig deeper into the workings of the ifstream, limitations of reading via operator>> versus getline, and also into possibly libc library oeprations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants