Skip to content

Conversation

@zhammer
Copy link

@zhammer zhammer commented Nov 6, 2019

context

possible fix for #14 (i'm new to cython and haven't coded in C for a long time)

currently fuzzy.Soundex has occasional undefined behavior on py3, as discussed in #14 and can be reproduced by running this script locally:

import itertools
import fuzzy

soundex = fuzzy.Soundex(6)

def main():
    # keep running the soundex function on the same input until encountering a UnicodeDecodeError
    for attempt in itertools.count():
        try:
            soundex('input')
        except UnicodeDecodeError as e:
            print(f'error on iteration {attempt}, err: {e}')
            break

if __name__ == "__main__":
    main()
# script output
(venv) Zach-Hammer:fuzzy zachhammer$ python3 soundextest.py 
error on iteration 2227560, err: 'ascii' codec can't decode byte 0xc1 in position 4
: ordinal not in range(128)

the UnicodeDecodeError is easiest to replicate as it breaks the code flow, but as noted in #14 the function can return valid (but incorrect) python strings

early deallocation of output string

i have a feeling this is occurring because of a change in python2/python3 cython behavior in this block:

fuzzy/src/fuzzy.pyx

Lines 227 to 230 in e15b195

pout = out
free(out)
return pout

  1. we assign pout = out
    • pout has no declared type (from looking at compiled c code it seems to be a char *)
    • the cython string docs aren't super clear here, but i'd imagine this is just a char * start pointer copy
  2. we free out
    • those bytes can be overwritten at any time, possibly even on free iirc
  3. we return pout
    • from the c compiled code, it seems like a few things happen here, most notably: __pyx_t_5 = __Pyx_PyUnicode_FromString(__pyx_v_pout); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 230, __pyx_L1_error) which seems to be doing some char * -> py string coercion
    • at some point here the free'd memory can be overwritten
    • that'd either lead to a UnicodeDecodeError if attempting to decode garbage bytes, or some random string output if the bytes are valid as a python string

i'm not sure what would cause this to be an issue only on py3, though, unless soundex has been occasionally returning garbage chars without notice on py2 as no UnicodeDecodeError was being thrown?

proposed fix

use try...finally deallocation syntax to free the output string on python garbage collection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant