-
Notifications
You must be signed in to change notification settings - Fork 667
Closed as not planned
MegaIng/interegular
#9Closed as not planned
Copy link
Labels
Description
Describe the issue as clearly as possible:
Specific characters trigger an AssertionError in make_byte_level_fsm if included in a case-insensitive regex group (e.g. (?i:ß)).
So far, I have found any of the following characters to trigger the error: ¤ ß İ ʼn ǰ ΐ ΰ
Steps/code to reproduce the bug:
import outlines
model = outlines.models.transformers("distilgpt2")
outlines.generate.regex(model, r"(?i:ß)")Expected result:
<outlines.generate.api.SequenceGenerator at 0x...>Error message:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[100], line 1
----> 1 outlines.generate.regex(model, r"(?i:ß)")
File ~/mambaforge/envs/test/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
885 if not args:
886 raise TypeError(f'{funcname} requires at least '
887 '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)
File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/generate/regex.py:32, in regex(model, regex_str, sampler)
11 @singledispatch
12 def regex(model, regex_str: str, sampler: Sampler = multinomial()):
13 """Generate structured text in the language of a regular expression.
14
15 Parameters
(...)
30
31 """
---> 32 fsm = RegexGuide(regex_str, model.tokenizer)
34 device = model.device
35 generator = SequenceGenerator(fsm, model, sampler, device)
File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/fsm/guide.py:146, in RegexGuide.__init__(self, regex_string, tokenizer)
136 raise ValueError(
137 "The vocabulary does not allow us to build a sequence that matches the input regex"
138 )
140 return states_to_token_maps, empty_token_ids, regex_fsm.finals
142 (
143 self.states_to_token_maps,
144 self.empty_token_ids,
145 fsm_finals,
--> 146 ) = create_states_mapping(
147 regex_string, tuple(sorted(tokenizer.vocabulary.items()))
148 )
149 self.vocabulary = list(tokenizer.vocabulary.values())
150 self.eos_token_id = tokenizer.eos_token_id
File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/caching.py:74, in cache.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
72 if cache_key in memory:
73 return memory[cache_key]
---> 74 result = cached_function(*args, **kwargs)
75 memory[cache_key] = result
76 return result
File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/fsm/guide.py:121, in RegexGuide.__init__.<locals>.create_states_mapping(regex_string, cacheable_vocabulary)
117 """Create the variables related to the mapping between states and tokens
118 The parameters of the function are used for caching purpose
119 """
120 regex_pattern = interegular.parse_pattern(regex_string)
--> 121 byte_fsm = make_byte_level_fsm(
122 regex_pattern.to_fsm().reduce(), keep_utf8=True
123 )
124 regex_fsm, _ = make_deterministic_fsm(byte_fsm)
125 states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
126 regex_fsm, tokenizer
127 )
File ~/mambaforge/envs/test/lib/python3.10/site-packages/outlines/fsm/regex.py:223, in make_byte_level_fsm(fsm, keep_utf8)
221 max_key = max(fsm.alphabet.values())
222 for symbol, transition_key in fsm.alphabet.items():
--> 223 assert symbol == anything_else or len(symbol) == 1
224 if symbol == anything_else or ord(symbol) < 0x80:
225 symbol_mapping[symbol] = transition_key
AssertionError:Outlines/Python version information:
Version information
Details
``` 0.0.37 Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] ```Context for the issue:
No response
Reactions are currently unavailable