-
-
Notifications
You must be signed in to change notification settings - Fork 398
Save case sensitive species names #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5d59c27 to
a849bad
Compare
Codecov Report
@@ Coverage Diff @@
## master #689 +/- ##
==========================================
- Coverage 70.63% 70.61% -0.02%
==========================================
Files 372 372
Lines 43567 43602 +35
==========================================
+ Hits 30773 30790 +17
- Misses 12794 12812 +18
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #689 +/- ##
==========================================
+ Coverage 70.63% 70.65% +0.02%
==========================================
Files 372 372
Lines 43567 43602 +35
==========================================
+ Hits 30773 30809 +36
+ Misses 12794 12793 -1
Continue to review full report at Codecov.
|
c669bc2 to
40cf2ed
Compare
speth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👎 on requiring case sensitivity, at least universally. There is no consistency between reaction mechanisms from different sources, so it's quite convenient to be able to leave compositions such as 'N2: 3.76, O2: 1.0' alone and have code just work when switching between different mechanisms.
The case from the original issue report is fairly rare -- the only conflicts in nasa.cti are CS/Cs and CS2/Cs2. The one other scenario where it might be nice to have case sensitivity is if you wanted to use SMILES as identifiers, where lower case is used as an indicator of aromaticity, and so would be needed to differentiate between cyclohexane (C1CCCCC1) and benzene (c1ccccc1). I am not opposed to a flag to enable case sensitivity, with the default behavior remaining as-is.
Replacing doublereal with double throughout Phase.h and Phase.cpp makes it difficult to read the actual changes in this PR, and is the reason I suggest making that change only very locally (i.e. at most, within methods that are already being modified).
f1c459a to
1fe3aa6
Compare
|
Thanks @speth for the feedback.
Fair enough. I changed the logic to retain this ability and removed warnings, while still storing case sensitive species names in the C++ layer.
I believe both are reasons for enabling case sensitive species? Imho it should be possible to pull rudimentary data from nasa for any of the stored species (and it is conceivable that the data base will contain more cases in the future). And, yes, the ability to use SMILES would be neat. (So would be aliases for species, but that should be kept for another PR).
I kept this in a single commit, so it was quite easy to undo. It probably should nevertheless be done if the proposed PR moves forward? The one use case this PR does not pre-empt is if one were to cut&paste bits and pieces from mechanism that use different uppercase/lowercase conventions. Since there will always be differences in nomenclature, this is imho not a major issue, as one must tread carefully in those attempts no matter what. |
1fe3aa6 to
52e4a9b
Compare
* store species information with case sensitive names * retain lookup for non-case sensitive species names, e.g. Phase::speciesIndex * implement flag that enforces case sensitive species names as a member variable of Phase * add exception handling for species that are not uniquely defined unless case sensitive (e.g. Cs and CS in nasa.cti if cs is specified and case sensitivity is not enforced) * deprecate Phase::species(std::string&)
52e4a9b to
cacffb7
Compare
speth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I think this change will provide a concrete benefit, by allowing either the existing case-insensitive behavior in most cases but permitting case sensitivity to be activated in the instances where it is necessary.
I kept this in a single commit, so it was quite easy to undo. It probably should nevertheless be done if the proposed PR moves forward?
As much as I wish the doublereal typedef had never been introduced, I think the costs of rooting it out broadly (beyond the scope of lines of code that are changing for other reasons) exceed the benefits. In addition to making diffs hard to read, it is very easy for these changes to make it hard to rebase/merge other branches that may have meaningful changes to these lines.
The one use case this PR does not pre-empt is if one were to cut&paste bits and pieces from mechanism that use different uppercase/lowercase conventions. Since there will always be differences in nomenclature, this is imho not a major issue, as one must tread carefully in those attempts no matter what.
I'm not too worried about that, and I think at least some of those cases will work with the YAML format (less so in the CTI case).
src/thermo/Phase.cpp
Outdated
| if (toLowerCopy(k.first) == nLower) { | ||
| if (loc == npos) { | ||
| loc = k.second; | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than falling back to an O(N) search anytime a non-canonical species name is provided, I wonder if the following might be a cleaner solution:
- Add a map of lower-case species names to species indices (say,
m_speciesIndicesLower) - When adding species, if there is already an entry in this map with the same name, set its value to
npos, and possibly set them_caseSensitiveSpeciesflag. - The implementation of
speciesIndexcan then use eitherm_speciesIndicesLowerorm_speciesIndicesdepending on the value ofm_caseSensitiveSpecies. Ifm_speciesIndicesLoweris checked and containsnpos, raise the exception about non-unique species identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this suggestion. The second (lower-case) species map would contain nearly identical information; in cases like this, I usually keep the information in a single representation, as it is easier to maintain.
While the suggested implementation would be faster if a user opts to use species names with names other than those specified in the input file, I doubt that it has significant real-world benefits. Assuming that those searches are typically done during problem setup, there is little computational overhead (in repeated operations, it is always faster to select species by index). I am also unsure about auto-magically flipping flags, as those decisions may be best left to the user? Weighing cost and benefit, I am suggesting to leave the implementation as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, there is a benefit to not introducing a new data structure. However, avoiding that requires several instances of this kind of convoluted try/catch construct (with I think another one needed in Phase::species(string).
I think using the extra data structure would allow us to avoid adding findSpeciesLower as a new method -- its logic could just be in speciesIndex.
There's also some value in not generating exceptions for circumstances that aren't really exceptional (at least in C++). I'm less concerned about the performance implication than with the ability to run in gdb and break at an error by simply doing catch throw. If we use exceptions in places like this, then you end up spending a lot more time continuing past these unexceptional exceptions.
I would also note that speciesIndex is actually called with a name not in m_speciesIndices is more often than you might think -- it's not just instances of case-mismatched species names. It is used extensively in multiphase systems as a way of finding which phase a species belongs to. See for example Kinetics::kineticsSpeciesIndex and Kinetics::speciesPhase.
My thought on automatically enabling case sensitivity when there are species that can only be resolved this way was that that might be less surprising to the user than having to enable it explicitly. There may also be issues with initializing phases with species that are only distinguishable by case. For instance, what if a reaction is specified (in the input file) that contains one of these species? Adding the reaction requires looking up the indices of the involved species, and this needs to happen before the user has the chance to set the case sensitive species flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@speth ... thanks for your comments (plus catching the other access-species-by-name case for Phase::species). I.e. this involves another (almost) duplicate map.
Your comment regarding C++ exceptions makes a lot of sense. In the current revision, I opted to replace std::map::at by std::map::find, which eliminates the try - catch blocks.
I do realize that the kinetics managers make extensive use of speciesIndex (they have to), but in most cases the logic will employ case-sensitive species without additional overhead.
Regarding the automatic flags: my thought was that in most cases, case sensitive species are used by default internally and the fallback solution is rarely used. I updated the error message to explicitly direct users towards the caseSensitiveSpecies flag, which can be employed after all species and reactions are set up: internally, everything is case sensitive, i.e. lowercase names are treated as fall-back solutions.
Finally, at this point the case-sensitive/lowercase logic is exclusively handled within Phase::speciesIndex and Phase::species, i.e. there is no longer any duplication of logic blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the "automatic flags", I had to play around with this a bit to realize how this was now working - an exception is only thrown when case sensitivity is disabled and you specify a name which is not a case-sensitive match for any species. So my concern about setting up reactions is moot, and I agree that there is no need to automatically toggle this flag.
I still think the speciesIndex calls within the Kinetics managers for interface phases will frequently incur the O(N) overhead. For example, if you have a kinetics manager containing a surface, a gas, and a solid phase in that order, looking up the index of a species in the solid phase, even with the correct case, will cause a search through all the surface phase species, then all the gas phase species, before finally taking the easy path for the index into the solid phase.
0c18698 to
da35bdf
Compare
src/thermo/Phase.cpp
Outdated
| if (toLowerCopy(k.first) == nLower) { | ||
| if (loc == npos) { | ||
| loc = k.second; | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, there is a benefit to not introducing a new data structure. However, avoiding that requires several instances of this kind of convoluted try/catch construct (with I think another one needed in Phase::species(string).
I think using the extra data structure would allow us to avoid adding findSpeciesLower as a new method -- its logic could just be in speciesIndex.
There's also some value in not generating exceptions for circumstances that aren't really exceptional (at least in C++). I'm less concerned about the performance implication than with the ability to run in gdb and break at an error by simply doing catch throw. If we use exceptions in places like this, then you end up spending a lot more time continuing past these unexceptional exceptions.
I would also note that speciesIndex is actually called with a name not in m_speciesIndices is more often than you might think -- it's not just instances of case-mismatched species names. It is used extensively in multiphase systems as a way of finding which phase a species belongs to. See for example Kinetics::kineticsSpeciesIndex and Kinetics::speciesPhase.
My thought on automatically enabling case sensitivity when there are species that can only be resolved this way was that that might be less surprising to the user than having to enable it explicitly. There may also be issues with initializing phases with species that are only distinguishable by case. For instance, what if a reaction is specified (in the input file) that contains one of these species? Adding the reaction requires looking up the indices of the involved species, and this needs to happen before the user has the chance to set the case sensitive species flag.
Further: * revert unit tests to previous species definitions (some case mis-matches) * remove non-essential comments * opt to maintain case-sensitive species maps with lowercase as fallback
da35bdf to
919cadb
Compare
speth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional updates. I like having the calls to findSpeciesLower isolated within speciesIndex, and would like to see that through to the end.
I'm still concerned about the effect on setting up interface kinetics systems -- without the auxiliary data structure, I think this turns into a O(n_species * n_reactions) computation, which can actually slow things down.
src/thermo/Phase.cpp
Outdated
| if (toLowerCopy(k.first) == nLower) { | ||
| if (loc == npos) { | ||
| loc = k.second; | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the "automatic flags", I had to play around with this a bit to realize how this was now working - an exception is only thrown when case sensitivity is disabled and you specify a name which is not a case-sensitive match for any species. So my concern about setting up reactions is moot, and I agree that there is no need to automatically toggle this flag.
I still think the speciesIndex calls within the Kinetics managers for interface phases will frequently incur the O(N) overhead. For example, if you have a kinetics manager containing a surface, a gas, and a solid phase in that order, looking up the index of a species in the solid phase, even with the correct case, will cause a search through all the surface phase species, then all the gas phase species, before finally taking the easy path for the index into the solid phase.
| std::map<std::string, size_t>::const_iterator it; | ||
|
|
||
| it = m_speciesIndices.find(nameStr); | ||
| if (it != m_speciesIndices.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good place to use auto instead of writing out the complex iterator type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨ ... didn't know that one.
|
@speth ... thanks for the additional comments. I feel this is converging (also caught another case -
👍 That was exactly my intent (case sensitivity is disabled by default)
This is a legitimate point. I likewise do not see a way around the auxiliary map. I'll add a O(log(N)) fallback option in the next commit. |
|
Done. There now is a parallel |
speth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for convincing me that this was a worthwhile change, and the willingness to iterate on the implementation details.
|
NP. Your reviews are well worth the effort. Also good to know about #709 ... |
Please fill in the issue number this pull request is fixing:
Fixes #621
The following fails (example courtesy @Saintis).
as cantera cannot differentiate between existing species
CsandCS.Changes proposed in this pull request:
Phase::speciesIndexPhaseCsandCSinnasa.ctiifcsis specified and case sensitivity is not enforced)add deprecation warning to switch default behavior from non-enforcing to enforcing after cantera 2.5