Skip to content

Conversation

@ischoegl
Copy link
Member

@ischoegl ischoegl commented Aug 12, 2019

Please fill in the issue number this pull request is fixing:

Fixes #621

The following fails (example courtesy @Saintis).

gas_cti = """ideal_gas(
    name="gas",
    elements=" S C ",
    species=" nasa: all ",
    options=["skip_undeclared_elements"],
    initial_state=state(temperature=300, pressure=(1, "bar"))
)"""
gas = ct.Solution(source=gas_cti)

as cantera cannot differentiate between existing species Cs and CS.

Changes proposed in this pull request:

  • store species names as case sensitive strings
  • retain ability to look up non-case sensitive species names, e.g. Phase::speciesIndex
  • implement flag that enforces case sensitive species names as a member variable of Phase
  • add exception handling for species that are not uniquely defined unless case sensitive (e.g. Cs and CS in nasa.cti if cs is specified and case sensitivity is not enforced)
  • add deprecation warning to switch default behavior from non-enforcing to enforcing after cantera 2.5
  • potentially add switch to save all lower case species names if explicitly stated as an option for CTI/YAML import

@ischoegl ischoegl force-pushed the case-sensitive-species-names branch 2 times, most recently from 5d59c27 to a849bad Compare August 12, 2019 10:34
@codecov
Copy link

codecov bot commented Aug 12, 2019

Codecov Report

Merging #689 into master will decrease coverage by 0.01%.
The diff coverage is 58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #689      +/-   ##
==========================================
- Coverage   70.63%   70.61%   -0.02%     
==========================================
  Files         372      372              
  Lines       43567    43602      +35     
==========================================
+ Hits        30773    30790      +17     
- Misses      12794    12812      +18
Impacted Files Coverage Δ
include/cantera/thermo/Phase.h 100% <ø> (ø) ⬆️
src/thermo/RedlichKwongMFTP.cpp 76.39% <50%> (-0.14%) ⬇️
src/thermo/Phase.cpp 79.95% <58.69%> (-2.32%) ⬇️
src/transport/GasTransport.cpp 90.58% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5184ebc...a849bad. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 12, 2019

Codecov Report

Merging #689 into master will increase coverage by 0.02%.
The diff coverage is 82.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #689      +/-   ##
==========================================
+ Coverage   70.63%   70.65%   +0.02%     
==========================================
  Files         372      372              
  Lines       43567    43602      +35     
==========================================
+ Hits        30773    30809      +36     
+ Misses      12794    12793       -1
Impacted Files Coverage Δ
include/cantera/thermo/Phase.h 100% <100%> (ø) ⬆️
src/transport/GasTransport.cpp 90.58% <100%> (ø) ⬆️
src/thermo/RedlichKwongMFTP.cpp 76.39% <50%> (-0.14%) ⬇️
src/thermo/Phase.cpp 82.88% <83.72%> (+0.61%) ⬆️
src/base/global.cpp 76.92% <0%> (+1.92%) ⬆️
src/thermo/NasaPoly2.cpp 87.09% <0%> (+3.22%) ⬆️
src/base/application.h 75% <0%> (+8.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 265a186...fb9c749. Read the comment docs.

@ischoegl ischoegl force-pushed the case-sensitive-species-names branch 2 times, most recently from c669bc2 to 40cf2ed Compare August 12, 2019 11:35
Copy link
Member

@speth speth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👎 on requiring case sensitivity, at least universally. There is no consistency between reaction mechanisms from different sources, so it's quite convenient to be able to leave compositions such as 'N2: 3.76, O2: 1.0' alone and have code just work when switching between different mechanisms.

The case from the original issue report is fairly rare -- the only conflicts in nasa.cti are CS/Cs and CS2/Cs2. The one other scenario where it might be nice to have case sensitivity is if you wanted to use SMILES as identifiers, where lower case is used as an indicator of aromaticity, and so would be needed to differentiate between cyclohexane (C1CCCCC1) and benzene (c1ccccc1). I am not opposed to a flag to enable case sensitivity, with the default behavior remaining as-is.

Replacing doublereal with double throughout Phase.h and Phase.cpp makes it difficult to read the actual changes in this PR, and is the reason I suggest making that change only very locally (i.e. at most, within methods that are already being modified).

@ischoegl ischoegl force-pushed the case-sensitive-species-names branch from f1c459a to 1fe3aa6 Compare August 12, 2019 20:58
@ischoegl
Copy link
Member Author

Thanks @speth for the feedback.

There is no consistency between reaction mechanisms from different sources, so it's quite convenient to be able to leave compositions such as 'N2: 3.76, O2: 1.0' alone and have code just work when switching between different mechanisms.

Fair enough. I changed the logic to retain this ability and removed warnings, while still storing case sensitive species names in the C++ layer.

The case from the original issue report is fairly rare -- the only conflicts in nasa.cti are CS/Cs and CS2/Cs2. The one other scenario where it might be nice to have case sensitivity is if you wanted to use SMILES as identifiers, where lower case is used as an indicator of aromaticity, and so would be needed to differentiate between cyclohexane (C1CCCCC1) and benzene (c1ccccc1). I am not opposed to a flag to enable case sensitivity, with the default behavior remaining as-is.

I believe both are reasons for enabling case sensitive species? Imho it should be possible to pull rudimentary data from nasa for any of the stored species (and it is conceivable that the data base will contain more cases in the future). And, yes, the ability to use SMILES would be neat. (So would be aliases for species, but that should be kept for another PR).

Replacing doublereal with double throughout Phase.h and Phase.cpp makes it difficult to read the actual changes in this PR, and is the reason I suggest making that change only very locally (i.e. at most, within methods that are already being modified).

I kept this in a single commit, so it was quite easy to undo. It probably should nevertheless be done if the proposed PR moves forward?

The one use case this PR does not pre-empt is if one were to cut&paste bits and pieces from mechanism that use different uppercase/lowercase conventions. Since there will always be differences in nomenclature, this is imho not a major issue, as one must tread carefully in those attempts no matter what.

@ischoegl ischoegl changed the title Make species names case sensitive Save species names case sensitive Aug 12, 2019
@ischoegl ischoegl force-pushed the case-sensitive-species-names branch from 1fe3aa6 to 52e4a9b Compare August 12, 2019 21:35
Ingmar Schoegl and others added 2 commits August 12, 2019 17:31
* store species information with case sensitive names
* retain lookup for non-case sensitive species names, e.g. Phase::speciesIndex
* implement flag that enforces case sensitive species names as a member
variable of Phase
* add exception handling for species that are not uniquely defined unless case
sensitive (e.g. Cs and CS in nasa.cti if cs is specified and case sensitivity
is not enforced)
* deprecate Phase::species(std::string&)
@ischoegl ischoegl force-pushed the case-sensitive-species-names branch from 52e4a9b to cacffb7 Compare August 12, 2019 22:31
@ischoegl ischoegl changed the title Save species names case sensitive Save case sensitive species names Aug 13, 2019
Copy link
Member

@speth speth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I think this change will provide a concrete benefit, by allowing either the existing case-insensitive behavior in most cases but permitting case sensitivity to be activated in the instances where it is necessary.

I kept this in a single commit, so it was quite easy to undo. It probably should nevertheless be done if the proposed PR moves forward?

As much as I wish the doublereal typedef had never been introduced, I think the costs of rooting it out broadly (beyond the scope of lines of code that are changing for other reasons) exceed the benefits. In addition to making diffs hard to read, it is very easy for these changes to make it hard to rebase/merge other branches that may have meaningful changes to these lines.

The one use case this PR does not pre-empt is if one were to cut&paste bits and pieces from mechanism that use different uppercase/lowercase conventions. Since there will always be differences in nomenclature, this is imho not a major issue, as one must tread carefully in those attempts no matter what.
I'm not too worried about that, and I think at least some of those cases will work with the YAML format (less so in the CTI case).

if (toLowerCopy(k.first) == nLower) {
if (loc == npos) {
loc = k.second;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than falling back to an O(N) search anytime a non-canonical species name is provided, I wonder if the following might be a cleaner solution:

  • Add a map of lower-case species names to species indices (say, m_speciesIndicesLower)
  • When adding species, if there is already an entry in this map with the same name, set its value to npos, and possibly set the m_caseSensitiveSpecies flag.
  • The implementation of speciesIndex can then use either m_speciesIndicesLower or m_speciesIndices depending on the value of m_caseSensitiveSpecies. If m_speciesIndicesLower is checked and contains npos, raise the exception about non-unique species identifiers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this suggestion. The second (lower-case) species map would contain nearly identical information; in cases like this, I usually keep the information in a single representation, as it is easier to maintain.

While the suggested implementation would be faster if a user opts to use species names with names other than those specified in the input file, I doubt that it has significant real-world benefits. Assuming that those searches are typically done during problem setup, there is little computational overhead (in repeated operations, it is always faster to select species by index). I am also unsure about auto-magically flipping flags, as those decisions may be best left to the user? Weighing cost and benefit, I am suggesting to leave the implementation as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, there is a benefit to not introducing a new data structure. However, avoiding that requires several instances of this kind of convoluted try/catch construct (with I think another one needed in Phase::species(string).

I think using the extra data structure would allow us to avoid adding findSpeciesLower as a new method -- its logic could just be in speciesIndex.

There's also some value in not generating exceptions for circumstances that aren't really exceptional (at least in C++). I'm less concerned about the performance implication than with the ability to run in gdb and break at an error by simply doing catch throw. If we use exceptions in places like this, then you end up spending a lot more time continuing past these unexceptional exceptions.

I would also note that speciesIndex is actually called with a name not in m_speciesIndices is more often than you might think -- it's not just instances of case-mismatched species names. It is used extensively in multiphase systems as a way of finding which phase a species belongs to. See for example Kinetics::kineticsSpeciesIndex and Kinetics::speciesPhase.

My thought on automatically enabling case sensitivity when there are species that can only be resolved this way was that that might be less surprising to the user than having to enable it explicitly. There may also be issues with initializing phases with species that are only distinguishable by case. For instance, what if a reaction is specified (in the input file) that contains one of these species? Adding the reaction requires looking up the indices of the involved species, and this needs to happen before the user has the chance to set the case sensitive species flag.

Copy link
Member Author

@ischoegl ischoegl Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@speth ... thanks for your comments (plus catching the other access-species-by-name case for Phase::species). I.e. this involves another (almost) duplicate map.

Your comment regarding C++ exceptions makes a lot of sense. In the current revision, I opted to replace std::map::at by std::map::find, which eliminates the try - catch blocks.

I do realize that the kinetics managers make extensive use of speciesIndex (they have to), but in most cases the logic will employ case-sensitive species without additional overhead.

Regarding the automatic flags: my thought was that in most cases, case sensitive species are used by default internally and the fallback solution is rarely used. I updated the error message to explicitly direct users towards the caseSensitiveSpecies flag, which can be employed after all species and reactions are set up: internally, everything is case sensitive, i.e. lowercase names are treated as fall-back solutions.

Finally, at this point the case-sensitive/lowercase logic is exclusively handled within Phase::speciesIndex and Phase::species, i.e. there is no longer any duplication of logic blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the "automatic flags", I had to play around with this a bit to realize how this was now working - an exception is only thrown when case sensitivity is disabled and you specify a name which is not a case-sensitive match for any species. So my concern about setting up reactions is moot, and I agree that there is no need to automatically toggle this flag.

I still think the speciesIndex calls within the Kinetics managers for interface phases will frequently incur the O(N) overhead. For example, if you have a kinetics manager containing a surface, a gas, and a solid phase in that order, looking up the index of a species in the solid phase, even with the correct case, will cause a search through all the surface phase species, then all the gas phase species, before finally taking the easy path for the index into the solid phase.

@ischoegl ischoegl force-pushed the case-sensitive-species-names branch from 0c18698 to da35bdf Compare September 19, 2019 17:28
if (toLowerCopy(k.first) == nLower) {
if (loc == npos) {
loc = k.second;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, there is a benefit to not introducing a new data structure. However, avoiding that requires several instances of this kind of convoluted try/catch construct (with I think another one needed in Phase::species(string).

I think using the extra data structure would allow us to avoid adding findSpeciesLower as a new method -- its logic could just be in speciesIndex.

There's also some value in not generating exceptions for circumstances that aren't really exceptional (at least in C++). I'm less concerned about the performance implication than with the ability to run in gdb and break at an error by simply doing catch throw. If we use exceptions in places like this, then you end up spending a lot more time continuing past these unexceptional exceptions.

I would also note that speciesIndex is actually called with a name not in m_speciesIndices is more often than you might think -- it's not just instances of case-mismatched species names. It is used extensively in multiphase systems as a way of finding which phase a species belongs to. See for example Kinetics::kineticsSpeciesIndex and Kinetics::speciesPhase.

My thought on automatically enabling case sensitivity when there are species that can only be resolved this way was that that might be less surprising to the user than having to enable it explicitly. There may also be issues with initializing phases with species that are only distinguishable by case. For instance, what if a reaction is specified (in the input file) that contains one of these species? Adding the reaction requires looking up the indices of the involved species, and this needs to happen before the user has the chance to set the case sensitive species flag.

Further:
* revert unit tests to previous species definitions (some case mis-matches)
* remove non-essential comments
* opt to maintain case-sensitive species maps with lowercase as fallback
@ischoegl ischoegl force-pushed the case-sensitive-species-names branch from da35bdf to 919cadb Compare September 20, 2019 15:24
Copy link
Member

@speth speth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional updates. I like having the calls to findSpeciesLower isolated within speciesIndex, and would like to see that through to the end.

I'm still concerned about the effect on setting up interface kinetics systems -- without the auxiliary data structure, I think this turns into a O(n_species * n_reactions) computation, which can actually slow things down.

if (toLowerCopy(k.first) == nLower) {
if (loc == npos) {
loc = k.second;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the "automatic flags", I had to play around with this a bit to realize how this was now working - an exception is only thrown when case sensitivity is disabled and you specify a name which is not a case-sensitive match for any species. So my concern about setting up reactions is moot, and I agree that there is no need to automatically toggle this flag.

I still think the speciesIndex calls within the Kinetics managers for interface phases will frequently incur the O(N) overhead. For example, if you have a kinetics manager containing a surface, a gas, and a solid phase in that order, looking up the index of a species in the solid phase, even with the correct case, will cause a search through all the surface phase species, then all the gas phase species, before finally taking the easy path for the index into the solid phase.

std::map<std::string, size_t>::const_iterator it;

it = m_speciesIndices.find(nameStr);
if (it != m_speciesIndices.end()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good place to use auto instead of writing out the complex iterator type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✨ ... didn't know that one.

@ischoegl
Copy link
Member Author

ischoegl commented Sep 23, 2019

@speth ... thanks for the additional comments. I feel this is converging (also caught another case - setMassFractionByName that a unit test for some reason hadn't flagged before) and applied your recommendations.

an exception is only thrown when case sensitivity is dis en abled and you specify a name which is not a case-sensitive match for any species. So my concern about setting up reactions is moot, and I agree that there is no need to automatically toggle this flag.

👍 That was exactly my intent (case sensitivity is disabled by default)

if you have a kinetics manager containing a surface, a gas, and a solid phase in that order, looking up the index of a species in the solid phase, even with the correct case, will cause a search through all the surface phase species, then all the gas phase species, before finally taking the easy path for the index into the solid phase.

This is a legitimate point. I likewise do not see a way around the auxiliary map. I'll add a O(log(N)) fallback option in the next commit.

@ischoegl
Copy link
Member Author

Done. There now is a parallel m_SpeciesLower map, which allows for a O(log(N)) search in all cases.

Copy link
Member

@speth speth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for convincing me that this was a worthwhile change, and the willingness to iterate on the implementation details.

@speth speth merged commit 9ce2553 into Cantera:master Sep 24, 2019
@ischoegl
Copy link
Member Author

NP. Your reviews are well worth the effort. Also good to know about #709 ...

@ischoegl ischoegl deleted the case-sensitive-species-names branch October 5, 2019 00:11
@ischoegl ischoegl restored the case-sensitive-species-names branch October 5, 2019 00:11
@ischoegl ischoegl deleted the case-sensitive-species-names branch October 5, 2019 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to import all species containing C and S

2 participants