-
Notifications
You must be signed in to change notification settings - Fork 1
Enhancement: Add parsers and serializers for JSON, YAML and XML #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
FIXME: Compiles, but no symbols exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive structured data parsing and serialization support for JSON, YAML, and XML formats to the gul17 library. The implementation introduces a new DataTree class that provides a variant-based hierarchical data structure capable of representing different data types (null, boolean, integer, float, string, array, object), along with format-specific parser and serializer implementations.
Key changes:
- New
DataTreeclass providing a flexible, dynamically-typed tree structure - Parser and serializer implementations for JSON, YAML, and XML with support for escape sequences and format-specific features
- Comprehensive unit test coverage for all three formats
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
include/gul17/DataTree.h |
Defines the core DataTree class with variant-based value storage and accessor methods |
include/gul17/data_processors.h |
Declares public API functions for parsing and serializing JSON, YAML, and XML |
include/gul17/gul.h |
Adds include for the new data_processors.h header |
include/gul17/meson.build |
Registers new headers in the build system |
src/data_processors/json_processor.cc |
Implements JSON parsing and serialization with escape sequence handling |
src/data_processors/xml_processor.cc |
Implements XML parsing and serialization with attribute and comment support |
src/data_processors/yaml_processor.cc |
Implements YAML parsing and serialization with comment handling |
src/meson.build |
Registers new source files in the build system |
tests/data_processors/test_json_processor.cc |
Unit tests for JSON processor covering parsing, serialization, and edge cases |
tests/data_processors/test_xml_processor.cc |
Unit tests for XML processor covering parsing, serialization, attributes, and edge cases |
tests/data_processors/test_yaml_processor.cc |
Unit tests for YAML processor covering parsing, serialization, and edge cases |
tests/meson.build |
Registers new test files in the build system |
Comments suppressed due to low confidence (1)
include/gul17/DataTree.h:1
- The example code uses lowercase type names
arrayandobject, but the actual type names areArrayandObject(capitalized). This will cause compilation errors if someone copies this example. Update to useDataTree::ArrayandDataTree::Object.
/**
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 21 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 14 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| else if (value.is_double()) | ||
| { | ||
| output_ << value.as<double>(); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The same std::to_string precision issue applies here for YAML serialization of double values. Consider using more controlled formatting for better output consistency.
| else if (value.is_int()) | ||
| { | ||
| output_ << indent_str << opening_tag << ">" | ||
| << std::to_string(value.as<int>()) |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Inconsistent use of std::to_string for integers in XML serialization. While integers don't have the same precision issues as doubles, using output_ << value.as<int>() directly would be more consistent with boolean serialization and slightly more efficient.
| << std::to_string(value.as<int>()) | |
| << value.as<int>() |
| else if (c == '"' && !in_single_quote) | ||
| { | ||
| // Only toggle if not escaped | ||
| if (i == 0 || line[i-1] != '\\') |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing bounds checking before accessing str[i-1]. If i == 0, this will access memory before the string start. Add a check i > 0 before accessing line[i-1].
| if (i == 0 || line[i-1] != '\\') | |
| if (i == 0 || (i > 0 && line[i-1] != '\\')) |
| char buf[12]; | ||
| snprintf(buf, sizeof(buf), "\\u%04x", static_cast<unsigned char>(c)); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The snprintf call should check the return value to ensure the buffer was large enough, even though 12 bytes should be sufficient for \u%04x. Consider using a safer approach like std::ostringstream or verify the return value.
| output_ << indent_str << opening_tag << ">" | ||
| << std::to_string(value.as<double>()) | ||
| << closing_tag; |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The std::to_string function is used to serialize doubles, which may produce trailing zeros or lack precision control. For XML/JSON serialization, consider using more controlled formatting (e.g., std::ostringstream with precision settings) to ensure consistent and predictable output.
| output_ << indent_str << opening_tag << ">" | |
| << std::to_string(value.as<double>()) | |
| << closing_tag; | |
| { | |
| std::ostringstream oss; | |
| oss.precision(15); | |
| oss << std::defaultfloat << value.as<double>(); | |
| output_ << indent_str << opening_tag << ">" | |
| << oss.str() | |
| << closing_tag; | |
| } |
| if (obj.find(key) != obj.end()) | ||
| { | ||
| throw std::runtime_error(gul17::cat("Duplicate attribute name: ", attr_name, " at position ", pos_)); | ||
| } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicate attribute check at line 170 checks if the key already exists in obj, but attributes are being added with the "@" prefix. The check should verify if an attribute with the same name has already been added during the attribute parsing phase, not in the final object. Consider checking for duplicates in the attributes vector instead.
| if (obj.find(key) != obj.end()) | |
| { | |
| throw std::runtime_error(gul17::cat("Duplicate attribute name: ", attr_name, " at position ", pos_)); | |
| } |
| { | ||
| std::string str = value.as<std::string>(); | ||
| // Quote strings if they contain special characters | ||
| if (str.empty() || str.find_first_of(":#{}[]&*!|>\"'%") != std::string::npos) |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML serializer checks for special characters that require quoting with str.find_first_of(":#{}[]&*!|>\"'%"), but this doesn't cover all cases where quoting is necessary in YAML. For example, strings starting with - or ? followed by space, or strings that look like numbers/booleans should also be quoted. Consider a more comprehensive check or document the limitations.
| if (c == '"') | ||
| output_ << "\\\""; | ||
| else | ||
| output_ << c; |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When serializing YAML strings with quotes, only the double-quote character is escaped. However, other escape sequences like newlines, tabs, and backslashes should also be escaped when inside quoted strings to produce valid YAML. The current implementation would produce invalid YAML if the string contains newlines or backslashes.
| if (c == '"') | |
| output_ << "\\\""; | |
| else | |
| output_ << c; | |
| switch (c) | |
| { | |
| case '"': | |
| output_ << "\\\""; | |
| break; | |
| case '\\': | |
| output_ << "\\\\"; | |
| break; | |
| case '\n': | |
| output_ << "\\n"; | |
| break; | |
| case '\r': | |
| output_ << "\\r"; | |
| break; | |
| case '\t': | |
| output_ << "\\t"; | |
| break; | |
| default: | |
| // Escape other control characters as \xXX | |
| if (static_cast<unsigned char>(c) < 0x20) | |
| { | |
| char buf[5]; | |
| snprintf(buf, sizeof(buf), "\\x%02X", static_cast<unsigned char>(c)); | |
| output_ << buf; | |
| } | |
| else | |
| { | |
| output_ << c; | |
| } | |
| break; | |
| } |
| } | ||
| else if (value.is_double()) | ||
| { | ||
| output_ << std::to_string(value.as<double>()); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The same std::to_string precision issue applies here for JSON serialization of double values. Consider using more controlled formatting for better output consistency.
| TEST_CASE("YamlProcessor: YAML parsing with errors", "[YamlProcessor]") | ||
| { | ||
| // Currently, the parser does not throw exceptions for malformed YAML. | ||
| } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Currently, the parser does not throw exceptions for malformed YAML" but there's no test case validating this behavior. Consider adding at least one test case with malformed YAML to verify the parser's behavior, or remove this comment if error handling is expected.
|
This AI cruft does not make the PR human readable. Digging through at this moment already 68 comments is nothing I like and have time for. Maybe you should do the AI processing in a private chat or what and keep this for humans. Anyhow, I did not look at the code at all, but
Iirc @soerengrunewald did a lot research on the serialization and it is always best to use one of the existing libraries and not invent the wheel again. Edit: Add last paragraph |
|
My little Copilot chat...
Using an Existing Library vs. Rolling Your Own Parser When it comes to parsing and manipulating XML data in C++, there are two primary options: using an existing library like Xerces or writing your own parser from scratch. Here's a detailed analysis of the pros and cons of each approach: Pros:
Cons:
Writing Your Own Parser Pros:
Cons:
Conclusion Unless you have a specific requirement that cannot be met by an existing library like Xerces (e.g., a highly customized or embedded system), it is generally recommended to use an established library like Xerces for XML parsing and manipulation in C++. The benefits of using a mature library, including stability, standards compliance, and community support, outweigh the potential drawbacks. Writing your own parser should only be considered if:
In most cases, using Xerces or another established library will save you time, effort, and potential headaches in the long run. |
alt-graph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks useful to me, so it might tick the "Only code that is potentially useful for a wide range of C++ projects" box. What worries me a little is the lack of usage experience.
@zykure: Have you already used this or similar code either in multiple projects or intensively in a single project? If not, it might be better to put it into a separate library and gather experience there. Once it gets into GUL we'll be very unwilling to change the API again.
Do we have other project where we would want to use this? The clientlib already has Structure which is similar but different.
Other observations from a casual glance at the code:
- "Data processors" might be a misnomer. Maybe "parser" is closer to what the code is actually doing?
- The free functions
from_json_string()etc. do not have enough context for their names. It might be better to embed them into theDataTreeclass as static functions, e.g.auto tree = DataTree::from_json_string().
|
I'm not sure how other people do it, but in our servers that use XML they of course validate the input XML via XSD. |
|
@alt-graph wrote:
I see no reason why it should/could not stay in a separate library. 🤔
Well, but otoh we need to keep GUL "small enough" and not become the bucket for everything. |
|
I think @Finii is on to something. I would rather see GSL features in GUL over some parsers. Also there are battle tested parsers out there (which fixed all the edge cases), may of them header only and available via wrapdb, e.g:
That been said, I also see the benefit of having a unified interfaces for any type of parser. Non the less I don't see it gul. On the code side there is also some things which I would like to see changed.
|
This PR adds parsing and serialization features for structured data in JSON, YAML and XML formats.
For each format, a new function
to_*_string()andfrom_*_string()is exposed in the gul17 library via thedata_processors.hheader. These functions convert between the string representation of the data and the newDataTreestructure, which contains the corresponding tree nodes.The
DataTreestruct is essentially a wrapper around anstd::variantvalue that can hold null, boolean, integer, float and string values as well as otherDataTreeobjects in the form of a map or vector. This allows to represent each data item by a node in the tree, and includes sub-trees and node arrays.The XML parser/serializer also handles attributes and multiple elements of the same name. Attributes are added as tree nodes with a name like
"@attr"under the current XML element, and multiple elements that have the same XML tag on the same tree level are added as a node array. This is not a full-scale DOM/SAX-style parser of course, so there are some limitations.Unit tests for the 3 formats are provided as well.
I think this enhancement would be useful to other projects such as the DOOCS serverlib/clientlib where XML is already being used; and JSON/YAML parsing would add a new and convenient way of transferring structured data via the DOOCS protocol or to read and write configuration files, etc.