Skip to content

Conversation

@zykure
Copy link

@zykure zykure commented Nov 27, 2025

This PR adds parsing and serialization features for structured data in JSON, YAML and XML formats.

For each format, a new function to_*_string() and from_*_string() is exposed in the gul17 library via the data_processors.h header. These functions convert between the string representation of the data and the new DataTree structure, which contains the corresponding tree nodes.

The DataTree struct is essentially a wrapper around an std::variant value that can hold null, boolean, integer, float and string values as well as other DataTree objects in the form of a map or vector. This allows to represent each data item by a node in the tree, and includes sub-trees and node arrays.

The XML parser/serializer also handles attributes and multiple elements of the same name. Attributes are added as tree nodes with a name like "@attr" under the current XML element, and multiple elements that have the same XML tag on the same tree level are added as a node array. This is not a full-scale DOM/SAX-style parser of course, so there are some limitations.

Unit tests for the 3 formats are provided as well.

I think this enhancement would be useful to other projects such as the DOOCS serverlib/clientlib where XML is already being used; and JSON/YAML parsing would add a new and convenient way of transferring structured data via the DOOCS protocol or to read and write configuration files, etc.

@zykure zykure requested a review from Copilot November 27, 2025 14:15
@zykure zykure added the enhancement New feature or request label Nov 27, 2025
@zykure zykure requested a review from alt-graph November 27, 2025 14:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive structured data parsing and serialization support for JSON, YAML, and XML formats to the gul17 library. The implementation introduces a new DataTree class that provides a variant-based hierarchical data structure capable of representing different data types (null, boolean, integer, float, string, array, object), along with format-specific parser and serializer implementations.

Key changes:

  • New DataTree class providing a flexible, dynamically-typed tree structure
  • Parser and serializer implementations for JSON, YAML, and XML with support for escape sequences and format-specific features
  • Comprehensive unit test coverage for all three formats

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
include/gul17/DataTree.h Defines the core DataTree class with variant-based value storage and accessor methods
include/gul17/data_processors.h Declares public API functions for parsing and serializing JSON, YAML, and XML
include/gul17/gul.h Adds include for the new data_processors.h header
include/gul17/meson.build Registers new headers in the build system
src/data_processors/json_processor.cc Implements JSON parsing and serialization with escape sequence handling
src/data_processors/xml_processor.cc Implements XML parsing and serialization with attribute and comment support
src/data_processors/yaml_processor.cc Implements YAML parsing and serialization with comment handling
src/meson.build Registers new source files in the build system
tests/data_processors/test_json_processor.cc Unit tests for JSON processor covering parsing, serialization, and edge cases
tests/data_processors/test_xml_processor.cc Unit tests for XML processor covering parsing, serialization, attributes, and edge cases
tests/data_processors/test_yaml_processor.cc Unit tests for YAML processor covering parsing, serialization, and edge cases
tests/meson.build Registers new test files in the build system
Comments suppressed due to low confidence (1)

include/gul17/DataTree.h:1

  • The example code uses lowercase type names array and object, but the actual type names are Array and Object (capitalized). This will cause compilation errors if someone copies this example. Update to use DataTree::Array and DataTree::Object.
/**

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zykure and others added 4 commits November 27, 2025 15:30
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zykure zykure marked this pull request as ready for review November 27, 2025 15:28
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 21 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zykure zykure marked this pull request as draft November 27, 2025 15:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 14 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
else if (value.is_double())
{
output_ << value.as<double>();
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The same std::to_string precision issue applies here for YAML serialization of double values. Consider using more controlled formatting for better output consistency.

Copilot uses AI. Check for mistakes.
else if (value.is_int())
{
output_ << indent_str << opening_tag << ">"
<< std::to_string(value.as<int>())
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Inconsistent use of std::to_string for integers in XML serialization. While integers don't have the same precision issues as doubles, using output_ << value.as<int>() directly would be more consistent with boolean serialization and slightly more efficient.

Suggested change
<< std::to_string(value.as<int>())
<< value.as<int>()

Copilot uses AI. Check for mistakes.
else if (c == '"' && !in_single_quote)
{
// Only toggle if not escaped
if (i == 0 || line[i-1] != '\\')
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing bounds checking before accessing str[i-1]. If i == 0, this will access memory before the string start. Add a check i > 0 before accessing line[i-1].

Suggested change
if (i == 0 || line[i-1] != '\\')
if (i == 0 || (i > 0 && line[i-1] != '\\'))

Copilot uses AI. Check for mistakes.
Comment on lines +479 to +480
char buf[12];
snprintf(buf, sizeof(buf), "\\u%04x", static_cast<unsigned char>(c));
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The snprintf call should check the return value to ensure the buffer was large enough, even though 12 bytes should be sufficient for \u%04x. Consider using a safer approach like std::ostringstream or verify the return value.

Copilot uses AI. Check for mistakes.
Comment on lines +460 to +462
output_ << indent_str << opening_tag << ">"
<< std::to_string(value.as<double>())
<< closing_tag;
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The std::to_string function is used to serialize doubles, which may produce trailing zeros or lack precision control. For XML/JSON serialization, consider using more controlled formatting (e.g., std::ostringstream with precision settings) to ensure consistent and predictable output.

Suggested change
output_ << indent_str << opening_tag << ">"
<< std::to_string(value.as<double>())
<< closing_tag;
{
std::ostringstream oss;
oss.precision(15);
oss << std::defaultfloat << value.as<double>();
output_ << indent_str << opening_tag << ">"
<< oss.str()
<< closing_tag;
}

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +171
if (obj.find(key) != obj.end())
{
throw std::runtime_error(gul17::cat("Duplicate attribute name: ", attr_name, " at position ", pos_));
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate attribute check at line 170 checks if the key already exists in obj, but attributes are being added with the "@" prefix. The check should verify if an attribute with the same name has already been added during the attribute parsing phase, not in the final object. Consider checking for duplicates in the attributes vector instead.

Suggested change
if (obj.find(key) != obj.end())
{
throw std::runtime_error(gul17::cat("Duplicate attribute name: ", attr_name, " at position ", pos_));
}

Copilot uses AI. Check for mistakes.
{
std::string str = value.as<std::string>();
// Quote strings if they contain special characters
if (str.empty() || str.find_first_of(":#{}[]&*!|>\"'%") != std::string::npos)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML serializer checks for special characters that require quoting with str.find_first_of(":#{}[]&*!|>\"'%"), but this doesn't cover all cases where quoting is necessary in YAML. For example, strings starting with - or ? followed by space, or strings that look like numbers/booleans should also be quoted. Consider a more comprehensive check or document the limitations.

Copilot uses AI. Check for mistakes.
Comment on lines +520 to +523
if (c == '"')
output_ << "\\\"";
else
output_ << c;
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When serializing YAML strings with quotes, only the double-quote character is escaped. However, other escape sequences like newlines, tabs, and backslashes should also be escaped when inside quoted strings to produce valid YAML. The current implementation would produce invalid YAML if the string contains newlines or backslashes.

Suggested change
if (c == '"')
output_ << "\\\"";
else
output_ << c;
switch (c)
{
case '"':
output_ << "\\\"";
break;
case '\\':
output_ << "\\\\";
break;
case '\n':
output_ << "\\n";
break;
case '\r':
output_ << "\\r";
break;
case '\t':
output_ << "\\t";
break;
default:
// Escape other control characters as \xXX
if (static_cast<unsigned char>(c) < 0x20)
{
char buf[5];
snprintf(buf, sizeof(buf), "\\x%02X", static_cast<unsigned char>(c));
output_ << buf;
}
else
{
output_ << c;
}
break;
}

Copilot uses AI. Check for mistakes.
}
else if (value.is_double())
{
output_ << std::to_string(value.as<double>());
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The same std::to_string precision issue applies here for JSON serialization of double values. Consider using more controlled formatting for better output consistency.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +108
TEST_CASE("YamlProcessor: YAML parsing with errors", "[YamlProcessor]")
{
// Currently, the parser does not throw exceptions for malformed YAML.
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Currently, the parser does not throw exceptions for malformed YAML" but there's no test case validating this behavior. Consider adding at least one test case with malformed YAML to verify the parser's behavior, or remove this comment if error handling is expected.

Copilot uses AI. Check for mistakes.
@Finii
Copy link
Member

Finii commented Nov 28, 2025

This AI cruft does not make the PR human readable. Digging through at this moment already 68 comments is nothing I like and have time for. Maybe you should do the AI processing in a private chat or what and keep this for humans.

Anyhow, I did not look at the code at all, but

  • I believe we should use one of the several available and thoroughly tested libraries for parsing and not invent parsers on our own.
  • And with use I mean in the other libraries / servers; not pulling these libs in into GUL
  • We use Xerces for example in a lot libraries and servers, why switch away from a well maintained library?

and JSON/YAML parsing would add a new and convenient way of transferring structured data via the DOOCS protocol

Iirc @soerengrunewald did a lot research on the serialization and it is always best to use one of the existing libraries and not invent the wheel again.
Yaml for example has some severe quoting challenges, and I'm not sure we are able to catch them all - and also see no need to add a maintenance burden on us instead of relying on widespread used libraries.

Edit: Add last paragraph

@Finii
Copy link
Member

Finii commented Nov 28, 2025

My little Copilot chat...

Is it better to use the xerces library to manipulate XML data in a c++ program, or shall we write our own parser? Please give extensive reasoning for you answer.

Using an Existing Library vs. Rolling Your Own Parser

When it comes to parsing and manipulating XML data in C++, there are two primary options: using an existing library like Xerces or writing your own parser from scratch. Here's a detailed analysis of the pros and cons of each approach:
Using the Xerces Library

Pros:

  • Maturity and Stability: Xerces is a well-established, widely-used, and mature library with a large user base. This translates to a more stable and reliable parsing experience.
  • Standards Compliance: Xerces is designed to be compliant with various XML standards, including XML 1.0, XML 1.1, and XML Schema.
  • Feature-Rich: Xerces provides a comprehensive set of features, including support for DOM, SAX, and StAX parsing, as well as XML validation, XPath, and XSLT.
  • Community Support: As a popular library, Xerces has an active community, which means there are many resources available for troubleshooting and learning.
  • Less Code to Maintain: By using an existing library, you can focus on your application's logic without worrying about the intricacies of XML parsing.

Cons:

  • Overhead: Using a library like Xerces can introduce additional overhead, including increased binary size and potential performance impacts.
  • Learning Curve: While Xerces is well-documented, it still requires time and effort to learn how to use it effectively.

Writing Your Own Parser

Pros:

  • Customizability: By writing your own parser, you can tailor it to your specific needs and optimize it for your use case.
  • Lightweight: A custom parser can be designed to be lightweight, reducing binary size and overhead.
  • Fine-Grained Control: With a custom parser, you have complete control over the parsing process, allowing for precise error handling and optimization.

Cons:

  • Complexity: Writing a robust and standards-compliant XML parser is a complex task, requiring significant expertise and time.
  • Bugs and Edge Cases: A custom parser is more likely to contain bugs and may not handle edge cases as well as a mature library like Xerces.
  • Maintenance Burden: Your team will need to invest time and resources in maintaining and updating the parser, which can be a significant burden.

Conclusion

Unless you have a specific requirement that cannot be met by an existing library like Xerces (e.g., a highly customized or embedded system), it is generally recommended to use an established library like Xerces for XML parsing and manipulation in C++. The benefits of using a mature library, including stability, standards compliance, and community support, outweigh the potential drawbacks.

Writing your own parser should only be considered if:

  • You have a very specific, lightweight, or embedded system requirement.
  • You have extensive experience with XML parsing and are confident in your ability to create a robust and standards-compliant parser.

In most cases, using Xerces or another established library will save you time, effort, and potential headaches in the long run.

Copy link
Member

@alt-graph alt-graph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks useful to me, so it might tick the "Only code that is potentially useful for a wide range of C++ projects" box. What worries me a little is the lack of usage experience.

@zykure: Have you already used this or similar code either in multiple projects or intensively in a single project? If not, it might be better to put it into a separate library and gather experience there. Once it gets into GUL we'll be very unwilling to change the API again.

Do we have other project where we would want to use this? The clientlib already has Structure which is similar but different.

Other observations from a casual glance at the code:

  • "Data processors" might be a misnomer. Maybe "parser" is closer to what the code is actually doing?
  • The free functions from_json_string() etc. do not have enough context for their names. It might be better to embed them into the DataTree class as static functions, e.g. auto tree = DataTree::from_json_string().

@Finii
Copy link
Member

Finii commented Nov 28, 2025

I'm not sure how other people do it, but in our servers that use XML they of course validate the input XML via XSD.
That should of course be part of "our parser" if we have one, as good practice would be to validate the data before using it for better error handling and reporting.
Is that also included here? Without validation we would encourage people to use XML un-validated. The XSD data that belongs to a XML input also help documenting what we really expect and allow, giving a (automatically validated) protocol description.

@Finii
Copy link
Member

Finii commented Nov 28, 2025

@alt-graph wrote:

it might be better to put it into a separate library and gather experience there

I see no reason why it should/could not stay in a separate library. 🤔
Do not really see this fitting into GUL.

tick the "Only code that is potentially useful for a wide range of C++ projects" box

Well, but otoh we need to keep GUL "small enough" and not become the bucket for everything.
We've had other functionality not added to GUL, like Ryu, like fmt, like gsl, ...
I vote for a separate library (or using external libraries anyhow).

@soerengrunewald
Copy link
Contributor

I think @Finii is on to something. I would rather see GSL features in GUL over some parsers.

Also there are battle tested parsers out there (which fixed all the edge cases), may of them header only and available via wrapdb, e.g:

That been said, I also see the benefit of having a unified interfaces for any type of parser. Non the less I don't see it gul.

On the code side there is also some things which I would like to see changed.

  • Generally the API seems inconsistend.
  • DataTree::at DataTree::operator[] key access should be string_views
  • DataTree::as DataTree::is seem to differ in behavior, might be wanted, but seems strange.
  • const_cast not a fan.
  • General style issues (e.g large if-else blocks and loops, deep nesting, hard coded strings/numerical values)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants