Enhancement: Add parsers and serializers for JSON, YAML and XML #40

zykure · 2025-11-27T14:15:59Z

This PR adds parsing and serialization features for structured data in JSON, YAML and XML formats.

For each format, a new function to_*_string() and from_*_string() is exposed in the gul17 library via the data_processors.h header. These functions convert between the string representation of the data and the new DataTree structure, which contains the corresponding tree nodes.

The DataTree struct is essentially a wrapper around an std::variant value that can hold null, boolean, integer, float and string values as well as other DataTree objects in the form of a map or vector. This allows to represent each data item by a node in the tree, and includes sub-trees and node arrays.

The XML parser/serializer also handles attributes and multiple elements of the same name. Attributes are added as tree nodes with a name like "@attr" under the current XML element, and multiple elements that have the same XML tag on the same tree level are added as a node array. This is not a full-scale DOM/SAX-style parser of course, so there are some limitations.

Unit tests for the 3 formats are provided as well.

I think this enhancement would be useful to other projects such as the DOOCS serverlib/clientlib where XML is already being used; and JSON/YAML parsing would add a new and convenient way of transferring structured data via the DOOCS protocol or to read and write configuration files, etc.

FIXME: Compiles, but no symbols exported?

Copilot

Pull request overview

This PR adds comprehensive structured data parsing and serialization support for JSON, YAML, and XML formats to the gul17 library. The implementation introduces a new DataTree class that provides a variant-based hierarchical data structure capable of representing different data types (null, boolean, integer, float, string, array, object), along with format-specific parser and serializer implementations.

Key changes:

New DataTree class providing a flexible, dynamically-typed tree structure
Parser and serializer implementations for JSON, YAML, and XML with support for escape sequences and format-specific features
Comprehensive unit test coverage for all three formats

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`include/gul17/DataTree.h`	Defines the core `DataTree` class with variant-based value storage and accessor methods
`include/gul17/data_processors.h`	Declares public API functions for parsing and serializing JSON, YAML, and XML
`include/gul17/gul.h`	Adds include for the new data_processors.h header
`include/gul17/meson.build`	Registers new headers in the build system
`src/data_processors/json_processor.cc`	Implements JSON parsing and serialization with escape sequence handling
`src/data_processors/xml_processor.cc`	Implements XML parsing and serialization with attribute and comment support
`src/data_processors/yaml_processor.cc`	Implements YAML parsing and serialization with comment handling
`src/meson.build`	Registers new source files in the build system
`tests/data_processors/test_json_processor.cc`	Unit tests for JSON processor covering parsing, serialization, and edge cases
`tests/data_processors/test_xml_processor.cc`	Unit tests for XML processor covering parsing, serialization, attributes, and edge cases
`tests/data_processors/test_yaml_processor.cc`	Unit tests for YAML processor covering parsing, serialization, and edge cases
`tests/meson.build`	Registers new test files in the build system

Comments suppressed due to low confidence (1)

include/gul17/DataTree.h:1

The example code uses lowercase type names array and object, but the actual type names are Array and Object (capitalized). This will cause compilation errors if someone copies this example. Update to use DataTree::Array and DataTree::Object.

/**

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/gul17/data_processors.h

src/data_processors/yaml_processor.cc

src/data_processors/json_processor.cc

src/data_processors/yaml_processor.cc

src/data_processors/xml_processor.cc

tests/data_processors/test_yaml_processor.cc

tests/data_processors/test_json_processor.cc

src/data_processors/json_processor.cc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/gul17/data_processors.h

src/data_processors/json_processor.cc

src/data_processors/yaml_processor.cc

include/gul17/DataTree.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 21 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/data_processors/json_processor.cc

src/data_processors/yaml_processor.cc

src/data_processors/json_processor.cc

src/data_processors/yaml_processor.cc

include/gul17/DataTree.h

src/data_processors/xml_processor.cc

src/data_processors/yaml_processor.cc

include/gul17/data_processors.h

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/data_processors/yaml_processor.cc

src/data_processors/xml_processor.cc

tests/data_processors/test_yaml_processor.cc

tests/data_processors/test_json_processor.cc

src/data_processors/json_processor.cc

tests/data_processors/test_xml_processor.cc

src/data_processors/yaml_processor.cc

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 14 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-27T17:08:21Z

src/data_processors/yaml_processor.cc

+        }
+        else if (value.is_double())
+        {
+            output_ << value.as<double>();


[nitpick] The same std::to_string precision issue applies here for YAML serialization of double values. Consider using more controlled formatting for better output consistency.

Copilot · 2025-11-27T17:08:21Z

src/data_processors/xml_processor.cc

+        else if (value.is_int())
+        {
+            output_ << indent_str << opening_tag << ">"
+                << std::to_string(value.as<int>())


[nitpick] Inconsistent use of std::to_string for integers in XML serialization. While integers don't have the same precision issues as doubles, using output_ << value.as<int>() directly would be more consistent with boolean serialization and slightly more efficient.

Suggested change

<< std::to_string(value.as<int>())

<< value.as<int>()

Copilot · 2025-11-27T17:08:21Z

src/data_processors/yaml_processor.cc

+            else if (c == '"' && !in_single_quote)
+            {
+                // Only toggle if not escaped
+                if (i == 0 || line[i-1] != '\\')


Missing bounds checking before accessing str[i-1]. If i == 0, this will access memory before the string start. Add a check i > 0 before accessing line[i-1].

Suggested change

if (i == 0 || line[i-1] != '\\')

if (i == 0 || (i > 0 && line[i-1] != '\\'))

Copilot · 2025-11-27T17:08:22Z

src/data_processors/json_processor.cc

+                        char buf[12];
+                        snprintf(buf, sizeof(buf), "\\u%04x", static_cast<unsigned char>(c));


[nitpick] The snprintf call should check the return value to ensure the buffer was large enough, even though 12 bytes should be sufficient for \u%04x. Consider using a safer approach like std::ostringstream or verify the return value.

Copilot · 2025-11-27T17:08:22Z

src/data_processors/xml_processor.cc

+            output_ << indent_str << opening_tag << ">"
+                << std::to_string(value.as<double>())
+                << closing_tag;


[nitpick] The std::to_string function is used to serialize doubles, which may produce trailing zeros or lack precision control. For XML/JSON serialization, consider using more controlled formatting (e.g., std::ostringstream with precision settings) to ensure consistent and predictable output.

Suggested change

output_ << indent_str << opening_tag << ">"

<< std::to_string(value.as<double>())

<< closing_tag;

{

std::ostringstream oss;

oss.precision(15);

oss << std::defaultfloat << value.as<double>();

output_ << indent_str << opening_tag << ">"

<< oss.str()

<< closing_tag;

}

Copilot · 2025-11-27T17:08:24Z

src/data_processors/xml_processor.cc

+                if (obj.find(key) != obj.end())
+                {
+                    throw std::runtime_error(gul17::cat("Duplicate attribute name: ", attr_name, " at position ", pos_));
+                }


The duplicate attribute check at line 170 checks if the key already exists in obj, but attributes are being added with the "@" prefix. The check should verify if an attribute with the same name has already been added during the attribute parsing phase, not in the final object. Consider checking for duplicates in the attributes vector instead.

Suggested change

if (obj.find(key) != obj.end())

{

throw std::runtime_error(gul17::cat("Duplicate attribute name: ", attr_name, " at position ", pos_));

}

Copilot · 2025-11-27T17:08:25Z

src/data_processors/yaml_processor.cc

+        {
+            std::string str = value.as<std::string>();
+            // Quote strings if they contain special characters
+            if (str.empty() || str.find_first_of(":#{}[]&*!|>\"'%") != std::string::npos)


The YAML serializer checks for special characters that require quoting with str.find_first_of(":#{}[]&*!|>\"'%"), but this doesn't cover all cases where quoting is necessary in YAML. For example, strings starting with - or ? followed by space, or strings that look like numbers/booleans should also be quoted. Consider a more comprehensive check or document the limitations.

Copilot · 2025-11-27T17:08:26Z

src/data_processors/yaml_processor.cc

+                    if (c == '"')
+                        output_ << "\\\"";
+                    else
+                        output_ << c;


When serializing YAML strings with quotes, only the double-quote character is escaped. However, other escape sequences like newlines, tabs, and backslashes should also be escaped when inside quoted strings to produce valid YAML. The current implementation would produce invalid YAML if the string contains newlines or backslashes.

Suggested change

if (c == '"')

output_ << "\\\"";

else

output_ << c;

switch (c)

{

case '"':

output_ << "\\\"";

break;

case '\\':

output_ << "\\\\";

break;

case '\n':

output_ << "\\n";

break;

case '\r':

output_ << "\\r";

break;

case '\t':

output_ << "\\t";

break;

default:

// Escape other control characters as \xXX

if (static_cast<unsigned char>(c) < 0x20)

{

char buf[5];

snprintf(buf, sizeof(buf), "\\x%02X", static_cast<unsigned char>(c));

output_ << buf;

}

else

{

output_ << c;

}

break;

}

Copilot · 2025-11-27T17:08:26Z

src/data_processors/json_processor.cc

+        }
+        else if (value.is_double())
+        {
+            output_ << std::to_string(value.as<double>());


[nitpick] The same std::to_string precision issue applies here for JSON serialization of double values. Consider using more controlled formatting for better output consistency.

Copilot · 2025-11-27T17:08:26Z

tests/data_processors/test_yaml_processor.cc

+TEST_CASE("YamlProcessor: YAML parsing with errors", "[YamlProcessor]")
+{
+    // Currently, the parser does not throw exceptions for malformed YAML.
+}


The comment states "Currently, the parser does not throw exceptions for malformed YAML" but there's no test case validating this behavior. Consider adding at least one test case with malformed YAML to verify the parser's behavior, or remove this comment if error handling is expected.

Finii · 2025-11-28T10:08:28Z

This AI cruft does not make the PR human readable. Digging through at this moment already 68 comments is nothing I like and have time for. Maybe you should do the AI processing in a private chat or what and keep this for humans.

Anyhow, I did not look at the code at all, but

I believe we should use one of the several available and thoroughly tested libraries for parsing and not invent parsers on our own.
And with use I mean in the other libraries / servers; not pulling these libs in into GUL
We use Xerces for example in a lot libraries and servers, why switch away from a well maintained library?

and JSON/YAML parsing would add a new and convenient way of transferring structured data via the DOOCS protocol

Iirc @soerengrunewald did a lot research on the serialization and it is always best to use one of the existing libraries and not invent the wheel again.
Yaml for example has some severe quoting challenges, and I'm not sure we are able to catch them all - and also see no need to add a maintenance burden on us instead of relying on widespread used libraries.

Edit: Add last paragraph

Finii · 2025-11-28T10:18:48Z

My little Copilot chat...

Is it better to use the xerces library to manipulate XML data in a c++ program, or shall we write our own parser? Please give extensive reasoning for you answer.

Using an Existing Library vs. Rolling Your Own Parser

When it comes to parsing and manipulating XML data in C++, there are two primary options: using an existing library like Xerces or writing your own parser from scratch. Here's a detailed analysis of the pros and cons of each approach:
Using the Xerces Library

Pros:

Maturity and Stability: Xerces is a well-established, widely-used, and mature library with a large user base. This translates to a more stable and reliable parsing experience.
Standards Compliance: Xerces is designed to be compliant with various XML standards, including XML 1.0, XML 1.1, and XML Schema.
Feature-Rich: Xerces provides a comprehensive set of features, including support for DOM, SAX, and StAX parsing, as well as XML validation, XPath, and XSLT.
Community Support: As a popular library, Xerces has an active community, which means there are many resources available for troubleshooting and learning.
Less Code to Maintain: By using an existing library, you can focus on your application's logic without worrying about the intricacies of XML parsing.

Cons:

Overhead: Using a library like Xerces can introduce additional overhead, including increased binary size and potential performance impacts.
Learning Curve: While Xerces is well-documented, it still requires time and effort to learn how to use it effectively.

Writing Your Own Parser

Pros:

Customizability: By writing your own parser, you can tailor it to your specific needs and optimize it for your use case.
Lightweight: A custom parser can be designed to be lightweight, reducing binary size and overhead.
Fine-Grained Control: With a custom parser, you have complete control over the parsing process, allowing for precise error handling and optimization.

Cons:

Complexity: Writing a robust and standards-compliant XML parser is a complex task, requiring significant expertise and time.
Bugs and Edge Cases: A custom parser is more likely to contain bugs and may not handle edge cases as well as a mature library like Xerces.
Maintenance Burden: Your team will need to invest time and resources in maintaining and updating the parser, which can be a significant burden.

Conclusion

Unless you have a specific requirement that cannot be met by an existing library like Xerces (e.g., a highly customized or embedded system), it is generally recommended to use an established library like Xerces for XML parsing and manipulation in C++. The benefits of using a mature library, including stability, standards compliance, and community support, outweigh the potential drawbacks.

Writing your own parser should only be considered if:

You have a very specific, lightweight, or embedded system requirement.
You have extensive experience with XML parsing and are confident in your ability to create a robust and standards-compliant parser.

In most cases, using Xerces or another established library will save you time, effort, and potential headaches in the long run.

alt-graph

This looks useful to me, so it might tick the "Only code that is potentially useful for a wide range of C++ projects" box. What worries me a little is the lack of usage experience.

@zykure: Have you already used this or similar code either in multiple projects or intensively in a single project? If not, it might be better to put it into a separate library and gather experience there. Once it gets into GUL we'll be very unwilling to change the API again.

Do we have other project where we would want to use this? The clientlib already has Structure which is similar but different.

Other observations from a casual glance at the code:

"Data processors" might be a misnomer. Maybe "parser" is closer to what the code is actually doing?
The free functions from_json_string() etc. do not have enough context for their names. It might be better to embed them into the DataTree class as static functions, e.g. auto tree = DataTree::from_json_string().

Finii · 2025-11-28T10:27:13Z

I'm not sure how other people do it, but in our servers that use XML they of course validate the input XML via XSD.
That should of course be part of "our parser" if we have one, as good practice would be to validate the data before using it for better error handling and reporting.
Is that also included here? Without validation we would encourage people to use XML un-validated. The XSD data that belongs to a XML input also help documenting what we really expect and allow, giving a (automatically validated) protocol description.

Finii · 2025-11-28T10:38:08Z

@alt-graph wrote:

it might be better to put it into a separate library and gather experience there

I see no reason why it should/could not stay in a separate library. 🤔
Do not really see this fitting into GUL.

tick the "Only code that is potentially useful for a wide range of C++ projects" box

Well, but otoh we need to keep GUL "small enough" and not become the bucket for everything.
We've had other functionality not added to GUL, like Ryu, like fmt, like gsl, ...
I vote for a separate library (or using external libraries anyhow).

soerengrunewald · 2025-11-28T11:26:13Z

I think @Finii is on to something. I would rather see GSL features in GUL over some parsers.

Also there are battle tested parsers out there (which fixed all the edge cases), may of them header only and available via wrapdb, e.g:

That been said, I also see the benefit of having a unified interfaces for any type of parser. Non the less I don't see it gul.

On the code side there is also some things which I would like to see changed.

Generally the API seems inconsistend.
DataTree::at DataTree::operator[] key access should be string_views
DataTree::as DataTree::is seem to differ in behavior, might be wanted, but seems strange.
const_cast not a fan.
General style issues (e.g large if-else blocks and loops, deep nesting, hard coded strings/numerical values)

zykure added 4 commits November 25, 2025 16:17

Add JSON/XML/YAML parsers and serializers [WIP]

405744e

FIXME: Compiles, but no symbols exported?

Add documentation

4275280

Some cleanup

0a2c03c

Properly parse XML text content

fa31369

zykure requested a review from Copilot November 27, 2025 14:15

zykure added the enhancement New feature or request label Nov 27, 2025

zykure requested a review from alt-graph November 27, 2025 14:16

Copilot started reviewing on behalf of zykure November 27, 2025 14:16 View session

Copilot finished reviewing on behalf of zykure November 27, 2025 14:17

Copilot AI reviewed Nov 27, 2025

View reviewed changes

zykure and others added 4 commits November 27, 2025 15:30

Fix copyright notice in data_processors.h

13f0369

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[Copilot] Fix JSON parser issues

e472091

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[Copilot] Fix YAML parser issues

56a7fe2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix more issues from Copilot review

b59ef66

zykure requested a review from Copilot November 27, 2025 14:55

Copilot started reviewing on behalf of zykure November 27, 2025 14:56 View session

Copilot finished reviewing on behalf of zykure November 27, 2025 14:56

Copilot AI reviewed Nov 27, 2025

View reviewed changes

zykure marked this pull request as ready for review November 27, 2025 15:28

[Copilot] Minor changes from code review

27fde8c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

zykure requested a review from Copilot November 27, 2025 15:33

Copilot started reviewing on behalf of zykure November 27, 2025 15:33 View session

Copilot finished reviewing on behalf of zykure November 27, 2025 15:37

Copilot AI reviewed Nov 27, 2025

View reviewed changes

zykure marked this pull request as draft November 27, 2025 15:44

zykure added 2 commits November 27, 2025 17:39

Fix more issues from Copilot review

a9477a2

Use ordered map in DataTree to avoid extra sorting

7f29d06

zykure requested a review from Copilot November 27, 2025 16:45

Copilot started reviewing on behalf of zykure November 27, 2025 16:45 View session

Copilot finished reviewing on behalf of zykure November 27, 2025 16:47

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Fix more issues from Copilot review

7583414

zykure requested a review from Copilot November 27, 2025 17:02

Copilot started reviewing on behalf of zykure November 27, 2025 17:02 View session

Copilot finished reviewing on behalf of zykure November 27, 2025 17:05

Copilot AI reviewed Nov 27, 2025

View reviewed changes

alt-graph reviewed Nov 28, 2025

View reviewed changes

	if (i == 0 \|\| line[i-1] != '\\')
	if (i == 0 \|\| (i > 0 && line[i-1] != '\\'))

		char buf[12];
		snprintf(buf, sizeof(buf), "\\u%04x", static_cast<unsigned char>(c));

-                    if (c == '"')
-                        output_ << "\\\"";
-                    else
-                        output_ << c;
+                    switch (c)
+                    {
+                        case '"':
+                            output_ << "\\\"";
+                            break;
+                        case '\\':
+                            output_ << "\\\\";
+                            break;
+                        case '\n':
+                            output_ << "\\n";
+                            break;
+                        case '\r':
+                            output_ << "\\r";
+                            break;
+                        case '\t':
+                            output_ << "\\t";
+                            break;
+                        default:
+                            // Escape other control characters as \xXX
+                            if (static_cast<unsigned char>(c) < 0x20)
+                            {
+                                char buf[5];
+                                snprintf(buf, sizeof(buf), "\\x%02X", static_cast<unsigned char>(c));
+                                output_ << buf;
+                            }
+                            else
+                            {
+                                output_ << c;
+                            }
+                            break;
+                    }

Enhancement: Add parsers and serializers for JSON, YAML and XML #40

Are you sure you want to change the base?

Enhancement: Add parsers and serializers for JSON, YAML and XML #40

Uh oh!

Conversation

zykure commented Nov 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Finii commented Nov 28, 2025 •

edited

Loading

Finii commented Nov 28, 2025 •

edited

Loading