Skip to content

Add dataset parse/load library#13

Open
wzhao18 wants to merge 23 commits intoampersand-projects:masterfrom
wzhao18:stream-dataset
Open

Add dataset parse/load library#13
wzhao18 wants to merge 23 commits intoampersand-projects:masterfrom
wzhao18:stream-dataset

Conversation

@wzhao18
Copy link
Collaborator

@wzhao18 wzhao18 commented Feb 20, 2022

No description provided.

Copy link
Contributor

@anandj91 anandj91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct Row {
    string dataset;
};

struct CSVRow : public Row {
    vector<string> cols;
};

template<typename R>
class DataParser {
protected:
    virtual void run() = 0;
    virtual void parse(fstream&) = 0;
    virtual void decode(R&, stream::stream_event&) = 0;

    bool write_serialized_to_ostream(stream::stream_event &t)
    {
        if (!google::protobuf::util::SerializeDelimitedToOstream(t, &cout)) {
            cerr << "Fail to serialize data into output stream" << endl;
            return false;
        }
        return true;
    }

};

class CSVParser : public DataParser<CSVRow> {
protected:
    void parse_file(fstream &file)
    {   
        string line;
        getline(file, line);

        while (getline(file, line)) {
            CSVRow row;
            string word;
            stringstream ss(line);

            while (getline(ss, word, ',')) {
                row.cols.push_back(word);
            }

            stream::stream_event data;
            decode(row, data);
            if (!write_serialized_to_ostream(data)) {
                break;
            }
        }
    }   
};

consider something like the above organization of the base classes.

Additionally, try to follow some standard styling practice when you write code. Like the position of braces of classes and functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using two different parsers for fare and trip datasets. You don't need to combine them in one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of combining the parsers for fare and trip datasets is that both can be streamed to std out at the same time. The taxi benchmark requires two streams.

@wzhao18 wzhao18 requested a review from anandj91 May 4, 2022 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments