Skip to content

Add job to get columns from csv file#643

Merged
Icemole merged 7 commits intomainfrom
get-columns-from-csv-file-job
Feb 5, 2026
Merged

Add job to get columns from csv file#643
Icemole merged 7 commits intomainfrom
get-columns-from-csv-file-job

Conversation

@Icemole
Copy link
Collaborator

@Icemole Icemole commented Jan 30, 2026

If a user dumps a csv file with e.g. i6_core.text.processing.WriteToCsvFileJob, there's a chance that the delimiter appears in any of the fields. The csv module escapes these by adding quotes in the offending column. As a consequence, parsing the resulting csv file with basic parsers such as awk -F ... yields unpredictable results.

I propose a job to reliably obtain a given (set of) column(s) from a csv file by using csv.reader, which properly unescapes the columns.

Copy link
Member

@NeoLegends NeoLegends left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it's wise to stop outputting the single columns as CSV, because that has implications on escaping (you could write it as single-column-but-escaped-CSV), but I could not come up with a scenario where that would ever become a practical problem.

@Icemole
Copy link
Collaborator Author

Icemole commented Feb 4, 2026

I was wondering if it's wise to stop outputting the single columns as CSV, because that has implications on escaping (you could write it as single-column-but-escaped-CSV), but I could not come up with a scenario where that would ever become a practical problem.

With this job I wanted to dump the columns as a non-CSV file on purpose, because I want to interpret the columns without the csv module, and so I don't want any escaped tokens for the values inside my columns.

I'd say this is not an issue because if we had a setup which relies on csv.reader (for which the contents must be escaped), we'd be using the CSV file itself instead of relying on this job.

@NeoLegends
Copy link
Member

NeoLegends commented Feb 4, 2026

Actually, just found a case. If the per-cell CSV values contained newlines they will lose escaping and break the scheme. So perhaps escape those?

@Icemole
Copy link
Collaborator Author

Icemole commented Feb 4, 2026

Excellent point! I initially thought that the escaping would only be done to the delimiter, but it seems that other types of backslashes mess up our final columns:

>>> import csv
... with open("kek.txt", "w") as f:
...     wr = csv.writer(f)
...     wr.writerow(["this is\na test", "hello"])
...     
$ cat kek.txt
"this is
a test",hello
>>> import csv
... with open("kek.txt", "r") as f:
...     rd = csv.reader(f)
...     with open("col1.txt", "w") as c1:
...         for row in rd:
...             print(row[0])
...             c1.write(row[0])
...             
this is
a test
cat col1.txt
this is
a test

A quick search on the Internets tells me that our fellow coders seem to agree on str.encode("unicode_escape").decode("utf-8"), which works wonders:

>>> import csv
... with open("kek.txt", "r") as f:
...     rd = csv.reader(f)
...     with open("col1.txt", "w") as c1:
...         for row in rd:
...             print(row[0].encode("unicode_escape").decode("utf-8"))
...             c1.write(row[0].encode("unicode_escape").decode("utf-8"))
...             
this is\na test
cat col1.txt
this is\na test

@Icemole Icemole merged commit a520cc2 into main Feb 5, 2026
5 checks passed
@Icemole Icemole deleted the get-columns-from-csv-file-job branch February 5, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants