Redefine char and unicodeChar for correct Unicode usage #71

mbtaylor · 2025-08-07T07:41:12Z

char primitives are now UTF-8-encoded bytes,
unicodeChar primitives are now UTF-16-encoded byte pairs for BMP characters (non-BMP is excluded)
unicodeChar is deprecated
a little bit of surrounding text is rephrased to match Unicode concepts
there is some non-normative text explaining the implications of UTF-8 being a variable-width encoding
references to UCS-2 are removed

- char primitives are now UTF-8-encoded bytes, - unicodeChar primitives are now UTF-16-encoded byte pairs for BMP characters (non-BMP is excluded) - unicodeChar is deprecated - a little bit of surrounding text is rephrased to match Unicode concepts - there is some non-normative text explaining the implications of UTF-8 being a variable-width encoding - references to UCS-2 are removed

mbtaylor · 2025-08-07T07:51:17Z

This is the PR that I threatened on the Apps mailing list on 17 July, and follows on from the discussion on that thread and from @msdemlei's presentation in College Park.

It tackles issues #55 and #69; covering both getting rid of references to UCS-2 and redefining the char datatype so that it can be used for UTF-8. It is incompatible with PR #68 (if this one is accepted, that one should be retired).

rra

In the world of Unicode RFCs, the standards are fairly careful to always use the term "octet" for the individual 8-bit storage unit of an encoding. This text uses "byte" throughout. I think that's a reasonable choice these days, given that all the computing architectures that used non-8-bit bytes are very obsolete, but it might be worth aligning the terminology on octet just to avoid any confusion for readers who are moving back and forth between the RFC world and the IVOA standard world and might wonder if there's some difference between byte and octet.

I'm not sure where to put this in the document, but it feels worthwhile to add a fairly explicit warning for char that if a value is truncated to fit a length restriction on the column, it may be shorter than the number of octets given in arraysize, and therefore implementations cannot use length == arraysize as a flag to detect possibly truncated values.

rra · 2025-08-07T16:56:03Z

VOTable.tex

+For this type the primitive size of two bytes corresponds to a 2-byte
+UTF-16 {\em code unit}.
+Only characters in the Unicode Basic Multilingual Plane,
+which all have 2-byte representations, are permitted for this datatype,
+so that the primitive count matches the character count.


I know that you were trying to drop all the UCS-2 references, but I think you've essentially defined UCS-2 here without using that term explicitly. Maybe it would be a bit easier for implementers to understand if the text says that this is UCS-2? As far as I understand it, UCS-2 is exactly UTF-16 with all planes other than the Unicode Basic Multilingual Plane banned so that every character is exactly two octets.

Should there be a recommendation here about what implementations should do if given unicodeChar that is actually in UTF-16 and therefore contains surrogate pairs? I know that we would like to leave unicodeChar behind us, but we discovered that current PyVO generates unicodeChar fields when given a CSV table to upload, so we may be living with it for a while and I bet implementations will encounter people uploading higher plane characters in the wild.

I'm happy to revert the language to UCS-2 (which is indeed identical to BMP-only UTF-16) if people think that's more comprehensible.

It may be that there are unicodeChar fields containing higher-plane characters out there somewhere, but that would have been illegal for earlier versions of VOTable and would be illegal with this version too. Given that, I feel like software is within its rights to do whatever it likes... but in practice it's likely to be UTF-16 so using a UTF-16 decoder would probably do the right thing even in the presence of illegal data, so maybe it's worth recommending that.

I'm happy to revert the language to UCS-2 (which is indeed identical to BMP-only UTF-16) if people think that's more comprehensible.

I didn't express that well -- I like the change and I think it's great to connect this to UTF-16 because UTF-16 encoders are readily available but UCS-2 encoders may be a bit rarer. I was just thinking that it might be good to also note that UTF-16 with this restriction is just UCS-2.

It may be that there are unicodeChar fields containing higher-plane characters out there somewhere, but that would have been illegal for earlier versions of VOTable and would be illegal with this version too. Given that, I feel like software is within its rights to do whatever it likes... but in practice it's likely to be UTF-16 so using a UTF-16 decoder would probably do the right thing even in the presence of illegal data, so maybe it's worth recommending that.

I like the idea of saying explicitly that you can use a UTF-16 decoder and accept technically invalid VOTables that contain surrogate pairs if you want to be generous in what you accept and can handle UTF-16, but when creating a VOTable, you must not include surrogate pairs.

rra · 2025-08-07T16:58:33Z

VOTable.tex

+the 2-byte big-endian UTF-16 encoding
+of a Unicode character from the Basic Multilingual Plane.


Here too, I think this is just another way of saying UCS-2.

gpdf · 2025-08-07T19:10:42Z

I've been trying to read a variety of sources on UCS-2, all non-authoritative, and have not been able to get a completely clear answer to the question of whether there are any valid UCS-2 code points that would be interpreted differently in UTF-16.

This is roughly, but not precisely, equivalent to asking whether U+D800 - U+DFFF had always been reserved historically, even though UTF-16 didn't come along until later. Virtually all sources I can find are backward-looking, writing from the post-UTF-16 perspective, and just don't address this.

gpdf · 2025-08-07T19:13:34Z

VOTable.tex

+Note that the primitive size of one byte refers to a single
+UTF-8-encoded byte, not to a single character.
+Since UTF-8 is a variable-width encoding,
+a character may require multiple bytes, and for arrays the
+string length (length in characters) and primitive count (length in bytes)
+will in general differ.
+7-bit ASCII characters are however all encoded as a single byte in UTF-8,
+so in the case of ASCII characters, which were required for this
+datatype in earlier VOTable versions, the primitive and character count
+are equal.


Perhaps I'm overlooking it, and it's already there, but I think it might be worth an explicit statement in the text that clarifies that a bare char without a length (a one-octet string) is limited to being able to store an ASCII character.

I've added a sentence at 73bfd13 clarifying this. @fxpineau made a similar suggestion.

rra · 2025-08-07T19:48:36Z

I've been trying to read a variety of sources on UCS-2, all non-authoritative, and have not been able to get a completely clear answer to the question of whether there are any valid UCS-2 code points that would be interpreted differently in UTF-16.

This is roughly, but not precisely, equivalent to asking whether U+D800 - U+DFFF had always been reserved historically, even though UTF-16 didn't come along until later. Virtually all sources I can find are backward-looking, writing from the post-UTF-16 perspective, and just don't address this.

It's hard to find a formal definition of UCS-2 now because the UCS stuff is basically deprecated, but my understanding is that it is a direct mapping of the Unicode code points to a two-octet number per code point (and comes in either a big-endian or a little-endian variant). UTF-16 is defined to be exactly that mapping (see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G31699) except for surrogate pairs, and the code space for surrogate codes used in surrogate pairs is reserved for that purpose exclusively in https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G24089.

Following comments from Russ Allberry and Gregory D-F, make a couple of adjustments to the description of the (now deprecated) unicodeChar type: - note that the required encoding is just a rebadged UCS-2 - explicitly allow readers to treat it as standard UTF-16

mbtaylor · 2025-08-08T09:22:19Z

I've added a commit 7a17c37 that I think incorporates @rra's suggestions about unicodeChar description and usage recommendations.

mbtaylor · 2025-08-08T10:11:56Z

In the world of Unicode RFCs, the standards are fairly careful to always use the term "octet" for the individual 8-bit storage unit of an encoding. This text uses "byte" throughout. I think that's a reasonable choice these days, given that all the computing architectures that used non-8-bit bytes are very obsolete, but it might be worth aligning the terminology on octet just to avoid any confusion for readers who are moving back and forth between the RFC world and the IVOA standard world and might wonder if there's some difference between byte and octet.

I have to confess to working from wikipedia rather than the RFCs for my Unicode information; the wikipedia pages on e.g. UTF-8 and UTF-16 use the term "byte" throughout with no mention of octets. I feel like if it's good enough for wikipedia it's probably good enough here, and given that the rest of the VOTable document uses the term byte throughout as well, I think confusion will be minimised by leaving the terminology as is. But if majority opinion is against me I'll change it.

mbtaylor · 2025-08-08T11:21:24Z

I'm not sure where to put this in the document, but it feels worthwhile to add a fairly explicit warning for char that if a value is truncated to fit a length restriction on the column, it may be shorter than the number of octets given in arraysize, and therefore implementations cannot use length == arraysize as a flag to detect possibly truncated values.

I've tried to address this in c9859ed.

rra

This version looks great to me.

tomdonaldson · 2025-10-28T00:36:20Z

With apologies for being late to the party, I have a few thoughts on this.

I do think that this specification is clear and unambiguous. Usually that would be enough for me, but...
in this case I'm concerned about some practical aspects of the arraysize being (possibly) different from the number of characters in a value. I think I could be convinced that these concerns are manageable, but I did want to bring them up.

Clearly for deserializing BINARY and BINARY2 we need to know the number of bytes per value. This is also probably true for 2D char arrays in the TABLEDATA serialization, but there it is less clear to me since xml parsing and string storage techniques vary widely among libraries and languages. That said, the mismatch in sizes can be confusing and therefore error-prone for both providers and consumers.

My main concern is for providers, for whom one of the most common VOTable errors we see is arraysize values that are too small for the payload, leading some clients to truncate values in TABLEDATA or possibly fail to properly read BINARY data. With this new UTF-8 handling, determining an appropriate arraysize is more difficult. Databases represent UTF-8 in a variety of ways, and not all data comes from a database, so the value can't necessarily come from a schema. In general it will be difficult to know how many 2, 3 or 4 byte UTF-8 characters will be present without scanning all the values.

A safe arraysize would be 4 times the maximum number of characters. For large tables with character data, that seems pretty wasteful, but numpy tables for example store all unicode data (Python strings) in 4 bytes per character so that they can have the same number of bytes per value.

For consumers, the practical issues are smaller and more about possible confusion.

For deserialization,
- consumer code will have to carefully check the VOTable version to know how to handle arraysize.
- counting bytes of text in TABLEDATA (for 2D arrays) may be easier in some libraries/languages than others. (To be fair counting UTF-8 characters in TABLEDATA may be easier in some languages than others. I think that Java would report "🍕".length() to be 2, so it would be easy to mistakenly get the wrong character count.)
After deserialization, the arraysize meaning "number of bytes" may or may not be useful.
- Currently when astropy reads a VOTable into an astropy Table (columns are essentially numpy arrays), it uses arraysize to determine the numpy type for the column. An arraysize of 5 for a FIELD results in dtype='<U5' for the astropy table column. As mentioned above, this would result in 20 bytes of storage per row. An arraysize of "*" results in a dtype of object losing any benefits of fixed storage size but probably saving some space.
- So astropy would need to change its handling of arraysize for VOTable 1.6. To maintain the current behavior of creating a fixed-width column when arraysize is not "*", astropy needs the max number of characters, not bytes. Whether those fixed-width columns are useful is a separate question, but if there are other apps that currently utilize arraysize after serialization, they may need to adjust their behavior as well.

msdemlei · 2025-10-28T15:08:31Z

On Mon, Oct 27, 2025 at 05:36:42PM -0700, Tom wrote: Clearly for deserializing BINARY and BINARY2 we need to know the number of bytes per value. This is also probably true for 2D `char` arrays in the TABLEDATA serialization, but there it is less clear to me since xml parsing and string storage techniques vary

Uh. Right. nD char arrays in Tabledata could be a killer. The only way we can define that is to say "You need to UTF-8 encode the payload before splitting it". Yikes.

mbtaylor · 2025-10-31T11:09:22Z

Uh. Right. nD char arrays in Tabledata could be a killer. The only way we can define that is to say "You need to UTF-8 encode the payload before splitting it". Yikes.

That's right. I've included the following text in Section 2.2:

"To decode a multi-dimensional char array it is necessary first to represent it as a sequence of UTF-8 bytes in order to determine the byte sequence representing each fixed-storage-length string."

Do you think this needs spelling out more explicitly? I've presented it here from a read rather than write point of view, it should maybe be written "To encode or decode a multi-dimensional char array...". Is more rewording required?

mbtaylor · 2025-10-31T13:09:13Z

@tomdonaldson, thanks for your thoughts.

Clearly for deserializing BINARY and BINARY2 we need to know the number of bytes per value. This is also probably true for 2D char arrays in the TABLEDATA serialization, but there it is less clear to me since xml parsing and string storage techniques vary widely among libraries and languages.

If it was just TABLEDATA then you could (and probably should) count number of characters rather than number of bytes when looking at 2-d char arrays. But since for BINARY/2 it really has to be by (UTF-8) byte, I'm proposing it works the same way for TABLEDATA. Otherwise, the meaning of arraysize would be different for different serializations, which I'd say is too nasty to stand.

consumer code will have to carefully check the VOTable version to know how to handle arraysize.

Fortunately, this isn't true: if you write code that's correct for VOTable 1.6 (the system proposed by this PR) it will work correctly for all legal instances of earlier VOTable versions. The reason is that in VOTable versions to date it is only allowed to have 7-bit ASCII characters in char arrays, and in that case byte count == character count.

counting bytes of text in TABLEDATA (for 2D arrays) may be easier in some libraries/languages than others.

That is true; if you're in a language where there is no facility for converting between a character string and a UTF-8 byte array then implementation is going to be tricky. Is this likely to be the case for any language we're interested in?

The other points you make generally relate to difficulties in working out how many characters are required for a given arraysize declaration or vice versa. It's true that's hard and/or inefficient in the proposed scheme. My feeling is that at least for scalar strings (1-d char arrays) the best thing in most cases for providers will be now to use variable-length arrays rather than a fixed arraysize which is, as you say, of questionable utility. When parsing these things I'd consider just reading them in as variable-length strings rather than fixed-length ones even in the case of fixed arraysize, though there may be efficiency considerations for that depending on the platform. Admittedly you can't dodge a fixed arraysize dimension when working with 2-d char arrays, but that is a fairly unusual data type (and might possibly be addressed one day by xtype="strings").

One exception to the always-use-variable-arraysize rule of thumb could be (when authoring a VOTable) to use a fixed arraysize for strings known to be ASCII-like, for instance ISO-8601 dates. I should note that @fxpineau made a suggestion relevant to this, to do with using the width attribute of FIELD to flag to readers that the content is indeed ASCII-like.

tomdonaldson · 2025-11-05T18:50:42Z

VOTable.tex

-and a character string may therefore be terminated by an ASCII
-NULL [0x00]
+indicates a Unicode string composed of UTF-8 encoded text.
+A string may be terminated by a NULL code point


With the previous wording, it was explicit that the NULL termination applied to BINARY/BINARY2 (not TABLEDATA). Is it worth clarifying that here?

tomdonaldson · 2025-11-05T18:53:03Z

VOTable.tex

+by this standard, but readers MAY treat such characters normally
+if encountered, for instance by using a UTF-16 decoder on BINARY data,
+though note in this case the arraysize may no longer match the character count.
+Note this datatype is {\bf deprecated} from VOTable 1.6.


For emphasis, would it be useful to move this deprecation note to the beginning of the section.

tomdonaldson

Thanks @mbtaylor for the responses and for this PR in general. I am now comfortable that this approach is a very reasonable way to handle the subtle complexities of UTF-8 values. In any case, I can't think of another approach that would be cleaner to readers or writers. I'm particularly compelled by the point that for 1D char arrays, writers are not required or even encouraged to use a fixed arraysize, and that readers other than validators are similarly not required care use the arraysize.

I'm comfortable with what's written as is, but left a couple minor suggestions for your consideration.

Another question occurred to me but also seems minor. That is, PARAM values use the same datatypes as FIELDs, but we don't explicitly mention that their value is essentially a quoted string matching a TABLEDATA serialization. In the unlikely event that one used multidimensional strings in a PARAM, is that worth clarifying in the value desccription?

There are a couple of adjustments to the text clarifying the intention of the Unicode-related changes.

mbtaylor · 2025-11-06T13:59:37Z

Thanks @tomdonaldson I've made those adjustments as suggested. Incidentally a multi-dimensional char array in a PARAM is not all that outlandish, I came across one this week: astropy/astropy#18789 (comment)

mbtaylor · 2025-11-17T17:38:03Z

In discussion of this proposal during the Apps session in Görlitz (see my presentation) @stvoutsin suggested the possibility of restricting the content of FIELDs with datatype="char" and a fixed arraysize to ASCII characters. This would allow VOTable authors to specify string length in characters in the case that they happened to know the content was ASCII so that for instance document ingestors could store them in fixed-length character storage (like a CHAR(x) column). At the same time authors who had (maybe) non-ASCII data could still write it, but would be obliged to declare it arraysize="*".

On reflection, I don't support this idea for a few reasons:

it means that you can't write non-ASCII into string arrays (which have to declare string length with a fixed size)
it's kind of messy to restrict content based on arraysize
even without this arrangement, if a char column is declared with a fixed arraysize N then consumers can still declare fixed-length character storage like CHAR(N), it will just be a bit wasteful if there's much non-ASCII content in there

If I've misunderstood/misrepresented the proposal, or if Stelios or others want to argue for it, please do so in the comments here. If nobody does that in a week or so, I will shelve this suggestion, and move to merge the PR as is.

stvoutsin · 2025-11-19T23:48:47Z

In discussion of this proposal during the Apps session in Görlitz (see my presentation) @stvoutsin suggested the possibility of restricting the content of FIELDs with datatype="char" and a fixed arraysize to ASCII characters. This would allow VOTable authors to specify string length in characters in the case that they happened to know the content was ASCII so that for instance document ingestors could store them in fixed-length character storage (like a CHAR(x) column). At the same time authors who had (maybe) non-ASCII data could still write it, but would be obliged to declare it arraysize="*".

On reflection, I don't support this idea for a few reasons:

it means that you can't write non-ASCII into string arrays (which have to declare string length with a fixed size)

it's kind of messy to restrict content based on arraysize

even without this arrangement, if a char column is declared with a fixed arraysize N then consumers can still declare fixed-length character storage like CHAR(N), it will just be a bit wasteful if there's much non-ASCII content in there

If I've misunderstood/misrepresented the proposal, or if Stelios or others want to argue for it, please do so in the comments here. If nobody does that in a week or so, I will shelve this suggestion, and move to merge the PR as is.

I'm fine with the current approach and I do see how trying to limit fixed-size to ASCII would get messy.
That said I do share the practical concerns that Tom raised. Specifically the encode-to-UTF-8-then-truncate step for TABLEDATA feels error prone and not something most tool developers are probably used to thinking about.

We're also now basically saying that you need to scan the data once to get the character count for schema allocation, numpy arrays, etc. With astropy in mind in particular, this will probably have a performance impact since they'd have to either scan the data once or use object arrays, which would lose some of the advantages of fixed-length arrays (memory overhead, loss of vectorization)

That seems like we're sacrificing practicality for consistency in the spec.

The alternative of letting arraysize mean characters in TABLEDATA and bytes in BINARY/BINARY2 would avoid the truncation complexity as well as the need for tools to either scan all rows or guess conservatively about max length. While this would potentially add ambiguity, I don't think having serialization specific rules is the complete mess that others seem to think, or at least I see it as the lesser evil.

That said I get the consistency argument. And you're right that pushing people toward variable-length strings probably helps anyway.

Those are just some thoughts I wanted to raise, but happy with this if there's consensus.

pdowler · 2025-12-01T17:30:39Z

@stvoutsin wrote:

The alternative of letting arraysize mean characters in TABLEDATA and bytes in BINARY/BINARY2

I am thinking about implementation (TAP mainly) and there arraysize is in the tap_schema.columns table and ends up in VOTable output, so this kind of distinction seems like a source of pain/confusion.

Also, changing meaning of arraysize has a direct impact on the TAP specification since TAP uses the VOTable type system.

I envision doing the following:

in the database: only use char(N) where I know the data is ascii and fixed size
in the database: normally use varchar(N) and assume UTF-8
we already initialise the database (postgres) with --encoding=UTF8 --lc-collate=C --lc-ctype=C so nothing different to do there
the value of arraysize in tap_schema.columns would remain "N" for char columns
values of arraysize in tap_schema.columns would have to be "M*" (M = 2N) for varchar columns; this is a change from current practice but easy to do and no impact on clients
for columns like timestamps, the serialised value is ascii so the arraysize would be unchanged (see below)
there are other columns where I could finesse arraysize (eg URIs are storfed in varchar(N) columns and serialised using URI.toASCIIString()) so don't need the factor of 2 difference bewteen db and tap_schema, but I might not bother with the complication

Does that sound sane and sufficient for correctness?

Aside: in OpenCADC TAP services the VOSI-tables output use VOTableType exclusively for output, eg:

<dataType xsi:type="vs:VOTableType" arraysize="23*" extendedType="timestamp">char</dataType>

Our code also supports the type system from VODataService and I don't know the current state of that... but technically TAP may still depend on the VODataService type system which, iirc, also includes an equivalent to arraysize.

pdowler · 2025-12-01T17:31:26Z

Something related that came to mind...

One thing we'll need to make clear to users is in the table creation in WD-TAP-1.2 where a user can use a VOTable to create a database table with (in principle) fields using datatype="char":

arraysize="A" would map to a UTF8 column char(A/2)... well, A/2 + 1 if A is odd
arraysize="B*" would map to a UTF8 column varchar(B/2)...

So in principle users creating tables are impacted by the change in meaning of arraysize in a subtle way because they have to think "max UTF8 byte length" rather than "max string length". I don't think I can hide this.

rra · 2025-12-01T17:59:53Z

values of arraysize in tap_schema.columns would have to be "M*" (M = 2N) for varchar columns; this is a change from current practice but easy to do and no impact on clients

Wouldn't it have to be 4N to handle the worst-case scenario (emojis, etc.)?

arraysize="A" would map to a UTF8 column char(A/2)... well, A/2 + 1 if A is odd

I think this has to be A, because if the string is ASCII, it may contain up to A full characters.

pdowler · 2025-12-01T19:29:28Z

I was thinking UTF8 is 1-2 bytes per character (because I knew UTF16 exists), but you are right: 1-4 bytes/char.

yeah, arraysize="A*" can have A bytes of ascii which is A characters in the database column 🤦 if you think about input. But a varchar(A) database column would let you put up to 4A bytes in there (silently) and you might truncate output to A bytes on output. Ugh. So I think now I'd have to enforce the arraysize (byte) limit in code to avoid storing input that could not be faithfully output later... it's a complication (in principle) because we allow input in a variety of formats (not just VOTable, where we'd interpret/enforce arraysize and detect arraysize mismatch vs table), but maybe those other forms are all pure ascii so not an actual problem.

Anyway, it seems like creating a database table from VOTable will be more subtle and complicated to implement correctly.

fxpineau · 2025-12-02T17:08:10Z

So far, I have not received any answer, so I am trying again:
does anyone know what the width attribute is used for in the case of fixed length character arrays?

If nobody has an answer, we may state in the VOTable document that, if present, width is to be interpreted as the maximum number of characters contained in the fixed length character column, and that it is recommended to include this attribute when possible.

A reader will then understand that:

datatype="char" arraysize="N" width="M"

means:

ASCII string of M characters (and bytes) if M = N → corresponds to CHAR(M) in PostgreSQL and SQL Server
UTF-8 string of up to M characters if M < N → corresponds to VARCHAR(M) in PostgreSQL and CHAR(N) in SQL Server

msdemlei · 2025-12-03T08:36:53Z

On Tue, Dec 02, 2025 at 09:08:31AM -0800, François-Xavier Pineau wrote: fxpineau left a comment (ivoa-std/VOTable#71) If nobody has an answer, we may state in the VOTable document that, if present, `width` is to be interpreted as the maximum number of characters contained in the fixed length character column, and that it is _recommended_ to include this attribute when possible.

We could do that; but is this actually something people want? Or should want? Put another way: Is there a strong use case for fixed-display-length strings in the 2020ies? [fixed-storage-length is a different matter; perhaps not for VOTables, but being able to compute row offsets in memory-mapped FITS binary tables admittedly is cool].

A reader will then understand that: > datatype="char" arraysize="N" width="M" means: * ASCII string of M characters (and bytes) if M = N → corresponds to CHAR(M) in PostgreSQL and SQL Server * UTF-8 string of up to M characters if M < N → corresponds to VARCHAR(M) in PostgreSQL and CHAR(N) in SQL Server

At least Postgres is rather clear in the docs and has been that way for a long time: There is no performance difference among these three types [char, varchar, text], apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead. That said, I'd not veto employing width for this purpose. I'm not sure I'd put it into DaCHS, though.

fxpineau · 2025-12-04T10:27:53Z

Is there a strong use case for fixed-display-length strings in the 2020ies?

My point here is not about display; it is about having a way to tag ASCII strings (i.e., pre-v1.6 VOTable Strings).

Why:
- because ASCII strings are simpler, safer, and allow for more efficient operations
  - see @pdowler’s latest message
  - truncating fixed-byte-length UTF-8 strings may lead to variable-byte-length strings
  - truncating fixed-byte-lenght UTF-8 strings is either slower or unsafe (e.g., it may panic)
- because FITS does not support UTF-8, so it may be important to known in advance whether a VOTable-to-FITS conversion will be lossy or not (and avoid entering the slow UTF-8 --> ASCII code).
- thus, a library developer may decide to internally support both ASCII and UTF-8 string types
  - allowing for "business as usual" with ASCII
  - rejecting (at first) UTF-8 while waiting to implement and used the "more subtle and complicated" (@pdowler) code
  - testing for non-ASCII characters (and removing them) when converting to FITS
  - ...
How: Two possible solutions are: using xtype="ascii"; using arraysize == width.
I think I finally prefer the xtype solution, which also makes it possible to tag variable-length strings and arrays of strings.

Postgres is rather clear in the docs

Yes: "character(n) has performance advantages in some other database systems".
I would add: and also outside databases.

fixed-storage-length [...] is cool

Yes, this is why I prefer CHAR(N) in SQL Server over VARCHAR(M) in PostgreSQL:
in the first case, you know you can use a -- out-of-database -- fixed-length storage directly from the DB type.

msdemlei · 2025-12-04T13:23:04Z

On Thu, Dec 04, 2025 at 02:28:16AM -0800, François-Xavier Pineau wrote: fxpineau left a comment (ivoa-std/VOTable#71) > Is there a strong use case for fixed-display-length strings in the 2020ies? My point here is not about display; it is about having a way to tag ASCII strings (i.e., pre-v1.6 VOTable Strings).

Ah, ok. Yeah, that's a convincing use case.

* How: Two possible solutions are: using `xtype="ascii"`; using `arraysize == width`. I think I finally prefer the `xtype` solution, which also makes it possible to tag variable-length strings and arrays of strings.

Hm. I'd hope there are more options, because I don't particularly like either of them. xtype="ascii" I think is clumsy because all string-valued xtypes we have defined so far als qualify: The all are expressed in 7-bit ASCII. And no, having multiple xtypes is even worse. Against that, making arraysize==width a somewhat cryptic marker is clearly preferable to me, and I'd certainly implement that in some way if we really don't find anything less cryptic. So... do we? A thought that comes to mind is VALUES. This has always been painful with char[], as I bet if you look, you will find that OPTIONS for char[] VALUES (where they exist at all) enumerate full strings (i.e. arrays), where we otherwise have agreed in 1.5 that VALUES is element-wise for arrays unless there's an xtype. *If* we accept our element-wise VALUES policy for char[] as well, then "have an ASCII character in VALUES/MAX" would work nicely as a relatively clear signal for "it's ASCII". But then, admittedly, this shows another slight uglyness, because if there is non-ASCII in a char array, the MAX will be some UTF-8 escape unless we do some crazy definition (and I'd argue we shouldn't). For instance, if a column contained the röckdötty German word "Äußerung", VALUES/MAX would arguably be 'Ã' (that's LATIN CAPITAL LETTER A WITH TILDE in case you're squinting), the character corresponding to the code point u+00c3, a thing that cannot even be represented in VOTable char (only in char[2]). So, I could even imagine saying: VALUES is undefined for char[] that have non-ASCII; I actually think we should do that regardless of how we do the ASCII signal. And then "do VALUES/MAX" to signal pure ASCII isn't too far-fetched...

mbtaylor · 2025-12-15T13:05:26Z

@fxpineau's suggestion about co-opting the width attribute for this purpose is indeed a little bit cryptic, since the meaning here would be "maximum length in characters" (where characters means Unicode code points) which isn't exactly the same as its currently documented meaning "number of characters to be used for input or output" (VOTable 1.5 sec 4.2); though they are clearly related.

But on second thoughts (thanks for your persistence FX) I'm coming to like it more. I think it's at least harmless and if documented sufficiently prominently it would be able to address some of the concerns raised here. It not only can serve as a flag for ASCII content, but can also be used to signal length in characters (code points) to consumers that want that information for resource allocation as raised by @stvoutsin:

We're also now basically saying that you need to scan the data once to get the character count for schema allocation, numpy arrays, etc. With astropy in mind in particular, this will probably have a performance impact since they'd have to either scan the data once or use object arrays, which would lose some of the advantages of fixed-length arrays (memory overhead, loss of vectorization)

The width idea is therefore more capable (as well as being a bit less ugly IMHO) than co-opting either VALUES or xtype to signal ASCII content.

mbtaylor requested review from msdemlei and removed request for msdemlei August 7, 2025 08:00

rra reviewed Aug 7, 2025

View reviewed changes

gpdf reviewed Aug 7, 2025

View reviewed changes

clarify single char can only store 7-bit ASCII characters

73bfd13

Note UTF-8 array truncation issue

c9859ed

mbtaylor force-pushed the unicode branch from 339faf1 to c9859ed Compare August 8, 2025 11:16

rra approved these changes Aug 8, 2025

View reviewed changes

stvoutsin mentioned this pull request Aug 12, 2025

VOTable export should default to char datatype instead of unicodeChar for string columns astropy/astropy#18515

Open

mbtaylor mentioned this pull request Oct 21, 2025

[DRAFT] UCS-2 needs to be UTF-16 now #68

Closed

tomdonaldson reviewed Nov 5, 2025

View reviewed changes

tomdonaldson approved these changes Nov 5, 2025

View reviewed changes

mbtaylor added 2 commits November 6, 2025 11:29

Note multi-dimensional char decoding issues

684503c

Improvements following Tom D's comments

c201427

There are a couple of adjustments to the text clarifying the intention of the Unicode-related changes.

tomdonaldson mentioned this pull request Nov 7, 2025

VOTable: E01: Invalid size specifier '94x11' for a char field (in field 'FILE_CAVEATS') astropy/astropy#18789

Closed

		the 2-byte big-endian UTF-16 encoding
		of a Unicode character from the Basic Multilingual Plane.

Redefine char and unicodeChar for correct Unicode usage #71

Are you sure you want to change the base?

Redefine char and unicodeChar for correct Unicode usage #71

Uh oh!

Conversation

mbtaylor commented Aug 7, 2025

Uh oh!

mbtaylor commented Aug 7, 2025

Uh oh!

rra left a comment

Choose a reason for hiding this comment

Uh oh!

rra Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

mbtaylor Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

rra Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

rra Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gpdf commented Aug 7, 2025

Uh oh!

gpdf Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

mbtaylor Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

rra commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

mbtaylor commented Aug 8, 2025

Uh oh!

rra left a comment

Choose a reason for hiding this comment

Uh oh!

tomdonaldson commented Oct 28, 2025

Uh oh!

msdemlei commented Oct 28, 2025 via email

Uh oh!

mbtaylor commented Oct 31, 2025

Uh oh!

mbtaylor commented Oct 31, 2025

Uh oh!

tomdonaldson Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

tomdonaldson Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomdonaldson left a comment

Choose a reason for hiding this comment

Uh oh!

mbtaylor commented Nov 6, 2025

Uh oh!

mbtaylor commented Nov 17, 2025

Uh oh!

stvoutsin commented Nov 19, 2025

Uh oh!

pdowler commented Dec 1, 2025

Does that sound sane and sufficient for correctness?

Uh oh!

pdowler commented Dec 1, 2025

Uh oh!

rra commented Dec 1, 2025

Uh oh!

pdowler commented Dec 1, 2025

Uh oh!

fxpineau commented Dec 2, 2025

Uh oh!

msdemlei commented Dec 3, 2025 via email

rra commented Aug 7, 2025 •

edited

Loading

tomdonaldson Nov 5, 2025 •

edited

Loading