Problem
Reading from /dev/urandom in the VFS returns UTF-8 encoded data instead of raw bytes. Bytes > 0x7F are expanded to 2-byte UTF-8 sequences (c2 xx or c3 xx), inflating the output size and breaking byte-level text processing.
Note: #828 partially fixed this (no more U+FFFD replacement characters), but bytes are still UTF-8 encoded rather than raw.
Reproduction
Test 1: Byte count inflation
head -c 8 /dev/urandom | wc -c
# Expected: 8
# Actual: 11-13 (varies, always > 8)
Requested vs actual byte counts:
| Requested |
Actual |
| 1 |
1 |
| 4 |
5 |
| 8 |
11 |
| 16 |
25 |
| 32 |
51 |
~60% inflation — consistent with ~50% of random bytes being > 0x7F and getting 2-byte UTF-8 encoding.
Test 2: Raw bytes show UTF-8 multibyte sequences
head -c 16 /dev/urandom | od -A x -t x1z | head -2
# 0000000 c3 aa 5b 4d c3 82 c3 bd 72 25 c3 8c c3 84 68 09
The c3 and c2 prefixes are UTF-8 lead bytes — raw byte values like 0xAA become c3 aa (2 bytes).
Test 3: tr -dc filtering broken
LC_ALL=C tr -dc 'a-z0-9' < /dev/urandom | head -c 8
# Expected: 8 alphanumeric chars like "a7xk2m9p"
# Actual: garbled non-ASCII like "ÅʤÄ" (4-5 chars, many non-ASCII)
Test 4: Repeated runs show inconsistent lengths
result=$(LC_ALL=C tr -dc 'a-z0-9' < /dev/urandom | head -c 8)
echo "len=${#result} val=$result"
# len=5 val=Ýs±Ï�
# len=5 val=¹aÑ�
# len=4 val=²¨á
Root cause
The VFS read path converts random bytes through a Rust String (which must be valid UTF-8). Bytes > 0x7F are encoded as multi-byte UTF-8 sequences instead of being passed through as raw single bytes.
Impact
The common pattern tr -dc 'a-z0-9' < /dev/urandom | head -c N for generating random strings is broken. Used in wedow/ticket's generate_id(). Workaround: use $RANDOM instead.
Problem
Reading from
/dev/urandomin the VFS returns UTF-8 encoded data instead of raw bytes. Bytes > 0x7F are expanded to 2-byte UTF-8 sequences (c2 xxorc3 xx), inflating the output size and breaking byte-level text processing.Note: #828 partially fixed this (no more U+FFFD replacement characters), but bytes are still UTF-8 encoded rather than raw.
Reproduction
Test 1: Byte count inflation
Requested vs actual byte counts:
~60% inflation — consistent with ~50% of random bytes being > 0x7F and getting 2-byte UTF-8 encoding.
Test 2: Raw bytes show UTF-8 multibyte sequences
The
c3andc2prefixes are UTF-8 lead bytes — raw byte values like0xAAbecomec3 aa(2 bytes).Test 3:
tr -dcfiltering brokenTest 4: Repeated runs show inconsistent lengths
Root cause
The VFS read path converts random bytes through a Rust
String(which must be valid UTF-8). Bytes > 0x7F are encoded as multi-byte UTF-8 sequences instead of being passed through as raw single bytes.Impact
The common pattern
tr -dc 'a-z0-9' < /dev/urandom | head -c Nfor generating random strings is broken. Used in wedow/ticket'sgenerate_id(). Workaround: use$RANDOMinstead.