RS-19602: Remove characters from variable name if not supported by either Displayr or haven #93

chschan · 2025-11-26T03:55:56Z

Are the allowed characters for haven or Displayr documented? Maybe you could add a link for them? For this bug, what was causing the error?

Not documented explicitly as far as I know. The official rules from SPSS are vague:

Subsequent characters can be any combination of letters, numbers, nonpunctuation characters, and a period (.)

https://www.ibm.com/docs/en/spss-statistics/cd?topic=view-variable-names

Haven has the following regex https://github.com/tidyverse/haven/blob/190f7a9ababae53ea4bfb85e4183c21881644c29/R/haven-spss.R#L140

Whereas the readstat library (which haven uses) has this:
https://github.com/WizardMac/ReadStat/blob/dev/src/spss/readstat_sav_parse.rl#L20

Displayr has this https://github.com/Displayr/q/blob/master/QLib/DataModel/Variables/RawVariable.cs#L25-L58

All are different

The regex is hard to read so I also manually tested both to derive the conclusions in the tests

I guess if it was clearly documented, then we wouldn't be running into these bugs. But the regex doesn't look too bad now

Yeah its good it ended up being quite simple. Better to have a bias towards removing characters, which should mean bugs like this won't happen or will be rare.

When I fixed the previous bug, I must have assumed that the input data files would never have invalid characters. This bug wouldn't happen if there were explicit rules and everyone followed them. The emdash in the bug was allowed by Displayr, but not haven.

-Original file line number
+Diff line change
@@ Expand Up @@
     {
         input.name |>
             removeWhitespace() |>
+            removeInvalidCharacters() |>
             removeInvalidStartingCharacters() |>
             truncateNameToByteLimit() |>
             trimTrailingPeriods() |>
@@ Expand All / @@ -534,6 +535,19 @@ removeWhitespace <- function(name) @@
         gsub("\\s+", "", name)
     }
+    removeInvalidCharacters <- function(name)
+    {
+        # The regex matches all characters except:
+        #   \\pL = any kind of letter from any language
+        #   Numeric characters 0-9
+        #   \\p{Sc} = any kind of currency symbol
+        #   The special characters \ . _ $ # @
+        #
+        # This is stricter than either haven or Displayr, because their set of allowed characters are different.
+        # See unit tests for removeInvalidCharacters
+        gsub("[^\\pL0-9\\p{Sc}\\\\._$#@]", "", name, perl = TRUE)
+    }
     removeInvalidStartingCharacters <- function(name)
     {
         gsub("^[^a-zA-Z@]+", "", name)
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up @@
                           "VAR",
                           "VAR_1"))
     })
+    test_that("removeInvalidCharacters", {
+        expect_equal(removeInvalidCharacters(intToUtf8(0:127)), "#$.0123456789@ABCDEFGHIJKLMNOPQRSTUVWXYZ\\_abcdefghijklmnopqrstuvwxyz") # check characters in basic ASCII
+        expect_equal(removeInvalidCharacters("ç"), "ç") # letter characters in extended ASCII are allowed
+        expect_equal(removeInvalidCharacters("½"), "") # number characters in extended ASCII are removed (allowed by haven but not Displayr)
+        expect_equal(removeInvalidCharacters("¥"), "¥") # currency characters in extended ASCII are allowed
+        expect_equal(removeInvalidCharacters("…"), "") # punctuation characters in extended ASCII are removed (allowed by Displayr but not haven)
+        expect_equal(removeInvalidCharacters("©"), "") # other characters in extended ASCII are removed (allowed by haven but not Displayr)
+        expect_equal(removeInvalidCharacters("名称"), "名称") # "letter" unicode characters are allowed
+        expect_equal(removeInvalidCharacters("∞"), "") # number characters in unicode are removed (allowed by haven but not Displayr)
+        expect_equal(removeInvalidCharacters("€"), "€") # currency unicode characters are allowed
+        expect_equal(removeInvalidCharacters("¿"), "") # punctuation unicode characters are removed (not allowed by either)
+    })

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RS-19602: Remove characters from variable name if not supported by either Displayr or haven #93

Uh oh!

Diff view

Diff view

There are no files selected for viewing

chschan Nov 26, 2025

Uh oh!

JustinCCYap Nov 26, 2025

Uh oh!

JustinCCYap Nov 26, 2025

Uh oh!

chschan Nov 26, 2025

Uh oh!

JustinCCYap Nov 26, 2025

Uh oh!

RS-19602: Remove characters from variable name if not supported by either Displayr or haven #93

Uh oh!

RS-19602: Remove characters from variable name if not supported by either Displayr or haven #93

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

chschan Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

JustinCCYap Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

JustinCCYap Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

chschan Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

JustinCCYap Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!