Skip to content

Script Language Definition

David Callahan edited this page Dec 30, 2015 · 10 revisions

Script State

An rsed script defines a sequence of statements which are executed in the context of an input file which is consumed line by line. The current input line, possibly modified by script action, is available in a pre-defined variable as is the line number in the input file. Actions of script statements will change the current line, alter values associated variables, advance to next input line, output a line, or change input or output streams. Input lines may come from standard input, files, output of shell commands, or fragments of some other line.

Lexical Elements

rsed scripts are ASCII text files and generally are structured as a sequence of commands, one per line.

A # not inside a string denotes the start of a comment which continues to the end of the line. These characters are ignored. An empty line with only white-space characters is also ignored.

An & not inside a string suppresses the following newline to allow a single statement to span multiple lines.

A ; is treated as a newline and allows multiple statements on a single line.

An identifier is a sequence of upper and lower case letters, numbers or an underscores (_) which does not begin with a number. An identifier preceded by a dollar sign ($) is a variable. The following identifiers are reserved and have syntactic meaning.

all      and       close     columns   copy
else     end       error      for
foreach  if        in         input
match    not       or         output   past
print    replace   required	  skip
split    style     then       to       with

A number is a sequence of digits.A number preceded by a dollar sign is a dynamic variable.

The following sequences are interpreted as operators:

+    -   *   /  =~  =   ,
==   !=  <> <   <=  >=  >
(    )   $(

There are two forms of string constant. The inline string constant beings with a single (') or double quote (") and continues to until the initial character has a matching ending character. The string is the sequence of characters between these quotes. An instance of the starting character preceded by \ is interpreted as just an instance of the starting character and is not the matching ending character. Following the matching ending character is one or more string modifier letters: g, i, and r. The presence of a g or i effects the interpretation of the string as a regular expression. The presence of the r indicate this is a raw string.

The second form of string consists of a multi-line string. This string is initiated by the character sequence <<. This string is followed by an identifier which may optionally be enclosed in single or double quotes. If it is enclosed in quotes then it may be followed by the string modifiers. The string begins after the next newline and continues until a newline is immediately followed by the same identifier. For example

foo = << "END"r
line 1
line 2
END

This create a raw string consisting of the string "line1\nline2" where \n indicate the newline character. Note all characters after the terminator are processed as part of the input so the syntactic context where the "<<" effectively continues immediately after the terminator.

For strings that are not marked as raw, the following character sequences are interpreted as specific non-printing characters in the same manner as C++

\a    alert bell
\b    backspace
\f    form feed
\n    newline
\r    carriage return
\t    tab 
\v    vertical tab
\\    \
\?    ?
\'    '
\"    "
\0    null character (ASCII 0)

Values

rsed manipulates three kinds of values: strings (including modifiers), numbers, and logical. Numbers are represented using 64-bit IEEE floating point values while logicals are represented as true and false. There may be implicit conversion of values as determined by context. An empty string is interpreted as false as is the value "false", all other strings are true. A numeric value of 0 or NaN is treated as false, all others as true. A logical value of false is converted to the string "false" and of true to "true". Should a conversion fail, a diagnostic is generated and script interpretation halts immediately.

Expressions

Below is the syntax for an expression. Generally, an expression is similar to most language with binary infix operators with their usual meaning, unary minus with its usual meaning, comparison operations, and logical operators. This subset has common precedence rules. Below, where e1, e2, ... denote expressions, each line illustrates an expression. Parentheses may be used group terms as needed for clarity or to alter precedence.

( e1 ) 

where the evaluation results in the same value as e1.

Arithmetic operations

e1 + e2
e1 - e2
- e1 
e1 * e2
e1 / e2

When evaluated, all operand values are converted to numbers and the result is a number with the usual interpretation.

Comparison operations

e1 < e2
e1 <= e2
e1 != e2
e1 <> e2
e1 >= e2
e1 > e2 

When evaluated, comparison operations convert their operations to a common form based on this table

e1\e2  | string   logical  number

--------+-------------------------- string | string string string logical | string logical logical number | string logical number

The result is a logical value where standard ordering rules apply to numbers and strings and true is strictly greater than false.

Logical operations

 e1 and e2
 e1 or e2
 not e1

For operators and and or are evaluated, their operands are converted to logical and the usual interpretation of conjunction and disjunction are used respectively. The result is a logical value. For not, if the operand is a logical value or a number, it is converted to a logical value and inverted in the usual way.

For operator not, if the value of the operand is a string, the expression is interpreted as if it were:

 not (match (e1))

and the result is logical.

String operations

  e1 e2 

Juxtaposition of two expressions is interpreted as string concatenation. When interpreted, both operations are converted to strings and the result is the concatenation of those values. If either input to this operation has a string modifier, the result has this modifier. Note that concatenation has high precedence than arithmetic so e1 e2 + e3 is evaluated as (e1 e2) + e3.

  e1 =~ e2
  match e2
  replace [all] e2 with e3 [in e1]

In the above [...] denotes optional elements. The expression match e2 is equivalent to $CURRENT =~ (e2) and the expression replace [all] e2 with e3 is equivalent to (replace [all] e2 with e3 in $CURRENT) where $CURRENT is a special variable which holds the current line value. In these examples, all operand values are converted to strings as needed.

In e1 =~ e2, the value from the interpretation e2 is interpreted as a regular expression. If that regular expression matches the value from the evaluation of e1, then the result is a logical true and false otherwise. In the case that the expression matches, the sub-matches of the expression are available in the dynamic variables $0, $1, ... Expression evaluation is performed left-to-right with operand evaluated before containing operators. Dynamic variables always refer to the most recently evaluated match operation.

The replace operations replaces the first (or optionally every) match of the regular expression from the evaluate of e2 with the value from the evaluation of e3 in the string from the evaluation of e1. The resulting value is returned as the result of the replace operator. Every match is replaced when the all keyword is present or when the g modifier is present on the regular expression string e1. The replace operation does not modify dynamic variables.

Builtin Functions

There are number of builtin functions where the syntax is

  name ( arglist )

where name is an identifier and arglist is a possibly empty, comma-separated list of expressions.

Variables

A variable reference is a $ preceding an identifier, a number, or a parenthesized expression:

$( e )

When an identifier, the expression returns the value current associated with that identifier. That value may be explicitly specified in the program using a set statement in which case the most recent value is returned. That value may be one of a predefined set of:

 $CURRENT     current input line being processed after any replacements
 $SOURCE      original value of input line being processed
 $LINE        number of current input line where the first line is 1
 $ARG1        first parameter to the script excluding flags and the script name
              additional argument available as $ARG2, $ARG3, ...

If the variable is not otherwise defined but is identifier is defined in the ambient environment in which the script is executing, then that value is returned.

When a number, such as $2, this refers to the corresponding sub-match of the most recent regular expression match evaluation.

The expression $(e) evaluates the operand expression as a string and returns a value associated with that expression. Following the rules above.

Numbers

An numeric literal is also an expression which evaluates to its numeric value.

Statements

Clone this wiki locally