repattern Specification 1.0.0-beta.0

Specification of an embeddable DSL for building JavaScript RegExp from declarative Scheme objects.

Languages: Russian

Participate: GitHub teplostanski/repattern-spec (new issue, open issues)

Abstract

The repattern specification is an embeddable DSL for building JavaScript regular expressions from declarative Scheme objects, which are converted into RegExp instances.

Schemes are a human-readable format for describing patterns, as an alternative to writing regular expressions by hand.

Note

To use this specification effectively, you should understand regular expression terminology and how they work, since schemes describe the same concepts in a declarative format.

Table of Contents

  1. Terms
    1. Scheme
    2. Atom
    3. Sequencer
    4. Atom-sequencer
    5. Params object params
  2. Types
    1. Scheme
    2. Atom
    3. AtomSequencer
    4. Params
  3. Atom-sequencers
    1. Quantifiers
      1. repeat
      2. zeroOrMore
      3. maybe
    2. Group
      1. grouped
    3. Alternation
      1. anyOf
  4. Atoms
    1. Anchors
      1. lineStart
      2. lineEnd
    2. Literals and special characters
      1. exactly
      2. anyChar
      3. tab
      4. lineFeed
      5. carriageReturn
    3. Character sets
      1. charIn
      2. charNotIn
    4. Group references
      1. referenceTo
    5. Unicode properties
      1. unicodeProps
    6. Character classes
      1. digit
      2. word
      3. whitespace
      4. boundary

1. Terms

1.1. Scheme

A scheme is an array (sequencer) of objects (atoms) that describes the structure of a regular expression in a declarative format. Each element of a scheme (atom) represents one logical step of the pattern.

Simple example:

const scheme = [
  { lineStart: true },
  { zeroOrMore: [{ charIn: 'a-z' }] },
  { lineEnd: true },
];
// Result: /^[a-z]*$/

Example: email address validation

Regular expression:

/^(?:[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-])+@[a-zA-Z0-9](?:(?:[a-zA-Z0-9-]){0,61}[a-zA-Z0-9])?
(?:\.[a-zA-Z0-9](?:(?:[a-zA-Z0-9-]){0,61}[a-zA-Z0-9])?)+$/

This regular expression validates email addresses according to RFC 5322.

The scheme for this regular expression:

const scheme = [
  { lineStart: true }, // ^
  {
    repeat: [
      // (?: ... )+
      { charIn: "a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-" }, // [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]
    ],
  },
  { exactly: '@' }, // @
  { charIn: 'a-zA-Z0-9' }, // [a-zA-Z0-9]
  {
    maybe: [
      // (?: ... )?
      {
        repeat: [
          // (?: ... ){0,61}
          { charIn: 'a-zA-Z0-9-' }, // [a-zA-Z0-9-]
          { params: { times: [0, 61] } },
        ],
      },
      { charIn: 'a-zA-Z0-9' }, // [a-zA-Z0-9]
    ],
  },
  {
    repeat: [
      // (?: ... )+
      { exactly: '.' }, // .
      { charIn: 'a-zA-Z0-9' }, // [a-zA-Z0-9]
      {
        maybe: [
          // (?: ... )?
          {
            repeat: [
              // (?: ... ){0,61}
              { charIn: 'a-zA-Z0-9-' }, // [a-zA-Z0-9-]
              { params: { times: [0, 61] } },
            ],
          },
          { charIn: 'a-zA-Z0-9' }, // [a-zA-Z0-9]
        ],
      },
    ],
  },
  { lineEnd: true }, // $
];

1.2. Atom

An atom is the smallest unit of a scheme, describing one logical step of the pattern (anchor, character class, quantified sequence, alternation, and so on).

An atom is an object with exactly one key that defines the type of step. The value of that key may only be:

  • a string;
  • a number or an array of two numbers (only for the times parameter; see Params object params);
  • a boolean;
  • an array of atoms (sequencer), or for anyOf, an array of anonymous sequencers.

The value of an atom cannot be another atom or an arbitrary object.

Any object encountered in a scheme is treated as an atom, except the special params object params, which is not considered an atom and does not describe a separate pattern step (see Params object params).

Atoms:

1.3. Sequencer

A sequencer is an array of atoms that combines several sub-patterns into a sequence. The order of atoms in a sequencer corresponds to their order in the resulting regular expression.

Types of sequencers A scheme has three types of sequencers, differing by context and allowed contents:

  • Root sequencer — the top-level array of the scheme. Contains only atoms.

  • Atom-sequencer — an atom whose value is a sequencer (see Atom-sequencer).

  • Anonymous sequencer — a sequencer without a name, found only inside the atom-sequencer anyOf. Used to describe alternative branches and may contain only atoms.

1.4. Atom-sequencer

An atom-sequencer is an atom whose value is a sequencer (an array of atoms) or, for anyOf, an array of sequencers. It describes a sequence of sub-patterns: nested atoms are interpreted in array order and correspond to concatenation of sub-patterns inside one grouping construct.

An atom-sequencer always acts as a wrapper over sub-patterns, equivalent to one of the regular expression groups:

  • non-capturing (?: … )
  • capturing ( … )
  • named (?<name> … )

By default, an atom-sequencer creates a non-capturing group (?: … ) as the wrapper (except grouped). The wrapper type can be changed with the group parameter in the params object.

Atom-sequencers:

1.5. Params object params

The params object is an auxiliary object for configuring an atom-sequencer. It is not considered a scheme atom, does not participate in the sub-pattern sequence, and does not affect the order in which nested atoms are matched. The params object must be placed at the end of the atom-sequencer array and is interpreted as metadata for the parent atom-sequencer.

Parameters

times

Sets an exact repetition count or a range.

Applies only to the repeat quantifier.

times?: number | [number] | [number, number];
ValueEquivalentDescription
n{n}exactly n repetitions
[min]{min,}from min to infinity
[min, max]{min,max}repetition range

If the times parameter is omitted, the + quantifier is used (repeat 1 or more times).

lazy

Lazy quantifier flag.

true makes the quantifier lazy (*?, {n,}?); false disables the flag. Applies to all quantifiers: repeat, zeroOrMore, maybe.

lazy?: boolean;

group

Defines the type of grouping construct:

  • false: non-capturing group (?: … )
  • true: capturing group ( … )
  • "<name>": named group (?<name> … )
group?: boolean | string;

optionally

Flag that makes the group optional.

Applies only to anyOf alternation.

optionally?: boolean;

Usage context

The params object may appear only inside atom-sequencers.

Parameter applicability by atom-sequencer:

ParameterAtom-sequencer
timesrepeat
lazyrepeat, zeroOrMore, maybe
grouprepeat, zeroOrMore, maybe, grouped, anyOf
optionallyanyOf

2. Types

This section describes TypeScript types for working with schemes.

2.1. Scheme

A scheme is a root sequencer containing only atoms.

type Scheme = Atom[];

2.2. Atom

An atom is the smallest unit of a scheme—a union of all atom types.

type Atom = BooleanAtom | StringAtom | ReferenceToAtom | AtomSequencer;

type BooleanAtom =
  | { lineStart: boolean }
  | { lineEnd: boolean }
  | { digit: boolean }
  | { word: boolean }
  | { whitespace: boolean }
  | { boundary: boolean }
  | { anyChar: boolean }
  | { tab: boolean }
  | { lineFeed: boolean }
  | { carriageReturn: boolean };

type StringAtom =
  | { exactly: string }
  | { charIn: string }
  | { charNotIn: string }
  | { unicodeProps: string };

type ReferenceToAtom = { referenceTo: number | string };

2.3. AtomSequencer

Atom-sequencers that accept a sequencer (or, for anyOf, an array of sequencers) with the corresponding parameter types.

type AtomSequencer =
  | { repeat: RepeatSequence }
  | { zeroOrMore: ZeroOrMoreSequence }
  | { maybe: MaybeSequence }
  | { grouped: GroupedSequence }
  | { anyOf: AnyOfSequence };

type RepeatSequence = [...Atom[], { params: RepeatParams }?] | Atom[];

type ZeroOrMoreSequence = [...Atom[], { params: ZeroOrMoreParams }?] | Atom[];

type MaybeSequence = [...Atom[], { params: MaybeParams }?] | Atom[];

type GroupedSequence = [...Atom[], { params: GroupedParams }?] | Atom[];

type AnyOfSequence = [...Atom[][], { params: AnyOfParams }?] | Atom[][];

2.4. Params

Parameter types for atom-sequencers. Each atom-sequencer has its own set of allowed parameters.

type Params =
  | RepeatParams
  | ZeroOrMoreParams
  | MaybeParams
  | GroupedParams
  | AnyOfParams;

type RepeatParams = {
  times?: number | [number] | [number, number];
  lazy?: boolean;
  group?: boolean | string;
};

type ZeroOrMoreParams = {
  lazy?: boolean;
  group?: boolean | string;
};

type MaybeParams = {
  lazy?: boolean;
  group?: boolean | string;
};

type GroupedParams = {
  group?: boolean | string;
};

type AnyOfParams = {
  group?: boolean | string;
  optionally?: boolean;
};

3. Atom-sequencers

3.1. Quantifiers

3.1.1 repeat

repeat — Creates a group with a repetition quantifier. Groups sub-patterns and repeats them a specified number of times.

Supported parameters

The params object may contain (see Params object params):

  • times — without this parameter, equivalent to + (1 or more times);
  • lazy — default false;
  • group — default false (non-capturing group (?: … )).

Examples

const scheme = [
  { repeat: [{ charIn: 'a-z' }, { params: { times: [0, 3], lazy: true } }] },
];
// Result: /(?:[a-z]){0,3}?/

const scheme = [
  { repeat: [{ exactly: 'foo' }, { params: { group: 'word' } }] },
];
// Result: /(?<word>foo)+/

3.1.2. zeroOrMore

zeroOrMore — Creates a group with the * quantifier. Repeats the sub-pattern zero or more times.

Supported parameters

The params object may contain (see Params object params):

  • lazy — default false;
  • group — default false (non-capturing group (?: … )).

Examples

const scheme = [{ zeroOrMore: [{ exactly: 'foo' }] }];
// Result: /(?:foo)*/

const scheme = [
  {
    zeroOrMore: [
      { charIn: 'a-z' },
      { params: { group: 'letters', lazy: true } },
    ],
  },
];
// Result: /(?<letters>[a-z]*?)/

3.1.3. maybe

maybe — Creates a group with the ? quantifier (repeat the sub-pattern zero or one time). Makes the sub-pattern optional.

Supported parameters

The params object may contain (see Params object params):

  • lazy — default false;
  • group — default false (non-capturing group (?: … )).

Examples

const scheme = [{ maybe: [{ exactly: 'foo' }] }];
// Result: /(?:foo)?/

const scheme = [
  { maybe: [{ charIn: 'A-Z' }, { params: { group: 'opt', lazy: true } }] },
];
// Result: /(?<opt>[A-Z])??/

Note

For anyOf with deeply nested alternation branches, prefer the optionally parameter over wrapping in maybe to avoid excessive nesting.


3.2. Group

3.2.1. grouped

grouped — Creates a grouping construct. Combines several sub-patterns into one logical group.

Supported parameters

The params object may contain (see Params object params):

  • group — default true (capturing group ( … )).

Examples

const scheme = [{ grouped: [{ exactly: 'foo' }, { charIn: 'A-Z' }] }];
// Result: /(foo[A-Z])/

const scheme = [
  { grouped: [{ exactly: 'bar' }, { params: { group: false } }] },
];
// Result: /(?:bar)/

const scheme = [
  { grouped: [{ exactly: 'buzz' }, { params: { group: 'word' } }] },
];
// Result: /(?<word>buzz)/

3.3. Alternation

3.3.1. anyOf

anyOf — Creates alternation (the | operator). Combines several branches; at least one must match.

Structure

The value of the anyOf key is an array of anonymous sequencers (alternation branches):

{
  anyOf: [
    [ /* branch 1 */ ],
    [ /* branch 2 */ ],
    ...
  ]
}

Supported parameters

The params object may contain (see Params object params):

  • group — default false (non-capturing group (?: … ));
  • optionally — default false.

Note

To create optional alternation, use the optionally parameter instead of wrapping anyOf in maybe to avoid excessive nesting.

Examples

const scheme = [
  {
    anyOf: [
      [{ charIn: '01' }, { digit: true }], // branch 1
      [{ exactly: '2' }, { charIn: '0-3' }], // branch 2
      { params: { group: 'hours' } },
    ],
  },
  { exactly: ':' },
  {
    grouped: [
      { charIn: '0-5' },
      { digit: true },
      { params: { group: 'minutes' } },
    ],
  },
];
// Result: /(?<hours>[01]\d|2[0-3]):(?<minutes>[0-5]\d)/

const str = '23:59 25:99 1:2';
const re = /(?<hours>[01]\d|2[0-3]):(?<minutes>[0-5]\d)/;
const result = str.match(re);

console.log(result);

/* Output:
[
  '23:59',
  '23',
  '59',
  index: 0,
  input: '23:59 25:99 1:2',
  groups: [Object: null prototype] { hours: '23', minutes: '59' }
]
*/

4. Atoms

4.1. Anchors

4.1.1. lineStart

  • Type: Atom
  • Equivalent: ^

lineStart — Start-of-line anchor.

true adds the anchor; false omits this atom.

{
  lineStart: true;
} // ^

4.1.2. lineEnd

  • Type: Atom
  • Equivalent: $

lineEnd — End-of-line anchor.

true adds the anchor; false omits this atom.

{
  lineEnd: true;
} // $

4.2. Literals and special characters

4.2.1. exactly

exactly — Matches an exact sequence of characters (literal).

Accepts a string that will be escaped and used as-is in the regular expression.

{
  exactly: 'foo';
} // foo
{
  exactly: '.';
} // \.
{
  exactly: '(';
} // \(

4.2.2. anyChar

  • Type: Atom
  • Equivalent: .

anyChar — Matches any character except newline characters.

{
  anyChar: true;
} // .

4.2.3. tab

  • Type: Atom
  • Equivalent: \t

tab — Tab character.

{
  tab: true;
} // \t

4.2.4. lineFeed

  • Type: Atom
  • Equivalent: \n

lineFeed — Line feed character.

{
  lineFeed: true;
} // \n

4.2.5. carriageReturn

  • Type: Atom
  • Equivalent: \r

carriageReturn — Carriage return character.

{
  carriageReturn: true;
} // \r

4.3. Character sets

4.3.1. charIn

  • Type: Atom
  • Equivalent: [ ... ]

charIn — Character class matching any character from the specified set.

Accepts a string describing the character set using the same syntax as regex character classes (ranges, escapes, and so on).

{
  charIn: 'a-z';
} // [a-z]
{
  charIn: 'a-zA-Z0-9';
} // [a-zA-Z0-9]
{
  charIn: 'abc';
} // [abc]

4.3.2. charNotIn

  • Type: Atom
  • Equivalent: [^ ...]

charNotIn — Negated character class matching any character not in the specified set.

Accepts a string describing the character set using the same syntax as regex character classes.

{
  charNotIn: 'a-z';
} // [^a-z]
{
  charNotIn: '0-9';
} // [^0-9]

4.4. Group references

4.4.1. referenceTo

  • Type: Atom
  • Equivalent: \k<name>, \N

referenceTo — Reference to a captured group.

Accepts:

  • a number — backreference by group index (\N);
  • a string — backreference to a named group (\k<name>).
{
  referenceTo: 1;
} // \1
{
  referenceTo: 'name';
} // \k<name>

4.5. Unicode properties

4.5.1. unicodeProps

  • Type: Atom
  • Equivalent: \p{...}

unicodeProps — Matches characters based on Unicode properties.

Accepts any property expression valid inside \p{...} (for example, Letter, Number, Script=Latin, and so on).

{
  unicodeProps: 'Letter';
} // \p{Letter}
{
  unicodeProps: 'Script=Latin';
} // \p{Script=Latin}

4.6. Character classes

Character classes are atoms that take a boolean value. Unlike anchors, where false omits the atom, a false value on a character class produces the negated class (for example, \D instead of \d, \W instead of \w).

4.6.1. digit

  • Type: Atom
  • Equivalent: \d, \D

digit — Character class matching decimal digits.

true matches any decimal digit (0–9); false matches any non-digit.

{
  digit: true;
} // \d
{
  digit: false;
} // \D

4.6.2. word

  • Type: Atom
  • Equivalent: \w, \W

word — Character class matching word characters.

true matches Latin letters (A–Z, a–z), decimal digits (0–9), and underscore (_); false matches any non-word character.

{
  word: true;
} // \w
{
  word: false;
} // \W

4.6.3. whitespace

  • Type: Atom
  • Equivalent: \s, \S

whitespace — Character class matching whitespace characters.

true matches whitespace; false matches any non-whitespace character.

{
  whitespace: true;
} // \s
{
  whitespace: false;
} // \S

4.6.4. boundary

  • Type: Atom
  • Equivalent: \b, \B

boundary — Word boundary anchor.

true matches a word boundary; false matches a non-boundary position.

{
  boundary: true;
} // \b
{
  boundary: false;
} // \B