[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.11 Regular expression

Gauche has a built-in regular expression engine which is mostly upper-compatible of POSIX extended regular expression. Gauche's regexp also includes some extensions from Perl 5 regexp.

Builtin Class: <regexp>

Regular expression object. You can construct a regexp object from a string by string->regexp at run time. Gauche also has a special syntax to denote regexp literals, which construct regexp object at loading time.

Gauche's regexp engine is fully aware of multibyte characters.

Builtin Class: <regmatch>

Regexp match object. A regexp matcher rxmatch returns this object if match. This object contains all the information about the match, including submatches.

The advantage of using match object, rather than substrings or list of indices is efficiency. The regmatch object keeps internal state of match, and computes indices and/or substrings only when requested. This is particularly effective for mutibyte strings, for index access is slow on them.

Reader Syntax: #/regexp-spec/
Reader Syntax: #/regexp-spec/i

Denotes literal regular expression object. When read, it becomes an instance of <regexp>.

If a letter 'i' is given at the end, the created regexp becomes case-folding regexp, i.e. it matches in the case-insensitive way. (The current version only cares ASCII characters for case-folding— beyond ASCII characters, the match is done in the same way as normal match.)

The advantage of using this syntax over string->regexp is that the regexp is compiled only once. You can use literal regexp inside loop without worrying about regexp compilation overhead. If you want to construct regexp on-the-fly, however, use string->regexp.

Gauche's built-in regexp syntax follows POSIX extended regular expression, with a bit of extensions taken from Perl.

Note that the syntax described here is just a surface syntax. Gauche's regexp compiler works on the abstract syntax tree, and alternative syntax such as SRE will be supported in the future versions.

re*

Matches zero or more repetition of re.

re+

Matches one or more repetition of re.

re?

Matches zero or one occurrence of re.

re{n}
re{n,}
re{n,m}

Bounded repetition. re{n} matches exactly n occurrences of re. re{n,} matches n or more occurrences of re. re{n,m} matches at least n and at most m occurrences of re, where n <= m.

re*?
re+?
re??
re{n,}?
re{n,m}?

Same as the above repetition construct, but these syntaxes use "non-greedy" or "lazy" match strategy. That is, they try to match the minimum number of occurrences of re first, then retry longer ones only if it fails. Compare the following examples:

 
(rxmatch-substring (#/<.*>/ "<tag1><tag2><tag3>") 0)
  ⇒ "<tag1><tag2><tag3>"

(rxmatch-substring (#/<.*?>/ "<tag1><tag2><tag3>") 0)
  ⇒ "<tag1>"
(re…)

Clustering with capturing. The regular expression enclosed by parenthesis works as a single re. Besides, the string that matches re … is saved as a submatch.

(?:re…)

Clustering without capturing. re works as a single re, but the matched string isn't saved.

(?<name>re…)

Named capture and clustering. Like (re…), but adds the name name to the matched substring. You can refer to the matched substring by both index number and the name.

When the same name appears more than once in a regular expression, it is undefined which matched substring is returned as the submatch of the named capture.

(?i:re…)
(?-i:re…)

Lexical case sensitivity control. (?i:re…) makes re… matches case-insensitively, while (?-i:re…) makes re… matches case-sensitively.

Perl's regexp allows several more flags to appear between '?' and ':'. Gauche only supports above two, for now.

pattern1|pattern2|…

Alternation. Matches either one of patterns, where each pattern is re ….

\n

Backreference. n is an integer. Matches the substring captured by the n-th capturing group. (counting from 1). When capturing groups are nested, groups are counted by their beginnings. If the n-th capturing group is in a repetition and has mached more than once, the last matched substring is used.

\k<name>

Named backreference. Matches the substring captured by the capturing group with the name name. If the named capturing group is in a repetition and has mached more than once, the last matched substring is used. If there are more than one capturing group with name, matching will succeed if the input matches either one of the substrings captured by those groups.

.

Matches any character (including newline).

[char-set-spec]

Matches any of the character set specified by char-set-spec. See section Character Set, for the details of char-set-spec.

\s, \d, \w

Matches a whitespace character (#[[:space:]]), a digit character (#[[:digit:]]), or a word-constituent character (#[[:alpha:][:digit:]_]), respectively.

Can be used both inside and outside of character set.

\S, \D, \W

Matches the complement character set of \s, \d and \w, respectively.

^, $

Beginning and end of string assertion, when appears at the beginning or end of the pattern.

\b, \B

Word boundary and non word boundary assertion, respectively. That is, \b matches an empty string between word-constituent character and non-word-constituent character, and \B matches an empty string elsewhere.

\;
\"
\#

These are the same as ;, ", and #, respectively, and can be used to avoid confusing Emacs or other syntax-aware editors that are not familiar with Gauche's extension.

(?=pattern)
(?!pattern)

Positive/negative lookahead assertion. Match succeeds if pattern matches (or does not match) the input string from the current position, but this doesn't move the current position itself, so that the following regular expression is applied again from the current position.

For example, the following expression matches strings that might be a phone number, except the numbers in Japan (i.e. ones that begin with "81").

 
\+(?!81)\d{9,}
(?<=pattern)
(?<!pattern)

Positive/negative lookbehind assertion.

re*+
re++
re?+

They are the same as (?>re*), (?>re+), (?>re?), respectively.

Function: string->regexp string &keyword case-fold

Takes string as a regexp specification, and constructs an instance of <regexp> object.

If a true value is given to the keyword argument case-fold, the created regexp object becomes case-folding regexp. (See the above explanation about case-folding regexp).

Function: regexp? obj

Returns true iff obj is a regexp object.

Function: regexp->string regexp

Returns a source string describing the regexp regexp. The returned string is immutable.

Function: rxmatch regexp string

Regexp is a regular expression object. A string string is matched by regexp. If it matches, the function returns a <regmatch> object. Otherwise it returns #f.

This is called match, regexp-search or string-match in some other Scheme implementations.

Generic application: regexp string

A regular expression object can be applied directly to the string. This works the same as (rxmatch regexp string), but allows shorter notation. See section Applicable objects, for generic mechanism used to implement this.

Function: rxmatch-start match &optional (i 0)
Function: rxmatch-end match &optional (i 0)
Function: rxmatch-substring match &optional (i 0)

Match is a match object returned by rxmatch. If i equals to zero, the functions return start, end or the substring of entire match, respectively. With positive integer I, it returns those of I-th submatches. It is an error to pass other values to I.

It is allowed to pass #f to match for convenience. The functions return #f in such case.

These functions correspond to scsh's match:start, match:end and match:substring.

Function: rxmatch-num-matches match

Returns the number of matches in match. The number includes the "whole match", so it is always a positive integer for a <regmatch> object. The number also includes the submatches that don't have value (see the examples below).

For the convenience, rxmatch-num-matches returns 0 if match is #f.

 
(rxmatch-num-matches (rxmatch #/abc/ "abc"))
  ⇒ 1

(rxmatch-num-matches (rxmatch #/(a(.))|(b(.))/ "ba"))
  ⇒ 5

(rxmatch-num-matches #f)
  ⇒ 0
Function: rxmatch-after match &optional (i 0)
Function: rxmatch-before match &optional (i 0)

Returns substring of the input string after or before match. If optional argument is given, the i-th submatch is used (0-th submatch is the entire match).

 
(define match (rxmatch #/(\d+)\.(\d+)/ "pi=3.14..."))

(rxmatch-after match) ⇒ "..."
(rxmatch-after match 1) ⇒ ".14..."

(rxmatch-before match) ⇒ "pi="
(rxmatch-before match 2) ⇒ "pi=3."
Function: rxmatch->string regexp string &optional selector …

A convenience procedure to match a string to the given regexp, then returns the matched substring, or #f if it doesn't match.

If no selector is given, it is the same as this:

 
(rxmatch-substring (rxmatch regexp string))

If an integer is given as a selector, it returns the subtring of the numbered submatch.

If a symbol after or before is given, it returns the substring after or before the match. You can give these symbols and an integer to extract a substring before or after the numbered submatch.

 
gosh> (rxmatch->string #/\d+/ "foo314bar")
"314"
gosh> (rxmatch->string #/(\w+)@([\w.]+)/ "foo@example.com" 2)
"example.com"
gosh> (rxmatch->string #/(\w+)@([\w.]+)/ "foo@example.com" 'before 2)
"foo@"
Generic application: regmatch &optional index
Generic application: regmatch 'before &optional index
Generic application: regmatch 'after &optional index

A regmatch object can be applied directly to the integer index, or a symbol before or after. They works the same as (rxmatch-substring regmatch index), (rxmatch-before regmatch), and (rxmatch-after regmatch), respectively. This allows shorter notation. See section Applicable objects, for generic mechanism used to implement this.

 
(define match (#/(\d+)\.(\d+)/ "pi=3.14..."))

  (match)           ⇒ "3.14"
  (match 1)         ⇒ "3"
  (match 2)         ⇒ "14"

  (match 'after)    ⇒ "..."
  (match 'after 1)  ⇒ ".14..."

  (match 'before)   ⇒ "pi="
  (match 'before 2) ⇒ "pi=3."

(define match (#/(?<integer>\d+)\.(?<fraction>\d+)/ "pi=3.14..."))

  (match 1)         ⇒ "3"
  (match 2)         ⇒ "14"

  (match 'integer)  ⇒ "3"
  (match 'fraction) ⇒ "14"

  (match 'after 'integer)   ⇒ ".14..."
  (match 'before 'fraction) ⇒ "pi=3."

Function: regexp-replace regexp string substitution
Function: regexp-replace-all regexp string substitution

Replaces the part of string that matched to regexp for substitution. regexp-replace just replaces the first match of regexp, while regexp-replace-all repeats the replacing throughout entire string.

substitution may be a string or a procedure. If it is a string, it can contain references to the submatches by digits preceded by a backslash (e.g. \2) or the named submatch reference (e.g. \k<name>. \0 refers to the entire match. Note that you need two backslashes to include backslash character in the literal string; if you want to include a backslash character itself in the substitution, you need four backslashes.

 
(regexp-replace #/def|DEF/ "abcdefghi" "...")
  ⇒ "abc...ghi"
(regexp-replace #/def|DEF/ "abcdefghi" "|\\0|")
  ⇒ "abc|def|ghi"
(regexp-replace #/def|DEF/ "abcdefghi" "|\\\\0|")
  ⇒ "abc|\\0|ghi"
(regexp-replace #/c(.*)g/ "abcdefghi" "|\\1|")
  ⇒ "ab|def|hi"
(regexp-replace #/c(?<match>.*)g/ "abcdefghi" "|\\k<match>|")
  ⇒ "ab|def|hi"

If substitution is a procedure, for every match in string it is called with one argument, regexp-match object. The returned value from the procedure is inserted to the output string using display.

 
(regexp-replace #/c(.*)g/ "abcdefghi" 
                (lambda (m)
                  (list->string
                   (reverse
                    (string->list (rxmatch-substring m 1))))))
 ⇒ "abfedhi"

Note: regexp-replace-all applies itself recursively to the remaining of the string after match. So the beginning of string assertion in regexp doesn't only mean the beginning of input string.

Function: regexp-replace* string rx1 sub1 rx2 sub2 …
Function: regexp-replace-all* string rx1 sub1 rx2 sub2 …

First applies regexp-replace or regexp-replace-all to string with a regular expression rx1 substituting for sub1, then applies the function on the result string with a regular expression rx2 substituting for sub2, and so on. These functions are handy when you want to apply multiple substitutions sequentially on a string.

Function: regexp-quote string

Returns a string with the characters that are special to regexp escaped.

 
(regexp-quote "[2002/10/12] touched foo.h and *.c")
 ⇒ "\\[2002/10/12\\] touched foo\\.h and \\*\\.c"

In the following macros, match-expr is an expression which produces a match object or #f. Typically it is a call of rxmatch, but it can be any expression.

Macro: rxmatch-let match-expr (var …) form …

Evaluates match-expr, and if matched, binds var … to the matched strings, then evaluates forms. The first var receives the entire match, and subsequent variables receive submatches. If the number of submatches are smaller than the number of variables to receive them, the rest of variables will get #f.

It is possible to put #f in variable position, which says you don't care that match.

 
(rxmatch-let (rxmatch #/(\d+):(\d+):(\d+)/
                      "Jan  1 23:59:58, 2001")
   (time hh mm ss)
  (list time hh mm ss))
 ⇒ ("23:59:58" "23" "59" "58")

(rxmatch-let (rxmatch #/(\d+):(\d+):(\d+)/
                      "Jan  1 23:59:58, 2001")
   (#f hh mm)
  (list hh mm))
 ⇒ ("23" "59")

This macro corresponds to scsh's let-match.

Macro: rxmatch-if match-expr (var …) then-form else-form

Evaluates match-expr, and if matched, binds var … to the matched strings and evaluate then-form. Otherwise evaluates else-form. The rule of binding vars is the same as rxmatch-let.

 
(rxmatch-if (rxmatch #/(\d+:\d+)/ "Jan 1 11:22:33")
    (time)
  (format #f "time is ~a" time)
  "unknown time")
 ⇒ "time is 11:22"

(rxmatch-if (rxmatch #/(\d+:\d+)/ "Jan 1 11-22-33")
    (time)
  (format #f "time is ~a" time)
  "unknown time")
 ⇒ "unknown time"

This macro corresponds to scsh's if-match.

Macro: rxmatch-cond clause …

Evaluate condition in clauses one by one. If a condition of a clause satisfies, rest portion of the clause is evaluated and becomes the result of rxmatch-cond. Clause may be one of the following pattern.

(match-expr (var …) form …)

Evaluate match-expr, which may return a regexp match object or #f. If it returns a match object, the matches are bound to vars, like rxmatch-let, and forms are evaluated.

(test expr form …)

Evaluates expr. If it yields true, evaluates forms.

(test expr => proc)

Evaluates expr and if it is true, calls proc with the result of expr as the only argument.

(else form …)

If this clause exists, it must be the last clause. If other clauses fail, forms are evaluated.

If no else clause exists, and all the other clause fail, an undefined value is returned.

 
;; parses several possible date format
(define (parse-date str)
  (rxmatch-cond
    ((rxmatch #/^(\d\d?)\/(\d\d?)\/(\d\d\d\d)$/ str)
        (#f mm dd yyyy)
      (map string->number (list yyyy mm dd)))
    ((rxmatch #/^(\d\d\d\d)\/(\d\d?)\/(\d\d?)$/ str)
        (#f yyyy mm dd)
      (map string->number (list yyyy mm dd)))
    ((rxmatch #/^\d+\/\d+\/\d+$/ str)
        (#f)
     (errorf "ambiguous: ~s" str))
    (else (errorf "bogus: ~s" str))))

(parse-date "2001/2/3") ⇒ (2001 2 3)
(parse-date "12/25/1999") ⇒ (1999 12 25)

This macro corresponds to scsh's match-cond.

Macro: rxmatch-case string-expr clause …

String-expr is evaluated, and clauses are interpreted one by one. A clause may be one of the following pattern.

(re (var …) form …)

Re must be a literal regexp object (See section Regular expression). If the result of string-expr matches re, the match result is bound to vars and forms are evaluated, and rxmatch-case returns the result of the last form.

If re doesn't match the result of string-expr, string-expr yields non-string value, the interpretation proceeds to the next clause.

(test proc form …)

A procedure proc is applied on the result of string-expr. If it yields true value, forms are evaluated, and rxmatch-case returns the result of the last form.

If proc yieds #f, the interpretation proceeds to the next clause.

(test proc => proc2)

A procedure proc is applied on the result of string-expr. If it yields true value, proc2 is applied on the result, and its result is returned as the result of rxmatch-case.

If proc yieds #f, the interpretation proceeds to the next clause.

(else form …)

This form must appear at the end of clauses, if any. If other clauses fail, forms are evaluated, and the result of the last form becomes the result of rxmatch-case.

If no else clause exists, and all other clause fail, an undefined value is returned.

The parse-date example above becomes simpler if you use rxmatch-case

 
(define (parse-date2 str)
  (rxmatch-case str
    (test (lambda (s) (not (string? s))) #f)
    (#/^(\d\d?)\/(\d\d?)\/(\d\d\d\d)$/ (#f mm dd yyyy)
     (map string->number (list yyyy mm dd)))
    (#/^(\d\d\d\d)\/(\d\d?)\/(\d\d?)$/ (#f yyyy mm dd)
     (map string->number (list yyyy mm dd)))
    (#/^\d+\/\d+\/\d+$/                (#f)
     (errorf "ambiguous: ~s" str))
    (else (errorf "bogus: ~s" str))))

[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated by Shiro Kawai on October, 7 2008 using texi2html 1.78.