Regular Expressions - User guide
Regular expressions is the term used for a codified method of
searching invented or defined by the American
mathematician Stephen Kleene.
The syntax (language format) described is compliant with
extended regular expressions (EREs) defined in IEEE POSIX
1003.2 (Section 2.8). EREs are now commonly supported by
Apache, PHP4, Javascript 1.3+, MS Visual Studio, MS Frontpage, most
visual editors, vi, emac, the GNU family of tools (including grep,
awk and sed) as well as many others. Extended Regular
Expressions (EREs) will support Basic Regular
Expressions (BREs are essentially a subset of EREs). Most
applications, utilities and laguages that implement RE's extend the
capabilities defined. The appropriate documentation should always
be consulted.
Contents
A Gentle Introduction - the Basics
Apache browser recognition - a worked example
POSIX Standard Character Classes
Commonly Available extensions
Submatches, Groups and Backreferences
Regular Expression Tester - Experiment with your
own strings and expressions in your browser
Notes - general notes when using utilities and
lanuages
Utility notes - using Visual Studio regular
expressions
Utility notes - using sed for file
manipulation
A Gentle Introduction - the
Basics
The title is a bit of a misnomer - there is no gentle beginning
for regular expressions. You are either into hieroglyphics big time
- in which you will love this stuff, or you need to use them in
which case a headache may be your only reward.
Some Definitions before we start
We are going to be using the terms literal,
metacharacter, target string, escape sequence
and search string in this overview. Here is a definition of
our terms:
|
literal
|
A literal is any character we use in a search or
matching expression, for example, to find ind in
windows the ind is a literal string - each
character plays a part in the search, it is literally the
string we want to find. |
|
metacharacter
|
A metacharacter is one or more special characters that
have a unique meaning and are NOT used as literals in the
search expression, for example, the character ^ (circumflex or
caret) is a metacharacter. |
|
escape sequence
|
An escape sequence is a way of indicating that we want
to use one of our metacharacters as a literal. In a
regular expression an escape sequence involves placing the
metacharacter \ (backslash) in front of the
metacharacter that we want to use as a literal, for
example, if we want to find ^ind in w^indow then we
use the search string \^ind and if we want to find
\\file in the string c:\\file then we would need to
use the search string \\\\file (each \ we want to search for
(a literal) is preceded by an escape sequence
\). |
|
target string
|
We have chosen to use this term to describe the string that we
will be searching, that is, the string in which we want to find our
match or search pattern. |
|
search expression
|
We have chosen to use this term to describe the expression that
we will be using to search our target string, that is, the pattern
we use to find what we want. |
Our Example Target
Strings
Throughout this guide we will use the following as our target
strings:
STRING1 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
STRING2 Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)
These are Browser ID Strings and appear as the Apache
Environmental variable HTTP_USER_AGENT (full list
of Apache environmental variables).
Simple Matching
We are going to try some simple matching against our example
target strings:
Note: You can also experiment as you go through the examples.
Search for
|
|
|
|
| m |
STRING1
|
match
|
Finds the m in compatible |
|
STRING2
|
no match
|
There is no lower case m in this string. Searches are
case sensitive unless you take special action. |
| a/4 |
STRING1
|
match
|
Found in Mozilla/4.0 - any combination of characters can
be used for the match |
|
STRING2
|
match
|
Found in same place as in STRING1 |
| 5 [ |
STRING1
|
no match
|
The search is looking for a pattern of '5 [' and this does NOT
exist in STRING1. Spaces are valid in searches. |
|
STRING2
|
match
|
Found in Mozilla/4.75 [en] |
| in |
STRING1
|
match
|
found in Windows |
|
STRING2
|
match
|
Found in Linux |
| le |
STRING1
|
match
|
found in compatible
|
|
STRING2
|
no match
|
There is an l and an e in this string but they are not adjacent
(or contiguous). |
Brackets, Ranges and Negation
Bracket expressions introduce our first metacharacters,
in this case the square brackets which allow us to define list of
things to test for rather than the single characters we have been
checking up until now. These lists can be grouped into what are
known as Character Classes typically comprising well know groups
such as all numbers etc.
Metacharacter
|
Meaning
|
|
[ ]
|
Match anything inside the square brackets for one character
position once and only once, for example, [12] means match the
target to either 1 or 2 while [0123456789] means match to any
character in the range 0 to 9.
|
| - |
The - (dash) inside square brackets is the 'range
separator' and allows us to define a range, in our example above of
[0123456789] we could rewrite it as [0-9].
You can define more than one range inside a list e.g. [0-9A-C]
means check for 0 to 9 and A to C (but not a to c).
NOTE: To test for - inside brackets (as a literal)
it must come first or last, that is, [-0-9] will test for - and 0
to 9.
|
| ^ |
The ^ (circumflex or caret) inside square brackets
negates the expression (we will see an alternate use for the
circumflex/caret outside square brackets later), for
example, [^Ff] means anything except upper or lower case F and
[^a-z] means everything except lower case a to z.
NOTE: Spaces, or in this case the lack of them, between
ranges are very important.
|
NOTE: There are some special range values (Character Classes) that are built-in to
most regular expression software and have to be if it claims POSIX
1003.2 compliance for either BRE or ERE.
So lets try this new stuff with our target strings.
Search for
|
|
|
|
| in[du] |
STRING1
|
match
|
finds ind in Windows |
|
STRING2
|
match
|
finds inu in Linux |
| x[0-9A-Z] |
STRING1
|
no match
|
Again the tests are case sensitive to find the xt in
DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can
also use this format for testing upper and lower case e.g. [Ff]
will check for lower and upper case F. |
|
STRING2
|
match
|
Finds x2 in Linux2
|
| [^A-M]in |
STRING1
|
match
|
Finds Win in Windows |
|
STRING2
|
no match
|
We have excluded the range A to M in our search so Linux is not
found but linux (if it were present) would be found. |
Positioning (or Anchors)
We can control where in our target strings the matches are
valid. The following is a list of metacharacters that affect
the position of the search:
Metacharacter
|
Meaning
|
| ^ |
The ^ (circumflex or caret) outside square brackets
means look only at the beginning of the target string, for example,
^Win will not find Windows in STRING1 but ^Moz
will find Mozilla. |
| $ |
The $ (dollar) means look only at the end of the target string,
for example, fox$ (can be written as $fox) will find a match in
'silver fox' but not in 'the fox jumped over the moon'. |
| . |
The . (period) means any character(s) in this position, for
example, ton. will find tons and tonneau but
not wanton because it has no following character. |
NOTE: Many systems and utilities, but not all, support
special positioning macros, for example \< match at beginning of
word, \> match at end of word, \b match at the begining OR end
of word , \B except at the beginning or end of a word. List
of the common values.
So lets try this lot out with our example target strings..
Search for
|
|
|
|
| $[a-z]) |
STRING1
|
match
|
finds t) in DigiExt)
|
|
STRING2
|
no match
|
We have a numeric value at the end of this string but we would
need [0-9a-z]) to find it. |
| .in |
STRING1
|
match
|
Finds Win in Windows. |
|
STRING2
|
match
|
Finds Lin in Linux. |
Iteration 'metacharacters'
The following is a set of iteration metacharacters
(a.k.a. quantifiers) that can control the number of times a
character or string is found in our searches.
Metacharacter
|
Meaning
|
| ? |
The ? (question mark) matches the preceding character 0 or 1
times only, for example, colou?r will find both color and
colour. |
| * |
The * (asterisk or star) matches the preceding character 0 or
more times, for example, tre* will find tree and tread and
trough.
|
| + |
The + (plus) matches the previous character 1 or more times, for
example, tre+ will find tree and tread but not trough.
|
| {n} |
Matches the preceding character n times exactly, for example, to
find a local phone number we could use [0-9]{3}-[0-9]{4} which
would find any number of the form 123-4567.
Note: The - (dash) in this case, because it is outside
the square brackets, is a literal. Value is enclosed in
braces (curly brackets).
|
| {n,m} |
Matches the preceding character at least n times but not more
than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab'
but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly
brackets). |
So lets try them out with our example target strings.
Search for
|
|
|
|
| \(.*l |
STRING1
|
match
|
finds l in compatible (Note: The opening \ is an
escape sequence used to indicate the ( it precedes is a literal not
a metacharacter.) |
|
STRING2
|
no match
|
Mozilla contains lls but not preceded by an open parenthesis
(no match) and Linux has an upper case L (no match). |
|
We had previously defined the above test using the search value
l? (thanks to David Werner Wiebe for pointing out our
error). The search expression l? actually means find anything, even
if it has no l (l 0 or 1 times), so would match on both strings. We
had been looking for a method to find a single l and exclude ll
which, without lookahead (a relatively new extension to regular
expressions pioneered by PERL) is pretty difficult. Well that is
our excuse.
|
| W*in |
STRING1
|
match
|
Finds the Win in Windows. |
|
STRING2
|
match
|
Finds in in Linux preceded by W zero times - so a
match. |
| [xX][0-9a-z]{2} |
STRING1
|
no match
|
Finds x in DigExt but only one t. |
|
STRING2
|
match
|
Finds X and 11 in X11. |
More 'metacharacters'
The following is a set of additional metacharacters that
provide added power to our searches:
Metacharacter
|
Meaning
|
| () |
The ( (open parenthesis) and ) (close parenthesis) may be used
to group (or bind) parts of our search expression together -
see this example. |
| | |
The | (vertical bar or pipe) is called alternation in
techspeak and means find the left hand OR right values, for
example, gr(a|e)y will find 'gray' or 'grey'. |
<humblepie> In our examples, we blew this
expression ^([M-Z]in), we incorrectly stated that this would negate
the tests [M-Z], the '^' only performs this function inside
square brackets, here it is outside the square brackets and
is an anchor indicating 'start from first character'. The
corrected section is here. Many thanks to Mirko Stojanovic for pointing it
out and apologies to one and all.</humblepie>
So lets try these out with our example strings..
Search for
|
|
|
|
|
^([M-Z]in) |
STRING1
|
no match
|
The '^' is an anchor indicating first position. Win does
not start the string so no match. |
|
STRING2
|
no match
|
The '^' is an anchor indicating first position. Linux
does not start the string so no match. |
|
((4\.[0-3])|(2\.[0-3])) |
STRING1
|
match
|
Finds the 4.0 in Mozilla/4.0. |
|
STRING2
|
match
|
Finds the 2.2 in Linux2.2.16-22. |
| (W|L)in |
STRING1
|
match
|
Finds Win in Windows. |
|
STRING2
|
match
|
Finds Lin in Linux. |
More Stuff
For more information on regular expressions go to our links pages under
Languages/regex. There are lots of folks who get a real buzz out of
making any search a 'one liner' and they are incredibly helpful at
telling you how they did it. Welcome to the wonderful, if arcane,
world of Regular Expressions. You may want to play around with your
new found knowledge using this tool.
POSIX Character Class
Definitions
POSIX 1003.2 section 2.8.3.2 (6) defines a set of character
classesthat denote certain common ranges. They tend to look very
ugly but have the advantage that also take into account the
'locale', that is, any variant of the local
language/coding system. Many utilities/languages provide
short-hand ways of invoking these classes. Strictly the
names used and hence their contents reference the LC_CTYPE POSIX
definition (1003.2 section 2.5.2.1).
Value
|
Meaning
|
| [:digit:] |
Only the digits 0 to 9 |
| [:alnum:] |
Any alphanumeric character 0 to 9 OR A to Z or a to z. |
| [:alpha:] |
Any alpha character A to Z or a to z. |
| [:blank:] |
Space and TAB characters only. |
| [:xdigit:] |
Hexadecimal notation 0-9, A-F, a-f. |
| [:punct:] |
Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - /
< > = @ [ ] \ ^ _ { } | ~ |
| [:print:] |
Any printable character. |
| [:space:] |
Any whitespace characters (space, tab, NL, FF, VT, CR). Many
system abbreviate as \s. |
| [:graph:] |
Exclude whitespace (SPACE, TAB). Many system abbreviate as
\W. |
| [:upper:] |
Any alpha character A to Z. |
| [:lower:] |
Any alpha character a to z. |
| [:cntrl:] |
Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ
ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2
IS3 IS4 DEL. |
These are always used inside square brackets in the form
[[:alnum:]] or combined as [[:digit:]a-d]
Common Extensions and
Abbreviations
Some utitlities and most languages provide extensions or
abbreviations to simplify(!) regular expressions. These tend to
fall into Character Classes or position extensions and the most
common are listed below. In general these extensions are defined by
PERL and implemented in what is called PCRE's (Perl Compatible
Regular Expressions) which has been implemented in the form of a
libary that has been ported to many systems. Full details
of PCRE. PERL 5.8.8 regular expression
documentation.
While the \x type syntax for can look initially confusing the
backslash precedes a character that does not normally need escaping
and hence can be interpreted correctly by the utility or language -
whereas we simple humans tend to become confused more easily. The
following are supported by: .NET, PHP, PERL, RUBY, PYTHON,
Javascript as well as many others.
|
Character Class Abbreviations
|
| \d |
Match any character in the range 0 - 9 (equivalent of POSIX
[:digit:]) |
| \D |
Match any character NOT in the range 0 - 9 (equivalent of POSIX
[^[:digit:]]) |
| \s |
Match any whitespace characters (space, tab etc.). (equivalent
of POSIX [:space:] EXCEPT VT is not recognized) |
| \S |
Match any character NOT whitespace (space, tab). (equivalent of
POSIX [^[:space:]]) |
| \w |
Match any character in the range 0 - 9, A - Z and a - z
(equivalent of POSIX [:alnum:]) |
| \W |
Match any character NOT the range 0 - 9, A - Z and a - z
(equivalent of POSIX [^[:alnum:]]) |
|
Positional Abbreviations
|
| \b |
Match any character(s) at the beginning or end of a word, thus
\bton\b will find ton but not tons but \bton will also find
tons |
| \B |
Match any character(s) NOT at the beginning or end of a word,
thus \Bton\B will find wanton but not ton |
Submatches, Groups and
Backreferences
submatches (or groups or backreferences) Some language
regular expression implementations provide the last results of each
separate match enclosed in parenthesis (called a submatch,
group or backreference because there may be more than one) in
variables that may subsequently be used or substituted in an
expression. These variables are usually numbered $1 to $9. Where $1
will contain the first submatch, $2 will contain the second
submatch and so on. Example:
# assume target string = "cat"
search expression = (c|a)(t|z)
$1 will contain "a"
# $1 contains "a" because it is the last
# character found using (c|a)
# if the target string was "act"
# $1 would contain "c"
$2 will contain "t"
PERL, Ruby and the LDAP access directive support
submatches.
When used in regular expression utilities, such as grep, these
submatches are typically called groups or backreferences and are
placed in numeric variables (typically addressed as \1 to \9).
Again these groups or backreferences (variables) may be used in the
regular expression. The following demonstrates usage:
# the following expression finds double characters
(.)\1
# the paranthesis creates the grouping
# (or submatch or backreference) in this case the first or only (\1)
# the . (dot) finds any character and the \1 substitutes whatever
# character was found by the dot
Apache Browser
Identification - an Example
All we ever wanted to do was find enough about our browsers in
Apache to decide what code to supply or not for our pop-out menus.
The Apache BrowserMatch directives will set a variable if
the expression matches the USER_AGENT string.
We want to know:
- If we have any browser that supports Javascript (isJS).
- If we have any browser that supports the MSIE DHTML Object
Model (isIE).
- If we have any browser that supports the W3C DOM (isW3C).
Here in their glory are the Apache regular expression statements
we used (maybe you can understand them now)
BrowserMatchNoCase [Mm]ozilla/[4-6] isJS
BrowserMatchNoCase MSIE isIE
BrowserMatchNoCase [Gg]ecko isW3C
BrowserMatchNoCase MSIE.((5\.[5-9])|([6-9])) isW3C
BrowserMatchNoCase W3C_ isW3C
Notes:
- Line 1 checks for any upper or lower case variant of
Mozilla/4-6 (MSIE also sets this value). This test sets the
variable isJS for all version 4-6 browsers (we assume that version
3 and lower do not support Javascript or at least not a sensible
Javascript).
- Line 2 checks for MSIE only (line 1 will take out any MSIE 1-3
browsers even if this variable is set.
- Line 3 checks for any upper or lower case variant of the Gecko
browser which includes Firefox, Netscape 6, 7 and now 8 and the Moz
clones (all of which are Mozilla/5).
-
Line 4 checks for MSIE 5.5 (or greater) OR MSIE 6+.
NOTE about binding:This expression does not work:
BrowserMatchNoCase MSIE.(5\.[5-9])|([6-9]) isW3C
It incorrectly sets variable isW3C if the number 6 - 9 appears
in the string. Our guess is the binding of the first parenthesis is
directly to the MSIE expression and the OR and second parenthesis
is treated as a separate expression. Adding the inner parenthesis
fixed the problem.
-
Line 5 checks for W3C_ in any part of the line. This allows us
to identify the W3C validation services (either CSS or HTML/XHTML
page validation).
Some of the above checks may be a bit excessive, for example, is
Mozilla ever spelled mozilla, but it is also pretty silly to have
code fail just because of this 'easy to prevent' condition. There
is apparently no final consensus that all Gecko browsers will have
to use Gecko in their 'user-agent' string but it would be extremely
foolish not to since this would force guys like us to make huge
numbers of tests for branded products and the more likely outcome
would be that we would not.
Regular Expression -
Experiments
This simple little tester lets you experiment with regular
expressions using your browser's regular expression Javascript
function.
Enter or copy/paste the string you want to experiment with in
the box labled String: and the regular expression in the box
labeled RE:, click the Search button and results will
appear in the box labeled Results:. If you are very lucky
the results may even be what you expect. The first matched
expression is enclosed in < > which may not be terribly
helpful if you are dealing with HTML. But our heart is in the right
place. Clear will zap all the fields. See the notes below
for limitations, support and capabilities.
Notes:
- Javascript implementations may vary from browser to browser.
This experiment was tested with MSIE 6.x, Gecko and Opera 9. If it
does not work for you - well that is very sad but yell at your
browser supplier not us.
- The ECMA-262 (Javascript 1.2'ish) spec defines the regex
implementation to be based on Perl 5 which means that
submatches/backreferences, short forms such as \d etc should be
supported as well as standard BRE and ERE functionality.
- In Opera and MSIE the following backreference/submatch worked:
(.)\1
Which finds all occurences of double characters. Using Gecko
(Firefox) it did not work.
- For those of you familiar with Javascript's regular expressions
the script helpfully adds the start and end /, just enter the raw
regular expression. For those of you not familiar with Javascript
regular expressions ignore the preceding sentence.
Utility and Language Notes -
General
-
Certain utilities, notably grep, suggest that it is a good idea
to enclose any complex search expression inside single quotes. In
fact it is not a good idea - it is absolutely essential!
Example:
grep 'string\\' *.txt # this works correctly
grep string\\ *.txt # this does not work
-
Some utilities and most languages use / (forward slash) to start
and end (de-limit or contain) the search expression others may use
single quotes. This is especially true when there may be optional
following arguments (see the grep example above). These characters
do not play any role in the search itself.
Utility Notes - Using Visual
Studio
For reasons best know to itself VS (VS.NET) uses a bizarre set
of extensions to regular expressions. MS documentation but there is a
free regular expression add-in if
you want to return to sanity.
Utility Notes - Using sed
Stream editor (sed) is one of those amazingly powerful tools for
manipulating files that are simply horrible when you try to use
them - unless you get a buzz out of ancient Egyptian hieroglyphics.
But well worth the effort. So if you are
hieroglyphically-challenged, like us, these notes may help. There
again they may not. There is also a useful series of tutorials on sed and this list of sed one liners.
-
not all seds are equal: Linux uses GNU sed, the BSDs use
their own, slightly different, version.
-
sed on windows: GNU sed has been ported to windows.
-
sed is line oriented: sed operates on lines of text with
the file or input stream.
-
expression quoting: To avoid shell expansion (in BASH
especially) quote all expressions in single quotes as in a 'search
expression'.
-
sed defaults to BRE: The default behaviour of sed is to
support Basic Regular Expressions (BRE). To use all the features
described on this page set the -r (Linux) or -E (BSD) flag to use
Extended Regular Expressions (ERE) as shown:
# use ERE on Linux (GNU sed)
sed -r 'search expression' file
# use ERE on BSD
sed -E 'search expression' file
-
in-situ editing: By default sed outputs to 'Standard Out'
(normally the console/shell). There are two mutually exclusive
options to create modified files. Send the text to a file or use
in-situ editing with the -i option. The following two lines
illustrate the options:
# saves the unmodified file to file.bak BEFORE
# modifying
sed -i .bak 'search expression' file
# file is UNCHANGED the modified file is file.bak
sed 'search expression' file > file.bak
-
sed source: Sed will read from a file or 'Standard In'
and therefore may be used in piped sequences. The following two
lines are functionally equivalent:
cat file |sed 'search expression' > file.mod
sed 'search expression' file > file.mod
-
sed with substitution: sed's major use for most of us is
in changing the contents of files using the substitution feature.
Subsitution uses the following expression:
# substitution syntax
sed '[position]s/find/change/flag' file > file.mod
# where
# [position] - optional - normally called address in most documentation
# s - indicates substitution command
# find - the expression to be changed
# change - the expression to subsituted
# flag - controls the actions and may be
# g = repeat on same line
# N = Nth occurence only on line
# p = output line only if find was found!
# (needs -n option to suppress other lines)
# w ofile = append line to ofile only if find
# was found
# if no flag given changes only the first occurrence of
# find on every line is substituted
# examples
# change every occurrence of abc on every line to def
sed 's/abc/def/g' file > file.mod
# change only 2nd occurrence of abc on every line to def
sed 's/abc/def/2' file > file.mod
# creates file changed consisting of only lines in which
# abc was changed to def
sed 's/abc/def/w changed' file
# functionally identical to above
sed -n 's/abc/def/p' file > changed
-
Line deletion: sed provides for simple line deletion. The
following examples illustrate the syntax and a trivial example:
# line delete syntax
sed '/find/d' file > file.mod
# where
# find - find regular expression
# d - delete command
# delete every comment line (starting with #) in file
sed '/^#/d' file > file.mod
-
Delete vs Replace with null: If you use the delete
feature of sed it deletes the entire line on which 'search
expression' appears, which may not be the desired outcome. If all
you want to do it delete the 'search expression' from the line then
use replace with null. The following examples illustrate the
difference:
# delete (substitute with null) every occurrence of abc in file
sed 's/abc//g' file > file.mod
# delete every line with abc in file
sed '/abc/d' file > file.mod
-
Escaping: You need to escape certain characters when
using as literals using the standard \ technique. This removes the
width attribute from html pages that many web editors, such as
frontpage, annoyingly place on every line. The " are used as
literals in the expression and are escaped by using \:
# delete every occurrence of width="x" in file
# where x may be pure numeric or a percentage
sed 's/width=\"[0-9.%]*\"//g' file.html > file.mod
-
Delimiters: If you use sed when working with, say, paths
which contain / it can be a royal pain to escape them all so you
can use any sensible delimiter for the expresssions. The following
example illustrates the principle:
# use of / delimiter with a path containing /
# replaces all occurences of /var/www/ with /var/local/www/
sed 's/\/var\/www\//\/var\/local\/www\//g' file > file.mod
# functionally identical using : as delimiter
sed 's:/var/www/:/var/local/www/:g' file > file.mod
-
Positioning with sed: sed documentation uses, IOHO, the
confusing term address for what we call [position]. Positional
expressions can optionally be placed before sed commands to
position the execution of subsequent expressions/commands. Commands
may take 1 or 2 positional expressions which may be line or text
based. The following are simple examples:
# delete (subsitute with null) every occurrence of abc
# in file only on lines starting with xyz (1 positional expression)
sed '/^xyz/s/abc//g' file > file.mod
# delete (subsitute with null) every occurrence of abc
# only in lines 1 to 50
# 2 positional expression separated by comma
sed '1,50s/abc//g' file > file.mod
# delete (subsitute with null) every occurrence of abc
# except lines 1 - 50
# 2 positional expression separated by comma
sed '1,50!s/abc//g' file > file.mod
# delete (subsitute with null) every occurrence of abc
# between lines containing aaa and xxx
# 2 positional expression separated by comma
sed '/aaa/,/xxx/s/abc//g' file > file.mod
# delete first 50 lines of file
# 2 positional expression separated by comma
sed '1,50d' file > file.mod
# leave first 50 lines of file - delete all others
# 2 positional expression separated by comma
sed '1,50!d' file > file.mod
-
when to use -e: you can use -e (indicating sed commands)
with any search expression but when you have multiple command
sequences you must use -e. The following are functionality
identical:
# delete every occurrence of abc in file
sed 's/width=\"[0-9.%]*\"//g' file.html > file.mod
sed -e 's/width=\"[0-9.%]*\"//g' file.html > file.mod
-
Strip HTML tags: Regular expressions take the longest
match and therefore when stripping HTML tags may not yield the
desired result:
# target line
<b>I</b> want you to <i>get</i> lost.
# this command finds the first < and last > on line
sed 's/<.*>//g' file.html
# and yields
lost.
# instead delimit each < with >
sed 's/<[^>]*>//g' file.html
# yields
I want you to get lost.
# finally to allow for multi-line tags you must use
# following (attributed to S.G Ravenhall)
sed -e :aloop -e 's/<[^>]*>//g;/</N;//bloop'
[see explantion below]
-
labels, branching and multiple commands: sed allows
mutiple commands on a single line separated by semi-colons (;) and
the definition of labels to allow branching (looping) within
commands. The following example illustrates these features:
# this sequence strips html tags including multi-line ones
sed -e :aloop -e 's/<[^>]*>//g;/</N;//bloop'
# Explanation:
# -e :aloop consists of :a which creates a label followed by its name
# in this case 'loop' that can be branched to by a later command
# next -e s/<[^>]*>//g; removes tags on a single line and
# the ; terminates this command when the current line is exhausted.
# At this point the line buffer (called the search space) holds the
# current line text with any transformations applied, so <>
# sequences within the line have been removed from the search space.
# However we may have either an < or no < left in the current
# search space which is then processed by the next command which is:
# /</N; which is a positioning command looking for <
# in any remaining part of the search space. If < is found, the N
# adds a NL char to the search space (a delimiter) and tells sed
# to ADD the next line in the file to the search space, control
# then passes to the next command.
# If < was NOT found the search buffer is cleared (output normally)
# and a new line read into the search space as normal. Then control
# passes to the next command, which is:
# //bloop which comprises // = do nothing, b = branch to and loop
# which is the label to which we will branch and was created
# with -e :aloop. This simply restarts the sequence with EITHER just
# the next line of input (no < was left in the search space)
# or with the next line ADDED to the search space (there was a <
# left in the search space but no corresponding >)
# all pretty obvious really!
-
adding line numbers to files: Sometimes it's incredibly
useful to be able to get a file with line numbers to match up with
error messages for example. The following adds a line number
followed by a single space to each line in file:
# add the line number followed by space to every line in file
sed = file|sed 's/\n/ /' > file.lineno
# the first pass (sed = file) creates a line number and
# terminates it with \n (creating a new line)
# the second piped pass (sed 's/\n/ /') substitutes a space
# for \n making a single line