-
The search patterns used by the Sequence Manipulation
Suite are not case sensitive. The following is a simple
search pattern that will find all occurrences of the
sequence fragment GGAT (and ggat):
ggat
The above will match GGAT but not GGAA.
-
Sequences containing residues that vary at a particular
position can be matched using square brackets. The
following pattern will find all occurrences of
GGAT, GGAC, and GGAA:
gga[tca]
The above will match GGAT but not GGAG.
-
To represent a completely variable residue in a pattern,
use the . character. The following pattern will
find all occurrences of GCA followed by any
single residue, followed by TTT:
gca.ttt
The above will match GCAATTT but not
GCAAATTT.
-
To indicate that a residue can be repeated one or more
times in a sequence, use the + character. The
following pattern will find all occurrences of
MVV followed by one or more R residues:
MVVR+
The above will match MVVRR but not
MVVDR.
-
To indicate that a residue can be repeated zero or more
times in a sequence, use the * character. The
following pattern will find all occurrences of
MD followed by zero or more K residues,
followed by an L:
MDK*L
The above will match MDL but not MDVL.
-
To indicate that a residue can be repeated a specific
number of times, use curly parentheses. The following
pattern will find all occurrences of an
M residue, followed by between one and four
L residues, followed by a G residue:
ML{1,4}G
The above will match MLLG but not
MLLLLLG.
-
The special characters, brackets, and curly parentheses
in the above examples allow repeated residues to be
found. You can find repeated sub-sequences using regular
parentheses in combination with the +, *,
and {} characters. The following pattern will
find all occurrences of two to 5 TNT sequences in
a row, followed by one or more KM repeats:
(TNT){2,5}(KM)+
The above will match TNTTNTTNTKM but not
TNTTNKM.
-
To restrict matches to the beginning of a sequence, use
the ^ character. For example, the following
pattern will find GACCCT only if it is within
three residues of the sequence start:
^.{0,3}GACCCT
The above will find GACCCT in the sequence
ATCGACCCT but not in the sequence
AATCGACCCT.
-
To restrict matches to the end of a sequence, use the
$ character. For example, the following pattern
will find LVL only if it is located at the end of
a sequence:
LVL$
The above will find LVL in the sequence
KMHLVL but not in the sequence LVLD.
-
To find variable sequences, you can also use the
| character to separate patterns for the
different versions of the sequence segment you want to
find. For example, to find all occurrences of
MML, MAL, and MAD you could use the
following:
MML|MAL|MAD
The above will match MML but not MMK.
-
Other examples:
atg(...)+(tag|taa|tga)
The above will match open reading frames that start
with atg and end with tag, taa,
or tga
[VILMFWC]{10,}
The above will match stretches of proteins containing
ten or more hydrophobic residues.