PHP  
downloads | documentation | faq | getting help | | php.net sites | links 
search for in the  
previousqdom_treeeregnext
Last updated: Tue, 28 May 2002
view this page in Printer friendly version | English | Brazilian Portuguese | Czech | Dutch | French | German | Hungarian | Italian | Japanese | Korean | Polish | Romanian | Russian | Spanish | Turkish

LXXXVIII. Regular Expression Functions (POSIX Extended)

Huomaa: PHP also supports regular expressions using a Perl-compatible syntax using the PCRE functions. Those functions support non-greedy matching, assertions, conditional subpatterns, and a number of other features not supported by the POSIX-extended regular expression syntax.

Varoitus

These regular expression functions are not binary-safe. The PCRE functions are.

Regular expressions are used for complex string manipulation in PHP. The functions that support regular expressions are:

These functions all take a regular expression string as their first argument. PHP uses the POSIX extended regular expressions as defined by POSIX 1003.2. For a full description of POSIX regular expressions see the regex man pages included in the regex directory in the PHP distribution. It's in manpage format, so you'll want to do something along the lines of man /usr/local/src/regex/regex.7 in order to read it.

Esimerkki 1. Regular Expression Examples

ereg ("abc", $string);            
/* Returns true if"abc"
   is found anywhere in $string. */

ereg ("^abc", $string);
/* Returns true if "abc";
   is found at the beginning of $string. */

ereg ("abc$", $string);
/* Returns true if "abc"
   is found at the end of $string. */

eregi ("(ozilla.[23]|MSIE.3)", $HTTP_USER_AGENT);  
/* Returns true if client browser
   is Netscape 2, 3 or MSIE 3. */

ereg ("([[:alnum:]]+) ([[:alnum:]]+) ([[:alnum:]]+)", $string,$regs); 
/* Places three space separated words
   into $regs[1], $regs[2] and $regs[3]. */

$string = ereg_replace ("^", "<br />", $string); 
/* Put a <br /> tag at the beginning of $string. */
 
$string = ereg_replace ("$", "<br />", $string); 
/* Put a <br />; tag at the end of $string. */

$string = ereg_replace ("\n", "", $string);
/* Get rid of any newline
   characters in $string. */

Sis�llys
ereg -- Regular expression match
ereg_replace -- Replace regular expression
eregi -- case insensitive regular expression match
eregi_replace -- replace regular expression case insensitive
split -- split string into array by regular expression
spliti --  Split string into array by regular expression case insensitive
sql_regcase --  Make regular expression for case insensitive match
User Contributed Notes
Regular Expression Functions (POSIX Extended)
add a note about notes
07-Mar-2001 05:38
If you don't have commandline access to the manpage cited above, note that
the "POSIX 1003.2 Regular Expressions" manpage is also widely
re-published on the web.  See, for instance:



The "POSIX 1003.2 Regular Expressions" manpage provides a good
basic reference for the syntax used by ereg_* functions.  Most tutorials
on "extended regular expressions" are also applicable.


07-Mar-2001 12:53

Dario seems to have made a nice tutorial about regular expressions:



Thanks Dario! ...


18-Dec-2001 11:39

I noticed Cyro's link had gone old. So I made copy of the regex manpage and
placed it on my site. You can get it from the following address:



This is primarily for Windows users, who have no access to the man pages
in Linux distributions.

03-Feb-2002 01:02
if you are looking for the abbreviations like tab, carriage return,
regex-class definitions  

you should look here: 


some excerpts:

	\a	control characters bell
	\b	backspace
	\f	form feed
	\n	line feed
	\r	carriage return
	\t	horizontal tab
	\v	vertical tab

class example
	\cLu	all uppercase letters


21-Feb-2002 03:12

It's easy to exclude characters but excluding words with a regular
expression is a bit more tricky. For parentheses there is no equivalent to
the ^ for brackets. The only way I've found to exclude a string is to
proceed by inverse logic: accept all the words that do NOT correspond to
the string. So if you want to accept all strings except those _begining_
with "abc", you'd have to accept any string that matches one of
the following:
  ^(ab[^c])
  ^(a[^b]c)
  ^(a[^b][^c])
  ^([^a]bc)
  ^([^a]b[^c])
  ^([^a][^b]c)
  ^([^a][^b][^c])

which, put together, gives the regex
  ^(ab[^c]|a[^b]c|a[^b][^c]|[^a]bc|[^a]b[^c]|[^a][^b]c|[^a][^b][^c])

Note that this won't work to detect the word "abc" anywhere in a
string. You need to have some way of anchoring the inverse word match
like: ^(a[^b]|[^a]b|[^a][^b])   ;"ab" not at begining of line
  or: (a[^b]|[^a]b|[^a][^b])&   ;"ab" not at end of line
  or: 123(a[^b]|[^a]b|[^a][^b]) ;"ab" not after "123"

I don't know why "(abc){0,0}" is an invalid synthax. It would've
made all this much simpler.
 
 
Slightly off-topic, here's a regex date validator (format yyyy-mm-dd,
remove all spaces and linefeeds):
  ^(19|20)([0-9]{2}-((0[13-9]|1[0-2])-(0[1-9]|[12][0-9]|30)|
  (0[13578]|1[02])-31|02-(0[1-9]|1[0-9]|2[0-8]))|([2468]0|
  [02468][48]|[13579][26])-02-29)$

luciano_at_braziliantranslation.net
03-Mar-2002 06:15

mholdgate wrote a very nice quick reference guide in the next page (),
but I felt it could be improved a little:
________________

^		Start of line
$		End of line
n?		Zero or only one single occurrence of character 'n'
n*		Zero or more occurrences of character 'n'
n+		At least one or more occurrences of character 'n'
n{2}		Exactly two occurrences of 'n'
n{2,}		At least 2 or more occurrences of 'n'
n{2,4}		From 2 to 4 occurrences of 'n'
.		Any single character
()		Parenthesis to group expressions
(.*)		Zero or more occurrences of any single character, ie, anything!
(n|a)		Either 'n' or 'a'
[1-6]		Any single digit in the range between 1 and 6
[c-h]		Any single lower case letter in the range between c and h
[D-M]		Any single upper case letter in the range between D and M
[^a-z]		Any single character EXCEPT any lower case letter between a and z.

		Pitfall: the ^ symbol only acts as an EXCEPT rule if it is the 
		very first character inside a range, and it denies the 
		entire range including the ^ symbol itself if it appears again 
		later in the range. Also remember that if it is the first 
		character in the entire expression, it means "start of line". 
		In any other place, it is always treated as a regular ^ symbol.
		In other words, you cannot deny a word with ^undesired_word 
		or a group with ^(undesired_phrase).
		Read more detailed regex documentation to find out what is 
		necessary to achieve this.

[_4^a-zA-Z]	Any single character which can be the underscore or the 
		number 4 or the ^ symbol or any letter, lower or upper case

?, +, * and the {} count parameters can be appended not only to a single
character, but also to a group() or a range[].

therefore,
^.{2}[a-z]{1,2}_?[0-9]*([1-6]|[a-f])[^1-9]{2}a+$
would mean:

^.{2} 		= A line beginning with any two characters, 
[a-z]{1,2} 	= followed by either 1 or 2 lower case letters, 
_? 		= followed by an optional underscore, 
[0-9]* 		= followed by zero or more digits, 
([1-6]|[a-f]) 	= followed by either a digit between 1 and 6 OR a 
		lower case letter between a and f, 
[^1-9]{2} 	= followed by any two characters except digits 
		between 1 and 9 (0 is possible), 
a+$ 		= followed by at least one or more 
		occurrences of 'a' at the end of a line.


07-Mar-2002 04:26

sorry to be picky here but saying ^ is beginning of a line or $ is end of
line is rather misleading, if you're working on a daily basis with
regexes.

it might be that it is most of the time correct BUT in some occasions
you'd be better off to think of ^ as "start of string" and $ as
"end of string".

there are ways to make your regex engine forget about your system's notion
of a newline, it's what is commonly refered to as multiline regexes...


08-Mar-2002 04:33

Follow-up to my previous post:
Some simple optimization allowed me to realize that excluding a word at
the beginning of a string has a degree of complexity O(n) rather than
O(n^2). I only had to follow the logic:

if str[0] != badword[0] then OK
else
  if str[1] != badword[1] then OK
  else
    if str[2] != badword[2] then OK
    else ...

So excluding the word 'abc' at the beginning of a string is much more
simple than I had made it out to be:
  ^([^a]|a[^b]|ab[^c])


09-Mar-2002 04:40

Sadly, the Posix regexp evaluator (PHP 4.1.2) does not seem to support
multi-character coallating sequences, even though such sequences are
included in the man-page documentation.

Specifically, the man-page discusses the expression "[[.ch.]]*c"
which matches the first five characters of "chchcc".  Running
this expression in ereg_replace generates the error "Warning:
REG_ECOLLATE".  (Running an equivalent expression with only one
character between the periods does work, however.)

Multi-character coallating sequences are not supported!

This is really, really too bad, because it would have provided a simple
way to exlude words from the target.

I'm going to go learn PCRE, now.  :-(

add a note about notes
previousqdom_treeeregnext
Last updated: Tue, 28 May 2002
show source | credits | stats | mirror sites:  
Copyright © 2001, 2002 The PHP Group
All rights reserved.
This mirror generously provided by:
Last updated: Sat Jul 6 00:05:55 2002 CEST