CIII. 正規表現(regex)関数 (POSIX拡張サポート)

導入

注意: PHPは、PCRE関数によりPerl互換の構文を使用する正規表現式もサポートします。これらの関数は、「ものぐさ」マッチ、言明、条件付きサブパターン、そしてPOSIX拡張正規表現構文でサポートされていない他の複数の機能をサポートします。

警告

これらの正規表現関数はバイナリセーフではありません。PCRE関数はバイナリセーフです。

PHPにおいて正規表現は複雑な文字列操作に使用されます。次の正規表現関数がサポートされます。

これらの関数は、全て最初の引数に正規表現をとります。 PHPはPOSIX 1003.2で定義されたPOSIX拡張正規表現を使用します。 POSIX正規表現に関する詳細については、PHP配布ファイルのregexディレクトリにあるregexのmanページを参照下さい。このページはmanpageフォーマットであり、読むには man /usr/local/src/regex/regex.7のようにします。

要件

これらの関数は、標準モジュールの一部として利用可能であり、常に使用できます。

インストール手順

PHPで正規表現のサポートを有効にするには、 --with-regex=TYPEを指定して PHPのconfigureを行って下さい。TYPEは、system, apache, phpのどれかとします。デフォルトではphpを使用します。

注意: 動作に関する知識がある場合以外は、TYPEを変更しないで下さい。

Windows版のPHPにはこの拡張モジュールのサポートが組み込まれています。これらの関数を使用するために拡張モジュールを追加でロードする必要はありません。

実行用の設定

この拡張モジュールは設定ディレクティブを全く定義しません。

リソース型

この拡張モジュールはリソース型を全く定義しません。

定義済みの定数

この拡張モジュールは定数を全く定義しません。

例

例 1. 正規表現の例

ereg ("abc", $string); /* "abc" が $string のどこかにある場合に true を返す */ ereg ("^abc", $string); /* "abc" が $string の最初にある場合に true を返す */ ereg ("abc$", $string); /* "abc"; が $string の最後にある場合に true を返す */ eregi ("(ozilla.[23]|MSIE.3)", $HTTP_USER_AGENT); /* クライアントブラウザがNetscape 2, 3またはMSIE 3である場合にtrue を返す */ ereg ("([[:alnum:]]+) ([[:alnum:]]+) ([[:alnum:]]+)", $string,$regs); /* 空白で区切られた3つ単語を $regs[1], $regs[2],$regs[3]に代入する */ $string = ereg_replace ("^", "<br />", $string); /* <br /> タグを $string の先頭に挿入する */ $string = ereg_replace ("$", "<br />", $string); /* <br /> タグを $string の最後に挿入する */ $string = ereg_replace ("\n", "", $string); /* $string の改行文字を全て取り除く */

以下も参照下さい

Perl互換の構文を有する正規表現については、 PCRE関数を参照して下さい。簡単なシェル形式のワイルドカードパターンマッチングが fnmatch()で提供されています。

目次
ereg_replace -- 正規表現による置換
ereg -- 正規表現にマッチさせる
eregi_replace -- 大文字小文字を区別せずに正規表現による置換を行う
eregi -- 大文字小文字を区別せずに正規表現によるマッチングを行う
split -- 正規表現により文字列を分割し、配列に格納する
spliti -- 大文字小文字を区別しない正規表現により文字列を分割し、配列に入れる
sql_regcase -- 大文字小文字を区別しないマッチングのための正規表現を作成する

add a note User Contributed Notes
正規表現(regex)関数 (POSIX拡張サポート)

php at erikjan dot net
19-Feb-2005 12:28


To add to tgt's tip for metacharacters.

To test for a whole word, use [[:<:]]yourword[[:>:]]

franck569 at free dot fr
31-Jan-2005 03:14


if you want to exclude a WORD, use this :



[^[WORD]]{0}



@++, Franck569.

09-Nov-2004 05:49


for exclude a string you can use the PCRE extension



'(?(?=^(the_only_string_you_dont_want_to_match)$)^$|.*)'



it's a conditional assertion it does that:

if(match(string))

  test string with ^$

else

  test string with .*

annie
09-Sep-2004 12:17


Another nice tuturial about regular expressions:

tgt at tip dot nl
08-Sep-2004 09:17


Tip !

Metacharacters in regular expresions are usefull and easy to use.



The following is a set of special values that denote certain common ranges. They have the advantage that also take in account the 'locale' i.e. any variant of the local language/coding system.



[:digit:]      Only the digits 0 to 9 

[:alnum:]      Any alphanumeric character 0 to 9 OR A to Z or a to z. 

[:alpha:]       Any alpha character A to Z or a to z. 

[:blank:]       Space and TAB characters only. 

[:xdigit:]     . 

[:punct:]       Punctuation symbols . , " ' ? ! ; : 

[:print:]      Any printable character. 

[:space:]      Any space characters. 

[:graph:]       . 

[:upper:]       Any alpha character A to Z. 

[:lower:]       Any alpha character a to z. 

[:cntrl:]        .

mina86 at tlen dot pl
19-Oct-2003 05:14


I tested how fast POSIX and Perl regular expresions are, and here are the results:



           | POSIX Extended  | Perl-Compatible |   POSIX - Perl

-----------+-----------------+-----------------+-----------------

     match |    0.1296420097 |    0.1006720066 |  0.0289700031

   match i |    0.1204010248 |    0.1101620197 |  0.0102390051

   replace |    0.1896649599 |    0.1298999786 |  0.0597649813

 replace i |   10.6998120546 |    0.1453789473 | 10.5544331074



So, as you can see, preg_* functions are faster then ereg* functions. You can find source code of my test script here:

russlndr at online dot no
02-Jul-2003 01:55


The Regex Coach - interactive regular expressions:

tino at infeon dot com
11-Jun-2003 09:49


The book "Mastering Regular Expressions" is an invaluable resource.

Anand Thakur
25-Mar-2003 07:43


I saw a link to this page somewhere.  It is a library of user-submitted regular expressions for various things.  Some good stuff there.

Robin
15-Jan-2003 06:53


Ever wondered how to exclude "[" and "]"?

Here it goes: "[^][]". Extra characters to exclude can beadded right in the middle like this: "[^]fobar[]".

moc DOT liamtoh AT ssengnorw
18-Oct-2002 04:28


In a PCRE \s matches whitespace, but not inside a character class:



preg_match ('/\s/', ' ') // match

preg_match ('/[\s]/', ' ') // no match



Within a character class [:space:] is treated as a single character that matches any single whitespace character:



$pattern = '/[[:space:]]/';

$subject = "space tab\tnewline\n";

preg_match_all($pattern, $subject, $out) // == 3



To match a hyphen from within a character class, it must either be first or last; otherwise, it will act as a range operator.



Example: To match a blank string or a string containing only uppercase letters, underscores, spaces, and hyphens:



preg_match('/^[A-Z_ -]*$/', $subject)



To match any whitespace, not just spaces:



preg_match('/^[A-Z_[:space:]-]*$/', $subject)

paper
09-Sep-2002 06:57


I have also experienced the same problem as [email protected] had been experiencing, except I did not recognize the problem until after many hours of debugging.



"\s" does not seem to represent spaces, however "[[:space:]]" does.



Another problem I was having was matching dashes/hyphens '-'. You must escape them "\-" and place them at the end of a bracket expression.



Example: To match a blank string or a string containing only uppercase letters, underscores, spaces, and hyphens:



^([A-Z_\-]|[[:space:]])*$



Hope this saves someone some time from debugging like I was. :)

bps7j at yahoo dot com
22-Aug-2002 02:40


Something that really got me: I'm used to using Perl's regexps, and so I used \s to check for a whitespace character in a password on a website. My PHP book (Wrox Press, Professional PHP Programming) agreed with me that this is exactly the same as [ \r\n\t\f\v], but it's NOT. In fact, what it did was keep anyone from joining the site if they put an 's' in their password! So beware, check for subtle differences between what you're used to and PHP.



[[:space:]] works fine, by the way.



I'm going to use the pcre functions from now on... I like Perl :o)

david at NOgreenhammerSPAM dot com
09-Mar-2002 06:40


Sadly, the Posix regexp evaluator (PHP 4.1.2) does not seem to support multi-character coallating sequences, even though such sequences are included in the man-page documentation.



Specifically, the man-page discusses the expression "[[.ch.]]*c" which matches the first five characters of "chchcc".  Running this expression in ereg_replace generates the error "Warning: REG_ECOLLATE".  (Running an equivalent expression with only one character between the periods does work, however.)



Multi-character coallating sequences are not supported!



This is really, really too bad, because it would have provided a simple way to exlude words from the target.



I'm going to go learn PCRE, now.  :-(

regex at dan42 dot cjb dot net
08-Mar-2002 06:33


Follow-up to my previous post:

Some simple optimization allowed me to realize that excluding a word at the beginning of a string has a degree of complexity O(n) rather than O(n^2). I only had to follow the logic:



if str[0] != badword[0] then OK

else

  if str[1] != badword[1] then OK

  else

    if str[2] != badword[2] then OK

    else ...



So excluding the word 'abc' at the beginning of a string is much more simple than I had made it out to be:

  ^([^a]|a[^b]|ab[^c])

spiceee at potentialvalleys dot com
07-Mar-2002 06:26


sorry to be picky here but saying ^ is beginning of a line or $ is end of line is rather misleading, if you're working on a daily basis with regexes.



it might be that it is most of the time correct BUT in some occasions you'd be better off to think of ^ as "start of string" and $ as "end of string".



there are ways to make your regex engine forget about your system's notion of a newline, it's what is commonly refered to as multiline regexes...

luciano_at_braziliantranslation.net
03-Mar-2002 08:15


mholdgate wrote a very nice quick reference guide in the next page (), but I felt it could be improved a little:

________________



^        Start of line

$        End of line

n?        Zero or only one single occurrence of character 'n'

n*        Zero or more occurrences of character 'n'

n+        At least one or more occurrences of character 'n'

n{2}        Exactly two occurrences of 'n'

n{2,}        At least 2 or more occurrences of 'n'

n{2,4}        From 2 to 4 occurrences of 'n'

.        Any single character

()        Parenthesis to group expressions

(.*)        Zero or more occurrences of any single character, ie, anything!

(n|a)        Either 'n' or 'a'

[1-6]        Any single digit in the range between 1 and 6

[c-h]        Any single lower case letter in the range between c and h

[D-M]        Any single upper case letter in the range between D and M

[^a-z]        Any single character EXCEPT any lower case letter between a and z.



        Pitfall: the ^ symbol only acts as an EXCEPT rule if it is the 

        very first character inside a range, and it denies the 

        entire range including the ^ symbol itself if it appears again 

        later in the range. Also remember that if it is the first 

        character in the entire expression, it means "start of line". 

        In any other place, it is always treated as a regular ^ symbol.

        In other words, you cannot deny a word with ^undesired_word 

        or a group with ^(undesired_phrase).

        Read more detailed regex documentation to find out what is 

        necessary to achieve this.



[_4^a-zA-Z]    Any single character which can be the underscore or the 

        number 4 or the ^ symbol or any letter, lower or upper case



?, +, * and the {} count parameters can be appended not only to a single character, but also to a group() or a range[].



therefore,

^.{2}[a-z]{1,2}_?[0-9]*([1-6]|[a-f])[^1-9]{2}a+$

would mean:



^.{2}         = A line beginning with any two characters, 

[a-z]{1,2}     = followed by either 1 or 2 lower case letters, 

_?         = followed by an optional underscore, 

[0-9]*         = followed by zero or more digits, 

([1-6]|[a-f])     = followed by either a digit between 1 and 6 OR a 

        lower case letter between a and f, 

[^1-9]{2}     = followed by any two characters except digits 

        between 1 and 9 (0 is possible), 

a+$         = followed by at least one or more 

        occurrences of 'a' at the end of a line.

regex at dan42 dot cjb dot net
21-Feb-2002 05:12


It's easy to exclude characters but excluding words with a regular expression is a bit more tricky. For parentheses there is no equivalent to the ^ for brackets. The only way I've found to exclude a string is to proceed by inverse logic: accept all the words that do NOT correspond to the string. So if you want to accept all strings except those _begining_ with "abc", you'd have to accept any string that matches one of the following:

  ^(ab[^c])

  ^(a[^b]c)

  ^(a[^b][^c])

  ^([^a]bc)

  ^([^a]b[^c])

  ^([^a][^b]c)

  ^([^a][^b][^c])



which, put together, gives the regex

  ^(ab[^c]|a[^b]c|a[^b][^c]|[^a]bc|[^a]b[^c]|[^a][^b]c|[^a][^b][^c])



Note that this won't work to detect the word "abc" anywhere in a string. You need to have some way of anchoring the inverse word match

like: ^(a[^b]|[^a]b|[^a][^b])   ;"ab" not at begining of line

  or: (a[^b]|[^a]b|[^a][^b])&   ;"ab" not at end of line

  or: 123(a[^b]|[^a]b|[^a][^b]) ;"ab" not after "123"



I don't know why "(abc){0,0}" is an invalid synthax. It would've made all this much simpler.

 

 

Slightly off-topic, here's a regex date validator (format yyyy-mm-dd, remove all spaces and linefeeds):

  ^(19|20)([0-9]{2}-((0[13-9]|1[0-2])-(0[1-9]|[12][0-9]|30)|

  (0[13578]|1[02])-31|02-(0[1-9]|1[0-9]|2[0-8]))|([2468]0|

  [02468][48]|[13579][26])-02-29)$

03-Feb-2002 03:02


if you are looking for the abbreviations like tab, carriage return, regex-class definitions  



you should look here: 





some excerpts:



    \a    control characters bell

    \b    backspace

    \f    form feed

    \n    line feed

    \r    carriage return

    \t    horizontal tab

    \v    vertical tab



class example

    \cLu    all uppercase letters

bart at framers dot nl
07-Mar-2001 02:53


Dario seems to have made a nice tutorial about regular expressions:











Thanks Dario! ...

add a note


	downloads \| documentation \| faq \| getting help \| mailing lists \| \| php.net sites \| links \| my php.net

search for in the