Reusable Code: regular expressions

Showing posts with label regular expressions. Show all posts

Sunday, September 5, 2010

Roman Numerals, Part 4

Keith Alexander of Albuquerque, New Mexico writes:

“I like your Roman Numeral library. I needed a function to test for Roman numerals, so I wrote this one.
// Check to see if the string is a Roman Numeral
// NOTE: this doesn't check for fractions, overbars, the Bede "N" (zero) etc.
// NOTE: It also doesn't check for a well-formed Roman Numeral.
function is_roman_numeral( $roman )
{
    // Strip every non-word character
    // - A-Z, 0-9, apostrophe and understcore are what's left
    $roman = preg_replace( "/[^A-Z0-9_']/iu", "", $roman );
    // if it contains anything other than MDCLXVI, then it's not a Roman Numeral
    $result = preg_match( "/[^MDCLXVI]/u", $roman );
    if( $result )
    {
        return FALSE;
    }
    return TRUE;
}

Who knows if blogger is going to show it properly. If not, just contact me and I'll send it you in email or something. Anyway, it's something I wrote in 5 minutes. If you want to add it to your library, modified or otherwise, please feel free.”

Thanks for writing in Keith, and sorry for the late response. There are two ways to validate a Roman number, using regular expressions like you did, and converting back to an Arabic number (if the conversion fails, it's not a Roman number).

I'm not sure about using a regular expression to remove non-word characters. My gut tells me that anything containing such characters should fail validation as a Roman number. Also, I would reverse the match and eliminate the if statement by directly returning the result of the match.

PHP

function isRoman($roman)
{
return preg_match("/[MDCLXVI]/u", $roman);
}

ASP

function isRoman(roman)
dim regEx
set regEx = new RegExp
with regEx
.IgnoreCase = true
.Global = true
.Pattern = "[MDCLXVI]"
end with
if regEx.Test(roman) then
isRoman = true
else
isRoman = false
end if
set regEx = nothing
end function

Saturday, May 2, 2009

Regular Expressions

Over the past year, I've posted a lot of code that made use of regular expressions. When used appropriately, they can be very powerful. But regular expressions in ASP are a little more cumbersome than PHP. Wouldn't it be great if ASP had the simplicity of regular expression functions?

Once again, the source code is too long to post here, so it will only be available on Snipplr. This library of functions includes the following:

ereg() - case-sensitive regular expression match
eregi() - case-insensitive regular expression match
ereg_replace() - case-sensitive regular expression replacement
eregi_replace() - case-insensitive regular expression replacement
sql_regcase() - make regular expression for case insensitive match

View ASP implementation on Snipplr

Saturday, August 23, 2008

isValidPostCode

In this final installment of the postal code trilogy, we turn our attention to the United Kingdom. Postal codes in the UK are called postcodes. They are similar to postal codes in Canada in that they contain both letters and numbers, but unlike Canadian postal codes, they are variable in length.

A postcode can have any of the following formats:

A9 9AA
A99 9AA
A9A 9AA
AA9 9AA
AA99 9AA
AA9A 9AA

To match all of these formats, we'll use the following regular expression: [A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW])\ [0-9][ABD-HJLNP-UW-Z]{2}. You may notice that, like Canadian postal codes, certain letters are only allowed in certain positions, or not at all.

There are also a few special cases that are valid postcodes but deviate from the regular format:

Girobank - (GIR\ 0AA)
Father Christmas - (SAN\ TA1)
British Forces Post Office - (BFPO\ (C\/O\ )?[0-9]{1,4})
Overseas territories - ((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ)

ASP

function isValidPostCode(postCode)
dim regEx
set regEx = new RegExp
with regEx
.IgnoreCase = true
.Global = true
.Pattern = "^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW])\ [0-9][ABD-HJLNP-UW-Z]{2}|(GIR\ 0AA)|(SAN\ TA1)|(BFPO\ (C\/O\ )?[0-9]{1,4})|((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ))$"
end with
if regEx.Test(trim(CStr(postCode))) then
isValidPostCode = true
else
isValidPostCode = false
end if
set regEx = nothing
end function

PHP

function isValidPostCode($postCode)
{
$pattern = "/^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW])\ [0-9][ABD-HJLNP-UW-Z]{2}|(GIR\ 0AA)|(SAN\ TA1)|(BFPO\ (C\/O\ )?[0-9]{1,4})|((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ))$/i";
return (preg_match($pattern, trim($postCode)) > 0) ? true : false;
}

Saturday, August 16, 2008

isValidPostalCode

This week we're turning our attention to my own country, Canada, and writing a function to validate postal codes. Unlike a ZIP code, a postal code contains letters too; the format is A1A 1A1. The simplest regular expression to validate this would be:

^[A-Z]{1}[\d]{1}[A-Z]{1}[ ]?[\d]{1}[A-Z]{1}[\d]{1}$

But not all letters are used, and some letters can only appear in certain positions. Taking this into account gives us a slightly more complex pattern of:

^[ABCEGHJ-NPRSTVXY]{1}[0-9]{1}[ABCEGHJ-NPRSTV-Z]{1}[ ]?[0-9]{1}[ABCEGHJ-NPRSTV-Z]{1}[0-9]{1}$

But I want to take this a step further and check the validity of the combination of the first three characters. The postal code C2B 4S3 has a valid format, but the code itself is not valid because there is no C2B postal area. I also want to check the postal code against the province to ensure that they match.

The full source code is too long to post here, but is available in its entirety from Snipplr in the language of your choice:

Alternatively, if you wanted the extended validation, but didn't care about the province matching, you could combine everything into one gigantic regular expression:

^(A(0[ABCEGHJ-NPR]|1[ABCEGHK-NSV-Y]|2[ABHNV]|5[A]|8[A])|B(0[CEHJ-NPRSTVW]|1[ABCEGHJ-NPRSTV-Y]|2[ABCEGHJNRSTV-Z]|3[ABEGHJ-NPRSTVZ]|4[ABCEGHNPRV]|5[A]|6[L]|9[A])|C(0[AB]|1[ABCEN])|E(1[ABCEGHJNVWX]|2[AEGHJ-NPRSV]|3[ABCELNVYZ]|4[ABCEGHJ-NPRSTV-Z]|5[ABCEGHJ-NPRSTV]|6[ABCEGHJKL]|7[ABCEGHJ-NP]|8[ABCEGJ-NPRST]|9[ABCEGH])|G(0[ACEGHJ-NPRSTV-Z]|1[ABCEGHJ-NPRSTV-Y]|2[ABCEGJ-N]|3[ABCEGHJ-NZ]|4[ARSTVWXZ]|5[ABCHJLMNRTVXYZ]|6[ABCEGHJKLPRSTVWXZ]|7[ABGHJKNPSTXYZ]|8[ABCEGHJ-NPTVWYZ]|9[ABCHNPRTX])|H(0[HM]|1[ABCEGHJ-NPRSTV-Z]|2[ABCEGHJ-NPRSTV-Z]|3[ABCEGHJ-NPRSTV-Z]|4[ABCEGHJ-NPRSTV-Z]|5[AB]|7[ABCEGHJ-NPRSTV-Y]|8[NPRSTYZ]|9[ABCEGHJKPRSWX])|J(0[ABCEGHJ-NPRSTV-Z]|1[ACEGHJ-NRSTXZ]|2[ABCEGHJ-NRSTWXY]|3[ABEGHLMNPRTVXYZ]|4[BGHJ-NPRSTV-Z]|5[ABCJ-MRTV-Z]|6[AEJKNRSTVWYXZ]|7[ABCEGHJ-NPRTV-Z]|8[ABCEGHLMNPRTVXYZ]|9[ABEHJLNTVXYZ])|K(0[ABCEGHJ-M]|1[ABCEGHJ-NPRSTV-Z]|2[ABCEGHJ-MPRSTVW]|4[ABCKMPR]|6[AHJKTV]|7[ACGHK-NPRSV]|8[ABHNPRV]|9[AHJKLV])|L(0[[ABCEGHJ-NPRS]]|1[ABCEGHJ-NPRSTV-Z]|2[AEGHJMNPRSTVW]|3[BCKMPRSTVXYZ]|4[ABCEGHJ-NPRSTV-Z]|5[ABCEGHJ-NPRSTVW]|6[ABCEGHJ-MPRSTV-Z]|7[ABCEGJ-NPRST]|8[EGHJ-NPRSTVW]|9[ABCGHK-NPRSTVWYZ])|M(1[BCEGHJ-NPRSTVWX]|2[HJ-NPR]|3[ABCHJ-N]|4[ABCEGHJ-NPRSTV-Y]|5[ABCEGHJ-NPRSTVWX]|6[ABCEGHJ-NPRS]|7[AY]|8[V-Z]|9[ABCLMNPRVW])|N(0[ABCEGHJ-NPR]|1[ACEGHKLMPRST]|2[ABCEGHJ-NPRTVZ]|3[ABCEHLPRSTVWY]|4[BGKLNSTVWXZ]|5[ACHLPRV-Z]|6[ABCEGHJ-NP]|7[AGLMSTVWX]|8[AHMNPRSTV-Y]|9[ABCEGHJKVY])|P(0[ABCEGHJ-NPRSTV-Y]|1[ABCHLP]|2[ABN]|3[ABCEGLNPY]|4[NPR]|5[AEN]|6[ABC]|7[ABCEGJKL]|8[NT]|9[AN])|R(0[ABCEGHJ-M]|1[ABN]|2[CEGHJ-NPRV-Y]|3[ABCEGHJ-NPRSTV-Y]|4[AHJKL]|5[AGH]|6[MW]|7[ABCN]|8[AN]|9[A])|S(0[ACEGHJ-NP]|2[V]|3[N]|4[AHLNPRSTV-Z]|6[HJKVWX]|7[HJ-NPRSTVW]|9[AHVX])|T(0[ABCEGHJ-MPV]|1[ABCGHJ-MPRSV-Y]|2[ABCEGHJ-NPRSTV-Z]|3[ABCEGHJ-NPRZ]|4[ABCEGHJLNPRSTVX]|5[ABCEGHJ-NPRSTV-Z]|6[ABCEGHJ-NPRSTVWX]|7[AENPSVXYZ]|8[ABCEGHLNRSVWX]|9[ACEGHJKMNSVWX])|V(0[ABCEGHJ-NPRSTVWX]|1[ABCEGHJ-NPRSTV-Z]|2[ABCEGHJ-NPRSTV-Z]|3[ABCEGHJ-NRSTV-Y]|4[ABCEGK-NPRSTVWXZ]|5[ABCEGHJ-NPRSTV-Z]|6[ABCEGHJ-NPRSTV-Z]|7[ABCEGHJ-NPRSTV-Y]|8[ABCGJ-NPRSTV-Z]|9[ABCEGHJ-NPRSTV-Z])|X(0[ABCGX]|1[A])|Y(0[AB]|1[A]))[ ]?[0-9]{1}[ABCEGHJ-NPRSTV-Z]{1}[0-9]{1}$

Saturday, August 9, 2008

isValidZipCode

Another week, another validation function. Last week was a little long, so this week we'll do a bit shorter one: validating a United States ZIP code. The pattern is simple enough that I don't think an explanation is warranted.

ASP

function isValidZIPCode(zipCode)
dim regEx
set regEx = new RegExp
with regEx
.IgnoreCase = True
.Global = True
.Pattern = "^[0-9]{5}(-[0-9]{4})?$"
end with
if regEx.Test(trim(CStr(zipCode))) then
isValidZipCode = True
else
isValidZipCode = False
end if
set regEx = nothing
end function

PHP

function isValidZIPCode($zipCode)
{
return (preg_match("/^[0-9]{5}(-[0-9]{4})?$/i", trim($zipCode)) > 0) ? true : false;
}

Saturday, August 2, 2008

isValidEmail

This week we're going to build on the regular expression we wrote last week to validate e-mail addresses. What do IP addresses have to do with e-mail addresses? Just like domain names map to IP addresses, so also the domain part of an e-mail address can be substituted with an IP address, so instead of person@example.com you could have person@192.168.1.1

ASP

function isValidEmail(email)
dim regEx
dim result
set regEx = new RegExp
with regEx
.IgnoreCase = True
.Global = True
.Pattern = "^[^@]{1,64}@[^@]{1,255}$"
end with
result = false
' Test length.
if regEx.Test(email) then
regEx.Pattern = "^((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\"))@(([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?)$"
' Test syntax.
if regEx.Test(email) then
result = true
end if
end if
isValidEmail = result
set regEx = nothing
end function

PHP

function isValidEmail($email)
{
$lengthPattern = "/^[^@]{1,64}@[^@]{1,255}$/";
$syntaxPattern = "/^((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\"))@(([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?)$/";
return ((preg_match($lengthPattern, $email) > 0) && (preg_match($syntaxPattern, $email) > 0)) ? true : false;
}

The validation is broken down into two steps: checking the length of each part, and checking the syntax of each part.

^[^@]{1,64}@[^@]{1,255}$

The part before the @ symbol is called the local part, and cannot exceed 64 characters. The part after the @ symbol is called the domain part, and cannot exceed 255 characters.

^((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\"))@(([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?)$

In the check for syntax, the local part is validated by ((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\")). This is actually two patterns separated by the pipe character. The first, (([\w\+\-]+)(\.[\w\+\-]+)*), allows letters, numbers, the plus sign, and the hyphen (or minus sign if you prefer). It also allows periods, but not as the first or last character. The second, (\"[^(\\|\")]{0,62}\"), allows just about anything, provided the local part is enclosed in quotation marks (which is valid, but you'll probably never encounter it).

The domain part is validated by (([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?). Once again, this is two different patterns separated by the pipe character. The second pattern is our IP address checker from last week with optional enclosure in square brackets. The first pattern, ([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,}), allows a slightly smaller range of characters (no plus signs or underscores), any number of subdomains, and a top-level domain of at least 2 characters (the minimum). Some regular expressions will impose a maximum of six characters on the top-level domain (the longest at the moment is .museum), but that wouldn't allow for longer top-level domains that could be created in the future.

Saturday, July 26, 2008

isValidIP

As promised, more regular expression fun. This week we're going to validate IP addresses. An IP address consists of four octets separated by periods. A lazy person might be inclined to use a regular expression of \d{1,3} for each octet, but that would allow numbers larger than 255. A more complex expression is needed: ([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1}).

This expression consists of three parts separated by the pipe character "|". The first part, [1]?\d{1,2}, matches numbers between 0 and 199. The second part, 2[0-4]{1}\d{1}, matches numbers between 200 and 249. The third part, 25[0-5]{1}, matches numbers between 250 and 255. We will repeat this pattern four times, once for each octet, and separate with periods.

ASP

function isValidIP(ip)
dim regEx
set regEx = new RegExp
with regEx
.IgnoreCase = True
.Global = True
.Pattern = "^([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}$"
end with
if regEx.Test(trim(CStr(ip))) then
isValidIP = true
else
isValidIP = false
end if
set regEx = nothing
end function

PHP

function isValidIP($ip)
{
$pattern = "/^([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}$/";
return (preg_match($pattern, $ip) > 0) ? true : false;
}

Next week we're going to build on this pattern to validate something else. I wonder what that could be?

Saturday, July 19, 2008

isAlpha(Numeric)

We're going to take a break from math-related functions for a few weeks (yay!) and play with regular expressions. Regular expressions are more powerful and faster than old-fashioned string parsing.

Both ASP and PHP have a function for checking if something is numeric. How about a function for checking if something is alphabetical?

ASP

function isAlpha(someString)
dim regEx
set regEx = new RegExp
with regEx
.Global = true
.IgnoreCase = true
.Pattern = "[A-Z\s_]"
end with
if regEx.test(someString) then
isAlpha = true
else
isAlpha = false
end if
set regEx = nothing
end function

PHP

function is_alpha($someString)
{
return (preg_match("/[A-Z\s_]/i", $someString) > 0) ? true : false;
}

The test pattern we are using above will allow letters of the alphabet, the underscore character, and whitespace characters. With a small tweak to the test pattern, we can also write a function to check if a string is alphanumeric.

ASP

function isAlphaNumeric(someString)
dim regEx
set regEx = new RegExp
with regEx
.Global = true
.IgnoreCase = true
.Pattern = "[\w\s.]"
end with
if regEx.test(someString) then
isAlphaNumeric = true
else
isAlphaNumeric = false
end if
set regEx = nothing
end function

PHP

function is_alphanumeric($someString)
{
return (preg_match("/[\w\s.]/i", $someString) > 0) ? true : false;
}

The \w switch in the pattern includes the 26 letters of the alphabet plus the numbers zero through nine.

More regular expression fun next week!

Reusable Code

Sunday, September 5, 2010

Roman Numerals, Part 4

PHP

ASP

Saturday, May 2, 2009

Regular Expressions

Saturday, August 23, 2008

isValidPostCode

ASP

PHP

Saturday, August 16, 2008

isValidPostalCode

Saturday, August 9, 2008

isValidZipCode

ASP

PHP

Saturday, August 2, 2008

isValidEmail

ASP

PHP

Saturday, July 26, 2008

isValidIP

ASP

PHP

Saturday, July 19, 2008

isAlpha(Numeric)

ASP

PHP

ASP

PHP

About Me

Flair

License

Snipplr

Labels

Further Reading

Blog Archive

Sunday, September 5, 2010

PHP

ASP

Saturday, May 2, 2009

Saturday, August 23, 2008

ASP

PHP

Saturday, August 16, 2008

Saturday, August 9, 2008

ASP

PHP

Saturday, August 2, 2008

ASP

PHP

Saturday, July 26, 2008

ASP

PHP

Saturday, July 19, 2008

ASP

PHP

ASP

PHP

About Me

Flair

License

Snipplr

Subscribe To

Labels

Further Reading

Blog Archive