Showing posts with label regular expressions. Show all posts
Showing posts with label regular expressions. Show all posts

Sunday, September 5, 2010

Roman Numerals, Part 4

Keith Alexander of Albuquerque, New Mexico writes:
“I like your Roman Numeral library. I needed a function to test for Roman numerals, so I wrote this one.
  1. // Check to see if the string is a Roman Numeral
  2. // NOTE: this doesn't check for fractions, overbars, the Bede "N" (zero) etc.
  3. // NOTE: It also doesn't check for a well-formed Roman Numeral.
  4. function is_roman_numeral( $roman )
  5. {
  6.     // Strip every non-word character
  7.     // - A-Z, 0-9, apostrophe and understcore are what's left
  8.     $roman = preg_replace( "/[^A-Z0-9_']/iu", "", $roman );
  9.     // if it contains anything other than MDCLXVI, then it's not a Roman Numeral
  10.     $result = preg_match( "/[^MDCLXVI]/u", $roman );
  11.     if( $result )
  12.     {
  13.         return FALSE;
  14.     }
  15.     return TRUE;
  16. }

Who knows if blogger is going to show it properly. If not, just contact me and I'll send it you in email or something. Anyway, it's something I wrote in 5 minutes. If you want to add it to your library, modified or otherwise, please feel free.”

Thanks for writing in Keith, and sorry for the late response. There are two ways to validate a Roman number, using regular expressions like you did, and converting back to an Arabic number (if the conversion fails, it's not a Roman number).

I'm not sure about using a regular expression to remove non-word characters. My gut tells me that anything containing such characters should fail validation as a Roman number. Also, I would reverse the match and eliminate the if statement by directly returning the result of the match.

PHP

  1. function isRoman($roman)
  2. {
  3.     return preg_match("/[MDCLXVI]/u", $roman);
  4. }

ASP

  1. function isRoman(roman)
  2.     dim regEx
  3.     set regEx = new RegExp
  4.     with regEx
  5.         .IgnoreCase = true
  6.         .Global = true
  7.         .Pattern = "[MDCLXVI]"
  8.     end with
  9.     if regEx.Test(roman) then
  10.         isRoman = true
  11.     else
  12.         isRoman = false
  13.     end if
  14.     set regEx = nothing
  15. end function


Saturday, May 2, 2009

Regular Expressions

Over the past year, I've posted a lot of code that made use of regular expressions. When used appropriately, they can be very powerful. But regular expressions in ASP are a little more cumbersome than PHP. Wouldn't it be great if ASP had the simplicity of regular expression functions?


Once again, the source code is too long to post here, so it will only be available on Snipplr. This library of functions includes the following:

  • ereg() - case-sensitive regular expression match
  • eregi() - case-insensitive regular expression match
  • ereg_replace() - case-sensitive regular expression replacement
  • eregi_replace() - case-insensitive regular expression replacement
  • sql_regcase() - make regular expression for case insensitive match

View ASP implementation on Snipplr

Saturday, August 23, 2008

isValidPostCode

In this final installment of the postal code trilogy, we turn our attention to the United Kingdom. Postal codes in the UK are called postcodes. They are similar to postal codes in Canada in that they contain both letters and numbers, but unlike Canadian postal codes, they are variable in length.


A postcode can have any of the following formats:

  • A9 9AA
  • A99 9AA
  • A9A 9AA
  • AA9 9AA
  • AA99 9AA
  • AA9A 9AA

To match all of these formats, we'll use the following regular expression: [A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW])\ [0-9][ABD-HJLNP-UW-Z]{2}. You may notice that, like Canadian postal codes, certain letters are only allowed in certain positions, or not at all.


There are also a few special cases that are valid postcodes but deviate from the regular format:

  • Girobank - (GIR\ 0AA)
  • Father Christmas - (SAN\ TA1)
  • British Forces Post Office - (BFPO\ (C\/O\ )?[0-9]{1,4})
  • Overseas territories - ((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ)

ASP

  1. function isValidPostCode(postCode)
  2.     dim regEx
  3.     set regEx = new RegExp
  4.     with regEx
  5.         .IgnoreCase = true
  6.         .Global = true
  7.         .Pattern = "^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW])\ [0-9][ABD-HJLNP-UW-Z]{2}|(GIR\ 0AA)|(SAN\ TA1)|(BFPO\ (C\/O\ )?[0-9]{1,4})|((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ))$"
  8.     end with
  9.     if regEx.Test(trim(CStr(postCode))) then
  10.         isValidPostCode = true
  11.     else
  12.         isValidPostCode = false
  13.     end if
  14.     set regEx = nothing
  15. end function

PHP

  1. function isValidPostCode($postCode)
  2. {
  3.     $pattern = "/^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW])\ [0-9][ABD-HJLNP-UW-Z]{2}|(GIR\ 0AA)|(SAN\ TA1)|(BFPO\ (C\/O\ )?[0-9]{1,4})|((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ))$/i";
  4.     return (preg_match($pattern, trim($postCode)) > 0) ? true : false;
  5. }

Saturday, August 16, 2008

isValidPostalCode

This week we're turning our attention to my own country, Canada, and writing a function to validate postal codes. Unlike a ZIP code, a postal code contains letters too; the format is A1A 1A1. The simplest regular expression to validate this would be:

^[A-Z]{1}[\d]{1}[A-Z]{1}[ ]?[\d]{1}[A-Z]{1}[\d]{1}$

But not all letters are used, and some letters can only appear in certain positions. Taking this into account gives us a slightly more complex pattern of:

^[ABCEGHJ-NPRSTVXY]{1}[0-9]{1}[ABCEGHJ-NPRSTV-Z]{1}[ ]?[0-9]{1}[ABCEGHJ-NPRSTV-Z]{1}[0-9]{1}$


But I want to take this a step further and check the validity of the combination of the first three characters. The postal code C2B 4S3 has a valid format, but the code itself is not valid because there is no C2B postal area. I also want to check the postal code against the province to ensure that they match.


The full source code is too long to post here, but is available in its entirety from Snipplr in the language of your choice:


Alternatively, if you wanted the extended validation, but didn't care about the province matching, you could combine everything into one gigantic regular expression:


^(A(0[ABCEGHJ-NPR]|1[ABCEGHK-NSV-Y]|2[ABHNV]|5[A]|8[A])|B(0[CEHJ-NPRSTVW]|1[ABCEGHJ-NPRSTV-Y]|2[ABCEGHJNRSTV-Z]|3[ABEGHJ-NPRSTVZ]|4[ABCEGHNPRV]|5[A]|6[L]|9[A])|C(0[AB]|1[ABCEN])|E(1[ABCEGHJNVWX]|2[AEGHJ-NPRSV]|3[ABCELNVYZ]|4[ABCEGHJ-NPRSTV-Z]|5[ABCEGHJ-NPRSTV]|6[ABCEGHJKL]|7[ABCEGHJ-NP]|8[ABCEGJ-NPRST]|9[ABCEGH])|G(0[ACEGHJ-NPRSTV-Z]|1[ABCEGHJ-NPRSTV-Y]|2[ABCEGJ-N]|3[ABCEGHJ-NZ]|4[ARSTVWXZ]|5[ABCHJLMNRTVXYZ]|6[ABCEGHJKLPRSTVWXZ]|7[ABGHJKNPSTXYZ]|8[ABCEGHJ-NPTVWYZ]|9[ABCHNPRTX])|H(0[HM]|1[ABCEGHJ-NPRSTV-Z]|2[ABCEGHJ-NPRSTV-Z]|3[ABCEGHJ-NPRSTV-Z]|4[ABCEGHJ-NPRSTV-Z]|5[AB]|7[ABCEGHJ-NPRSTV-Y]|8[NPRSTYZ]|9[ABCEGHJKPRSWX])|J(0[ABCEGHJ-NPRSTV-Z]|1[ACEGHJ-NRSTXZ]|2[ABCEGHJ-NRSTWXY]|3[ABEGHLMNPRTVXYZ]|4[BGHJ-NPRSTV-Z]|5[ABCJ-MRTV-Z]|6[AEJKNRSTVWYXZ]|7[ABCEGHJ-NPRTV-Z]|8[ABCEGHLMNPRTVXYZ]|9[ABEHJLNTVXYZ])|K(0[ABCEGHJ-M]|1[ABCEGHJ-NPRSTV-Z]|2[ABCEGHJ-MPRSTVW]|4[ABCKMPR]|6[AHJKTV]|7[ACGHK-NPRSV]|8[ABHNPRV]|9[AHJKLV])|L(0[[ABCEGHJ-NPRS]]|1[ABCEGHJ-NPRSTV-Z]|2[AEGHJMNPRSTVW]|3[BCKMPRSTVXYZ]|4[ABCEGHJ-NPRSTV-Z]|5[ABCEGHJ-NPRSTVW]|6[ABCEGHJ-MPRSTV-Z]|7[ABCEGJ-NPRST]|8[EGHJ-NPRSTVW]|9[ABCGHK-NPRSTVWYZ])|M(1[BCEGHJ-NPRSTVWX]|2[HJ-NPR]|3[ABCHJ-N]|4[ABCEGHJ-NPRSTV-Y]|5[ABCEGHJ-NPRSTVWX]|6[ABCEGHJ-NPRS]|7[AY]|8[V-Z]|9[ABCLMNPRVW])|N(0[ABCEGHJ-NPR]|1[ACEGHKLMPRST]|2[ABCEGHJ-NPRTVZ]|3[ABCEHLPRSTVWY]|4[BGKLNSTVWXZ]|5[ACHLPRV-Z]|6[ABCEGHJ-NP]|7[AGLMSTVWX]|8[AHMNPRSTV-Y]|9[ABCEGHJKVY])|P(0[ABCEGHJ-NPRSTV-Y]|1[ABCHLP]|2[ABN]|3[ABCEGLNPY]|4[NPR]|5[AEN]|6[ABC]|7[ABCEGJKL]|8[NT]|9[AN])|R(0[ABCEGHJ-M]|1[ABN]|2[CEGHJ-NPRV-Y]|3[ABCEGHJ-NPRSTV-Y]|4[AHJKL]|5[AGH]|6[MW]|7[ABCN]|8[AN]|9[A])|S(0[ACEGHJ-NP]|2[V]|3[N]|4[AHLNPRSTV-Z]|6[HJKVWX]|7[HJ-NPRSTVW]|9[AHVX])|T(0[ABCEGHJ-MPV]|1[ABCGHJ-MPRSV-Y]|2[ABCEGHJ-NPRSTV-Z]|3[ABCEGHJ-NPRZ]|4[ABCEGHJLNPRSTVX]|5[ABCEGHJ-NPRSTV-Z]|6[ABCEGHJ-NPRSTVWX]|7[AENPSVXYZ]|8[ABCEGHLNRSVWX]|9[ACEGHJKMNSVWX])|V(0[ABCEGHJ-NPRSTVWX]|1[ABCEGHJ-NPRSTV-Z]|2[ABCEGHJ-NPRSTV-Z]|3[ABCEGHJ-NRSTV-Y]|4[ABCEGK-NPRSTVWXZ]|5[ABCEGHJ-NPRSTV-Z]|6[ABCEGHJ-NPRSTV-Z]|7[ABCEGHJ-NPRSTV-Y]|8[ABCGJ-NPRSTV-Z]|9[ABCEGHJ-NPRSTV-Z])|X(0[ABCGX]|1[A])|Y(0[AB]|1[A]))[ ]?[0-9]{1}[ABCEGHJ-NPRSTV-Z]{1}[0-9]{1}$

Saturday, August 9, 2008

isValidZipCode

Another week, another validation function. Last week was a little long, so this week we'll do a bit shorter one: validating a United States ZIP code. The pattern is simple enough that I don't think an explanation is warranted.


ASP

  1. function isValidZIPCode(zipCode)
  2.     dim regEx
  3.     set regEx = new RegExp
  4.     with regEx
  5.         .IgnoreCase = True
  6.         .Global = True
  7.         .Pattern = "^[0-9]{5}(-[0-9]{4})?$"
  8.     end with
  9.     if regEx.Test(trim(CStr(zipCode))) then
  10.         isValidZipCode = True
  11.     else
  12.         isValidZipCode = False
  13.     end if
  14.     set regEx = nothing
  15. end function

PHP

  1. function isValidZIPCode($zipCode)
  2. {
  3.     return (preg_match("/^[0-9]{5}(-[0-9]{4})?$/i", trim($zipCode)) > 0) ? true : false;
  4. }

Saturday, August 2, 2008

isValidEmail

This week we're going to build on the regular expression we wrote last week to validate e-mail addresses. What do IP addresses have to do with e-mail addresses? Just like domain names map to IP addresses, so also the domain part of an e-mail address can be substituted with an IP address, so instead of person@example.com you could have person@192.168.1.1


ASP

  1. function isValidEmail(email)
  2.     dim regEx
  3.     dim result
  4.     set regEx = new RegExp
  5.     with regEx
  6.         .IgnoreCase = True
  7.         .Global = True
  8.         .Pattern = "^[^@]{1,64}@[^@]{1,255}$"
  9.     end with
  10.     result = false
  11.     ' Test length.
  12.     if regEx.Test(email) then
  13.         regEx.Pattern = "^((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\"))@(([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?)$"
  14.         ' Test syntax.
  15.         if regEx.Test(email) then
  16.             result = true
  17.         end if
  18.     end if
  19.     isValidEmail = result
  20.     set regEx = nothing
  21. end function

PHP

  1. function isValidEmail($email)
  2. {
  3.     $lengthPattern = "/^[^@]{1,64}@[^@]{1,255}$/";
  4.     $syntaxPattern = "/^((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\"))@(([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?)$/";
  5.     return ((preg_match($lengthPattern, $email) > 0) && (preg_match($syntaxPattern, $email) > 0)) ? true : false;
  6. }

The validation is broken down into two steps: checking the length of each part, and checking the syntax of each part.


^[^@]{1,64}@[^@]{1,255}$


The part before the @ symbol is called the local part, and cannot exceed 64 characters. The part after the @ symbol is called the domain part, and cannot exceed 255 characters.


^((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\"))@(([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?)$


In the check for syntax, the local part is validated by ((([\w\+\-]+)(\.[\w\+\-]+)*)|(\"[^(\\|\")]{0,62}\")). This is actually two patterns separated by the pipe character. The first, (([\w\+\-]+)(\.[\w\+\-]+)*), allows letters, numbers, the plus sign, and the hyphen (or minus sign if you prefer). It also allows periods, but not as the first or last character. The second, (\"[^(\\|\")]{0,62}\"), allows just about anything, provided the local part is enclosed in quotation marks (which is valid, but you'll probably never encounter it).


The domain part is validated by (([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,})|\[?([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}\]?). Once again, this is two different patterns separated by the pipe character. The second pattern is our IP address checker from last week with optional enclosure in square brackets. The first pattern, ([a-zA-Z0-9\-]+\.)+([a-zA-Z0-9]{2,}), allows a slightly smaller range of characters (no plus signs or underscores), any number of subdomains, and a top-level domain of at least 2 characters (the minimum). Some regular expressions will impose a maximum of six characters on the top-level domain (the longest at the moment is .museum), but that wouldn't allow for longer top-level domains that could be created in the future.


Saturday, July 26, 2008

isValidIP

As promised, more regular expression fun. This week we're going to validate IP addresses. An IP address consists of four octets separated by periods. A lazy person might be inclined to use a regular expression of \d{1,3} for each octet, but that would allow numbers larger than 255. A more complex expression is needed: ([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1}).


This expression consists of three parts separated by the pipe character "|". The first part, [1]?\d{1,2}, matches numbers between 0 and 199. The second part, 2[0-4]{1}\d{1}, matches numbers between 200 and 249. The third part, 25[0-5]{1}, matches numbers between 250 and 255. We will repeat this pattern four times, once for each octet, and separate with periods.


ASP

  1. function isValidIP(ip)
  2.     dim regEx
  3.     set regEx = new RegExp
  4.     with regEx
  5.         .IgnoreCase = True
  6.         .Global = True
  7.         .Pattern = "^([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}$"
  8.     end with
  9.     if regEx.Test(trim(CStr(ip))) then
  10.         isValidIP = true
  11.     else
  12.         isValidIP = false
  13.     end if
  14.     set regEx = nothing
  15. end function

PHP

  1. function isValidIP($ip)
  2. {
  3.     $pattern = "/^([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3}$/";
  4.     return (preg_match($pattern, $ip) > 0) ? true : false;
  5. }

Next week we're going to build on this pattern to validate something else. I wonder what that could be?


Saturday, July 19, 2008

isAlpha(Numeric)

We're going to take a break from math-related functions for a few weeks (yay!) and play with regular expressions. Regular expressions are more powerful and faster than old-fashioned string parsing.


Both ASP and PHP have a function for checking if something is numeric. How about a function for checking if something is alphabetical?


ASP

  1. function isAlpha(someString)
  2.     dim regEx
  3.     set regEx = new RegExp
  4.     with regEx
  5.         .Global = true
  6.         .IgnoreCase = true
  7.         .Pattern = "[A-Z\s_]"
  8.     end with
  9.     if regEx.test(someString) then
  10.         isAlpha = true
  11.     else
  12.         isAlpha = false
  13.     end if
  14.     set regEx = nothing
  15. end function

PHP

  1. function is_alpha($someString)
  2. {
  3.     return (preg_match("/[A-Z\s_]/i", $someString) > 0) ? true : false;
  4. }

The test pattern we are using above will allow letters of the alphabet, the underscore character, and whitespace characters. With a small tweak to the test pattern, we can also write a function to check if a string is alphanumeric.


ASP

  1. function isAlphaNumeric(someString)
  2.     dim regEx
  3.     set regEx = new RegExp
  4.     with regEx
  5.         .Global = true
  6.         .IgnoreCase = true
  7.         .Pattern = "[\w\s.]"
  8.     end with
  9.     if regEx.test(someString) then
  10.         isAlphaNumeric = true
  11.     else
  12.         isAlphaNumeric = false
  13.     end if
  14.     set regEx = nothing
  15. end function

PHP

  1. function is_alphanumeric($someString)
  2. {
  3.     return (preg_match("/[\w\s.]/i", $someString) > 0) ? true : false;
  4. }

The \w switch in the pattern includes the 26 letters of the alphabet plus the numbers zero through nine.


More regular expression fun next week!