Find your content:

Search form

You are here

How to validate UTF-8 in regex

 
Share

I have basic validation rules setup for name fields:

NOT(REGEX(FirstName, "^[A-Za-z\\. '-]+$"))

The goal is to only allow letters, periods, spaces, hyphens and apostrophes in the name field. The problem with this is that it does not allow accented characters (graphemes). I've tried some simplified ideas based on a regex tutorial and the Java Docs Salesforce links to, but they do not work:

  1. NOT( REGEX( FirstName , "\\P{M}\\p{M}") )
  2. NOT( REGEX( FirstName , "\\p{Alpha}") )
  3. NOT( REGEX( FirstName , "\\X") )

Has anybody else run into this problem? How do you validate names with accent marks?

Update: After further testing I'm making some progress: The validation rule REGEX(LastName, "(?>\\P{M}\\p{M}*)") successfully flags "é" as a match. Unfortunately that means pretty much any character is a match and I want to exclude numerals and most punctuation.


Attribution to: Mike Chale

Possible Suggestion/Solution #1

This might need some refinement, but my understanding is \p{L} will match "a single code point in the category 'letter'".

I tested the following as Anonymous Apex and got the Matches debug message.

String FirstName = 'Fredé';

Pattern regexPattern = Pattern.compile('^[\\p{L}\\. \'-]+$');
Matcher regexMatcher = regexPattern.matcher(FirstName);

if (!regexMatcher.matches()) {
    System.debug(LoggingLevel.Warn, 'No Matches');
} else {
    System.debug(LoggingLevel.Debug, 'Matches');
}

According to the Regex Tutorial: Unicode Character Properties you will probably need to add \p{M}* to optionally match any diacritics:

To match a letter including any diacritics, use \p{L}\p{M}*. This last regex will always match à, regardless of how it is encoded.


Attribution to: Daniel Ballinger
This content is remixed from stackoverflow or stackexchange. Please visit https://salesforce.stackexchange.com/questions/893

My Block Status

My Block Content