Wednesday, May 18, 2016

NSRegularExpression... String? NSRange? Wha...? (Swift 2.2)

Trying to work with NSRegularExpression in Swift, I came across this common issue: to get matches, you must provide a Swift String to match on. However, NSRegularExpression returns NSRange objects representing any resulting matches. Problem is, String and NSRange don't play well with each other!

To work around this, as usual, I'll give you the short of the story, followed by the details.

The Short Story:
Convert your resulting NSRange match items into Range items with the original string's String.UTF16View representation. UTF16View has a samePositionIn(String) function ready-made for this purpose!

Example code:

// Note: all types are specified for clarity //
// set up the regular match String AND a UTF-16 version
let matchString: String = "This is the text to search with the Regular Expression pattern."
let matchStringUTF16: String.UTF16View = matchString.utf16
// create your NSRegularExpression object
// get your match(es) - which are NSRange objects -
// and while looping through each:
// use the NSRange's location and length to create start and end values for the String Range
// (note: 'match' represents the current NSRange object; 'idx' is the current index of the loop)
let theNSRange: NSRange = match.rangeAtIndex(idx)
let nsRangeStart: Int = theNSRange.location
let nsRangeEnd: Int = theNSRange.location + theNSRange.length
// create UTF-16 start and end indexes using the UTFView version of the original String
let utf16StartIndex: String.UTF16View.Index = matchString16.startIndex.advancedBy(nsRangeStart)
let utf16EndIndex: String.UTF16View.Index = matchString16.startIndex.advancedBy(nsRangeEnd)
// use samePositionIn(String) to convert the UTFView indexes to regular CharacterView indexes
// note: these are optionals as the conversion may not work; handle appropriately in your own code
let stringStartIndex: String.CharacterView.Index? = utf16RangeStart.samePositionIn(matchString)
let stringEndIndex: String.CharacterView.Index? = utf16RangeEnd.samePositionIn(matchString)
// now, create a Range object with those CharacterView indexes
// FYI: Range<Index>, as seen in error messages, is short-hand for Range<String.CharacterView.Index>
let stringRange: Range<String.CharacterView.Index> = stringStartIndex!..<stringEndIndex!
// and prove it works by using the new Range on the original String to get the match's substring
let substringMatch: String = matchString.substringWithRange(stringRange)

Here are the details:
One example NSRegularExpression object method call looks like this:

func matchesInString(_ string: String,
options options: NSMatchingOptions,
range range: NSRange) -> [NSTextCheckingResult]

It requires a String for matching, but an NSRange for the part of the string to search. To convert TO an NSRange, you need to convert your regular String into an NSString to get the length - this part's easy:

// we're assuming we want to search the entire string
let nsMatchString: NSString = matchString as NSString
let nsRange: NSRange = NSMakeRange(0, nsMatchString.length)

Ah, but WHY? Why didn't we just use matchString.characters.count?

A Story of Character Encoding
Well, it turns out that OS X and iOS use an underlying character encoding called Unicode. Unicode utilizes "code units" to make up strings. And, the number of code units representing a given character will differ depending on how the character is represented in the UTF standard.

NSString uses a UTF-16 representation for working with Unicode characters. While many characters consist of only one code unit in UTF-16 form, some characters consist of more than one code unit. A letter "A", for example takes only one code unit. But, the emoji {Smiling face}, "☺️", consists of 2 code units in UTF-16 (and therefore also NSString).

Swift's String, on the other hand, uses a visible character count rather than calculating the underlying code units. So, the String literals "ABC" and "A☺️C" both seem to be 3 characters long: "ABC".characters.count and "A☺️C".characters.count both return 3.

But, when you convert those same strings to NSString, you get a little surprise: ("ABC" as NSString).length returns the 3 you'd expect. ("A☺️C" as NSString).length, though, returns 4! Yike!

Of course, the difference is because NSString is taking into account that extra UTF-16 code unit that String ignores.

Okay, now what about those ranges?
Swift bridges the String you provide to the function into an NSString object. So, the NSRange results you get back from the call are based on that UTF-16 length, not the actual String character count. If you've got any multi-code-unit characters in your String, the ranges you get back won't match up.

Additionally, NSRange objects provide a location and a length. These are both Int values, which won't work in a Swift Range. You need Index objects for the start and end of a Range object.

Speaking of indexes, Swift Range objects expect a particular type of Index. You've seen the error messages about Range<Index> or Range<String.Index>. These actually represent Range<String.CharacterView.Index>. So, you need to make sure you're using the right type of Index, or your conversion won't work.

The workaround!
Note: in the example code below, I haven't provided any error handling. Make sure to add code to handle any potential optional results, such as in creating your NSRegularExpression object. Guard statements come in handy here.

Now, here we go...
1) Swift has some UTF-16 representations for a String object. So, first, we'll ensure we've got a UTF-16 version of our match string. We'll use the utf16 property of String to convert our match string into its equivalent String.UTF16View type as well:

// set up the regular match String AND a UTF-16 version
//let matchString: String = "ABCABC" // switch the remarked string to experiment
let matchString: String = "A😏CA😏C"
let matchStringUTF16: String.UTF16View = matchString.utf16

2) We then set up our NSRegularExpression object.

//let regexPattern: String = "A(BC)+" // again, switch remarked strings to experiment
let regexPattern = "A(😏C)+"
let options = NSRegularExpressionOptions.CaseInsensitive
guard let regex = try? NSRegularExpression(pattern: regexPattern, options: options) else {
return
}

3) Then, we make our call to the regex object's appropriate function using the String value and the NSRange based on the NSString representation.

Note: strangely, the UTF16View version of the match string won't work for creating the NSRange. You MUST use an NSString cast to create the NSRange.

// creating the NSRange object to send into the function call
let nsMatchString: NSString = matchString as NSString
let nsRange: NSRange = NSMakeRange(0, nsMatchString.length)
// function call with original String and our new NSRange
let matches: NSTextCheckingResult? = regex?.matchesInString(matchString, options: matchingOptions, range: nsRange)

4) You'll need to loop through the results, converting as you go:

for (idx, matchItem) in matches.enumerate() {
let numMatches = matchItem.numberOfRanges
// now THIS gives us the NSRange for the main match PLUS each of the group captures
// rangeAtIndex(0) = main match (same object as matchItem[idx].range)
// rangeAtIndex(>0) = group capture match
for matchIdx in 0..<numMatches {
// use the NSRange's location and length to create start and end values for the String Range
// (note: 'match' represents the current NSRange object; 'idx' is the current index of the loop)
let theNSRange: NSRange = match.rangeAtIndex(idx)
let nsRangeStart: Int = theNSRange.location
let nsRangeEnd: Int = theNSRange.location + theNSRange.length
// then, using the UTF16View string, get the start and end UTF16 index types
let utf16StartIndex: String.UTF16View.Index = matchString16.startIndex.advancedBy(locationNS)
let utf16EndIndex: String.UTF16View.Index = matchString16.startIndex.advancedBy(locationNS + lengthNS)
// using the String.UTF16View.Index type's samePositionIn(String) function,
// you can finally get the String.CharacterView.Index types needed for the Range<String.Index>
let stringStartIndex: String.CharacterView.Index? = utf16StartIndex.samePositionIn(matchString)
let stringEndIndex: String.CharacterView.Index? = utf16EndIndex.samePositionIn(matchString)
// create your Range
let stringRange: Range<String.CharacterView.Index> = stringStartIndex!..<stringEndIndex!
// and use it against the original String to get your matching String - PHEW! All done!
let substringMatch: String = matchString.substringWithRange(stringRange)
}
}

No comments:

Post a Comment