Trying to work with NSRegularExpression in Swift, I came across this common issue: to get matches, you must provide a Swift String to match on. However, NSRegularExpression returns NSRange objects representing any resulting matches. Problem is, String and NSRange don't play well with each other!
To work around this, as usual, I'll give you the short of the story, followed by the details.
The Short Story:
Convert your resulting NSRange match items into Range items with the original string's String.UTF16View representation. UTF16View has a samePositionIn(String) function ready-made for this purpose!
Example code:
Here are the details:
One example NSRegularExpression object method call looks like this:
It requires a String for matching, but an NSRange for the part of the string to search. To convert TO an NSRange, you need to convert your regular String into an NSString to get the length - this part's easy:
Ah, but WHY? Why didn't we just use matchString.characters.count?
A Story of Character Encoding
Well, it turns out that OS X and iOS use an underlying character encoding called Unicode. Unicode utilizes "code units" to make up strings. And, the number of code units representing a given character will differ depending on how the character is represented in the UTF standard.
NSString uses a UTF-16 representation for working with Unicode characters. While many characters consist of only one code unit in UTF-16 form, some characters consist of more than one code unit. A letter "A", for example takes only one code unit. But, the emoji {Smiling face}, "☺️", consists of 2 code units in UTF-16 (and therefore also NSString).
Swift's String, on the other hand, uses a visible character count rather than calculating the underlying code units. So, the String literals "ABC" and "A☺️C" both seem to be 3 characters long: "ABC".characters.count and "A☺️C".characters.count both return 3.
But, when you convert those same strings to NSString, you get a little surprise: ("ABC" as NSString).length returns the 3 you'd expect. ("A☺️C" as NSString).length, though, returns 4! Yike!
Of course, the difference is because NSString is taking into account that extra UTF-16 code unit that String ignores.
Okay, now what about those ranges?
Swift bridges the String you provide to the function into an NSString object. So, the NSRange results you get back from the call are based on that UTF-16 length, not the actual String character count. If you've got any multi-code-unit characters in your String, the ranges you get back won't match up.
Additionally, NSRange objects provide a location and a length. These are both Int values, which won't work in a Swift Range. You need Index objects for the start and end of a Range object.
Speaking of indexes, Swift Range objects expect a particular type of Index. You've seen the error messages about Range<Index> or Range<String.Index>. These actually represent Range<String.CharacterView.Index>. So, you need to make sure you're using the right type of Index, or your conversion won't work.
The workaround!
Note: in the example code below, I haven't provided any error handling. Make sure to add code to handle any potential optional results, such as in creating your NSRegularExpression object. Guard statements come in handy here.
Now, here we go...
1) Swift has some UTF-16 representations for a String object. So, first, we'll ensure we've got a UTF-16 version of our match string. We'll use the utf16 property of String to convert our match string into its equivalent String.UTF16View type as well:
2) We then set up our NSRegularExpression object.
3) Then, we make our call to the regex object's appropriate function using the String value and the NSRange based on the NSString representation.
Note: strangely, the UTF16View version of the match string won't work for creating the NSRange. You MUST use an NSString cast to create the NSRange.
4) You'll need to loop through the results, converting as you go:
No comments:
Post a Comment