Trying to work with NSRegularExpression
in Swift, I came across this common issue: to get matches, you must provide a Swift String
to match on. However, NSRegularExpression
returns NSRange
objects representing any resulting matches. Problem is, String
and NSRange
don't play well with each other!
To work around this, as usual, I'll give you the short of the story, followed by the details.
The Short Story:
Convert your resulting NSRange
match items into Range
items with the original string's String.UTF16View
representation. UTF16View
has a samePositionIn(String)
function ready-made for this purpose!
Example code:
Here are the details:
One example NSRegularExpression
object method call looks like this:
It requires a String
for matching, but an NSRange
for the part of the string to search. To convert TO an NSRange
, you need to convert your regular String
into an NSString
to get the length - this part's easy:
Ah, but WHY? Why didn't we just use matchString.characters.count
?
A Story of Character Encoding
Well, it turns out that OS X and iOS use an underlying character encoding called Unicode. Unicode utilizes "code units" to make up strings. And, the number of code units representing a given character will differ depending on how the character is represented in the UTF standard.
NSString
uses a UTF-16 representation for working with Unicode characters. While many characters consist of only one code unit in UTF-16 form, some characters consist of more than one code unit. A letter "A", for example takes only one code unit. But, the emoji {Smiling face}, "☺️", consists of 2 code units in UTF-16 (and therefore also NSString
).
Swift's String
, on the other hand, uses a visible character count rather than calculating the underlying code units. So, the String
literals "ABC" and "A☺️C" both seem to be 3 characters long: "ABC".characters.count
and "A☺️C".characters.count
both return 3.
But, when you convert those same strings to NSString
, you get a little surprise: ("ABC" as NSString).length
returns the 3 you'd expect. ("A☺️C" as NSString).length
, though, returns 4! Yike!
Of course, the difference is because NSString
is taking into account that extra UTF-16 code unit that String
ignores.
Okay, now what about those ranges?
Swift bridges the String
you provide to the function into an NSString
object. So, the NSRange
results you get back from the call are based on that UTF-16 length, not the actual String
character count. If you've got any multi-code-unit characters in your String
, the ranges you get back won't match up.
Additionally, NSRange
objects provide a location
and a length
. These are both Int
values, which won't work in a Swift Range
. You need Index
objects for the start and end of a Range
object.
Speaking of indexes, Swift Range
objects expect a particular type of Index
. You've seen the error messages about Range<Index>
or Range<String.Index>
. These actually represent Range<String.CharacterView.Index>
. So, you need to make sure you're using the right type of Index
, or your conversion won't work.
The workaround!
Note: in the example code below, I haven't provided any error handling. Make sure to add code to handle any potential optional results, such as in creating your NSRegularExpression
object. Guard
statements come in handy here.
Now, here we go...
1) Swift has some UTF-16 representations for a String
object. So, first, we'll ensure we've got a UTF-16 version of our match string. We'll use the utf16
property of String
to convert our match string into its equivalent String.UTF16View
type as well:
2) We then set up our NSRegularExpression
object.
3) Then, we make our call to the regex object's appropriate function using the String
value and the NSRange
based on the NSString
representation.
Note: strangely, the UTF16View
version of the match string won't work for creating the NSRange
. You MUST use an NSString
cast to create the NSRange
.
4) You'll need to loop through the results, converting as you go:
No comments:
Post a Comment