Return to Answer

Correct UTF-16

1.7k
3
21
36

The consideration around the UTF18UTF16 approach depends if you're going to be using text that has multi-width characters and how you're getting the actual start and end values to make the substring. Here's a post that I found useful getting my head around these options: https://oleb.net/blog/2016/08/swift-3-strings/

Results... 🥁

It also looks like returning nil instead of an empty string gives a bit of a boost, but you might loose that small advantage elsewhere in your code depending on how you handle the nil` return. Also my benchmark did 1 million iterations and got a tiny difference, so I can't imagine what application this would have a practical improvement in anyway.

The consideration around the UTF18 approach depends if you're going to be using text that has multi-width characters and how you're getting the actual start and end values to make the substring. Here's a post that I found useful getting my head around these options: https://oleb.net/blog/2016/08/swift-3-strings/

Results... 🥁

The consideration around the UTF16 approach depends if you're going to be using text that has multi-width characters and how you're getting the actual start and end values to make the substring. Here's a post that I found useful getting my head around these options: https://oleb.net/blog/2016/08/swift-3-strings/

It also looks like returning nil instead of an empty string gives a bit of a boost, but you might loose that small advantage elsewhere in your code depending on how you handle the nil` return.

deleted 5 characters in body

Source Link

edited Jan 3, 2017 at 16:04

MathewS

edited Jan 3, 2017 at 16:04

MathewS

Baseline approach (in question)
Use isEmpty and calculate endIndex from startIndex
Use isEmpty and create UTF16 index directly from Int
Use isEmpty,create UTF16 index from Int and return String?

1. Baseline approach (in question)
2. Alternative (use isEmpty and calculate endIndex from startIndex)
3. UTF16 (use isEmpty and create UTF16 index directly from Int)
4. UTF16 nil (use isEmpty, create UTF16 index from Int and return String?)

-1. baselineBaseline -> 1.151s (2% STDEV)
-2. alternativeAlternative -> 0.633s (1% STDEV)
-3. UTF16 -> 0.408s (2% STDEV)
-4. UTF16 nil -> 0.404s (1% STDEV)

-1. baselineBaseline -> 0.074s (4% STDEV)
-2. alternativeAlternative -> 0.024s (12% STDEV)
-3. UTF16 -> 0.024s (11% STDEV)
-4. UTF16 nil -> 0.019s (12% STDEV)

Baseline approach (in question)
Use isEmpty and calculate endIndex from startIndex
Use isEmpty and create UTF16 index directly from Int
Use isEmpty,create UTF16 index from Int and return String?

- baseline -> 1.151s (2% STDEV)
- alternative -> 0.633s (1% STDEV)
- UTF16 -> 0.408s (2% STDEV)
- UTF16 nil -> 0.404s (1% STDEV)

- baseline -> 0.074s (4% STDEV)
- alternative -> 0.024s (12% STDEV)
- UTF16 -> 0.024s (11% STDEV)
- UTF16 nil -> 0.019s (12% STDEV)

1. Baseline approach (in question)
2. Alternative (use isEmpty and calculate endIndex from startIndex)
3. UTF16 (use isEmpty and create UTF16 index directly from Int)
4. UTF16 nil (use isEmpty, create UTF16 index from Int and return String?)

1. Baseline -> 1.151s (2% STDEV)
2. Alternative -> 0.633s (1% STDEV)
3. UTF16 -> 0.408s (2% STDEV)
4. UTF16 nil -> 0.404s (1% STDEV)

1. Baseline -> 0.074s (4% STDEV)
2. Alternative -> 0.024s (12% STDEV)
3. UTF16 -> 0.024s (11% STDEV)
4. UTF16 nil -> 0.019s (12% STDEV)

Corrected UTF18 typo and added test for UTF16 option that returns optional String.

Source Link

edited Jan 3, 2017 at 15:58

MathewS

edited Jan 3, 2017 at 15:58

MathewS

I see twosome things that might help:

index(_:, offsetBy:) is O(n) where n is the amount you're offsetting, so you can squeeze a bit out of calculating the endIndex as an offset from the startIndex especially if you're getting substrings from near the end of the string:

The consideration around the UTF18 approach depends if you're going to be using text that has multi-width characters (emoji, non-latin character sets) because thenand how you're not going to have 1:1 relationship between charactergetting the actual start and end values to make the UTF16 indexessubstring. Here's a post that I found useful getting my head around these options: https://oleb.net/blog/2016/08/swift-3-strings/

Lastly, for yourWhen you check to see if the string is empty you're checking if the character count is zero. self.characters.count == 0 is O(n) where n is the number of characters, you can get some performance increase here by using self.isEmpty which is O(1).

edit: added 4th option that returns String?

Finally, with the UTF16 option there's the need to cast to a String and force unwrap if you want to return the type String. An alternative could be to return nil instead as your early exit:

I ran a quick profile in Xcode comparing threethose four options:

Baseline approach (in question)
UsingUse isEmpty check and calculatingcalculate endIndex from startIndex
UsingUse isEmpty check and UTF18 substringcreate UTF16 index directly from Int
Use isEmpty,create UTF16 index from Int and return String?

Option 1: 1.302s
Option 2: 0.729s
Option 3: 0.462s

Benchmark substring using "hello tests".substring(1,10)

- baseline -> 1.151s (2% STDEV)
- alternative -> 0.633s (1% STDEV)
- UTF16 -> 0.408s (2% STDEV)
- UTF16 nil -> 0.404s (1% STDEV)

Benchmark early exit using "".substring(1,10)

- baseline -> 0.074s (4% STDEV)
- alternative -> 0.024s (12% STDEV)
- UTF16 -> 0.024s (11% STDEV)
- UTF16 nil -> 0.019s (12% STDEV)

Here's a gist of the test I used for full transparency: https://gist.github.com/mathewsanders/c4c43915c5e1c13e8fe3b912bf4c27d1

So absolutely use isIndex instead of counting characters, and maybe consider using the UTF18UTF16 view if it's appropriate for the text you'll be making substrings from!.

I see two things that might help:

index(_:, offsetBy:) is O(n) where n is the amount you're offsetting, so you can squeeze a bit out of calculating the endIndex as an offset from the startIndex:

The consideration around the UTF18 approach depends if you're going to be using text that has multi-width characters (emoji, non-latin character sets) because then you're not going to have 1:1 relationship between character and the UTF16 indexes.

Lastly, for your check to see if the string is empty you're checking if the character count is zero. self.characters.count == 0 is O(n) where n is the number of characters, you can get some performance increase here by using self.isEmpty which is O(1).

I ran a quick profile in Xcode comparing three options:

Baseline approach (in question)
Using isEmpty check and calculating endIndex from startIndex
Using isEmpty check and UTF18 substring

Option 1: 1.302s
Option 2: 0.729s
Option 3: 0.462s

So absolutely use isIndex instead of counting characters, and maybe consider using the UTF18 view if it's appropriate for the text you'll be making substrings from!

I see some things that might help:

When you check to see if the string is empty you're checking if the character count is zero. self.characters.count == 0 is O(n) where n is the number of characters, you can get some performance increase here by using self.isEmpty which is O(1).

edit: added 4th option that returns String?

Finally, with the UTF16 option there's the need to cast to a String and force unwrap if you want to return the type String. An alternative could be to return nil instead as your early exit:

I ran a quick profile in Xcode comparing those four options:

Baseline approach (in question)
Use isEmpty and calculate endIndex from startIndex
Use isEmpty and create UTF16 index directly from Int
Use isEmpty,create UTF16 index from Int and return String?

Benchmark substring using "hello tests".substring(1,10)

- baseline -> 1.151s (2% STDEV)
- alternative -> 0.633s (1% STDEV)
- UTF16 -> 0.408s (2% STDEV)
- UTF16 nil -> 0.404s (1% STDEV)

Benchmark early exit using "".substring(1,10)

- baseline -> 0.074s (4% STDEV)
- alternative -> 0.024s (12% STDEV)
- UTF16 -> 0.024s (11% STDEV)
- UTF16 nil -> 0.019s (12% STDEV)

Here's a gist of the test I used for full transparency: https://gist.github.com/mathewsanders/c4c43915c5e1c13e8fe3b912bf4c27d1

So absolutely use isIndex instead of counting characters, and maybe consider using the UTF16 view if it's appropriate for the text you'll be making substrings from.

Source Link

answered Jan 3, 2017 at 5:28

MathewS

answered Jan 3, 2017 at 5:28

MathewS

default