I am pulling a JSON file from a site and one of the strings received is:
The Weeknd ‘King Of The Fall’ [Video Premiere] | @TheWeeknd | #SoPhi
How can I convert things like ‘ into the correct characters?
I've made a Xcode Playground to demonstrate it:
import UIKit
var error: NSError?
let blogUrl: NSURL = NSURL.URLWithString("http://sophisticatedignorance.net/api/get_recent_summary/")
let jsonData = NSData(contentsOfURL: blogUrl)
let dataDictionary = NSJSONSerialization.JSONObjectWithData(jsonData, options: nil, error: &error) as NSDictionary
var a = dataDictionary["posts"] as NSArray
println(a[0]["title"])
23 Answers 23
This answer was last revised for Swift 5.2 and iOS 13.4 SDK.
There's no straightforward way to do that, but you can use NSAttributedString magic to make this process as painless as possible (be warned that this method will strip all HTML tags as well).
Remember to initialize NSAttributedString from main thread only. It uses WebKit to parse HTML underneath, thus the requirement.
// This is a[0]["title"] in your case
let htmlEncodedString = "The Weeknd <em>‘King Of The Fall’</em>"
guard let data = htmlEncodedString.data(using: .utf8) else {
return
}
let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
guard let attributedString = try? NSAttributedString(data: data, options: options, documentAttributes: nil) else {
return
}
// The Weeknd ‘King Of The Fall’
let decodedString = attributedString.string
extension String {
init?(htmlEncodedString: String) {
guard let data = htmlEncodedString.data(using: .utf8) else {
return nil
}
let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
guard let attributedString = try? NSAttributedString(data: data, options: options, documentAttributes: nil) else {
return nil
}
self.init(attributedString.string)
}
}
let encodedString = "The Weeknd <em>‘King Of The Fall’</em>"
let decodedString = String(htmlEncodedString: encodedString)
18 Comments
stringByConvertingFromHTML is an extension, clarity is the single most important attribute a program can have.@akashivskyy's answer is great and demonstrates how to utilize NSAttributedString to decode HTML entities. One possible disadvantage
(as he stated) is that all HTML markup is removed as well, so
<strong> 4 < 5 & 3 > 2</strong>
becomes
4 < 5 & 3 > 2
On OS X there is CFXMLCreateStringByUnescapingEntities() which does the job:
let encoded = "<strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €. @ "
let decoded = CFXMLCreateStringByUnescapingEntities(nil, encoded, nil) as String
println(decoded)
// <strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €. @
but this is not available on iOS.
Here is a pure Swift implementation, for Swift 4 and later. It decodes character entities
references like < using a dictionary, and all numeric character
entities like @ or €. (Note that I did not list all
252 HTML entities explicitly.)
// Mapping from XML/HTML character entity reference to character
// From http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
private let characterEntities : [ Substring : Character ] = [
// XML predefined entities:
""" : "\"",
"&" : "&",
"'" : "'",
"<" : "<",
">" : ">",
// HTML character entity references:
" " : "\u{00a0}",
// ...
"♦" : "♦",
]
extension String {
/// Returns a new string made by replacing in the `String`
/// all HTML character entity references with the corresponding
/// character.
var stringByDecodingHTMLEntities : String {
// ===== Utility functions =====
// Convert the number in the string to the corresponding
// Unicode character, e.g.
// decodeNumeric("64", 10) --> "@"
// decodeNumeric("20ac", 16) --> "€"
func decodeNumeric(_ string : Substring, base : Int) -> Character? {
guard let code = UInt32(string, radix: base),
let uniScalar = UnicodeScalar(code) else { return nil }
return Character(uniScalar)
}
// Decode the HTML character entity to the corresponding
// Unicode character, return `nil` for invalid input.
// decode("@") --> "@"
// decode("€") --> "€"
// decode("<") --> "<"
// decode("&foo;") --> nil
func decode(_ entity : Substring) -> Character? {
if entity.hasPrefix("&#x") || entity.hasPrefix("&#X") {
return decodeNumeric(entity.dropFirst(3).dropLast(), base: 16)
} else if entity.hasPrefix("&#") {
return decodeNumeric(entity.dropFirst(2).dropLast(), base: 10)
} else {
return characterEntities[entity]
}
}
// ===== Method starts here =====
var result = ""
var position = startIndex
// Find the next '&' and copy the characters preceding it to `result`:
while let ampRange = self[position...].range(of: "&") {
result.append(contentsOf: self[position ..< ampRange.lowerBound])
position = ampRange.lowerBound
// Find the next ';' and copy everything from '&' to ';' into `entity`
guard let semiRange = self[position...].range(of: ";") else {
// No matching ';'.
break
}
let entity = self[position ..< semiRange.upperBound]
if let decoded = decode(entity) {
// Replace by decoded character:
result.append(decoded)
position = semiRange.upperBound
} else {
// Invalid entity, copy verbatim:
result.append(contentsOf: self[ampRange])
position = ampRange.upperBound
}
}
// Copy remaining characters to `result`:
result.append(contentsOf: self[position...])
return result
}
}
Example:
let encoded = "&<strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €. @ &"
let decoded = encoded.stringByDecodingHTMLEntities
print(decoded)
// &<strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €. @ &
16 Comments
strtooul(string, nil, base) entirely will cause the code not to work with numeric character entities and crash when it comes to an entity it doesn't recognize (instead of failing gracefully).Swift 4
- String extension computed variable
- Without extra guard, do, catch, etc...
- Returns the original strings if decoding fails
extension String {
var htmlDecoded: String {
let decoded = try? NSAttributedString(data: Data(utf8), options: [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
], documentAttributes: nil).string
return decoded ?? self
}
}
3 Comments
Swift 3 version of @akashivskyy's extension,
extension String {
init(htmlEncodedString: String) {
self.init()
guard let encodedData = htmlEncodedString.data(using: .utf8) else {
self = htmlEncodedString
return
}
let attributedOptions: [String : Any] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
]
do {
let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
self = attributedString.string
} catch {
print("Error: \(error)")
self = htmlEncodedString
}
}
}
2 Comments
Swift 2 version of @akashivskyy's extension,
extension String {
init(htmlEncodedString: String) {
if let encodedData = htmlEncodedString.dataUsingEncoding(NSUTF8StringEncoding){
let attributedOptions : [String: AnyObject] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding
]
do{
if let attributedString:NSAttributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil){
self.init(attributedString.string)
}else{
print("error")
self.init(htmlEncodedString) //Returning actual string if there is an error
}
}catch{
print("error: \(error)")
self.init(htmlEncodedString) //Returning actual string if there is an error
}
}else{
self.init(htmlEncodedString) //Returning actual string if there is an error
}
}
}
1 Comment
I was looking for a pure Swift 3.0 utility to escape to/unescape from HTML character references (i.e. for server-side Swift apps on both macOS and Linux) but didn't find any comprehensive solutions, so I wrote my own implementation: https://github.com/IBM-Swift/swift-html-entities
The package, HTMLEntities, works with HTML4 named character references as well as hex/dec numeric character references, and it will recognize special numeric character references per the W3 HTML5 spec (i.e. € should be unescaped as the Euro sign (unicode U+20AC) and NOT as the unicode character for U+0080, and certain ranges of numeric character references should be replaced with the replacement character U+FFFD when unescaping).
Usage example:
import HTMLEntities
// encode example
let html = "<script>alert(\"abc\")</script>"
print(html.htmlEscape())
// Prints "<script>alert("abc")</script>"
// decode example
let htmlencoded = "<script>alert("abc")</script>"
print(htmlencoded.htmlUnescape())
// Prints "<script>alert(\"abc\")</script>"
And for OP's example:
print("The Weeknd ‘King Of The Fall’ [Video Premiere] | @TheWeeknd | #SoPhi ".htmlUnescape())
// prints "The Weeknd ‘King Of The Fall’ [Video Premiere] | @TheWeeknd | #SoPhi "
Edit: HTMLEntities now supports HTML5 named character references as of version 2.0.0. Spec-compliant parsing is also implemented.
3 Comments
( ͡° ͜ʖ ͡° )), whereas none of the other answers manage that.Swift 4 Version
extension String {
init(htmlEncodedString: String) {
self.init()
guard let encodedData = htmlEncodedString.data(using: .utf8) else {
self = htmlEncodedString
return
}
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey : Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
do {
let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
self = attributedString.string
}
catch {
print("Error: \(error)")
self = htmlEncodedString
}
}
}
4 Comments
rawValue syntax NSAttributedString.DocumentReadingOptionKey(rawValue: NSAttributedString.DocumentAttributeKey.documentType.rawValue) and NSAttributedString.DocumentReadingOptionKey(rawValue: NSAttributedString.DocumentAttributeKey.characterEncoding.rawValue) is horrible. Replace it with .documentType and .characterEncodingextension String{
func decodeEnt() -> String{
let encodedData = self.dataUsingEncoding(NSUTF8StringEncoding)!
let attributedOptions : [String: AnyObject] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding
]
let attributedString = NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil, error: nil)!
return attributedString.string
}
}
let encodedString = "The Weeknd ‘King Of The Fall’"
let foo = encodedString.decodeEnt() /* The Weeknd ‘King Of The Fall’ */
3 Comments
Swift 4:
The total solution that finally worked for me with HTML code and newline characters and single quotes
extension String {
var htmlDecoded: String {
let decoded = try? NSAttributedString(data: Data(utf8), options: [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
], documentAttributes: nil).string
return decoded ?? self
}
}
Usage:
let yourStringEncoded = yourStringWithHtmlcode.htmlDecoded
I then had to apply some more filters to get rid of single quotes (for example, don't, hasn't, It's, etc.), and new line characters like \n:
var yourNewString = String(yourStringEncoded.filter { !"\n\t\r".contains(0ドル) })
yourNewString = yourNewString.replacingOccurrences(of: "\'", with: "", options: NSString.CompareOptions.literal, range: nil)
3 Comments
This would be my approach. You could add the entities dictionary from https://gist.github.com/mwaterfall/25b4a6a06dc3309d9555 Michael Waterfall mentions.
extension String {
func htmlDecoded()->String {
guard (self != "") else { return self }
var newStr = self
let entities = [
""" : "\"",
"&" : "&",
"'" : "'",
"<" : "<",
">" : ">",
]
for (name,value) in entities {
newStr = newStr.stringByReplacingOccurrencesOfString(name, withString: value)
}
return newStr
}
}
Examples used:
let encoded = "this is so "good""
let decoded = encoded.htmlDecoded() // "this is so "good""
OR
let encoded = "this is so "good"".htmlDecoded() // "this is so "good""
1 Comment
Elegant Swift 4 Solution
If you want a string,
myString = String(htmlString: encodedString)
add this extension to your project:
extension String {
init(htmlString: String) {
self.init()
guard let encodedData = htmlString.data(using: .utf8) else {
self = htmlString
return
}
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey : Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
do {
let attributedString = try NSAttributedString(data: encodedData,
options: attributedOptions,
documentAttributes: nil)
self = attributedString.string
} catch {
print("Error: \(error.localizedDescription)")
self = htmlString
}
}
}
If you want an NSAttributedString with bold, italic, links, etc.,
textField.attributedText = try? NSAttributedString(htmlString: encodedString)
add this extension to your project:
extension NSAttributedString {
convenience init(htmlString html: String) throws {
try self.init(data: Data(html.utf8), options: [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
], documentAttributes: nil)
}
}
Comments
Swift 5.1 Version
import UIKit
extension String {
init(htmlEncodedString: String) {
self.init()
guard let encodedData = htmlEncodedString.data(using: .utf8) else {
self = htmlEncodedString
return
}
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey : Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
do {
let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
self = attributedString.string
}
catch {
print("Error: \(error)")
self = htmlEncodedString
}
}
}
Also, if you want to extract date, images, metadata, title and description, you can use my pod named:
][1].
2 Comments
Swift 4
I really like the solution using documentAttributes. However, it is may too slow for parsing files and/or usage in table view cells. I can't believe that Apple does not provide a decent solution for this.
As a workaround, I found this String Extension on GitHub which works perfectly and is fast for decoding.
So for situations in which the given answer is to slow, see the solution suggest in this link: https://gist.github.com/mwaterfall/25b4a6a06dc3309d9555
Note: it does not parse HTML tags.
Comments
Computed var version of @yishus' answer
public extension String {
/// Decodes string with HTML encoding.
var htmlDecoded: String {
guard let encodedData = self.data(using: .utf8) else { return self }
let attributedOptions: [String : Any] = [
NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue]
do {
let attributedString = try NSAttributedString(data: encodedData,
options: attributedOptions,
documentAttributes: nil)
return attributedString.string
} catch {
print("Error: \(error)")
return self
}
}
}
Comments
Have a look at HTMLString - a library written in Swift that allows your program to add and remove HTML entities in Strings
For completeness, I copied the main features from the site:
- Adds entities for ASCII and UTF-8/UTF-16 encodings
- Removes more than 2100 named entities (like &)
- Supports removing decimal and hexadecimal entities
- Designed to support Swift Extended Grapheme Clusters (→ 100% emoji-proof)
- Fully unit tested
- Fast
- Documented
- Compatible with Objective-C
1 Comment
Swift 4
func decodeHTML(string: String) -> String? {
var decodedString: String?
if let encodedData = string.data(using: .utf8) {
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey : Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
do {
decodedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil).string
} catch {
print("\(error.localizedDescription)")
}
}
return decodedString
}
1 Comment
Swift 4.1 +
var htmlDecoded: String {
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey : Any] = [
NSAttributedString.DocumentReadingOptionKey.documentType : NSAttributedString.DocumentType.html,
NSAttributedString.DocumentReadingOptionKey.characterEncoding : String.Encoding.utf8.rawValue
]
let decoded = try? NSAttributedString(data: Data(utf8), options: attributedOptions
, documentAttributes: nil).string
return decoded ?? self
}
1 Comment
Swift 4
extension String {
var replacingHTMLEntities: String? {
do {
return try NSAttributedString(data: Data(utf8), options: [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
], documentAttributes: nil).string
} catch {
return nil
}
}
}
Simple Usage
let clean = "Weeknd ‘King Of The Fall’".replacingHTMLEntities ?? "default value"
2 Comments
Updated answer working on Swift 3
extension String {
init?(htmlEncodedString: String) {
let encodedData = htmlEncodedString.data(using: String.Encoding.utf8)!
let attributedOptions = [ NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType]
guard let attributedString = try? NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil) else {
return nil
}
self.init(attributedString.string)
}
Comments
Objective-C
+(NSString *) decodeHTMLEnocdedString:(NSString *)htmlEncodedString {
if (!htmlEncodedString) {
return nil;
}
NSData *data = [htmlEncodedString dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *attributes = @{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: @(NSUTF8StringEncoding)};
NSAttributedString *attributedString = [[NSAttributedString alloc] initWithData:data options:attributes documentAttributes:nil error:nil];
return [attributedString string];
}
Comments
Swift 3.0 version with actual font size conversion
Normally, if you directly convert HTML content to an attributed string, the font size is increased. You can try to convert an HTML string to an attributed string and back again to see the difference.
Instead, here is the actual size conversion that makes sure the font size does not change, by applying the 0.75 ratio on all fonts:
extension String {
func htmlAttributedString() -> NSAttributedString? {
guard let data = self.data(using: String.Encoding.utf16, allowLossyConversion: false) else { return nil }
guard let attriStr = try? NSMutableAttributedString(
data: data,
options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType],
documentAttributes: nil) else { return nil }
attriStr.beginEditing()
attriStr.enumerateAttribute(NSFontAttributeName, in: NSMakeRange(0, attriStr.length), options: .init(rawValue: 0)) {
(value, range, stop) in
if let font = value as? UIFont {
let resizedFont = font.withSize(font.pointSize * 0.75)
attriStr.addAttribute(NSFontAttributeName,
value: resizedFont,
range: range)
}
}
attriStr.endEditing()
return attriStr
}
}
Comments
Swift 4
extension String {
mutating func toHtmlEncodedString() {
guard let encodedData = self.data(using: .utf8) else {
return
}
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey : Any] = [
NSAttributedString.DocumentReadingOptionKey(rawValue: NSAttributedString.DocumentAttributeKey.documentType.rawValue): NSAttributedString.DocumentType.html,
NSAttributedString.DocumentReadingOptionKey(rawValue: NSAttributedString.DocumentAttributeKey.characterEncoding.rawValue): String.Encoding.utf8.rawValue
]
do {
let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
self = attributedString.string
}
catch {
print("Error: \(error)")
}
}
2 Comments
rawValue syntax NSAttributedString.DocumentReadingOptionKey(rawValue: NSAttributedString.DocumentAttributeKey.documentType.rawValue) and NSAttributedString.DocumentReadingOptionKey(rawValue: NSAttributedString.DocumentAttributeKey.characterEncoding.rawValue) is horrible. Replace it with .documentType and .characterEncodingUse:
NSData dataRes = (nsdata value )
var resString = NSString(data: dataRes, encoding: NSUTF8StringEncoding)