This question is kinda similar to mine. However, I am using C++ with Qt instead of C#.
How would I efficiently and easily remove all accents and special characters like !"§$%&/()=? etc. from a QString
?
So "áche" should turn into "ache" or "über dir" to "ueber dir" (in german ü,ä,ö can be changed into the normalized character with an e appended) or at least "uber dir".
Note: Some people use a $ instead of s in some words so I want to make sure if a file is called "Ke$ha" that it will come out as "Kesha" or at least "KeSha".
The way I do it so far, incomplete, is like this:
void Utils::replaceInvalidChars(QString &str)
{
if( str.size() == 0 )
return;
while( str.at(0) == '.' ) {
str.remove(0,1);
}
str.replace( "/", "-" );
str.replace( "|", "" );
str.replace( ":", "-" );
str.replace("\"", "" );
str.replace( "?", "" );
str.replace( "$", "s" );
str.replace( "*", "" );
str.replace( ",", "" );
str.replace( "¿", "" );
str.replace( "¡", "" );
str.replace( "!", "" );
str.replace( "'", "" );
str.replace( "ë", "e" );
str.replace( "ê", "e" );
str.replace( "é", "e" );
str.replace( "è", "e" );
str.replace( "ç", "c" );
str.replace( "ó", "o" );
str.replace( "ö", "oe" );
//U's...
str.replace( "ü", "ue" );
str.replace( "Ü", "U" );
str.replace( "ù", "u" );
str.replace( "Ù", "U" );
str.replace( "û", "u" );
str.replace( "Û", "u" );
//ns
str.replace( "ñ", "n" );
//as
str.replace( "ä", "ae" );
str.replace( "Ä", "ae" );
str.replace( "á", "a" );
str.replace( "Á", "A" );
str.replace( "à", "a" );
str.replace( "À", "A" );
str.replace( "ï", "i" );
}
So at first I remove all dots from the beginning. No matter how many there are. Then I replace certain characters with no character at all and some with a character like 's' or a depending on what it is.
My way is very long, tedious and chaotic. I am about to organize it a little with comments like "N's", "U's" etc. but still, if I make a mistake somewhere it will take way too long until I (eventually) find it.
2 Answers 2
I would start by separating the data from the logic:
std::vector<std::pair<QString, QString>> replacements {
{ "/", "-" },
{ "|", "" },
// ...
{ "ï", "i" }
};
for ( auto const &r : replacements) {
str.replace(r.first, r.second);
}
I'm not sure the comments about the groups of letters being replaced really add a lot though.
Then I'd at least consider moving the data out of the program itself, and into a data file the program uses, so the replacements you do can be adjusted without re-compiling the code (this is the sort of thing that frequently seems to need a fair amount of "tweaking", since there's no one way of doing it that's obviously correct and the other ways are wrong).
-
\$\begingroup\$ Loading the data from a file is a good idea. About the map, since "á" and "à" would both turn into "a", do you think I should use a QMap<QString, QStringList>() instead or would that be a rather bad idea? \$\endgroup\$Davlog– Davlog2016年01月14日 17:36:30 +00:00Commented Jan 14, 2016 at 17:36
-
1\$\begingroup\$ @Davlog: I doubt it makes a whole lot of difference in either direction. \$\endgroup\$Jerry Coffin– Jerry Coffin2016年01月14日 18:31:42 +00:00Commented Jan 14, 2016 at 18:31
I would eliminate lines and clean up the code by relying on regular expressions.
QString s = "áche über dir Ke$ha is worth 100ドル";
// Performance: Eliminate characters you do not wish to have.
s.remove(QRegularExpression("[" + QRegularExpression::escape("'!*,?|¡¿") + "]"));
qDebug().noquote() << "Before:\t" << s;
// Performance: Check for characters
if (s.contains(QRegularExpression("[" + QRegularExpression::escape("$/:ÀÁÄÙÛÜàáäçèéêëïñóöùûü") + "]")))
{
// Special Characters
// Escape function is a safety measure in case you accidentally insert "^" in the square brackets.
s.replace(QRegularExpression("[" + QRegularExpression::escape(":/") + "]"), "-");
s.replace(QRegularExpression("[$]"), "s");
// Upper Case
s.replace(QRegularExpression("[ÁÀ]"), "A");
s.replace(QRegularExpression("[Ä]"), "Ae");
s.replace(QRegularExpression("[ÜÛÙ]"), "U");
// Lower Case
s.replace(QRegularExpression("[áà]"), "a");
s.replace(QRegularExpression("[ä]"), "ae");
s.replace(QRegularExpression("[ç]"), "c");
s.replace(QRegularExpression("[ëêéè]"), "e");
s.replace(QRegularExpression("[ï]"), "i");
s.replace(QRegularExpression("[ñ]"), "n");
s.replace(QRegularExpression("[óö]"), "o");
s.replace(QRegularExpression("[ûù]"), "u");
s.replace(QRegularExpression("[ü]"), "ue");
}
qDebug().noquote() << " After:\t" << s;
Before: áche über dir Ke$ha is worth 100ドル
After: ache ueber dir Kesha is worth s100
Oops; found an error in your code. Lets just adjust this line then:
s.replace(QRegularExpression("[$]([^0-9])"), "s\1円");
Before: áche über dir Ke$ha is worth 100ドル
After: ache ueber dir Kesha is worth 100ドル