Advertisement
Advertisement


How to convert std::string to lower case?


Question

I want to convert a std::string to lowercase. I am aware of the function tolower(), however in the past I have had issues with this function and it is hardly ideal anyway as use with a std::string would require iterating over each character.

Is there an alternative which works 100% of the time?

2018/06/14
1
798
6/14/2018 8:15:17 AM

Accepted Answer

Adapted from Not So Frequently Asked Questions:

#include <algorithm>
#include <cctype>
#include <string>

std::string data = "Abc";
std::transform(data.begin(), data.end(), data.begin(),
    [](unsigned char c){ return std::tolower(c); });

You're really not going to get away without iterating through each character. There's no way to know whether the character is lowercase or uppercase otherwise.

If you really hate tolower(), here's a specialized ASCII-only alternative that I don't recommend you use:

char asciitolower(char in) {
    if (in <= 'Z' && in >= 'A')
        return in - ('Z' - 'z');
    return in;
}

std::transform(data.begin(), data.end(), data.begin(), asciitolower);

Be aware that tolower() can only do a per-single-byte-character substitution, which is ill-fitting for many scripts, especially if using a multi-byte-encoding like UTF-8.

2019/07/06
931
7/6/2019 12:24:38 AM

Boost provides a string algorithm for this:

#include <boost/algorithm/string.hpp>

std::string str = "HELLO, WORLD!";
boost::algorithm::to_lower(str); // modifies str

Or, for non-in-place:

#include <boost/algorithm/string.hpp>

const std::string str = "HELLO, WORLD!";
const std::string lower_str = boost::algorithm::to_lower_copy(str);
2019/07/06

tl;dr

Use the ICU library. If you don't, your conversion routine will break silently on cases you are probably not even aware of existing.


First you have to answer a question: What is the encoding of your std::string? Is it ISO-8859-1? Or perhaps ISO-8859-8? Or Windows Codepage 1252? Does whatever you're using to convert upper-to-lowercase know that? (Or does it fail miserably for characters over 0x7f?)

If you are using UTF-8 (the only sane choice among the 8-bit encodings) with std::string as container, you are already deceiving yourself if you believe you are still in control of things. You are storing a multibyte character sequence in a container that is not aware of the multibyte concept, and neither are most of the operations you can perform on it! Even something as simple as .substr() could result in invalid (sub-) strings because you split in the middle of a multibyte sequence.

As soon as you try something like std::toupper( 'ß' ), or std::tolower( 'Σ' ) in any encoding, you are in trouble. Because 1), the standard only ever operates on one character at a time, so it simply cannot turn ß into SS as would be correct. And 2), the standard only ever operates on one character at a time, so it cannot decide whether Σ is in the middle of a word (where σ would be correct), or at the end (ς). Another example would be std::tolower( 'I' ), which should yield different results depending on the locale -- virtually everywhere you would expect i, but in Turkey ı (LATIN SMALL LETTER DOTLESS I) is the correct answer (which, again, is more than one byte in UTF-8 encoding).

So, any case conversion that works on a character at a time, or worse, a byte at a time, is broken by design. This includes all the std:: variants in existence at this time.

Then there is the point that the standard library, for what it is capable of doing, is depending on which locales are supported on the machine your software is running on... and what do you do if your target locale is among the not supported on your client's machine?

So what you are really looking for is a string class that is capable of dealing with all this correctly, and that is not any of the std::basic_string<> variants.

(C++11 note: std::u16string and std::u32string are better, but still not perfect. C++20 brought std::u8string, but all these do is specify the encoding. In many other respects they still remain ignorant of Unicode mechanics, like normalization, collation, ...)

While Boost looks nice, API wise, Boost.Locale is basically a wrapper around ICU. If Boost is compiled with ICU support... if it isn't, Boost.Locale is limited to the locale support compiled for the standard library.

And believe me, getting Boost to compile with ICU can be a real pain sometimes. (There are no pre-compiled binaries for Windows that include ICU, so you'd have to supply them together with your application, and that opens a whole new can of worms...)

So personally I would recommend getting full Unicode support straight from the horse's mouth and using the ICU library directly:

#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/locid.h>

#include <iostream>

int main()
{
    /*                          "Odysseus" */
    char const * someString = u8"ΟΔΥΣΣΕΥΣ";
    icu::UnicodeString someUString( someString, "UTF-8" );
    // Setting the locale explicitly here for completeness.
    // Usually you would use the user-specified system locale,
    // which *does* make a difference (see ı vs. i above).
    std::cout << someUString.toLower( "el_GR" ) << "\n";
    std::cout << someUString.toUpper( "el_GR" ) << "\n";
    return 0;
}

Compile (with G++ in this example):

g++ -Wall example.cpp -licuuc -licuio

This gives:

ὀδυσσεύς

Note that the Σ<->σ conversion in the middle of the word, and the Σ<->ς conversion at the end of the word. No <algorithm>-based solution can give you that.

2020/07/13

Using range-based for loop of C++11 a simpler code would be :

#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

int main ()
{
  std::locale loc;
  std::string str="Test String.\n";

 for(auto elem : str)
    std::cout << std::tolower(elem,loc);
}
2013/10/09

If the string contains UTF-8 characters outside of the ASCII range, then boost::algorithm::to_lower will not convert those. Better use boost::locale::to_lower when UTF-8 is involved. See http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html

2012/10/10

This is a follow-up to Stefan Mai's response: if you'd like to place the result of the conversion in another string, you need to pre-allocate its storage space prior to calling std::transform. Since STL stores transformed characters at the destination iterator (incrementing it at each iteration of the loop), the destination string will not be automatically resized, and you risk memory stomping.

#include <string>
#include <algorithm>
#include <iostream>

int main (int argc, char* argv[])
{
  std::string sourceString = "Abc";
  std::string destinationString;

  // Allocate the destination space
  destinationString.resize(sourceString.size());

  // Convert the source string to lower case
  // storing the result in destination string
  std::transform(sourceString.begin(),
                 sourceString.end(),
                 destinationString.begin(),
                 ::tolower);

  // Output the result of the conversion
  std::cout << sourceString
            << " -> "
            << destinationString
            << std::endl;
}
2013/03/28

Source: https://stackoverflow.com/questions/313970
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]