Compile-Time Strings

Compile-Time Strings

By Wu Yongwei

Overload, 30(172):4-7, December 2022


Compile-time strings have been used in many projects over the years. Wu Yongwei summarises his experience.

std::string is mostly unsuitable for compile-time string manipulations.

There are several reasons:

  • Before C++20, one could not use strings at all at compile time. In addition, the major compilers didn’t start to support compile-time strings until quite late. MSVC [MSVC] was the front runner in this regard, GCC [GCC] came second with GCC 12, and Clang [Clang] came last with Clang 15 (released a short while ago).
  • With C++20 one can use strings at compile time, but there are still a lot of inconveniences, the most obvious being that strings generated at compile time cannot be used at run time. Besides, a string cannot be declared constexpr.
  • A string cannot be used as a template argument.

So we have to give up this apparent choice, but explore other possibilities. The candidates are:

  • const char pointer, which is what a string literal naturally decays to
  • string_view, a powerful tool added by C++17: it has similar member functions to those of string, but they are mostly marked as constexpr!
  • array, with which we can generate brand-new strings

We will try these types in the following discussion.

Functions commonly needed

Getting the string length

One of the most basic functions on a string is getting its length. Here we cannot use the C function strlen, as it is not constexpr.

We will try several different ways to implement it.

First, we can implement strlen manually, and mark the function constexpr (see Listing 1).

namespace strtools {
  constexpr size_t length(const char* str)
  {
    size_t count = 0;
    while (*str != '\0') {
      ++str;
      ++count;
    }
    return count;
  }
} // namespace strtools
			
Listing 1

However, is there an existing mechanism to retrieve the length of a string in the standard library? The answer is a definite Yes. The standard library does support getting the length of a string of any of the standard character types, like char, wchar_t, etc. With the most common character type char, we can write:

  constexpr size_t length(const char* str)
  {
    return char_traits<char>::length(str);
  }

It’s been possible to use char_traits methods at compile time since C++17. (However, you may encounter problems with older compiler versions, like GCC 8.)

Assuming you can use C++17, string_view is definitely worth a try:

  constexpr size_t length(string_view sv)
  {
    return sv.size();
  }

Regardless of the approach used, now we can use the following code to verify that we can indeed check the length of a string at compile time:

  static_assert(strtools::length("Hi") == 2);

At present, the string_view implementation seems the most convenient.

Finding a character

Finding a specific character is also quite often needed. We can’t use strchr, but again, we can choose from a few different implementations. The code is pretty simple, whether implemented with char_traits or with string_view.

Here is the version with char_traits:

  constexpr const char*
  find(const char* str, char ch)
  {
    return char_traits<char>::find(
      str, length(str), ch);
  }

Here is the version with string_view:

  constexpr string_view::size_type
  find(string_view sv, char ch)
  {
    return sv.find(ch);
  }

I am not going to show the manual lookup code this time. (Unless you have to use an old compiler, simpler is better.)

Comparing strings

The next functions are string comparisons. Here string_view wins hands down: string_view supports the standard comparisons directly, and you do not need to write any code.

Getting substrings

It seems that string_views are very convenient, and we should use string_views wherever possible. However, is string_view::substr suitable for getting substrings? This is difficult to answer without an actual usage scenario. One real scenario I encountered in projects was that the __FILE__ macro may contain the full path at compile time, resulting in different binaries when compiling under different paths. We wanted to truncate the path completely so that the absolute paths would not show up in binaries.

My tests showed that string_view::substr could not handle this job. With the following code:

  puts("/usr/local"sv.substr(5).data());

we will see assembly output like the following from the compiler on [Godbolt] (at https://godbolt.org/z/1dssd96vz):

  .LC0:
        .string "/usr/local"
        …
        mov     edi, OFFSET FLAT:.LC0+5
        call    puts

We have to find another way.

Let’s try array. It’s easy to think of code like the following:

  constexpr auto substr(string_view sv, 
    size_t offset, size_t count)
  {
    array<char, count + 1> result{};
    copy_n(&sv[offset], count, result.data());
    return result;
  }

The intention of the code should be very clear: generate a brand-new character array of the requested size and zero it out (constexpr variables had to be initialized on declaration before C++20); copy what we need; and then return the result. Unfortunately, the code won’t compile.

There are two problems in the code:

  • Function parameters are not constexpr, and cannot be used as template arguments.
  • copy_n was not constexpr before C++20, and cannot be used in compile-time programming.

The second problem is easy to fix: a manual loop will do. We shall focus on the first problem.

A constexpr function can be evaluated at compile time or at run time, so its function arguments are not treated as compile-time constants, and cannot be used in places where compile-time constants are required, such as template arguments.

Furthermore, this problem still exists with the C++20 consteval function, where the function is only invoked at compile time. The main issue is that if we allow function parameters to be used as compile-time constants, then we can write a function where its arguments of different values (same type) can produce return values of different types. For example (currently illegal):

  consteval auto make_constant(int n)
  {
    return integral_constant<int, n>{};
  }

This is unacceptable in the current type system: we still require that the return values of a function have a unique type. If we want a value to be used as a template argument inside a function, it must be passed to the function template as a template argument (rather than as a function argument to a non-template function). In this case, each distinct template argument implies a different template specialization, so the issue of a multiple-return-type function does not occur.

By the way, a standard proposal P1045 [Stone19] tried to solve this problem, but its progress seems stalled. As there are workarounds (to be discussed below), we are still able to achieve the desired effect.

Let’s now return to the substr function and convert the count parameter into a template parameter. Listing 2 is the result

template <size_t Count>
constexpr auto substr(string_view sv,
                      size_t offset = 0)
{
  array<char, Count + 1> result{};
  for (size_t i = 0; i < Count; ++i)
  {
    result[i] = sv[offset + i];
  }
  return result;
}
Listing 2

The code can really work this time. With:

  puts(substr<5>("/usr/local", 5).data())

we no longer see "/usr/" in the compiler output.

Regretfully, we now see how compilers are challenged with abstractions: With the latest versions of GCC (12.2) and MSVC (19.33) on Godbolt, this version of substr does not generate the optimal output. There are also some compatibility issues with older compiler versions. So, purely from a practical point of view, I recommend the implementation in Listing 3 that does not use string_view.

template <size_t Count>
constexpr auto substr(const char* str,
                      size_t offset = 0)
{
  array<char, Count + 1> result{};
  for (size_t i = 0; i < Count; ++i) {
    result[i] = str[offset + i];
  }
  return result;
}
			
Listing 3

If you are interested, you can compare the assembly outputs of these two different versions of the code:

Only Clang is able to generate the same efficient assembly code with both versions:

  mov     word ptr [rsp + 4], 108
  mov     dword ptr [rsp], 1633906540
  mov     rdi, rsp
  call    puts

If you don’t understand why the numbers 108 and 1633906540 are there, let me remind you that the hexadecimal representations of these two numbers are 0x6C and 0x61636F6C, respectively. Check the ASCII table and you should be able to understand.

Since we have stopped using string_view in the function parameters, the parameter offset has become much less useful. Hence, I will get rid of this parameter, and rename the function to copy_str (Listing 4).

template <size_t Count>
constexpr auto copy_str(const char* str)
{
  array<char, Count + 1> result{};
  for (size_t i = 0; i < Count; ++i)
  {
    result[i] = str[i];
  }
  return result;
}
			
Listing 4

Passing arguments at compile time

When you try composing the compile-time functions together, you will find something lacking. For example, if you wanted to remove the first segment of a path automatically (like from "/usr/local" to "local"), you might try some code like Listing 5.

constexpr auto remove_head(const char* path)
{
  if (*path == '/') {
    ++path;
  }
  auto start = find(path, '/');
  if (start == nullptr) {
    return copy_str<length(path)>(path);
  } else {
    return copy_str<length(start + 1)
      >(start + 1);
  }
}
			
Listing 5

The problem is still that it won’t compile. And did you notice that this code violates exactly the constraint I mentioned above that the return type of a function must be consistent and unique?

I have adopted a solution described by Michael Park [Park17]: using lambda expressions to encapsulate ‘compile-time arguments’. I have defined three macros for convenience and readability:

  #define CARG typename
  #define CARG_WRAP(x) [] { return (x); }
  #define CARG_UNWRAP(x) (x)()

CARG means ‘constexpr argument’, a compile-time constant argument. We can now make make_constant really work:

  template <CARG Int>
  constexpr auto make_constant(Int cn)
  {
    constexpr int n = CARG_UNWRAP(cn);
    return integral_constant<int, n>{};
  }

And it is easy to verify that it works:

  auto result = make_constant(CARG_WRAP(2));
  static_assert(
   std::is_same_v<integral_constant<int, 2>,
   decltype(result)>);

A few explanations follow. In the template parameter, I use CARG (instead of typename) for code readability: it indicates the intention that the template parameter is essentially a type wrapper for compile-time constants. Int is the name of this special type. We will not provide this type when instantiating the function template, but instead let the compiler deduce it.

When calling the ‘function’ (make_constant(CARG_WRAP(2))), we provide a lambda expression ([] { return (2); }), which encapsulates the constant we need. When we need to use this parameter, we use CARG_UNWRAP (evaluate [] { return (2); }()) to get the constant back.

Now we can rewrite the remove_head function (Listing 6).

template <CARG Str>
constexpr auto remove_head(Str cpath)
{
  constexpr auto path = CARG_UNWRAP(cpath);
  constexpr int skip = (*path == '/') ? 1 : 0;
  constexpr auto pos = path + skip;
  constexpr auto start = find(pos, '/');
  if constexpr (start == nullptr) {
    return copy_str<length(pos)>(pos);
  } else {
    return copy_str<length(start + 1)>(start 
                                       + 1);
  }
}
			
Listing 6

This function is similar in structure to the previous version, but there are many detail changes. In order to pass the result to copy_str as a template argument, we have to use constexpr all the way along. So we have to give up mutability, and write code in a quite functional style.

Does it really work? Let’s put the following statement into the main function:

  puts(strtools::remove_head(
    CARG_WRAP("/usr/local")) .data());

And here is the optimized assembly output from GCC on x86-64 (see https://godbolt.org/z/Mv5YanPvq):

  main:
          sub     rsp, 24
          mov     eax, DWORD PTR .LC0[rip]
          lea     rdi, [rsp+8]
          mov     DWORD PTR [rsp+8], eax
          mov     eax, 108
          mov     WORD PTR [rsp+12], ax
          call    puts
          xor     eax, eax
          add     rsp, 24
          ret
  .LC0:
          .byte   108
          .byte   111
          .byte   99
          .byte   97

As you can see clearly, the compiler will put the ASCII codes for "local" on the stack, assign its starting address to the rdi register, and then call the puts function. There is absolutely no trace of "/usr/" in the output. In fact, there is no difference between the output of the puts statement above and that of puts(substr<5>("/usr/local", 5).data()).

I would like to remind you that it is safe to pass and store the character array, but it is not safe to store the pointer obtained from its data() method. It is possible to use such a pointer immediately in calling other functions (like puts, above), as the lifetime of array will extend till the current statement finishes execution. However, if you saved this pointer, it would become dangling after the current statement, and dereferencing it would then be undefined behaviour.

String template parameters

We have tried turning strings into types (via lambda expressions) for compile-time argument passing, but unlike integers and integral_constants, there is no one-to-one correspondence between the two. This is often inconvenient: for two integral_constants, we can directly use is_same to determine whether they are the same; for strings represented as lambda expressions, we cannot do the same – two lambda expressions always have different types.

Direct use of string literals as non-type template arguments is not allowed in C++, because strings may appear repeatedly in different translation units, and they do not have proper comparison semantics – comparing two strings is just a comparison of two pointers, which cannot achieve what users generally expect. To use string literals as template arguments, we need to find a way to pass the string as a sequence of characters to the template. We have two methods available:

  • The non-standard GNU extension used by GCC and Clang (which can be used prior to C++20)
  • The C++20 approach suitable for any conformant compilers (including GCC and Clang)

Let’s have a look one by one.

The GNU extension

GCC and Clang have implemented the standard proposal N3599 [Smith13], which allows us to use strings as template arguments. The compiler will expand the string into characters, and the rest is standard C++. Listing 7 is an example.

template <char... Cs>
struct compile_time_string {
  static constexpr char value[]{Cs..., '\0'};
};
template <typename T, T... Cs>
constexpr compile_time_string<Cs...> 
  operator""_cts()
{
  return {};
}
			
Listing 7

The definition of the class template is standard C++, so that:

  compile_time_string<'H', 'i'>

is a valid type and, at the same time, by taking the value member of this type, we can get "Hi". The GNU extension is the string literal operator template – we can now write "Hi"_cts to get an object of type compile_time_string<'H', 'i'>. The following code will compile with the above definitions:

  constexpr auto a = "Hi"_cts;
  constexpr auto b = "Hi"_cts;
  static_assert(
    is_same_v<decltype(a), decltype(b)>);

The C++20 approach

Though the above method is simple and effective, it failed to reach consensus in the C++ standards committee and did not become part of the standard. However, with C++20, we can use more types in non-type template parameters. In particular, user-defined literal types are amongst them. Listing 8 is an example.

template <size_t N>
struct compile_time_string {
  constexpr compile_time_string(
    const char (&str)[N])
  {
    copy_n(str, N, value);
  }
  char value[N]{};
};
template <compile_time_string cts>
constexpr auto operator""_cts()
{
  return cts;
}
			
Listing 8

Again, the first class template is not special, but allowing this compile_time_string to be used as the type of a non-type template parameter (quite a mouthful ☺), as well as the string literal operator template, is a C++20 improvement. We can now write "Hi"_cts to generate a compile_time_string object. Note, however, that this object is of type compile_time_string<3>, so "Hi"_cts and "Ha"_cts are of the same type – which is very different from the results of the GNU extension. However, the important thing is that compile_time_string can now be used as type of a template parameter, so we can just add another layer:

  template <compile_time_string cts>
  struct cts_wrapper {
    static constexpr compile_time_string str{cts};
  };

Corresponding to the previous compile-time string type comparison, we now need to write:

  auto a = cts_wrapper<"Hi"_cts>{};
  auto b = cts_wrapper<"Hi"_cts>{};
  static_assert(
    is_same_v<decltype(a), decltype(b)>);

Or we can further simplify it to (as compile_time_string has a non-explicit constructor):

  auto a = cts_wrapper<"Hi">{};
  auto b = cts_wrapper<"Hi">{};
  static_assert(
    is_same_v<decltype(a), decltype(b)>);

They have proved to be useful in my real projects, and I hope they will help you too.

References

[Clang] https://clang.llvm.org/

[GCC] https://gcc.gnu.org/

[Godbolt] Matt Godbolt, Compiler Explorer, https://godbolt.org/

[MSVC] https://visualstudio.microsoft.com/

[Park17] Michael Park, ‘constexpr function parameters’, May 2017, https://mpark.github.io/programming/2017/05/26/constexpr-function-parameters/

[Smith13] Richard Smith, ‘N3599: Literal operator templates for strings’, March 2013, http://wg21.link/n3599

[Stone19] David Stone, ‘P1045R1: constexpr Function Parameters’, September 2019, https://wg21.link/p1045r1

Wu Yongwei Having been a programmer and software architect, Yongwei is currently a consultant and trainer on modern C++. He has nearly 30 years’ experience in systems programming and architecture in C and C++. His focus is on the C++ language, software architecture, performance tuning, design patterns, and code reuse. He has a programming page at http://wyw.dcweb.cn/






Your Privacy

By clicking "Accept Non-Essential Cookies" you agree ACCU can store non-essential cookies on your device and disclose information in accordance with our Privacy Policy and Cookie Policy.

Current Setting: Non-Essential Cookies REJECTED


By clicking "Include Third Party Content" you agree ACCU can forward your IP address to third-party sites (such as YouTube) to enhance the information presented on this site, and that third-party sites may store cookies on your device.

Current Setting: Third Party Content EXCLUDED



Settings can be changed at any time from the Cookie Policy page.