Titlecase, underscore and laser guns

Posted by on June 10, 2021

Not so long ago, I had an opportunity to peek under the hood of titlecase and underscore methods, the tiny cogs of the “Rails magic” machine. The latter turned out to be a very interesting function—a lot of hard-to-follow transformations, secret injections and the like. All of these bits significantly contributed to an odd-looking bug I’ve been working on. 

Today, we’ll unravel the implementation of both methods and look in the eye of regular expressions. But first of all, let’s see what this bug was about.

Note: This post describes the function behaviour present in Rails 6.1.3 and may change in the future.

The Bug

The FreeAgent sign-up process can take a while when you have to fill in all business details by hand. Thankfully, we have an option of automating this process by using the “Find company details” feature. The search is simple—just provide the company name and all relevant fields will be filled in with data retrieved from the Companies House API.

Sometimes, the search results didn’t look as expected:

A screenshot of a dropdown menu that shows a list of company names with extra spaces

Hmm, “RTI”? In our domain it’s a pretty special acronym but it’s not something that our user expects to see in the name of their company. It made me wonder how it is possible that RTI is found and formatted this way here. It doesn’t seem to be on our side. We just call this one method:

result["title"].titlecase

So, we take a string and make a pretty title out of it, right? Well, it’s a bit more complicated. Rails’s String::titlecase internally calls an identically named method defined in ActiveSupport::Inflector module, a part of the machine I mentioned before. ActiveSupport::Inflector operations allow transformation of strings such as pluralisation, converting namespaces to paths and many more. As a bonus, it has the power of recognising acronyms in strings including these defined by a user. RTI happened to be defined in our config/inflections.yml file. Let’s take a look:

# good
"partition arts".titlecase => "Partition Arts" 

# whoops  
"PARTITION ARTS".titlecase => "Pa RTI Tion Arts" 

As you can see, there’s a link between uppercase characters and finding acronyms. We deal with company names so we don’t want to infer any acronyms from their names. It suffices to add downcase to fix the bug:

result["title"].downcase.titlecase

But why is that? Where precisely do we find RTI in a string? Let’s find out.

The Big Picture

titleize, the true form of titlecase, is powered by a composition of two methods: humanize and underscore:

humanize(underscore(word), keep_id_suffix: ...)
.gsub(...) do |match|
  match.capitalize
end

The underscore method does the heavy lifting of inferring potential words and acronyms from CamelCase and hyphenated strings. The humanize method, on the other hand, applies a couple of rules to transform a string so it looks acceptable to the end user. This includes replacing underscores with spaces, capitalising the first letter of the string, converting all letters but acronyms to lowercase and the like. For now, we’ll ignore the gsub definition to keep things simple. The general flow of this function is:

  1. Get an underscored version of word
  2. Feed it to the humanize method
  3. Apply regex defined in gsub to word
  4. Iterate over each of them and convert to uppercase

underscore Deep Dive

You might’ve been thinking, “It’s not a big deal! underscore just places underscores between words!”. Well, sort of. At its core, underscore is intended to transform CamelCase, namespaced and hyphenated words into underscore separated strings:

"testingUnderscore".underscore  => "testing_underscore"

# Oh!
"testing underscore".underscore => "testing underscore" 

Now, let’s take a deep breath and look at the definition:

def underscore(camel_cased_word)
 return camel_cased_word unless /[A-Z-]|::/.match?(camel_cased_word)  
 word = camel_cased_word.to_s.gsub("::", "/")  
 word.gsub!(inflections.acronyms_underscore_regex){"#{$1 && '_' }#{$2.downcase}"}    
 word.gsub!(/([A-Z\d]+)([A-Z][a-z])/, '\1_\2' ) 
 word.gsub!(/([a-z\d])([A-Z])/, '\1_\2')  
 word.tr!("-", "_")
 word.downcase!
 word
end

There’s a lot to unpack. For simplicity, I numbered some of the regular expressions so you can go back and see the actual code when I explain them later on. In our “debugging session”, we’ll pass the “PARTITION ARTS” string in, so we can see each transformation that contributed to our bug.

The function flow

1. Do an early check if matches the parameter. If this is not the case, return the word with no substitutions. This is a pretty simple regular expression—it’ll match only if there’s at least one uppercase letter, hyphen or a namespace separator:

There’s a match for our argument so we continue our journey.

2.  If finds a namespace separator, it is replaced with a slash. This line helps with inferring paths from namespaced modules/classes strings. As we don’t have such characters in our string, word is equal to the parameter.

3. Try to match the string against inflections.acronyms_underscore_regex. It’s a regular expression to test if any user-defined acronyms “hidden” in the string[1]. If this is the case, we catch the acronym and letter that precedes it. We have a match on our string:

The gsub that uses this acronyms regex defines the substitution as follows:

A diagram showing a string interpolation between two groups of regex matches

appends an underscore to the first group (“A”) and concatenates it with the lowercase version of the acronym. Now, word is equal to “PA_rtiTION ARTS”

4. Match uppercase letters/digits in one group and two letters in another, a pair of uppercase-lowercase letters:

For example, a match on the “CamLLcase” string would be split into two groups: 
Group 1: “L”
Group 2: “Lc”

Later on, gsub would transform the string and place an underscore between these groups. So, “CamLLcase” → “CamL_Lcase”. As we have no match to this regular expression, this transformation doesn’t apply here—there’s no change to our word.

5. Match all occurrences of lowercase/digit-uppercase letter pairs and insert “_” between them:

This is the last step of extracting the earlier identified acronym. In our case it’s just one pair:

So ④ will transform word to “PA_rti_TION ARTS”.

6. Replace all `-` for `_` to finalise the underscore substitution

7. Return a lowercase version of the processed string

The final form of word is “pa_rti_tion arts”.

Going back to titleize

After the underscore method returned a transformed string, it’s passed to humanize:

                             # default value ↴
humanize(“pa_rti_tion arts”, keep_id_suffix: false)

This method replaces all underscores with spaces and capitalises the first letter of the string and all acronyms. Here, this changes our string to “Pa RTI tion arts”, which is pretty close to what was rendered in the autocomplete list.

But wait, we still have a long-forgotten gsub to analyse:

humanize(....).gsub(/\b(?<!\w['’`()])[a-z]/)

Before we start panicking, we’ll break it down into more digestible chunks:

A diagram explaining regular expression introduced in the mentioned gsub

Word boundary is a matcher that “moves” the regular expression so it can be applied from the beginning or end of a word (e.g. after or before a space character). Once we’re set in the right position, we do a negative lookbehind match (indicated by ?<!). This is a check that ensures that the regex it describes does not match the string. If the negative lookbehind is happy with what it got, the expression matches the next lowercase letter and captures it. In our case that would be:

So, as a result, we get a list of lowercase first letters in the string. Each of them is passed to this block and capitalised:

humanize(...).gsub(...) do |match|
  match.capitalize
end

We are finally done with the transformations. We end up with “Pa RTI Tion Arts”.

Summary

A fix of calling downcase in result["title"].downcase.titlecase was a result of trial and error. Now we know why it works. The underscore method is very eager to find acronyms but only if it finds any uppercase letters first.

It makes sense given how we define acronyms but it has unintended consequences like finding acronyms in all-capitalised strings. Unfortunately, this is not obvious without understanding the inner workings of the titlecase/underscore method.

Also, it looks like titlecase was innocent all the way. I’m sorry, buddy.


[1] – To be more specific, it checks if an acronym is preceded only by letters and digits and followed by non-lowercase characters. Full definition.

Leave a reply

Your email address will not be published. Required fields are marked *