Not so long ago, I had an opportunity to peek under the hood of titlecase
and underscore
methods, the tiny cogs of the “Rails magic” machine. The latter turned out to be a very interesting function—a lot of hard-to-follow transformations, secret injections and the like. All of these bits significantly contributed to an odd-looking bug I’ve been working on.
Today, we’ll unravel the implementation of both methods and look in the eye of regular expressions. But first of all, let’s see what this bug was about.
Note: This post describes the function behaviour present in Rails 6.1.3 and may change in the future.
The Bug
The FreeAgent sign-up process can take a while when you have to fill in all business details by hand. Thankfully, we have an option of automating this process by using the “Find company details” feature. The search is simple—just provide the company name and all relevant fields will be filled in with data retrieved from the Companies House API.
Sometimes, the search results didn’t look as expected:
Hmm, “RTI”? In our domain it’s a pretty special acronym but it’s not something that our user expects to see in the name of their company. It made me wonder how it is possible that RTI is found and formatted this way here. It doesn’t seem to be on our side. We just call this one method:
result["title"].titlecase
So, we take a string and make a pretty title out of it, right? Well, it’s a bit more complicated. Rails’s String::titlecase
internally calls an identically named method defined in ActiveSupport::Inflector
module, a part of the machine I mentioned before. ActiveSupport::Inflector
operations allow transformation of strings such as pluralisation, converting namespaces to paths and many more. As a bonus, it has the power of recognising acronyms in strings including these defined by a user. RTI happened to be defined in our config/inflections.yml
file. Let’s take a look:
# good "partition arts".titlecase => "Partition Arts" # whoops "PARTITION ARTS".titlecase => "Pa RTI Tion Arts"
As you can see, there’s a link between uppercase characters and finding acronyms. We deal with company names so we don’t want to infer any acronyms from their names. It suffices to add downcase
to fix the bug:
result["title"].downcase.titlecase
But why is that? Where precisely do we find RTI in a string? Let’s find out.
The Big Picture
titleize
, the true form of titlecase
, is powered by a composition of two methods: humanize
and underscore
:
humanize(underscore(word), keep_id_suffix: ...) .gsub(...) do |match| match.capitalize end
The underscore
method does the heavy lifting of inferring potential words and acronyms from CamelCase and hyphenated strings. The humanize
method, on the other hand, applies a couple of rules to transform a string so it looks acceptable to the end user. This includes replacing underscores with spaces, capitalising the first letter of the string, converting all letters but acronyms to lowercase and the like. For now, we’ll ignore the gsub
definition to keep things simple. The general flow of this function is:
- Get an underscored version of
word
- Feed it to the
humanize
method - Apply regex defined in
gsub
toword
- Iterate over each of them and convert to uppercase
underscore
Deep Dive
You might’ve been thinking, “It’s not a big deal! underscore
just places underscores between words!”. Well, sort of. At its core, underscore
is intended to transform CamelCase, namespaced and hyphenated words into underscore separated strings:
"testingUnderscore".underscore => "testing_underscore" # Oh! "testing underscore".underscore => "testing underscore"
Now, let’s take a deep breath and look at the definition:
def underscore(camel_cased_word) return camel_cased_word unless /[A-Z-]|::/.match?(camel_cased_word) ① word = camel_cased_word.to_s.gsub("::", "/") ② word.gsub!(inflections.acronyms_underscore_regex){"#{$1 && '_' }#{$2.downcase}"} ③ word.gsub!(/([A-Z\d]+)([A-Z][a-z])/, '\1_\2' ) word.gsub!(/([a-z\d])([A-Z])/, '\1_\2') ④ word.tr!("-", "_") word.downcase! word end
There’s a lot to unpack. For simplicity, I numbered some of the regular expressions so you can go back and see the actual code when I explain them later on. In our “debugging session”, we’ll pass the “PARTITION ARTS” string in, so we can see each transformation that contributed to our bug.
The function flow
1. Do an early check if ① matches the parameter. If this is not the case, return the word with no substitutions. This is a pretty simple regular expression—it’ll match only if there’s at least one uppercase letter, hyphen or a namespace separator:
There’s a match for our argument so we continue our journey.
2. If ② finds a namespace separator, it is replaced with a slash. This line helps with inferring paths from namespaced modules/classes strings. As we don’t have such characters in our string, word
is equal to the parameter.
3. Try to match the string against inflections.acronyms_underscore_regex
. It’s a regular expression to test if any user-defined acronyms “hidden” in the string[1]. If this is the case, we catch the acronym and letter that precedes it. We have a match on our string:
The gsub
that uses this acronyms regex defines the substitution as follows:
③ appends an underscore to the first group (“A”) and concatenates it with the lowercase version of the acronym. Now, word
is equal to “PA_rtiTION ARTS”
4. Match uppercase letters/digits in one group and two letters in another, a pair of uppercase-lowercase letters:
For example, a match on the “CamLLcase” string would be split into two groups:
• Group 1: “L”
• Group 2: “Lc”
Later on, gsub
would transform the string and place an underscore between these groups. So, “CamLLcase” → “CamL_Lcase”. As we have no match to this regular expression, this transformation doesn’t apply here—there’s no change to our word
.
5. Match all occurrences of lowercase/digit-uppercase letter pairs and insert “_” between them:
This is the last step of extracting the earlier identified acronym. In our case it’s just one pair:
So ④ will transform word
to “PA_rti_TION ARTS”.
6. Replace all `-` for `_` to finalise the underscore substitution
7. Return a lowercase version of the processed string
The final form of word
is “pa_rti_tion arts”.
Going back to titleize
After the underscore
method returned a transformed string, it’s passed to humanize
:
# default value ↴ humanize(“pa_rti_tion arts”, keep_id_suffix: false)
This method replaces all underscores with spaces and capitalises the first letter of the string and all acronyms. Here, this changes our string to “Pa RTI tion arts”, which is pretty close to what was rendered in the autocomplete list.
But wait, we still have a long-forgotten gsub
to analyse:
humanize(....).gsub(/\b(?<!\w['’`()])[a-z]/)
Before we start panicking, we’ll break it down into more digestible chunks:
Word boundary is a matcher that “moves” the regular expression so it can be applied from the beginning or end of a word (e.g. after or before a space character). Once we’re set in the right position, we do a negative lookbehind match (indicated by ?<!
). This is a check that ensures that the regex it describes does not match the string. If the negative lookbehind is happy with what it got, the expression matches the next lowercase letter and captures it. In our case that would be:
So, as a result, we get a list of lowercase first letters in the string. Each of them is passed to this block and capitalised:
humanize(...).gsub(...) do |match| match.capitalize end
We are finally done with the transformations. We end up with “Pa RTI Tion Arts”.
Summary
A fix of calling downcase
in result["title"].downcase.titlecase
was a result of trial and error. Now we know why it works. The underscore
method is very eager to find acronyms but only if it finds any uppercase letters first.
It makes sense given how we define acronyms but it has unintended consequences like finding acronyms in all-capitalised strings. Unfortunately, this is not obvious without understanding the inner workings of the titlecase/underscore
method.
Also, it looks like titlecase
was innocent all the way. I’m sorry, buddy.
[1] – To be more specific, it checks if an acronym is preceded only by letters and digits and followed by non-lowercase characters. Full definition. ⤴