Diacritics: The Bane of STMP Email Sending in Sitecore
In our Sitecore 10 project, Sitecore’s old school approach of sending emails has still been used:
MainUtil.SendMail(emailMessage);
In the implementation, the value from a comma-delimited string from a Single-Line text Sitecore field named “To” is being received in the backend and then parsed and individually sent email to.
It’s been working fine alright, but after months passed the client reports that one of the recipients is not receiving an email at all (in this case, the address bien@gmail.com). Digging into the logs reveals a clue into what’s failing in the SMTP email sending process:
Exception: System.Net.Mail.SmtpException Message: The client or server is only configured for E-mail addresses with ASCII local-parts: bien@gmail.com. Source: System at System.Net.Mail.MailAddress.GetUser(Boolean allowUnicode) at System.Net.Mail.MailAddress.GetAddress(Boolean allowUnicode) at System.Net.Mail.MailAddress.GetSmtpAddress(Boolean allowUnicode) at System.Net.Mail.SmtpClient.ValidateUnicodeRequirement(MailMessage message, MailAddressCollection recipients, Boolean allowUnicode) at System.Net.Mail.SmtpClient.Send(MailMessage message) at Sitecore.MainUtil.SendMail(MailMessage message) at FormsSendMail.Forms.Actions.SendMailAction.Execute(SendMailActionData data, FormSubmitContext formSubmitContext) in C:\Development\Sitecore\src\Feature\FormsSendMail\code\Forms\Actions\SendMailAction.cs:line 197
According to this SmtpException, the E-mail addresses validation error has something to do with ASCII characters. Initial assessment suggests that both the receiving and the mail-sending server (SMTP gateways) should both satisfy an encoding type of SMTPUTF8 and a DeliveryFormat to SmtpDeliveryFormat.International. However, this may mean digging further outside of Sitecore and more into the SMTP gateways that I don’t or have little control with. Instead of going that route, I focused on what’s wrong with email-sending process.
The Invisible Culprit
A strange thing that has been consistent in all the erring email recipients string is that only one of delimited addresses causes trouble. Looking at the email address in the logs does not reveal anything suspicious about it and that it can indeed receive emails. But removing the entire string value in the field and typing from scratch seem to work!
So I looked for an online tool which could help verify there’s an invalid character in a string and found this: https://pages.cs.wisc.edu/~markm/ascii.html. Pasting the value copied from the Sitecore field directly into this tool’s textarea reveals what’s wrong…and even weirder.
So I continued searching tools online to capture any presence of invalid characters and found this: https://onlinestringtools.com/convert-string-to-ascii. I pasted again the directly copied value and saw this:
The character count and ASCII conversion byte count should be equivalent. But it isn’t. It has extra three bytes! How come? Yes, they are Non-ASCII.
Based on this definition by WhatIs.com,
ASCII (American Standard Code for Information Interchange) is the most common character encoding format for text data in computers and on the internet. In standard ASCII-encoded data, there are unique values for 128 alphabetic, numeric or special additional characters and control codes.
Basically, most or if not all characters that you can see in your QWERTY keyboard are ASCII. But what are these extra three bytes? By definition, we can conclude that anything beyond 127 Decimal ASCII conversion are something else. These extra three leading bytes 226, 128 and 140 are called Diacritics and some of them are freaking invisible to the naked eye!
According to this University of Sussex link:
“Diacritics, often loosely called `accents’, are the various little dots and squiggles which, in many languages, are written above, below or on top of certain letters of the alphabet to indicate something about their pronunciation”
Where did these hidden characters come from? My hypothesis is that these diacritics may have been copied as part of an original source (could be from Microsoft Excel, OneNote, etc.). These included characters did not translate well into a simple text upon pasting in the Sitecore field.
Diacritics Remover: Content Authoring Workaround
One solution would be to remove the diacritics by the content authors before inputting values in Sitecore fields. This online tool is helpful in detecting and removing diacritics: https://pteo.paranoiaworks.mobi/diacriticsremover/. Just click the Remove Diacritics button and matching characters will be replaced by an underscore by default. Exclude the replacement characters and paste the value again back into the Sitecore field.
Note: Based on my observation, when changing values at the Sitecore field textbox, if the diacritics is located at the beginning of the field value, you cannot totally remove it by backspacing or cutting the whole value, removing the diacritics using this tool and pasting it back again. You would need to totally clear up the field value and paste the clean text version. Seems rather odd for the diacritics to stay hidden in the field.
Sitecore Validation Rule: A More Proactive Approach
As a developer, if you know that you will have a Sitecore field that will contain valuable information that could be, for instance, used by a server for some reason and that you’d want to avoid Non-ASCII values meddling with your transactions, you would want a more proactive approach – and this is where Sitecore field validation comes in.
Under /sitecore/system/Settings/Validation Rules/Field Rules/Common/, create a new Validation Rule and name it as Is ASCII.
Use the Data type Sitecore.Data.Validators.FieldValidators.RegexValidator,Sitecore.Kernel.
And for the Parameters, enter the following value to validate ASCII characters first and describe how the error message should look like:
Pattern=^[\x20-\x7F]+$&Text=Field "{0}" contains non-ASCII characters
Fill in the other item fields. It should look something like this:
Now apply it to the field. Go to the Template field item and under the Validation Rules section, double click Is ASCII from the left selection box. Note: If you don’t see the Validation Rules section, you might need to enable Standard fields tickbox under the View Ribbon tab.
Finally, to test the rule in the content item fields that inherit this template, try entering values with diacritics and it will display an error red line beside the field and will prompt the error message we specified.
I have recommended to Sitecore (through its Support Partner portal) to include this Is ASCII as a common Validation Rule for capturing specific scenarios like this. Hopefully, it makes it to their minor enhancements in the future releases.