Problem solve Get help with specific problems with your technologies, process and projects.

Performance impacts of joining Unicode and non-Unicode data

Unicode may help you address internationalization issues -- but join it with conventional-text data and you'll have performance problems instead. Get troubleshooting help here.

SQL Server has explicit datatypes for storing textual information as 16-bit Unicode and conventional 8-bit text. Many database programmers who are worried about internationalization issues are opting to use Unicode-format data when gathering information from Unicode-presented Web pages, which may also receive form submissions in double-byte format. But, if you're one of the many people who uses both formats together, you may have a new issue to address -- performance.

The 8-bit formats are char, varchar and text. The 16-bit Unicode formats are nchar, nvarchar and ntext. Because the two varieties of variables are designed to interoperate, many SQL programmers mix the two freely, often in the same table design, or use combinations of 8 and 16-bit text columns as parameters in a cross-table join.

Here's an example: If a table named LocalUsers has a UserName column in 8-bit format, and another table named UserInput has an InputName column in 16-bit format, you could make a join statement where UserName and InputName are the matched columns. Joining between 8 and 16-bit data types isn't illegal, but the processing cost involved in the data conversion -- either from 8 to 16-bit, or vice versa, depending on the direction and variety of the join -- could cost the procedure a fair amount of overhead. The overhead would only get worse if multiple such joins were running at once.

It isn't just on-the-fly data type conversion that causes performance to degrade. Another downside to such a join is that any indexes on joined columns where there's on-the-fly data conversion will not be used. If you are creating indexes for tables that are matched in this fashion, this may be a big reason why you're not seeing a performance boost using the index.

One of the following two best practices should come out of this:
1. You make sure 8-bit and 16-bit data types are not matched against each other in joins, or
2. You pick either 8-bit or 16-bit for the textual data types in your project and stick with that all the way across the board.

I personally feel the best choice for the long run is 16-bit format, but depending on the type of work you are doing, you may think 8-bit format is fine.

People coding Web sites that accept input in multiple languages (especially Far Eastern languages such as Korean and Japanese) need to keep this in mind when building the underlying data structures for their site. (It is possible to store all text in the database in ISO-8859-1 format with 8-bit storage, but ISO-8859-1 requires up to eight bytes for each character instead of Unicode's two bytes.)

Serdar Yegulalp is the editor of the Windows 2000 Power Users Newsletter. Check out his Windows 2000 blog for his latest advice and musings on the world of Windows network administrators -- please share your thoughts as well!

Dig Deeper on Windows client management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.