Skip to content

August 22, 2013

Selecting Rows With Chinese Characters in MySQL

…or Russian/Cyrillic characters, or Arabic, or any characters of non-latin origin.

On this blog, I’ve let tons of spam comments pile up in the pending queue. Their number grew to the point of being completely unmanageable. Finally I decided to clean it up, and I had to find a way to bulk delete rows from the wp_comments table in the WordPress database. One thing I noticed was that hundreds, probably thousands, of spam comments contained Chinese characters. Other comments contained Russian or other non-latin characters. But how to select comments in which non-latin characters appear without selecting for specific strings (for example by using LIKE or REGEXP)?

It turns out the way to do it is like this:

SELECT  * 
FROM    wp_comments 
WHERE   comment_content != CONVERT(comment_content USING latin1)

CONVERT … USING will convert the column value to the specified character set, replacing any characters that do not map into the set with a question mark. If there are no Chinese/Russian/etc. characters, the output of CONVERT … USING will be the same as the unconverted column value.

This was the SQL statement I used to delete the unwanted posts:

DELETE FROM wp_comments 
WHERE   comment_content != CONVERT(comment_content USING latin1) AND 
        comment_approved = '0'

If there is any interest, I can post some other SQL statements I used to clean up the spam.

Read more from Miscellaneous

Share your thoughts, post a comment.

(required)
(required)

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments