Detecting Character Encoding

Post date: 31 October 2019
Reading time: 4 Minute Read
Categories: Computing

I occasionally find myself having to determine the character encoding of a file, either for foreign-language content or other data files that don’t identify their encoding.

In general, it’s not always possible to determine the encoding, but languages are not random text, so it’s often quite easy to determine the language based on character sequences, and from there it’s relatively easy to determine the character encoding by knowing which languages tend to use which encodings. Most character encoding detectors (such as chardet) work in this way.

As an example, I often download Polish subtitles for films, and I might see a line like this:

Napisy zosta³y specjalnie dopasowane do Twojej wersji filmu.

Here, I can see that the ł has been interpreted in UTF-8 as ³. If I didn’t know it was an ł, Google Translate (or Search) can often give a suggestion. For Polish-language text, I’m looking for words I recognise containing ł, ś or ć as those will help me quickly identify the encoding.

Now that I have identified a non-ASCII character, I can look at the byte representation:

Napisy zosta<b3>y specjalnie dopasowane do Twojej wersji filmu.

Now I know that ł is encoded as b3 in this file. Looking at the encodings of ł in different character sets, I can see that ISO-8859-2 encodes ł in this way. So running the file through iconv -f ISO-8859-2 -t UTF-8 should give me what I want:

Napisy zostały specjalnie dopasowane do Twojej wersji filmu.

I have to say I don’t really follow this manual process that often any more, since Polish text is almost always in ISO-8859-2 due to its support for Slavic languages. I’ll often just try this conversion first as a default, or Windows-1252 comes up quite a lot for data files, or I’ll run the file through a very basic detection and conversion script that uses Google’s Compact Encoding Detection. However, it’s quite rewarding to watch the mojibake transform into the original text after a bit of digging!

Further essential reading for programmers: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

2 Comments

cuolphim
21 October 2022 21:41

Superb postings. With thanks.
Linafub
16 November 2023 04:13

Ηiǃ I’ve notісеd that mаnу guyѕ prеfеr rеgulаr gіrlѕ. Ι apрlаudе the men out thеrе whо hаd the ballѕ tо еnϳоy the lоvе оf manу women аnd chоoѕe the one that he knеw wоuld be his bеst frіеnd durіng thе bumру аnd crazу thing called life. Ι wаnted to be that friеnd, not ϳuѕt a stаblе, rеlіаble аnd bоring hоuѕewifе. Ι am 28 yеаrѕ old, Linа, from thе Сzеch Reрublic, knоw Englіѕh language аlѕo. Αnуwaу, yоu саn find mу profile here: http://keytemospo.gq/idl-87087/

« Google Home So Far Lockdown Activities »

Detecting Character Encoding

2 Comments

Leave a Reply