This website is now static html (since the end of 2010). The pages you see here are a simple wget spider mode crawl of the original wordpress, thus dynamic features like commenting are not working anymore.

svn – Valid UTF-8 data followed by invalid UTF-8 sequence


This happened to me some time ago while working with a subversion repository in which there were some files created in Windows with a strange encoding. Issuing a simple svn status gave me this error message:

svn  status
svn:  Valid UTF-8  data
(hex: 4b)
followed by invalid UTF-8 sequence
(hex:  fc 63 68  65)

I searched for a solution but the best tip that I could find was to remove the files causing the error (see
this for example), but that wasn’t my case. I needed those files.

Discover the encoding of the filenames

So if you want to save your files and use them in subversion, you first need to discover the directory in which the files are stored. I just cd inside all the dirs checking if the error message was still relevant (may not be stylish but for me did worked). If you have a longer valid UTF-8 sequence than 4b you can try with:

echo "\x4b\x6f\x72\x5f" | xargs -0 printf

and it will give you the starting part of the corrupted filename (Kor_ in this example).

Once you have spotted the directory containing the files causing the error you need to discover the encoding of the filenames. Using the simple ls you should be able to spot strange files, for discovering the encoding of those files you can use the file command like this (if you have only one file you can try to substitute the * with the filename, but I don’t know if it works since I had all the files in that dir with the wrong encoding):

ls * | file -
/dev/stdin:  ISO-8859 text

in normal circumstances (debian/ubuntu) the filename should be UTF-8 like this simple test shows

$ echo "test" > /tmp/cioé.txt
$ ls /tmp/cioé.txt | file -
/dev/stdin:  UTF-8  Unicode text

In my situation it happened that the filenames were latin1 but the content of the file was UTF-8, to be sure of this I used the isutf8 command (found in the moreutils package)

isutf8 *

that gives no output if everything is fine.

Convert the filenames to utf-8 encoding

At this point you need to convert the filenames to utf-8 encoding, this is easily achieved with convmv (on the man page you can read “converts filenames from one encoding to another” that is exactly what we need). The usage is simple:

convmv  -f latin1 -t utf-8 *

and this is an example output:

Your Perl version has fleas #37757 #49830
Starting  a dry run without changes...
mv "./Cefal�txt"        "./Cefalù.txt"
mv "./Badem�el-Badschr�ke.txt"  "./Bademöbel-Badschränke.txt"
mv "./Blutzuckermessger�e.txt"  "./Blutzuckermessgeräte.txt"

beware that by default this is a dry run, you have to add the --notest flag to actually change the filenames.

Hope this helps.

References
http://joey.kitenet.net/code/moreutils/
http://www.j3e.de/linux/convmv/



2 Responses to “svn – Valid UTF-8 data followed by invalid UTF-8 sequence”

  1. persuader

    great article – saved me hours of searching

  2. NateV

    I got a similar error and this helped figure out what I was looking for. The _key_ is that the sequence that SVN reports is the letters of the file name. So using echo to print the hex digits helps find the file. Thanks.



Leave a Reply