svn – Valid UTF-8 data followed by invalid UTF-8 sequence
This happened to me some time ago while working with a subversion repository in which there were some files created in Windows with a strange encoding. Issuing a simple svn status gave me this error message:
svn status svn: Valid UTF-8 data (hex: 4b) followed by invalid UTF-8 sequence (hex: fc 63 68 65)
I searched for a solution but the best tip that I could find was to remove the files causing the error (see
this for example), but that wasn’t my case. I needed those files.
Discover the encoding of the filenames
So if you want to save your files and use them in subversion, you first need to discover the directory in which the files are stored. I just cd
inside all the dirs checking if the error message was still relevant (may not be stylish but for me did worked). If you have a longer valid UTF-8 sequence than 4b
you can try with:
echo "\x4b\x6f\x72\x5f" | xargs -0 printf
and it will give you the starting part of the corrupted filename (Kor_
in this example).
Once you have spotted the directory containing the files causing the error you need to discover the encoding of the filenames. Using the simple ls
you should be able to spot strange files, for discovering the encoding of those files you can use the file
command like this (if you have only one file you can try to substitute the * with the filename, but I don’t know if it works since I had all the files in that dir with the wrong encoding):
ls * | file - /dev/stdin: ISO-8859 text
in normal circumstances (debian/ubuntu) the filename should be UTF-8 like this simple test shows
$ echo "test" > /tmp/cioé.txt $ ls /tmp/cioé.txt | file - /dev/stdin: UTF-8 Unicode text
In my situation it happened that the filenames were latin1 but the content of the file was UTF-8, to be sure of this I used the isutf8
command (found in the moreutils
package)
isutf8 *
that gives no output if everything is fine.
Convert the filenames to utf-8 encoding
At this point you need to convert the filenames to utf-8 encoding, this is easily achieved with convmv
(on the man page you can read “converts filenames from one encoding to another” that is exactly what we need). The usage is simple:
convmv -f latin1 -t utf-8 *
and this is an example output:
Your Perl version has fleas #37757 #49830 Starting a dry run without changes... mv "./Cefal�txt" "./Cefalù.txt" mv "./Badem�el-Badschr�ke.txt" "./Bademöbel-Badschränke.txt" mv "./Blutzuckermessger�e.txt" "./Blutzuckermessgeräte.txt"
beware that by default this is a dry run, you have to add the --notest
flag to actually change the filenames.
Hope this helps.
References
http://joey.kitenet.net/code/moreutils/
http://www.j3e.de/linux/convmv/
great article – saved me hours of searching
I got a similar error and this helped figure out what I was looking for. The _key_ is that the sequence that SVN reports is the letters of the file name. So using echo to print the hex digits helps find the file. Thanks.