setting a UTF-8 in java and csv file [duplicate]
I spent some time but found solution for your problem.
First I opened notepad and wrote the following line: שלום, hello, приветThen I saved it as file he-en-ru.csv using UTF-8.Then I opened it with MS excel and everything worked well.
Now, I wrote a simple java program that prints this line to file as following:
PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8")); w.print(line); w.flush(); w.close();
When I opened this file using excel I saw "gibrish."
Then I tried to read content of 2 files and (as expected) saw that file generated by notepad contains 3 bytes prefix:
239 EF 187 BB 191 BF
So, I modified my code to print this prefix first and the text after that:
String line = "שלום, hello, привет"; OutputStream os = new FileOutputStream("c:/temp/j.csv"); os.write(239); os.write(187); os.write(191); PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8")); w.print(line); w.flush(); w.close();
And it worked! I opened the file using excel and saw text as I expected.
Bottom line: write these 3 bytes before writing the content. This prefix indicates that the content is in 'UTF-8 with BOM' (otherwise it is just 'UTF-8 without BOM').
Unfortunately, CSV is a very ad hoc format with no metadata and no real standard that would mandate a flexible encoding. As long as you use CSV, you can't reliably use any characters outside of ASCII.
Your alternatives:
- Write to XML (which does have encoding metadata if you do it right) and have the users import the XML into Excel.
- Use Apache POI to create actual Excel documents.
Excel doesn't use UTF8
to open CSV files. Thats a known problem. The actual encoding used depends on the locale settings of Microsoft Windows. With a German lcoale for example Excel would open a CSV file with CP1252
.
You could create an Excel file containing some persian characters and save it as an CSV file. Then write a small Java program to read this file and test some common encodings. Thats the way I used to figure out the correct encoding for German umlauts in CSV files.