提问者:小点点

Java-以独立于系统的方式从File读取UTF8字节到String


如何准确地将Java中的UTF8编码文件读入字符串?

当我改变这个编码。java文件UTF-8(Eclipse

为什么源文件的编码应该对从字节创建字符串有任何影响。当编码已知时,从字节创建字符串的万无一失的方法是什么?我可能有不同编码的文件。一旦文件的编码已知,我必须能够读入字符串,而不管file.编码的值如何?

utf8文件的内容如下

English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.

-文件结束-

代码在下面。我的观察在里面的评论中。

public class App {
public static void main(String[] args) {
    String slash = System.getProperty("file.separator");
    File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
    File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
    File outputUtfByteWrittenFile = new File(
            "C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
    outputUtfFile.delete();
    outputUtfByteWrittenFile.delete();

    try {

        /*
         * read a utf8 text file with internationalized strings into bytes.
         * there should be no information loss here, when read into raw bytes.
         * We are sure that this file is UTF-8 encoded. 
         * Input file created using Notepad++. Text copied from Google translate.
         */
        byte[] fileBytes = readBytes(inputUtfFile);

        /*
         * Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
         */
        String str = new String(fileBytes, StandardCharsets.UTF_8);

        /*
         * The console is incapable of displaying this string.
         * So we write into another file. Open in notepad++ to check.
         */
        ArrayList<String> list = new ArrayList<>();
        list.add(str);
        writeLines(list, outputUtfFile);

        /*
         * Works fine when I read bytes and write bytes. 
         * Open the other output file in notepad++ and check. 
         */
        writeBytes(fileBytes, outputUtfByteWrittenFile);

        /*
         * I am using JDK 8u60.
         * I tried running this on command line instead of eclipse. Does not work.
         * I tried using apache commons io library. Does not work. 
         *  
         * This means that new String(bytes, charset); does not work correctly. 
         * There is no real effect of specifying charset to string.
         */
    } catch (IOException e) {
        e.printStackTrace();
    }

}

public static void writeLines(List<String> lines, File file) throws IOException {
    BufferedWriter writer = null;
    OutputStreamWriter osw = null;
    OutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        osw = new OutputStreamWriter(fos);
        writer = new BufferedWriter(osw);
        String lineSeparator = System.getProperty("line.separator");
        for (int i = 0; i < lines.size(); i++) {
            String line = lines.get(i);
            writer.write(line);
            if (i < lines.size() - 1) {
                writer.write(lineSeparator);
            }
        }
    } catch (IOException e) {
        throw e;
    } finally {
        close(writer);
        close(osw);
        close(fos);
    }
}

public static byte[] readBytes(File file) {
    FileInputStream fis = null;
    byte[] b = null;
    try {
        fis = new FileInputStream(file);
        b = readBytesFromStream(fis);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fis);
    }
    return b;
}

public static void writeBytes(byte[] inBytes, File file) {
    FileOutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        writeBytesToStream(inBytes, fos);
        fos.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fos);
    }
}

public static void close(InputStream inStream) {
    try {
        inStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    inStream = null;
}

public static void close(OutputStream outStream) {
    try {
        outStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    outStream = null;
}

public static void close(Writer writer) {
    if (writer != null) {
        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer = null;
    }
}

public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
    int bytesread = -1;
    byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
    long count = 0;
    bytesread = readStream.read(b);
    while (bytesread != -1) {
        writeStream.write(b, 0, bytesread);
        count += bytesread;
        bytesread = readStream.read(b);
    }
    return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
    ByteArrayOutputStream writeStream = null;
    byte[] byteArr = null;
    writeStream = new ByteArrayOutputStream();
    try {
        copy(readStream, writeStream);
        writeStream.flush();
        byteArr = writeStream.toByteArray();
    } finally {
        close(writeStream);
    }
    return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
    ByteArrayInputStream bis = null;
    bis = new ByteArrayInputStream(inBytes);
    try {
        copy(bis, writeStream);
    } finally {
        close(bis);
    }
}
};

编辑:对于@JB Nizet和所有人:)

//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work. 
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works

将字节读入String时,我需要指定字节编码。当我将字符串中的字节写入文件时,我需要指定字节编码。

一旦我在JVM中有一个字符串,我就不需要记住源字节编码,对吗?

当我写入文件时,它应该将字符串转换为我机器的默认字符集(无论是UTF8、ASCII还是cp1252)。那是失败的。UTF16 BE也失败了。为什么某些字符集会失败?


共1个答案

匿名用户

源文件编码Java确实无关紧要。而且你代码的读取部分是正确的(虽然效率低下)。不正确的是写入部分:

osw = new OutputStreamWriter(fos);

应改为

osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);

否则,您使用默认编码(在您的系统上似乎不是UTF8)而不是使用UTF8。

请注意,Java允许在文件路径中使用正斜杠,即使在Windows上也是如此

File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");

编辑:

一旦我在JVM中有一个字符串,我就不需要记住源字节编码,对吗?

是的,你说得对。

当我写入文件时,它应该将字符串转换为我机器的默认字符集(无论是UTF8、ASCII还是cp1252)。那是失败的。

如果您不指定任何编码,Java确实会使用平台默认编码将字符转换为字节。如果您指定了编码(如本答案开头所建议的),那么它将使用您告诉它使用的编码。

但是所有的编码不能像UTF8一样代表所有的unicode字符。例如ASCII只支持128个不同的字符。Cp1252,AFAIK,只支持256个字符。所以,编码成功了,但是它用一个特殊的字符替换了不可编码的字符(我不记得是哪一个),这意味着:我不能对这个泰语或俄语字符进行编码,因为它不是我支持的字符集的一部分。

UTF16编码应该没问题。但请务必将文本编辑器配置为在读取和显示文件内容时使用UTF16。如果配置为使用其他编码,则显示的内容将不正确。