Advertisement
Advertisement


Converting UTF-8 to ISO-8859-1 in Java - how to keep it as single byte


Question

I am trying to convert a string encoded in java in UTF-8 to ISO-8859-1. Say for example, in the string 'âabcd' 'â' is represented in ISO-8859-1 as E2. In UTF-8 it is represented as two bytes. C3 A2 I believe. When I do a getbytes(encoding) and then create a new string with the bytes in ISO-8859-1 encoding, I get a two different chars. â. Is there any other way to do this so as to keep the character the same i.e. âabcd?

2009/03/17
1
64
3/17/2009 8:42:29 PM


byte[] iso88591Data = theString.getBytes("ISO-8859-1");

Will do the trick. From your description it seems as if you're trying to "store an ISO-8859-1 String". String objects in Java are always implicitly encoded in UTF-16. There's no way to change that encoding.

What you can do, 'though is to get the bytes that constitute some other encoding of it (using the .getBytes() method as shown above).

2019/08/08

Starting with a set of bytes which encode a string using UTF-8, creates a string from that data, then get some bytes encoding the string in a different encoding:

    byte[] utf8bytes = { (byte)0xc3, (byte)0xa2, 0x61, 0x62, 0x63, 0x64 };
    Charset utf8charset = Charset.forName("UTF-8");
    Charset iso88591charset = Charset.forName("ISO-8859-1");

    String string = new String ( utf8bytes, utf8charset );

    System.out.println(string);

    // "When I do a getbytes(encoding) and "
    byte[] iso88591bytes = string.getBytes(iso88591charset);

    for ( byte b : iso88591bytes )
        System.out.printf("%02x ", b);

    System.out.println();

    // "then create a new string with the bytes in ISO-8859-1 encoding"
    String string2 = new String ( iso88591bytes, iso88591charset );

    // "I get a two different chars"
    System.out.println(string2);

this outputs strings and the iso88591 bytes correctly:

âabcd 
e2 61 62 63 64 
âabcd

So your byte array wasn't paired with the correct encoding:

    String failString = new String ( utf8bytes, iso88591charset );

    System.out.println(failString);

Outputs

âabcd

(either that, or you just wrote the utf8 bytes to a file and read them elsewhere as iso88591)

2009/03/17

This is what I needed:

public static byte[] encode(byte[] arr, String fromCharsetName) {
    return encode(arr, Charset.forName(fromCharsetName), Charset.forName("UTF-8"));
}

public static byte[] encode(byte[] arr, String fromCharsetName, String targetCharsetName) {
    return encode(arr, Charset.forName(fromCharsetName), Charset.forName(targetCharsetName));
}

public static byte[] encode(byte[] arr, Charset sourceCharset, Charset targetCharset) {

    ByteBuffer inputBuffer = ByteBuffer.wrap( arr );

    CharBuffer data = sourceCharset.decode(inputBuffer);

    ByteBuffer outputBuffer = targetCharset.encode(data);
    byte[] outputData = outputBuffer.array();

    return outputData;
}
2016/02/17

If you have the correct encoding in the string, you need not do more to get the bytes for another encoding.

public static void main(String[] args) throws Exception {
    printBytes("â");
    System.out.println(
            new String(new byte[] { (byte) 0xE2 }, "ISO-8859-1"));
    System.out.println(
            new String(new byte[] { (byte) 0xC3, (byte) 0xA2 }, "UTF-8"));
}

private static void printBytes(String str) {
    System.out.println("Bytes in " + str + " with ISO-8859-1");
    for (byte b : str.getBytes(StandardCharsets.ISO_8859_1)) {
        System.out.printf("%3X", b);
    }
    System.out.println();
    System.out.println("Bytes in " + str + " with UTF-8");
    for (byte b : str.getBytes(StandardCharsets.UTF_8)) {
        System.out.printf("%3X", b);
    }
    System.out.println();
}

Output:

Bytes in â with ISO-8859-1
 E2
Bytes in â with UTF-8
 C3 A2
â
â
2014/03/12

For files encoding...

public class FRomUtf8ToIso {
        static File input = new File("C:/Users/admin/Desktop/pippo.txt");
        static File output = new File("C:/Users/admin/Desktop/ciccio.txt");


    public static void main(String[] args) throws IOException {

        BufferedReader br = null;

        FileWriter fileWriter = new FileWriter(output);
        try {

            String sCurrentLine;

            br = new BufferedReader(new FileReader( input ));

            int i= 0;
            while ((sCurrentLine = br.readLine()) != null) {
                byte[] isoB =  encode( sCurrentLine.getBytes() );
                fileWriter.write(new String(isoB, Charset.forName("ISO-8859-15") ) );
                fileWriter.write("\n");
                System.out.println( i++ );
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                fileWriter.flush();
                fileWriter.close();
                if (br != null)br.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }

    }


    static byte[] encode(byte[] arr){
        Charset utf8charset = Charset.forName("UTF-8");
        Charset iso88591charset = Charset.forName("ISO-8859-15");

        ByteBuffer inputBuffer = ByteBuffer.wrap( arr );

        // decode UTF-8
        CharBuffer data = utf8charset.decode(inputBuffer);

        // encode ISO-8559-1
        ByteBuffer outputBuffer = iso88591charset.encode(data);
        byte[] outputData = outputBuffer.array();

        return outputData;
    }

}
2014/05/20

Source: https://stackoverflow.com/questions/655891
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]