Using a Pinyin java library in Talend to transliterate Chinese to English

If you want to transliterate Chinese characters to Roman / Latin alphabet using Talend, then you may find this blog helpful.

I will show you how to build a simple Talend job that converts some Chinese characters to the English readable representation using a 3rd party library that uses the Pinyin conversion standard.

You will need to download the jar pinyin4j-2.5.0.jar from: https://mvnrepository.com/artifact/ruiyun/pinyin4j/2.5.0

Create a new Talend DI a job and begin with adding a tLibraryLoad.

Configure the Basic settings (specify the path of pinyin4j-2.5.0.jar).

In the advanced settings specify the functions to import. I have loaded all of them even though some are not required for this example.

import net.sourceforge.pinyin4j.PinyinHelper;

import net.sourceforge.pinyin4j.format.HanyuPinyinCaseType;

import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;

import net.sourceforge.pinyin4j.format.HanyuPinyinToneType;

import net.sourceforge.pinyin4j.format.HanyuPinyinVCharType;

import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination;

Join the tLibraryLoad to a tFixedFlowInput.

Create a new column and call it ‘Name’.

Insert some Chinese characters to test e.g. “你好,世界”

Join the tFixedFlowInput to a tJavaRow, sync the columns and then configure as follows:

HanyuPinyinOutputFormat defaultPinyinFormat = new HanyuPinyinOutputFormat();
defaultPinyinFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);
defaultPinyinFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE);

output_row.Name = input_row.Name;

Now join the tJavaRow to a tMap.

Create a new output with a column ‘Name’ and map the input to the output.

In the Expression editor use the Pinyin Library function PinyinHelper to convert the string.

PinyinHelper.toHanyuPinyinString(row2.Name,defaultPinyinFormat,"")

Join the output from the tMap to a tLogRow and Run the job.

You should now see in the log window, the transliterated string