mahout版本:0.7 (mahout的安装请参考:https://cwiki.apache.org/MAHOUT/buildingmahout.html)
hadoop版本:1.0.3
lucene版本:3.6.0(+paoding1.0无法通过maven直接导入依赖,需要单独加到classpath)
mahout标准的分类demo展示了如何对英文文本内容分类,并且使用的是mahout的命令行脚本。下面主要是介绍使用mahout的api来完成同样的事,并且是中文。
标准的mahout文本分类分为以下几步:
- sequencing:将训练样本从简单的文本转换为hadoop标准的sequence格式
- vectorize: 向量化,将sequence格式的样本转换为向量形式。如“钓鱼岛是中国的”,先转换为分词后的序列“钓鱼,钓鱼岛,岛,中国”,然后给每个词一个index,最终转换为”1212:1,232:1, 16:1,789:1″这样的向量
- split: 样本分为训练集和测试集
- train: 根据训练集训练出模型
- test: 使用测试集测试模型并获得模型的准确率和confusionmatrix
- classify: 使用模型对新样本进行分类
下面分别介绍每个步骤:
sequencing
java代码:
args = new String[] { “-i”, sampleDir, “-o”, sequenceDir };
SequenceFilesFromDirectory sequenceJob = new SequenceFilesFromDirectory();
sequenceJob.setConf(getConf());
sequenceJob.run(args);
sampleDir是样本文件,组织结构是每个类一个子文件夹,文件夹内是对应的一个个样本文件。
执行后会在sequenceDir目录下看到一个chunk-0的文件,可以通过mahout的SequenceFileDumper转换为文本,转换后可以看到内容是这样:
Input Path: file:/Users/derekzhangv/Develop/temptest/testTopic/testTopic-seq/chunk-0
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text
Key: /good/51057: Value: 浪潮之巅
Key: /good/55107: Value: 电动车
Key: /bad/85364: Value: 婴儿用品
Count: 3
(技巧:如果需要在命令行里方便地查看可以设置一个alias:“alias seqdump=’/Users/derekzhangv/Develop/mahout-0.7/bin/mahout seqdumper -i `pwd`/$1 | more’ ”)
vectorize
java代码:
args = new String[] { “-i”, sequenceDir, “-o”, vectorDir, “-lnorm”,
“-nv”, “-wt”, “tfidf”, “-s”, “2”// minSuport, default 2 , “-a”, “net.paoding.analysis.analyzer.PaodingAnalyzer” };
SparseVectorsFromSequenceFiles vectorizeJob = new SparseVectorsFromSequenceFiles();
vectorizeJob.setConf(getConf());
vectorizeJob.run(args);
参数:-i/-o是指定输入/输出的目录,-lnorm指定使用对数来normalize向量,-nv指定向量需要命名,-wt指定权重计算方法tf或者tfidf,-s指定最小支持数量(即只考虑出现指定次数或者以上的词),-a指定analyzer分词方法,这里只需要指定一个中文分词analyzer既可,这里是PaodingAnalyzer。
执行后会在[vectorDir]下产生如下文件:
df-count
dictionary.file-0
frequency.file-0
tf-vectors
tfidf-vectors
tokeninzed-documents
wordcount
可以用vectorDumper看看里面的内容。这里就不赘述了。
split
java代码:
args = new String[] { “-i”, tfidfVectorDir, “–trainingOutput”,trainVectorDir, “–testOutput”, testVectorDir,”–randomSelectionPct”, “40”, “–overwrite”, “–sequenceFiles”,
“-xm”, “sequential” };
SplitInput splitJob = new SplitInput();
splitJob.setConf(getConf());
splitJob.run(args);
参数:–trainingOutput和–testOutput分别指定训练和测试样本的输出路径,–randomSelectionPct指定比例,–sequenceFiles说明输入文件格式为sequence,-xm执行方式,默认是map reduce,这里指定sequential
执行后会产生对应的两个文件夹,里面是part-r-00000文件,可以通过sequencedumper查看。
train
java代码:
args = new String[] { “-i”, trainVectorDir, “-o”, modelDir, “-li”,labelIndexDir, “-el”, “-ow” };
TrainNaiveBayesJob trainJob = new TrainNaiveBayesJob();
trainJob.setConf(getConf());
trainJob.run(args);
参数:-o指定模型输出路径,-li指定labelindex文件的路径,这个文件是在vectorize时候产生的。-el指明label是从样本中提取,-ow指明可以覆盖文件。
执行后会在[modelDir]下看到naiveBayesModel.bin文件,就是训练出来的结果了。
test
java代码:
args = new String[] { “-i”, testVectorDir, “-m”, modelDir, “-l”,
labelIndexDir, “-ow”, “-o”, testingDir };
TestNaiveBayesDriver testJob = new TestNaiveBayesDriver();
testJob.setConf(getConf());
testJob.run(args);
参数:-i指定测试样本vector的路径,-m模型所在路径,-l是label索引路径,-ow指明覆盖原来的输出,-o输出路径
输出的文件可以通过sequencedumper查看。
classify
java代码:(有点dirty)
Configuration conf = new Configuration();
AbstractNaiveBayesClassifier classifier = loadClassifier(topicId, conf);
Vector instance = buildInstance(topicId,text);
Vector r = classifier.classifyFull(instance);
Path labelIndexPath = new Path(this.tempDir + “/” + topicId +“/” + topicId
+ “-labelIndex”);
Map<Integer, String> labelMap = BayesUtils.readLabelIndex(conf,
labelIndexPath);
int bestIdx = Integer.MIN_VALUE;
double bestScore = Long.MIN_VALUE;
HashMap<String, Double> resultMap = new HashMap<String, Double>();
for (int i = 0; i < labelMap.size(); i++) {
Vector.Element element = r.getElement(i);
resultMap.put(labelMap.get(element.index()), element.get());
if (element.get() > bestScore) {
bestScore = element.get();
bestIdx = element.index();
}
}
ClassifyResult result = new ClassifyResult();
if (bestIdx != Integer.MIN_VALUE) {
String label = labelMap.get(bestIdx);
double score = bestScore;
result.setLabel(label);
result.setScore(score);
}
return result;
–附上buildInstance和loadClassifier的代码:
private AbstractNaiveBayesClassifier loadClassifier(String topicId,Configuration conf) throws IOException {
Path modelPath = new Path(this.modelDir + “/” + topicId + “-model”);
NaiveBayesModel model = NaiveBayesModel.materialize(modelPath, conf);
AbstractNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);
return classifier;
}
private Vector buildInstance(String topicId,String text){
try {
reBuildDictionary(topicId);
} catch (IOException e) {
e.printStackTrace();
}
Vector vector = new RandomAccessSparseVector(FEATURES);
FeatureExtractor fe = new FeatureExtractor();
HashSet<String> fs = fe.extract(text);
for (String s : fs) {
int index = dictionary.get(s);
vector.setQuick(index, frequency.get(index));
}
return vector;
}
(这里就是比较dirty的地方,需要根据dictionary.file-0和frequency.file-0来构造索引,然后用来vectorize目标样本(要分类的样本)
private boolean dictRebuilt = false;
privatevoid reBuildDictionary(String topicId) throws IOException{
if(dictRebuilt) return;
Configuration conf = getConf();
Path dictionaryFile = new Path(tempDir+“/”+topicId+“/”+topicId+“-vectors/dictionary.file-0”);
// key is feature, value is the document frequency
for (Pair<Text,IntWritable> record
: new SequenceFileIterable<Text,IntWritable>(dictionaryFile, true, conf)) {
dictionary.put(record.getFirst().toString(), record.getSecond().get());
}
Path freqFile = new Path(tempDir+“/”+topicId+“/”+topicId+“-vectors/frequency.file-0”);
// key is feature, value is the document frequency
for (Pair<IntWritable,LongWritable> record
: new SequenceFileIterable<IntWritable,LongWritable>(freqFile, true, conf)) {
frequency.put(record.getFirst().get(), record.getSecond().get());
}
dictRebuilt = true;
}
结束。
Like this:
Like Loading...