当前位置：K88软件开发 → 文章中心 → 大数据 → Hadoop → 文章内容

Hadoop 读取数据

减小字体

增大字体作者：佚名来源：网上搜集发布时间：2019-1-26 10:06:51

bug("Total # of splits: " + splits.size()); return splits;}分片间的数据如何处理split是根据文件大小分割的，而一般处理是根据分隔符进行分割的，这样势必存在一条记录横跨两个split解决办法是只要不是第一个split，都会远程读取一条记录。不是第一个split的都忽略到第一条记录public class LineRecordReader extends RecordReader<LongWritable, Text> { private CompressionCodecFactory compressionCodecs = null; private long start; private long pos; private long end; private LineReader in; private int maxLineLength; private LongWritable key = null; private Text value = null; // initialize函数即对LineRecordReader的一个初始化 // 主要是计算分片的始末位置，打开输入流以供读取K-V对，处理分片经过压缩的情况等 public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException { FileSplit split = (FileSplit) genericSplit; Configuration job = context.getConfiguration(); this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE); start = split.getStart(); end = start + split.getLength(); final Path file = split.getPath(); compressionCodecs = new CompressionCodecFactory(job); final CompressionCodec codec = compressionCodecs.getCodec(file); // 打开文件，并定位到分片读取的起始位置 FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(split.getPath()); boolean skipFirstLine = false; if (codec != null) { // 文件是压缩文件的话，直接打开文件 in = new LineReader(codec.createInputStream(fileIn), job); end = Long.MAX_VALUE; } else { // 只要不是第一个split，则忽略本split的第一行数据 if (start != 0) { skipFirstLine = true; --start; // 定位到偏移位置，下&#x#x6B21;读取就会从偏移位置开始 fileIn.seek(start); } in = new LineReader(fileIn, job); } if (skipFirstLine) { // 忽略第一行数据，重新定位start start += in.readLine(new Text(), 0, (int) Math.min((long) Integer.MAX_VALUE, end - start)); } this.pos = start; } public boolean nextKeyValue() throws IOException { if (key == null) { key = new LongWritable(); } key.set(pos);// key即为偏移量 if (value == null) { value = new Text(); } int newSize = 0; while (pos < end) { newSize = in.readLine(value, maxLineLength, Math.max((int) Math.min(Integer.MAX_VALUE, end - pos), maxLineLength)); // 读取的数据长度为0，则说明已读完 if (newSize == 0) { break; } pos += newSize; // 读取的数据长度小于最大行长度，也说明已读取完毕 if (newSize < maxLineLength) { break; } // 执行到此处，说明该行数据没读完，继续读入 } if (newSize == 0) { key = null; value = null; return false; } else { return true; } }}

上一页 [1] [2]

Hadoop 读取数据

[] [返回上一页] [打印]

·上一篇文章：Hadoop MapReduce
·下一篇文章：Hadoop Mapper

Hadoop 读取数据

文章评论评论内容只代表网友观点，与本站立场无关！

频道栏目导航

本类热门阅览

相关文章