We're in the process of adding Hive support to Timberwolf, which involves writing files into HDFS so that they can get loaded into Hive tables. Writing to HDFS involves FSDataOutputStreams and FSDataInputStreams, which are all fine and good until you want to start writing tests. My normal approach when testing something that writes to a stream is to create it with a stream that's ultimately backed by a byte array (generally through ByteArrayOutputStream), then pull those bytes out and verify that they're all what I expect them to be. In this case, I was writing a sequence file, so I figured I could use SequenceFile.Reader to pull out my key/value pairs and check that they're correct. That is, until I tried constructing an FSDataInputStream with a ByteArrayInputStream.

Turns out, FSDataInputStream imposes requirements on its backing streams that aren't reflected in the constructor's type signature: FSDataInputStream#FSDataInputStream. So I needed to get a stream that I could construct from a byte array that also implemented PositionedReadable and Seekable. As it turns out, there isn't one of those in the org.apache.hadoop.fs namespace, so I went ahead and rolled my own: SeekablePositionedReadableByteArrayInputStream. It's not complete, since I wasn't sure what exactly seekToNewSource should do and I didn't need it for my tests, but it gets enough of the job done. Maybe it'll help you, too?

Comment