HDFS and Unix Integration
There are two straightforward and one more complex way of achieving HDFS-VDM integration.
- Use HDFS and stream data into the VDMETL process which in turn will handle splitting, parsing and loading
- Use Hadoop's Java API streaming classes using the VDM parsers as mappers storing the load-ready file in HDFS and subsequently loading from it.
- Enable VDMGEN to create Java Hadoop-enabled mappers that can transform data in parallel from the split blocks in the HDFS.
The first approach is simple to implement and provides the most obvious first step. The stream pulled out of the HDFS is funneled into a single pipe into the VDM process. VDM in split mode can then split the ingested data on local disk and process the data in parallel. The use of HDFS for storing source data has advantages, such as cost effective scaling, redundancy, immediate accessibility and faster throughput than conventional file systems or SANs. The process that reads the pipe and ingests the data into VDMETL is a very thin efficient task that will feed upwards of 50 parser streams without becoming a bottleneck.
The second approach achieves the same goals using the Java Stream API. It invokes the parsers in parallel for each of the blocks and stores the resulting load files in HDFS. The final step is to pipe the so generated load file into the loader finishing the process with intra-database activities. This approach avoids the funneling of the data prior to loading time and exploits the partitioning implemented in the HDFS and locally processes each block.The loader should ideally be implemented as the reducer to avoid staging the load-file in HDFS, except for an apparent issue with the current version of the Hadoop framework which does not allow the reducer to write to Unix pipes.
The third approach also avoids funneling and results in a similar run-time topology with comparable performance. The advantages in this level of integration are in the richer (alas slower) language for the parser and the Hadoop environment:
- Standardization around a common schema and efficient standard serialization potential using AVRO
- UTF-8 support across storage and processing components
- Opens the gate to Hadoop as an ETL target