HDFS Design and Synergies with VDM
The Hadoop Distributed File System (HDFS) uses divide and conquer techniques behind the covers to distribute data and processing. The design of HDFS according to "Tom White - The Definitive Guide to Hadoop - O'Reiley" is driven by three primary objectives:
- Accommodate very large data
- Optimize for throughput, streaming data efficiently
- Read whole files as opposed to specific records
- Optimize the total access time as opposed to getting to the first record fast
- Use commodity hardware
What HDFS is not currently enabled or optimized for are:
- Direct low latency access to specific records or areas of the file -
- Lots of small files - the overhead associated with each file open and read is relatively high. The same amount of data stored in 1000 files vs 10 files causes 100 times more overhead and it can be quite significant.
- File maintenance limitations
- Concurrent write capability does not exist. One writer at a time writes to the file
- No updates in the middle of the file are provided for. All writing is limited to appending at the end of a file.
VDMETL is designed with the following assumptions and goals:
- The Target is typically a DBMS or other environment that requires data to be cleansed and transformed. XML and conventional table structures are supported.
- Large amounts of data stored in large fixed format files that don't change, but can be replaced to address upstream corrections
- Efficient history loads that require massive amounts of data to be processed in short time windows
- Efficient reloads and corrections with minimal disruption of production capacity
- Fast and easy mechanism for implementing small incremental changes to the transformation/cleansing rules.in the provisioning processes
- Openness and simplicity - use standard open Unix tools and platforms - very much in line with Hadoop and HDFS
- Convention over configuration - Avoid complexity by standardization and enforcement of naming and other conventions