I’m back writing on my personal blog, because i’ve felt the “force”.
One of the biggest accomplishments of Hadoop community is the open file formats that were developed to “fix” some things that were not enough on this Big Data world. As example, we have ORC file format that was designed under Stinger initiative to help Hive to answer to human-time requests in human-time expectations.
ORC is basically a columnar format (Optimized Row Columnar) with some smart tricks to allow huge performance improvements on Hive processing.
I will explore the “under the wood” ORC files, so we can understand it better. ON this post we will explore one of the major features of ORC file format: Built-In indexes
ORC contains 3 levels of indexes:
- file level (across the entire file)
- stripe level (across the entire stripe)
- row level
Before we continue with each of the levels of the built-in indexes, let’s talk a little about Stripes. Stripes are they way ORC Files are splitted. Most of the times, the size of the Stripe is 64MB (67108864 bytes). You can change the parameter and do some tuning based on the average lenght of your rows and how many unique values you have. For now let’s consider 64MB.
hive> set hive.exec.orc.default.stripe.size; hive.exec.orc.default.stripe.size=67108864
Back to the indexes, file and stripe level columnar statistics are under the footer of the file, so it would be easier for the engine to read it, without reading the whole ORC file.
These columnar statistics include:
- COUNT of the values
- If there is NULL values or not
- MIN/MAX for each column
- SUM when it is an integer.
- Bloom filters (> Hive1.2)
To verify this, let’s look at a dumpfile of ORC file that corresponds to a table called product.
File and stripe Level
On our particular case, we only have one stripe (as the table is very small), so the columnar statistics present in File Statistics will be the same the ones present in Stripe Statistics.
[hive@sandbox root]$ hive --orcfiledump /apps/hive/warehouse/foodmart.db/product Processing data file hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/foodmart.db/product/product [length: 57544]
The results are what you expected. 1 stripe, 15 columns, hasNull appears, some of the column are integers and therefore the “sum” appears. No bloom filters.
Now, if you are using TEZ as Hive execution engine, it will use file/stripe footer statistics to determine how many TEZ containers it will fire-up (not only) according the information that is present.
A very common sign of unsorted data is the number of containers that are necessary to fullfil your querys. As example, the image bellow shows how unsorted data can actually make your querys much slower.
On this sorted case example, TEZ engine will only need one container to scan because it reads ORC file and stripe columnar statistics and decided that for value 13 only one container is needed.
So please, if possible, sort your data when you are inserting.
Things will get more fun on row level. This statistics show the values in each column for each set of 10,000 rows (a stride or a row group) within a stripe. These include both the column statistics for each row group and the position for seeking to the start of the row group.
The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes.
hive> set hive.exec.orc.default.row.index.stride; hive.exec.orc.default.row.index.stride=10000
Dump will reveal a little more. Data is divided as streams, for each stream and each column is possible to know where row group starts (start offset) and where it ends (start + length).
Please remember that each stride contains 10k rows and our table only contains 1560.
This is an introduction of how ORC Built-in Index work and maybe this can reminder you that for Exadata Storage Indexes are not unique 🙂