Hive ORC file format – Built-in Indexes

I’m back writing on my personal blog, because i’ve felt the “force”.

One of the biggest accomplishments of Hadoop community is the open file formats that were developed to “fix” some things that were not enough on this Big Data world. As example, we have ORC file format  that was designed under Stinger initiative to help Hive to answer to human-time requests in human-time expectations.

ORC is basically a columnar format (Optimized Row Columnar) with some smart tricks to allow huge performance improvements on Hive processing.

I will explore the “under the wood” ORC files, so we can understand it better. ON this post we will explore one of the major features of ORC file format: Built-In indexes

ORC contains 3 levels of indexes:

  • file level (across the entire file)
  • stripe level (across the entire stripe)
  • row level

Before we continue with each of the levels of the built-in indexes, let’s talk a little about Stripes. Stripes are they way ORC Files are splitted. Most of the times, the size of the Stripe is 64MB (67108864 bytes). You can change the parameter and do some tuning based on the average lenght of your rows and how many unique values you have. For now let’s consider 64MB.

hive> set hive.exec.orc.default.stripe.size;

Back to the indexes, file and stripe level columnar statistics are under the footer of the file, so it would be easier for the engine to read it, without reading the whole ORC file.

These columnar statistics include:

  • COUNT of the values
  • If there is NULL values or not
  • MIN/MAX for each column
  • SUM when it is an integer.
  • Bloom filters (> Hive1.2)

To verify this, let’s look at a dumpfile of ORC file that corresponds to a table called product.

File and stripe Level

On our particular case, we only have one stripe (as the table is very small), so the columnar statistics present in File Statistics will be the same the ones present in Stripe Statistics.

[hive@sandbox root]$ hive --orcfiledump /apps/hive/warehouse/foodmart.db/product
 Processing data file hdfs:// [length: 57544]

The results are what you expected. 1 stripe, 15 columns, hasNull appears, some of the column are integers and therefore the “sum” appears. No bloom filters.












Now, if you are using TEZ as Hive execution engine, it will use file/stripe footer statistics to determine how many TEZ containers it will fire-up (not only) according the information that is present.

A very common sign of unsorted data is the number of containers that are necessary to fullfil your querys. As example, the image bellow shows how unsorted data can actually make your querys much slower.

On this sorted case example, TEZ engine will only need one container to scan because it reads ORC file and stripe columnar statistics and decided that for value 13 only one container is needed.

So please, if possible, sort your data when you are inserting.











Row Level

Things will get more fun on row level. This statistics show the values in each column for each set of 10,000 rows (a stride or a row group) within a stripe. These include both the column statistics for each row group and the position for seeking to the start of the row group.

The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes.

hive> set hive.exec.orc.default.row.index.stride;

Dump will reveal a little more. Data is divided as streams, for each stream and each column is possible to know where row group starts (start offset) and where it ends (start + length).
Please remember that each stride contains 10k rows and our table only contains 1560.









This is an introduction of how ORC Built-in Index work and maybe this can reminder you that for Exadata Storage Indexes are not unique 🙂


Cloud is the trending topic in Data Management area, it was “Big Data” before, but the word “Big Data” became infamous mostly because inside companies (traditional or startups) decision makers didn’t know what to do with so many buzz tools, success articles, google-sized use cases, Gartner magic quadrants, vendors pressure and business non-existent use cases.
Things are more clear in “Big Data” space, there is the clear understanding that it will be the answer to some business questions and future needs and it needs to be fully integrated with the “old” technologies. Integration and “play together”, big and small data.

Cloud is another beast and it is mainly about agility, easy scaling and cost and decision makers know exactly this means for their companies. It is clear as water for them.

Some traditional companies are moving parts of their workloads (the less critical parts) to the cloud, while still maintain a big portion of the workloads on their on-premises data centers. Startups, they were born in the cloud so there is no other clear choice for them.
Oracle Database still remain very present on traditional companies, assuring critical OTLP and OLAP workloads and consequently present in all stages of development/test and production. Some of this stages and workloads will be forced to move to the cloud. From now on, things will become a complex world if you don’t pick Oracle Cloud and decide (for whatever reason) for the IaaS/PaaS leader: Amazon AWS.

First, it will be helpful to check Oracle position on this and that will be “Oracle Database Support for Amazon AWS EC2 (Doc ID 2174134.1)” on MOS as it states:

  • Singles instances (no RAC) is supported in Amazon AMI  on top of OEL 6.4 or latter (EC2 only)
  • No support for Oracle RAC at all
  • No support for Oracle Multitenant in Oracle 12C
  • No support for Oracle RDS, even it is single instance only.

As Single Instance is supported inside EC2, RAC is not and Oracle has a detailed document about “Third Party Clouds and Oracle RAC”:

On this document Oracle states that RAC is not support for 2 reasons: the lack of shared storage and missing required network capabilities and it justify both reason.

  • Lack of shared storage: Amazon AMI images allow bypass EBS limitation on shared storage (concurrent access) using iSCSI and building a NAS to “emulate” shared storage. As Oracle states, there is of course an performance impact on I/O as another layer is built to “emulate” shared storage, so Amazon recomends large AMI instances stating the following here:

“In order to provide high I/O performance, we chose to build the NAS instances on EC2’s i2.8xlarge instance size, which has 6.25 terabytes of local ephemeral solid state disk (SSD). With the i2.8xlarge instances, we can present those SSD volumes via iSCSI to the RAC nodes”

Another side effect of Amazon AMI images is they rely on “ephemeral” storage as it literally means: “It is persistant for the life of the instance.”  You will have data loss if your NAS instances fail and Amazon is aware of this and states the following in the same document:

“The SSD volumes that come with the i2.8xlarge instance are ephemeral disk. That means that upon stop, termination or failure of the instance, the data on those volumes will be lost. This is obviously unacceptable to most customers, so we recommend deploying two or three of these NAS instances, and mirroring the storage between them using ASM Normal or High redundancy”Oracle, of course, doesn’t find this a very good solution (let’s face it, it is not) as the data loss danger is real.

  • Required network capabilities: The network on EC2 doesn’t not support multicast IP, however this is much needed for RAC to broadcast packets during cluster configuration and reconfiguration.
    Amazon provided here an workaround for this issue: point-to-point VPN among the RAC nodes using NTop N2N (a discontinued product).
    Of course network performance will suffer on top of this solution and we know how Oracle RAC deals with bad network performance on interconnect with the popular “gc wait events”.
    That said, Amazon is well aware of this stating:”

    Performance of network communications via N2N is significantly lower than non-VPN traffic over the same EC2 network. In addition, Ntop, the makers of N2N, are no longer developing the N2N product. These factors may preclude running production-class RAC workloads over N2N.Currently, we are developing new approaches using clusterware-managed dynamic creation of GRE tunnels to serve the cluster’s multicast needs. We will provide details on this approach and revise this tutorial accordingly before August 2016.”Still no news at date of this post (as far we have investigated).

Conclusion is that there two major workarounds provided by Amazon AWS should be carefully evaluated if you decide to deploy a RAC cluster inside AWS. Things will eventually improve on Amazon if the demand for deploying Orace RAC is high, but for now please be very careful on this topic.


Article also published here:

Durability – Oracle vs Cassandra

In this post, i will try the analogy between how Cassandra provides durability to data and how it compares to well-know (at least for me) Oracle “method” of data persistence.

So Cassandra uses something called commitlog, where data that is inserted, updated or deleted is appended.
In that way Cassandra provide us the “D”, of Durability. As the operation of a write is always an “append”, there is no seek time (or is minimal) and the sequential write is much more efficient.

Also, there is a parameter called commitlog_sync that defaults to “periodic“, that will sync commitlog with disk every commitlog_sync_period_in_ms (1000ms of default in 2.x).There is also other modes of commitlog_sync, mainly “batch” that will tell Cassandra to group the “mutations” into batches and then sync it.

Does this commitlog job sounds familiar to you? Oracle redo log data is first written in a buffer memory (like commitlog) right before it is flushed to the disk. LGWR is also responsible for writting the commit records to the log file, also it also groups the “commits” (when needed) and flush it to the redo log files in “batch”.
LGWR doesn’t have a single parameter like “commitlog_sync_period_in_ms” or “commitlog_batch_window_in_ms” to control when flushing from redo buffer to disk happens, it has a bunch of conditions when to flush the contents to redo files (every 3 seconds, redo buffer is 1/3, before DBWR writes out a dirty block,etc).

As Cassandra on commitlog and Oracle on redo log files, both are used in case of disaster or node restart need to be replayed to unsure that recovery is done.

The little difference here is subtle: If you are really unlucky and all your Cassandra replicas crash during the “period_in_ms” or “windows_in_ms” (depending of your commitlog_sync parameter), your data is lost, unless you set is to 0.

That doesn’t happen (by design) in Oracle and LGWR. No matter what you do in the database, when it crashes you recovery everything that showed to you to “committed”.

But for Cassandra this is only part of the game, the other part of the game is something called memtable that is basically a structure in memory with the content stored in key-value pair and sorted by key. memtable also stores every single write until it reaches a specific threshold and then it is flushed to immutable and ordered disk structure called SSTables.
So the question is, how Cassandra knows what data is actually persistent as SSTable and in case of node disaster what to “replay”? The answer is simple: As soon the memtable is flushed to a SSTable the data in commitlog is purged!

Here, there is no archivelogs – data is purged as soon as it gets persistent in the nodes.

Update: As a few readers pointed out, after Cassandra 2.0 (my fault to not verify newer versions), an archive of commit log is possible when the node starts,when the commit log is written to disk or even at a specific point in time.
Comparing to Oracle, the concept is almost the same (except that Oracle and LGWR don’t necessarily archive at startup of the instance), but it is also used for “backup” and “recovery” supporting PIT restores.

Hope Cassandra is now more clear to you, as a Oracle DBA.

PS: Thanks isinghora and Ishinder Singh for point me out my error.

OLT – An initial overview of Oracle Linux Test – Part 1

The OLT Toolkit (a.k.a Oracle Linux Test)  is an opensource tool provided by Oracle here. It contains a full set of tools ranging from OLTP/DSS workload tools to kernel test modules like mempressure as well as virtualization tests. It is a fully loaded “on-steroids” set of tests to make sure that your Oracle infrastructure and configurations are perfect.

Please remind while you are reading this, that i’m searching for whatever it looks useful for DBAs, not a complete QA and verification framework to run on every machine.

I’m not going to describe installation process, you will figure out how to install it (README). Just make sure that your system is set with a FQDN (full qualified domain name), otherwise configuration will fail.
Please also review olt-setup.env file before start the installation as it will describe what kind of setup you want to submit to the tests (Single Instance, RAC or XEN).
Installation process will install a few RPMs for you. Installation went fine on Oracle Linux 7.1, as the two questions were the missing package dependencies (easily spot on log files) and not using FQDN.

There is a almost-complete user guide here that mostly describe all you need to run in different configurations. For now with single instance all you need is this:

[oracle@phoenix OLT]$ wget
oracle@phoenix OLT-2.3.0-7]$ ls -ltr
total 36
-rwxr-xr-x. 1 oracle oinstall 12010 May  5  2015 oltInstaller --> Installer itself
-rw-r--r--. 1 oracle oinstall  7108 May  5  2015 olt-setup.env --> Review before install
-rw-r--r--. 1 oracle oinstall  1842 May  5  2015 README
drwxr-xr-x. 3 oracle oinstall  4096 May  5  2015 RPMS
-rw-r--r--. 1 oracle oinstall  5106 Nov 13 22:07 olt-install-log --> Installation log

[root@phoenix OLT-2.3.0-7]# ./oltInstaller

Initial test

Searching for something useful, i’ve decided to modify the file provided ( validate.cmd present in /opt/oracle/oltest/olt-schedule) and enable only one particular test called sv-asm.

[oracle@phoenix olt_tests]$ pwd

# Default command file containing all the testcases

##### Silent Install tests #####
##### End of Silent Install tests #####

##### Housekeeping tests #####


As you can see, only sv-asm (that verify if the oracleasm kernel module is present) runs.

Run/schedule it now by using olt-schedule command

[oracle@phoenix olt_tests]$ olt-schedule -f ./validate_noinstall.cmd -r

Validation is the next step

[oracle@phoenix olt_tests]$ olt-schedule -f ./validate_noinstall.cmd -s
>>> Node : phoenix.localdomain
    -    testsuite "sv-asm" has PASSED

It looks good, but how we really know what happened behind the scenes? Lets check the logs and the result in work directory.

[oracle@phoenix work]$ ls -ltr
total 8
drwxr-xr-x. 4 oracle oinstall   75 Nov 14 00:46 sv-asm_Nov14_15-00-46-24

[oracle@phoenix sv-asm_Nov-14-00-46-24-504885828-10576]$ cd results/
[oracle@phoenix results]$ tail -100f testrun.log
+ '[' x == x1 ']'
++ uname -r
+ KVER=3.8.13-55.1.6.el7uek.x86_64
++ find /lib/modules/3.8.13-55.1.6.el7uek.x86_64/ -name oracleasm.ko
+ asm_module1=/lib/modules/3.8.13-55.1.6.el7uek.x86_64/kernel/drivers/block/oracleasm/oracleasm.ko
++ find /lib/modules/3.8.13-55.1.6.el7uek.x86_64/ -name oracleasm.o
+ asm_module2=
+ '[' x/lib/modules/3.8.13-55.1.6.el7uek.x86_64/kernel/drivers/block/oracleasm/oracleasm.ko '!=' x -o x '!=' x ']'
+ echo 'oracleasm module is present in the running kernel'
oracleasm module is present in the running kernel
+ exit 0

[oracle@phoenix work]$

Now, i’ve picked two tests: ft-aio and ft-dio.
* ft-aio – Verify that Oracle IO layer uses aio system calls – io_submit, io_getevents ­ via strace
* ft-dio – Verify that Oracle opens the database files with the O_DIRECT flag via strace

[oracle@phoenix olt_tests]$ olt-schedule -f validate_noinstall.cmd -s
>>> Node : phoenix.localdomain
    -    testsuite/testcase "ft-aio" is IN PROGRESS
>>> Node : phoenix.localdomain
    -    testsuite "ft-dio" is NOT YET SCHEDULED

As i’ve discovered quickly  that most of the tests (Oracle says all) require that the test of “Silent-installation” – the real installation of an Oracle Home in the server is needed to perform other kind of tests (?!) and the tests on ft-aio and ft-dio failed (at least both stayed in progress for almost an hour). FAQ says it is possible to run the tests from an existing Oracle Home, but the docs are really outdated on how to do it and not applicable on this particular version.

I’m particular interested on this as OLT allows to do a lot of test (list of tests is available here) including stress tests.
I’ve emailed someone behind the project so he can help me run the tests on an already installed Oracle Home.

To be continued…

GPO – Back to writing in Portuguese

Hi folks,

My friend Alex Zaballa (Oracle ACE Director, Oracle OCM) asked if i could contribute to a great Brazilian Oracle community (GPO – Grupo Profissionais Oracle) writing articles again in Portuguese. Since i’ve started all my blog career writing in Portuguese, i’ve decided to take on the challenge.
Portuguese language has more than 260 million speakers, so i hope i can contribute to inject some more know-how on Oracle technologies, mainly on the Oracle Database and Oracle Private Cloud (DBaaS).

Here is my profile on the GPO in the link bellow. Check the bottom of the page for the article written by me:

First article about an introduction to B-Tree internals here.


SLOB LIO against Oracle on AIX different page sizes

1 – Introduction

We all know the possible benefits of using Huge Pages (or Large Pages, depending of the platform). Is is simply a method to have larger page size and that is useful for working with very large memory allocations. As you already know how things flow in Linux Kernel i will just give you a brief how things go in another operating system: AIX
The ideia behind is exactly the same as any other operating system that supports different pages sizes, but AIX (AIX 6.1 and Power6+) added one feature called: Dynamic Variable Page Size Support (VPSS). As you know pages are simply fixed length data blocks in virtual memory, so AIX VMM can dynamically use larger page sizes based on the “application memory workload”. The ideia behind is very good as the feature is transparent to applications, reduces the number hardware address translations done and the effort to implement it is close to nothing.
However, POWER6+ processors only supports mixing 4 KB and 64 KB page sizes, not 16M or other sizes (16G is also available).
To know the different page sizes available for your AIX system:

$ pagesize -a
4096 (4K)
65536 (64K)
16777216 (16M)
17179869184 (16G)

2 – The basics

To exemplify this let’s pick a POWER6 CPU running AIX 6.1 and Oracle (RAC):

$ lsattr -El proc0
frequency   4400000000     Processor Speed       False
smt_enabled true           Processor SMT enabled False
smt_threads 2              Processor SMT threads False
state       enable         Processor state       False
type        PowerPC_POWER6 Processor type        False

This first thing to make sure is what page sizes are currently in use. vmstat will output 2 different page sizes in use: 4K and 64K .

$ vmstat -P all

System configuration: mem=49152MB
pgsz            memory                           page
----- -------------------------- ------------------------------------
           siz      avm      fre    re    pi    po    fr     sr    cy
   4K  3203552  2286288   324060     0     0     0    55    227     0
  64K   586210   605917     1180     0     0     0     0      0     0

Let’s now see what pagesize is used by Oracle. We will use svmon that captures and analysis the virtual memory allocation. The option “-P” will allow us to ask only for PID associated with SMON background process.

$ svmon -P $(ps -elf | egrep ora_smon_${ORACLE_SID} | grep -v egrep | awk '{print $4}') | grep shmat
 14c8d31  7000000d work default shmat/mmap           m   4096     0    4    4096
 147cf12  70000011 work default shmat/mmap           m   4096     0  655    4096
 15a1164  70000042 work default shmat/mmap           m   4096     0 1732    4096
 13648df  70000051 work default shmat/mmap           m   4096     0 1452    4096
 145cd1b  70000018 work default shmat/mmap           m   4096     0  880    4096
 1468717  70000006 work default shmat/mmap           m   4096     0  739    4096
 168ffae  70000038 work default shmat/mmap           m   4096     0 1131    4096
 17323cf  70000048 work default shmat/mmap           m   4096     0  601    4096
 15a1b64  70000039 work default shmat/mmap           m   4096     0 1015    4096
 13862ec  7000004f work default shmat/mmap           m   4096     0  836    4096
 1664798  7000001d work default shmat/mmap           m   4096     0 1713    4096
 115ec5a  7000002d work default shmat/mmap           m   4096     0 1474    4096
 13dbafb  70000058 work default shmat/mmap           m   4096     0 1271    4096
 1221a85  70000052 work default shmat/mmap           m   4096     0 1341    4096
 1753fd9  7000003f work default shmat/mmap           m   4096     0  728    4096

The column “m” shows that Oracle is asking for “medium size” pages (64K) only.
A better way of checking this is using this following options in svmon.
(1306770 is the PID of smon)

$ svmon -P 1306770 -O sortentity=pgsp,unit=auto,pgsz=on
Unit: auto
     Pid Command          Inuse      Pin     Pgsp  Virtual
 1306770 oracle           27,2G    99,4M    7,81G    28,1G

     PageSize                Inuse        Pin       Pgsp    Virtual
     s    4 KB             275,38M         0K      48,7M      60,2M
     m   64 KB               26,9G      99,4M      7,76G      28,0G

SQL> select sum(BYTES)/1024/1024/1024 as SGA_SIZE from V$SGASTAT;


As you can see the value allocated to “Virtual Memory” (column Virtual) for page size of 64K is 28,0G and should be very similar to the SGA allocated to Oracle.
64K page size also reduce TLB miss rate and it is clear that will benefit the performance when using it instead of the typical 4KB pages. Besides this, no configuration is required at all as AIX kernel automatically allocates 64KB pages for shared memory (SGA) regions (and process data and instruction text if requested).
Another thing that you should know and you see in it column “Pin” in svmon command is that pages of 64K (or 4K) are not pinned. This is due to:
– Your Oracle LOCK_SGA parameter is set to FALSE, so that your SGA is not locked into physical memory.
– Is is recommended to not pin pages of regular size (4K or 64K) due complex nature of pinning that can cause serious problems. Pinning SGA to physical memory only with 16M page size (or 16GB?)

3 – The 16M pages question:

Every time you use large pages to map virtual memory, TLB is able to map more virtual memory entries with a much lower TLB miss rate. This is why Large Pages (or Huge Pages on Linux) are a common topic and a best practice on both OLTP or DW environments.
There are several documents that say to forget 16M and just use the regular 64K and let VPSS take care of promoting the page size that is asked for the application (Oracle in this case).So as you image the question is: Is 16M page size a benefit on Logical I/O performance? Let’s SLOB it!!

4 – SLOB for 64K vs 16M page size

To test, i’ve decided to use SLOB workload and measure the impact on CPU and LIO of 64k page size first. SLOB is an very useful utility written by Kevin Closson ( that can really be used in a various number of scenarios including PIO and LIO testing as well as CPU analysis (in the end everything is a CPU problem ;).
The hardware lab includes POWER7 processors and AIX 6.1 (6100-08-02-1316) with a CPU_COUNT=24.

4.1 – The setup

1 – Create a new database called SLOB (or use SLOB create_database_kit).

This step includes a few interesting details about the database, mainly the SGA size of 30G and a db_cache_size set by default to be managed by Oracle. That will be very handy to test the impact of allocating different page sizes.


2 – Setup SQL*Net connectivity – Remember that we are running SLOB inside Linux that connects to a AIX database system.

# Settings for SQL*Net connectivity:

3 – Confirming that Oracle is “asking” AIX kernel 64k pages. Check column “m” that stands for medium pages size (64k).

$ svmon -P $(ps -elf | egrep ora_smon_SLOB | grep -v egrep | awk '{print $4}') | grep shmat
  ac40ac  7000007c work default shmat/mmap           m   4096     0    0    4096
  8adf8a  7000006a work default shmat/mmap           m   4096     0    0    4096
  8fe14f  70000063 work default shmat/mmap           m   4096     0    0    4096
  b152b1  70000047 work default shmat/mmap           m   4096     0    0    4096

Testcases rules:

– Oracle was used (The version that i really need to test)
– Buffer Hit should be always 100% on Instance Efficiency AWR section
– The tests on each page size includes waiting 10 minutes between each page size run
– The following order was used for each testcase:

For 64k page size:

0 – Reboot server
1 – Run SLOB: ./ 20 to populate Oracle buffer cache
2 – Run SLOB: ./ 20 and save AWR data.
3 – Run SLOB: ./ 20 and save AWR data.
4 – Run SLOB: ./ 20 and save AWR data.

For 16M page size:

5 – Reboot server to free continuous 16M memory chuncks
6 – Setup Large Pages (16M) for Oracle in AIX
7 – Run SLOB: ./ 20 to populate Oracle buffer cache
8 – Run SLOB: ./ 20 and save AWR data.
9 – Run SLOB: ./ 20 and save AWR data.
10 – Run SLOB: ./ 20 and save AWR data.

Testcase #1 – LIO in 64k vs 16M page size with small dataset

The first test pretends to compare the use of medium pages of 64K and large pages (16M) and the impact on Logical I/O using a small dataset size (Aprox. 7GB). Please note that the test doesn’t pretend to know the “maximum value” for your Logical I/O, but instead compare 3 “equal” runs of SLOB results in different page size. As this test will run only with 20 active sessions, the CPUs are not totally busy as idle time is also present (CPU_COUNT=24).

Testcase #1 – slob.conf


Testcase #1 – SLOB Dataset (20 schemas)

$ ./ IOPS 20
NOTIFY  : 2015.03.26-16:04:43 :
NOTIFY  : 2015.03.26-16:04:43 : Begin SLOB setup. Checking configuration.
NOTIFY  : 2015.03.26-16:07:32 : SLOB setup complete (169 seconds).

Testcase #1 – Run it (64k vs 16M)

As rule for the runs, i’ve decided to populate the buffer cache to get Buffer Hit of 100%, avoiding interference from Physical I/O. The will happen only after the 2nd run of SLOB as the first run will partly serve as buffer cache warm-up.

$ ./ 20
NOTIFY : 2015.04.01-14:04:39 :
NOTIFY : 2015.04.01-14:04:39 : Conducting SLOB pre-test checks.
NOTIFY : 2015.04.01-14:04:39 : All SLOB sessions will connect to SLOB via SQL*Net
NOTIFY : 2015.04.01-14:10:02 : Terminating background data collectors.
./ line 589: 24771 Killed                  ( iostat -xm 3 > iostat.out 2>&1 )
./ line 590: 24772 Killed                  ( vmstat 3 > vmstat.out 2>&1 )
./ line 590: 24773 Killed                  ( mpstat -P ALL 3 > mpstat.out 2>&1 )
NOTIFY : 2015.04.01-14:10:12 : SLOB test is complete.
# vmo -p -o lgpg_regions=1921 -o lgpg_size=16777216
Setting lgpg_size to 16777216
Setting lgpg_regions to 1921

$ export ORACLE_SGA_PGSZ=16m

$ svmon -P $(ps -elf | egrep "ora_smon_SLOB" | grep -v egrep | awk '{print $4}') | grep shmat
  8f0a4f  7000005c work default shmat/mmap           L     16    16    0      16
  8e0a4e  70000061 work default shmat/mmap           L     16    16    0      16
  bb077b  7000002e work default shmat/mmap           L     16    16    0      16
  ad072d  7000002b work default shmat/mmap           L     16    16    0      16
  b60836  7000002d work default shmat/mmap           L     16    16    0      16
  8d0a4d  7000005e work default shmat/mmap           L     16    16    0      16
  bf097f  7000000a work default shmat/mmap           L     16    16    0      16
  9d065d  70000029 work default shmat/mmap           L     16    16    0      16
  be097e  70000002 work default shmat/mmap           L     16    16    0      16
  bd097d  70000010 work default shmat/mmap           L     16    16    0      16

The column with value L, shows that Large Pages of 16M are actually being used by Oracle.

Testcase #1 – The Results

Instance Efficiency Percentages (Target 100%)
            Buffer Nowait %:  100.00       Redo NoWait %:  100.00
            Buffer  Hit   %:  100.00    In-memory Sort %:  100.00
            Library Hit   %:  101.20        Soft Parse %:   94.83
         Execute to Parse %:   99.98         Latch Hit %:  100.00
Parse CPU to Parse Elapsd %:    0.00     % Non-Parse CPU:  100.00

Testcase #1 - LIO

In average of 3 different runs with different page sizes, 16M page size shows a very small improvement on less than 2% (1,9%).

Testcase #2 – LIO in 64k vs 16M page size with larger SLOB dataset

The same rules applies to this test case, the only difference is the SCALE used in SLOB as it is bigger but still be able to fit in Oracle buffer cache. As every block still comes from Oracle buffer cache the OS needs to know the status of every page allocated by Oracle SGA and make the so called address translation. On this test case SLOB ran with 20 sessions just like on the first test case.

Testcase #2 – The bigger SLOB SCALE run

This will result in about 23G of data with 20 different schemas

$ ./ 20
NOTIFY : 2015.04.01-14:04:39 :
NOTIFY : 2015.04.01-14:04:39 : Conducting SLOB pre-test checks.
NOTIFY : 2015.04.01-14:04:39 : All SLOB sessions will connect to SLOB via SQL*Net
NOTIFY : 2015.04.01-14:10:02 : Terminating background data collectors.
./ line 589: 24771 Killed                  ( iostat -xm 3 > iostat.out 2>&1 )
./ line 590: 24772 Killed                  ( vmstat 3 > vmstat.out 2>&1 )
./ line 590: 24773 Killed                  ( mpstat -P ALL 3 > mpstat.out 2>&1 )
NOTIFY : 2015.04.01-14:10:12 : SLOB test is complete.

Testcase #2 – The Results

Testcase #2 - LIO

In average of 3 different runs, it represents no improvement using 16M page size with a difference of less than 1% between the 2 different page sizes. For this SLOB workload, we can conclude that 64k and 16M page size showed the same results.

Testcase #3 – LIO in 64k vs 16M page size with CPU pressure
The same rules applies to this test case as other two, as the Buffer Hit should be 100%. But this time we run under CPU starvation with 40 concurrent sessions on a CPU_COUNT=24. To make sure that all blocks come from Oracle buffer cache the SLOB SCALE was reduced to 80000.

Also, the overhead to managing large pages tables is indicated by CPU usage (mostly). This increase on kernel mode CPU usage will eventually make suffer your Logical I/O numbers. So CPU is wasted in managing big page tables translations instead of giving the CPU to Oracle to process your workload. The overhead of CPU usage and the “sometimes happen” page faults will lead to a less-than-good Oracle performance. The bottom line here is simple: 16M page size should provide better results by theory under CPU pressure.

Testcase #3 – CPU pressure and starvation

$ ./ 40
NOTIFY : 2015.04.02-16:44:34 :
NOTIFY : 2015.04.02-16:44:34 : Conducting SLOB pre-test checks.
NOTIFY : 2015.04.02-16:44:34 : All SLOB sessions will connect to SLOB via SQL*Net
RUN_TIME == 300
SCALE == 80000
WORK_UNIT == 256

Testcase #3 – The Results


Looks good! A difference on average of 3 runs of more than 11% in favor of 16M page size. This shows that under CPU pressure and possible starvation you will end up with more benefits than the work that is required to setup 16M page size on AIX 6/7.

To make sure that these results were ok to publish, i’ve done numerous SLOB runs with 64k and 16M page size and the results were the same. Benefits between 9% and 12% with CPU under a lot of pressure.


– 16M page size on AIX (and on other OS, probably) will provide you better Logical I/O performance when CPUs are under pressure. The benefits range between 9% and 12% when using 16M page size.
– These results may differ from your conclusions or tests, because as you understand your workload is different and the results will be inevitably be different.

presman – Oracle Resource Manager Monitor

I’ve been working a lot with DBRM lately and the worst thing is feeling that we can’t properly measure our plan directives and how effective our DBRM plans are in real world.
I’ve decided to learn more about Python so i did a try in writing a tool to monitor my DBRM complex implementations and plans.

presman is a small tool, that works with Python >2.6 and cx_Oracle. It was tested on Linux and MacOSX Yosemite (10.10.x). It works on your favorite terminal and no more is needed.

As soon as you install Python (if needed) and cx_Oracle you are ready to go. Please remember that for MacOSx you have to build cx_Oracle. I will provide detailed instructions on how to build it in near future.

Some of the features are:

1) Measure your CPU usage and Session I/O by Consumer Group;
2) Show real time table and bar chart with a configurable refresh time;
3) Export measured data to a CSV to easy manipulate;
4) Configuration is easy and straightforward.

Before run it, check the configuration file config.ini that contains a few options:


– Available options are cpu and session_io. Default is cpu;
– Refresh rate is the rate for screen refresh. Minimum allowed is 3 seconds;
– The connection should be made using an Oracle user that is allowed to query DBA_ and V$ views from DBRM. SYSTEM is recommended.

To run it you have two different options. The second option is to output the measured on chart to a CSV:
– For cpu option the output is the % of CPU Usage by Consumer Group;
– For session_io option the output is the % of Consumed I/O (Small and Large Read and Write I/O) by Consumer Group.

lcarapinha@Luiss-MacBook-Pro:~/$./ output_plan.csv

Here is the first screenshot of presman in action using cpu option. Please note that in the top of the image you have several information regarding your database.
The table shows several information about CPU Usage by Consumer Group, for example the number of yields, active session or requests.
As you can see, it allows you to measure with your defined refresh rate how DBRM is allocating the CPU resource and check if that really matches with your implementation.

presman shot

The next screenshot will show presman in action using session_io option. This option allows us to check I/O consumed by Consumer Group. DBRM is not able to “limit” your I/O but you can measure the distribution of the I/O by Consumer Group and how your DBRM plan can influence that.


If you opt to output the measured values (see what is saved in top of this post) to a file, you can use the CSV to create fancy graphics like this one (with refresh rate 10 seconds):


The same applies to session_io option:


Please make note of the following restrictions on 1.0 version:
– No RAC support (yet)
– Not tested in Windows platform

I will release it soon in a new “page” in the blog as i have one “performance” bug fixed when presman runs for several hours and need to output to file.
For now code is available in github here:

Profiling DB Resource Manager – VKRM process

It seems that VKRM is a deeply unknow background process. I did a little investigation that will help to understand better all mechanism of profiling Oracle (Thank you Frits Hoogland) and a little more of one of the most underestimated feature of Oracle: Resource Manager.

VKRM manages the CPU scheduling for all Oracle processes and includes the CPU scheduling for the Database Resource Manager. Your DBRM active plan (parameter resource_manager_plan) will be subject to VKRM job to ensure that all your plan directives are fulfilled.
VKRM is a special background process, because it just go away when is not needed (at least in 11gR2) and every time your Resource Manager CPU scheduling kicks in, DBRM process will spawn VKRM again. Please note that DBRM is the “main” process for all Resource Manager tasks, VKRM is only for CPU scheduling.

There is no documentation explaining how VKRM works in detail, so what is left for us is to try some profiling and reach some (basic?) conclusions.

The first thing about VKRM is that, you simply can’t control its behavior…except there is an hidden parameter called _vkrm_schedule_interval exists which is basically VKRM schedule interval (surprise, surprise) that is by default set to 10 milliseconds:

SQL> @phidden _vkrm

-------------------------------------------------- --------------------------------------------------
_vkrm_schedule_interval 			   10

This is easily confirmed by strace on the PID corresponding to VKRM background process:

[oracle@baco scripts]$ ps -ef | grep ora_vkrm
oracle    2566     1  0 Nov01 ?        00:00:25 ora_vkrm_bacodb1
oracle    8965  7296  0 01:02 pts/3    00:00:00 grep ora_vkrm

[root@baco scripts]# strace -p 2566 -o ora_vkrm_strace.out
Process 2566 attached - interrupt to quit
^CProcess 2566 detached

The result is a bunch of nanosleep() Linux kernel functions, that suspends the execution of a calling thread until either at least the time specified (10000000 nanoseconds) has elapsed. On easy words, it is holding a sleep for every 10 milliseconds. On a successful sleep, nanosleep() returns 0.

nanosleep({0, 10000000}, 0x7fff271b1160) = 0
nanosleep({0, 10000000}, 0x7fff271b1160) = 0
nanosleep({0, 10000000}, 0x7fff271b1160) = 0

A small change in the _vkrm_schedule_interval to 5000 milliseconds will result in a different argument call for nanosleep() function and on a different period (every 5 seconds).
This will probably change the behavior of VKRM and CPU scheduling, the greater the value, the less precise will be your scheduling. As you can see in strace output it is possible to change _vkrm_schedule_interval while database is running (scope=memory) and it will take immediate effect on your scheduling behavior:

SQL> alter system set "_vkrm_schedule_interval"=5000 scope=memory;
SQL> alter system set "_vkrm_schedule_interval"=6000 scope=memory;
nanosleep({0, 10000000}, 0x7fff271b1160) = 0
nanosleep({0, 10000000}, 0x7fff271b1160) = 0
nanosleep({5, 0}, 0x7fff271b1160)       = 0
nanosleep({5, 0}, 0x7fff271b1160)       = 0
nanosleep({5, 0}, 0x7fff271b1160)       = 0
nanosleep({6, 0}, 0x7fff271b1160)       = 0

Trace files will also reveal your change:

*** 2014-11-02 04:02:43.992
kskvkrmschedintmod: setting VKRM scheduling interval from (6000)ms to [(10)ms (10000)us]
*** 2014-11-02 04:11:53.078
kskvkrmschedintmod: setting VKRM scheduling interval from (10)ms to [(5000)ms (5000000)us]
kskvkrmschedintmod: setting VKRM scheduling interval from (5000)ms to [(10)ms (10000)us]

Another chapter in profiling VKRM process is to use perf on Linux to see if we can see more interesting stuff. Bellow is the result of a perf report against VKRM process. Top 3 are three different kernel mode executed functions: __do_softirq, finish_task_switch and _raw_spin_unlock_irqrestore.
Most of the work is done in kernel mode, with Linux kernel software interrupts (softirq) and scheduler functions (finish_task_switch) allowing the high-precision CPU scheduling made by VKRM.
Another thing worth mention is usermode Oracle function kskvkrmmain representing only 3.03% of all work done by VKRM.

root@baco outputs]# perf record -g -p 2542 -e cpu-clock
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.451 MB (~19697 samples) ]

[oracle@baco outputs]$ perf report
[vdso] with build id 553f611ad979d16f78a66945dca52ba113827329 not found, continuing without symbols
 39.24%  ora_vkrm_bacodb  [kernel.kallsyms]   [k] __do_softirq
                 -- 99.05%-- do_nanosleep
34.31%  ora_vkrm_bacodb  [kernel.kallsyms]   [k] finish_task_switch
14.22%  ora_vkrm_bacodb  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
3.03%  ora_vkrm_bacodb  oracle              [.] kskvkrmmain
            --- kskvkrmmain

1.25%  ora_vkrm_bacodb  oracle              [.] sltrusleep
            --- sltrusleep

Another shot is oradebug to understand what kind of events happen related with VKRM:

SQL> oradebug setospid 2542
Oracle pid: 10, Unix process pid: 2542, image: oracle@baco (VKRM)
SQL>  oradebug unlimit
Statement processed.
SQL> oradebug event 10046 trace name context forever, level 8;
Statement processed.

*** 2014-11-09 14:06:38.559
WAIT #0: nam='latch free' ela= 21980 address=2722482696 <b>number=467</b> tries=0 obj#=-1 tim=6866775549

*** 2014-11-09 14:09:41.598
WAIT #0: nam='latch free' ela= 31774 address=2722482696 <b>number=467</b> tries=0 obj#=-1 tim=7049814301

The only event that is happening on this trace is latch free wait event.It is possible to identify what latch is related with latch free wait event with a simple query (see bellow). The latch is obviously related to Resource Manager CPU scheduling.

SQL> select latch#, name from v$latchname where latch# = 467;

---------- ----------------------------------------------------------------
       467 resmgr:resource group CPU method

This post has no great conclusions, it is just a pure exercise to understand a little more about a deeply unknow Oracle background process.

Resource Manager – CPU allocation math – Part 3

This is the last post of this mini-series regarding CPU allocation in Resource Manager. The idea behind this last post is very simple: Tracing the same test case we’ve used before and analyze trace files. This will let us understand how Oracle instrumentation works when DBRM is active and managing the CPU.
Please note that we are going to trace for only one service, that is perfectly enough for our testing.

Changing our cpu_alloc_burn.sql for tracing using 10046 event with the prefix for our traces ‘DBRM_TRACE’:

alter session set tracefile_identifier='DBRM_TRACE';
alter session set events '10046 trace name context forever, level 12';
select distinct t1.N2 from t1, t2
where t1.N1t2.N2
and t1.N3t2.N1
and t1.N2  t2.N1
and t2.N2 is not null;
[oracle@phoenix resource_manager]$  ./
Starting 20 new executions for S_ADHOC service with tracing...

Now we have 20 new sessions connected to the service name S_ADHOC and consumer group ADHOC_QUERYS. The first thing that we will notice before digging into trace files is the wait event resmgr:cpu quantum:

---------- -------- -------------------------------- ---------- ------------------------------
	22 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	24 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	26 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	28 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	29 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	32 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	34 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	35 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
	38 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       134 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       136 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum

---------- -------- -------------------------------- ---------- ------------------------------
       143 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       148 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       150 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       151 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       152 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       156 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       157 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       159 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum
       162 ACTIVE   ADHOC_QUERYS		     S_ADHOC	resmgr:cpu quantum

This wait event basically states that a session exists and is waiting for the allocation of a quantum of CPU. It is basically DBRM doing his job, throttling CPU allocation until it is according the plan directives that we have defined. It is then obvious if you want to reduce the persistence of this wait event (AWR will help you checking that), you have to increase your CPU allocation (your plan directives) to avoid waiting so much on it.
Another (and the best way to do it, since it gives you a lot of information) is to check the trace file that we’ve generated before:

*** 2014-06-13 17:06:39.844
WAIT #140096016814088: nam='resmgr:cpu quantum' ela= 807849 location=2 consumer group id=88620  =0 obj#=88623 tim=1402675599844408
WAIT #140096016814088: nam='Disk file operations I/O' ela= 5589 FileOperation=2 fileno=0 filetype=15 obj#=88623 tim=1402675599854817

*** 2014-06-13 17:06:40.778
WAIT #140096016814088: nam='resmgr:cpu quantum' ela= 821271 location=3 consumer group id=88620  =0 obj#=88623 tim=1402675600778500

*** 2014-06-13 17:06:41.736
WAIT #140096016814088: nam='resmgr:cpu quantum' ela= 917063 location=3 consumer group id=88620  =0 obj#=88623 tim=1402675601736754

*** 2014-06-13 17:06:42.605
WAIT #140096016814088: nam='resmgr:cpu quantum' ela= 859088 location=3 consumer group id=88620  =0 obj#=88623 tim=1402675602605611

*** 2014-06-13 17:06:43.612
WAIT #140096016814088: nam='resmgr:cpu quantum' ela= 905964 location=3 consumer group id=88620  =0 obj#=88623 tim=1402675603612339
WAIT #140096016814088: nam='direct path read' ela= 1332 file number=4 first dba=16130 block cnt=62 obj#=88623 tim=1402675603682243

Some interesting info here:

ela – Amount time in microseconds that the session spent waiting for a CPU quantum allocation. If we sum everything (all the microseconds) we will have the total time of the session that is “out of CPU”;
consumer group id– The consumer group id, maps with DBA_RSRC_CONSUMER_GROUPS view;
obj# – The object that is part of the wait itself. On our case, it is a table. Maps directly with view DBA_OBJECTS.

Of course if we use tkprof to help us, we can have a more broader picture showing that one of our 20 sessions waited 391,34 seconds during his lifetime and waited for a maximum of 1,10 seconds for a CPU quantum allocation.

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     1        0.00          0.00
  cursor: pin S wait on X                         1        0.14          0.14
  resmgr:cpu quantum                            511        1.10        391.34
  Disk file operations I/O                        4        0.00          0.01
  direct path read                              105        0.30          0.96


– Use math to define correctly your CPU allocation in DBRM plans and be careful with over and under allocations as they impact your database performance.
– Always try to test your DBRM implementation before go live. Sometimes complex plans can be tricky to test and if you can’t measure the impact you can be in trouble. Trial and error is not a problem, when you are not live.
– Understand how DBRM works! DBRM is a complex beast and i hope that this mini-series can help on that.