RPD Deployment in OBIEE-12c server

RPD Deployment in Obiee 12c server is unlikely different from 11g. In 12c, EM does not have any option for deployments of rpd.We should be using weblogic scripting command “uploadrpd’ to upload repository to Oracle BI Server.

Steps to follow:

  1. Open the command prompt and type “cd \”to change the directory and press Enter
  2. Type “cd /user_projects/domains/bi/bitools/bin”, press Enter
  3. Do “Ls”, to find the utility, data-model-cmd.sh on UNIX and data-model-cmd.cmd on Windows.
  4. Run the data-model-cmd.cmd utility along with the upload rpd parameters below:

Syntax:
uploadrpd -I .rpd -W -U -P -SI

Example:
uploadrpd -I BI1_SAMPLE.rpd -W Ora234 -U weblogic -P weblogic17-SI ssi

If the operation completes successfully, you will see the following message:

“Operation Successful. RPD upload completed successfully. ”

To understand further these parameters, see below. Along with above parameters, yiu can also use S, N, SSL etc.

  • I specifies the name of the repository that you want to upload.
  • W is the repository’s password. If you do not supply the password, then you will be prompted for the password when the command is run. For security purposes, Oracle recommends that you include a password in the command only if you are using automated scripting to run the command.
  • SI specifies the name of the service instance.
  • U specifies a valid user’s name to be used for Oracle BI EE authentication.
  • P specifies the password corresponding to the user’s name that you specified for U. If you do not supply the password, then you will be prompted for the password when the command is run. For security purposes, Oracle recommends that you include a password in the command only if you are using automated scripting to run the command.
  • S specifies the Oracle BI EE host name. Only include this option when you are running the command from a client installation.
  • N specifies the Oracle BI EE port number. Only include this option when you are running the command from a client installation.
  • SSL specifies to use SSL to connect to the WebLogic Server to run the command. Only include this option when you are running the command from a client installation.
  • H displays the usage information and exits the command.

Example: data-model-cmd.sh uploadrpd -I <RepositoryName.rpd> -SI ssi -U weblogic -S server.example.com -N 8003 -SSL

Types of Databases in existing world

The capture and analyzing of data is typically performed by database management systems, otherwise known as DBMS’s. These types of database software systems are programmed in SQL. SQL (pronounced “ess-que-el”) stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. The most common of all the different types of databases is Relational Databases.

Let’s learn now the different types of databases that exist in today’s world and how to use them in our work.

Types of Databases

  • Relational Databases: A relational database is a collection of data items organized as a set of formally-described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables. The relational database was invented by E. F. Codd at IBM in 1970. Example are: PostgreSQL, SQLite, MySQL ,Oracle, Sysbase.
  • No SQLDatabases/Non-relational Databases : A NoSQL (originally referring to “non SQL”, “non relational” or “not only SQL”) database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications.NoSQL systems are also sometimes called “Not only SQL” to emphasize that they may support SQL-like query languages. Motivations for this approach include: simplicity of design, simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases), and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, columnar, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL.

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance.  Some reasons that block adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL.  Also, most NoSQL stores lack true ACID transactions or only support transactions in certain circumstances and at certain levels (e.g., document level).  Finally, RDBMS’s are usually much simpler to use as they have GUI’s where many NoSQL solution use a command-line interface.

 

  • New SQL Databases: NewSQL is a term to describe a new group of databases that share much of the functionality of traditional SQL relational databases, while offering some of the benefits of NoSQL technologies.Like it provides, ACID transactional consistency of traditional operational databases; the familiarity and interactivity of SQL; and the scalability and speed of NoSQL.

 

Example to understand both above Databases:

 

If we use a bank example, each aspect of a customer’s relationship with a bank is stored as separate row items in separate tables.  So the customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on.  All these tables are linked to each other through the use of relations such as primary keys and foreign keys.

Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model.  Key-value pairs allow you to store several related items in one “row” of data in the same table.  We place the word “row” in quotes because a row here is not really the same thing as the row of a relational table.  For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details.  All data relating to one customer would be conveniently stored together as one record.

In the relational model, there is an built-in and foolproof method of ensuring and enforcing business logic and rules at the database layer, for instance that a withdrawal is charged to the correct bank account, through primary keys and foreign keys.  In key-value stores, this responsibility falls squarely on the application logic and many people are very uncomfortable leaving this crucial responsibility just to the application.  This is one reason why relational databases will continued to be used.

However, when it comes to web-based applications that use databases, the aspect of rigorously enforcing business logic is often not a top priorities.  The highest priority is the ability to service large numbers of user requests, which are typically read-only queries.  For example, on a site like eBay, the majority of users simply browse and look through posted items (read-only operations).  Only a fraction of these users actually place bids or reserve the items (read-write operations).  And remember, we are talking about millions, sometimes billions, of page views per day.  The eBay site administrators are more interested in quick response time to ensure faster page loading for the site’s users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes.

 

Types and examples of NoSQL databases

There have been various approaches to classify NoSQL databases, each with different categories and subcategories, some of which overlap. What follows is a basic classification by data model, with examples:

 

  1. Key-Value Pair (KVP) Databases: Key-value (KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. The key-value model can be extended to a discretely ordered model that maintains keys in lexicographic order. This extension is computationally powerful, in that it can efficiently retrieve selective key e.g., InfinityDB, Oracle NoSQL Database and dbm.
  2. Document Databases: each document-oriented database implementation differs on the details of this definition, in general, they all assume that documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML and JSON. Documents are addressed in the database via a unique key that represents that document. One of the other defining characteristics of a document-oriented database is that in addition to the key lookup performed by a key-value store, the database offers an API or query language that retrieves documents based on their contents.

Different implementations offer different ways of organizing and/or grouping documents:

  • Collections
  • Tags
  • Non-visible metadata
  • Directory hierarchies

In short, Store documents or web pages, e.g.,MongoDB, Apache CouchDB

  1. Columnar Databases: Store data in columns, e.g., Hbase, SAP Hana
  2. Graph Databases: This kind of database is designed for data whose relations are well represented as a graph consisting of elements interconnected with a finite number of relations between them. The type of data could be social relations, public transport links, road maps or network topologies. Stores nodes and relationship, e.g., Neo4J, FlockDB
  3. Spatial Databases: For map and nevigational data, e.g.,OpenGEO, PortGIS, ArcSDE
  4. In-Memory Database (IMDB): All data in memory. For real time applications
  5. Cloud Databases: Any data that is run in a cloud using IAAS,VM Image, DAAS
            dbimages                                                                      Image courtesy: theWindowsclub.com
Advantages of NoSQL database:
  • Process data faster
  • Have simple data models to understand and execute
  • manage unstructured text

 

 

 

Hadoop Ecosystem & Architecture(Tutorial Day 4)

Like we discussed in last blog, Big Data is not just Hadoop. Similarly Hadoop is not one only monolithic thing, but is an ecosystem which consists of various  hadoop components and an amalgamation of different technologies.Like HDFS (Hadoop Distributed File System), Map Reduce, Pig, Hive,Hbase, Flume and so on.

Hadoop Ecosystem

The Hadoop platform consists of many tools but two key services are: Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce.

Vendors that provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, and Amazon.

The combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware (potentially scaling to thousands of nodes) in a reliable, fault-tolerant manner. We can combine various Hadoop Ecosystem tools to serve the business requirements in cost effective fashion.
Below image describes the Hadoop Ecosystem.

hadoop-ecosystem_mines

In view of Hadoop ecosystem prominence is given to Hadoop Core components (Hadoop common, YARN, HDFS and MapReduce), which we will discuss first.

1) Hadoop Common refers to the collection of common utilities ,libraries,necessary Java files and scripts that support other Hadoop modules. It is an essential part or module of the Apache Hadoop Framework.

2) Hadoop YARN is described as a clustering platform or framework that helps to manage resources and schedule tasks.It is a great enabler for dynamic resource utilization on Hadoop framework as users can run various Hadoop applications without having to bother about increasing workloads.

3) HDFS is a distributed file system that runs on standard or low-end hardware. Developed by Apache Hadoop, HDFS works like a standard distributed file system but provides better data throughput and access through the MapReduce algorithm, high fault tolerance and native support of large data sets.

HDFS comprises of 3 important components called NameNode, DataNode and Secondary NameNode. HDFS operates on a Master-Slave architecture model where the NameNode acts as the master node for keeping a track of the storage cluster and the DataNode acts as a slave node summing up to the various systems within a Hadoop cluster.

It provides data reliability by replicating each data instance as three different copies – two in one group and one in another. These copies may be replaced in the event of failure.

Default replication count is 3
• 1st replica on the local rack
• 2nd replica on the local rack but different machine
• 3rd replica on the different rack

The HDFS architecture consists of clusters, each of which is accessed through a single NameNode software tool installed on a separate machine to monitor and manage the that cluster’s file system and user access mechanism. The other machines install one instance of DataNode to manage cluster storage.
Because HDFS is written in Java, it has native support for Java application programming interfaces (API) for application integration and accessibility. It also may be accessed through standard Web browsers.

hdfs-architecture

Namenode

The namenode is the commodity hardware that stores the metadata like name of the datanodes, location or path, replica block path etc.The system having the namenode acts as the master server and there can only be 1 Name node. If you want you can create a replica of it and called it as Secondary Namenode. But either of one can be active.It does the following tasks:

  • Manages the file system namespace.
  • Regulates client’s access to files.
  • It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

A DataNode stores & manages the data stored in HDFS. In a functional filesystem we can have more than one DataNode, with data blocks replicated across them.DataNode connects to the Namenode; spinning until that service comes up. It then responds to requests from the Namenode for filesystem operations.

Client applications can talk directly to a DataNode, once the Namenode has provided the location of the data. Similarly, MapReduce operations delegated out to Task Tracker instance near a DataNode, can talk directly to the DataNode to access the files. following task are performed here:

  • Datanodes perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
  • DataNode instances can talk to each other, which is what they do when they are replicating data.

Block

A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system. Blocks are large.In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.Most systems run with block sizes of 128 megabytes or larger.

4) MapReduce is a programming model introduced by Google. It breaks down a big data processing job into smaller tasks. It is responsible for the analyzing large data-sets in parallel before reducing it to find the results. It is highly scaleable & has several forms of implementation provided by multiple programming languages, like Java, C# and C++.

The MapReduce executed in 2 stages :

  1. Map:  The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
  2. Reduce: It is combination of Shuffle and Reduce.The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

The main advantage of the MapReduce framework is its fault tolerance, where periodic reports from each node in the cluster are expected when work is completed.
A task is transferred from one node to another. If the master node notices that a node has been silent for a longer interval than expected, the main node performs the reassignment process to the frozen/delayed task.

As discussed above, there are several other Hadoop components that form an integral part of the Hadoop ecosystem, making Hadoop faster or developing novel features and functionalities. To know further about some of the eminent Hadoop components , please read my Next Blog.

To learn more on MapReduce Algorithm and how it works click here.

Pic/content credit :Google and Specific mentioned.

Big Data Architecture & its Challenges (Tutorial Day2)

Big Data as word describes is data that is too large to process using traditional methods.It originated with companies who had the problem of querying very large distributed semi or structured data. Google developed MapReduce to support distributed computing on large data sets on computer clusters. As discussed in earlier post, few examples of Big Data are:

  • Petabytes of data
  • Billions of records
  • distributed data
  • Flat files (cannot be seen in relation DB)
  • Semi structure data like log files
  • Video messages

Applications that produce or generate Big-data can be:

  • Transactional/operational  (CRM,ERP,Sales,HR),
  • Analytics (IT logs,Call Centre)

Big Data Architecture is set of few components joined to each other as shown in below image. Hadoop is present in middle tier of this structure, but not mandatory requirement.

big-data-components_operational-data-graph

Will discuss the components further in next tutorial blogs.

Bottlenecks with Big Data are :

  • Storage
  • Transfer
  • Sharing
  • Analysis
  • Processing
  • Visualization
  • Security

Big data is not just about size
–Finds insights from complex, noisy, heterogeneous, longitudinal, and voluminous data
–It aims to answer questions that were previously unanswered

In our existing traditional approach, we use a Data-warehouse to store data (OLTP-OLAP) in structured format. Process it , do data mining and build reports for further high level analysis.This approach works fine with those applications that process less volume of data which can be accommodated by standard db servers, or up to the limit of the processor that is processing the data.

But when it comes to dealing with huge amounts of scale-able data, it becomes a problem to process it using this tradition approach.Transactional Big-data projects cannot use Hadoop, as it is not real-time.

For transactional systems that do not need a database transaction to have ACID properties (Atomicity, Consistency, Isolation,Durability), NoSQL databases can be used, though there are constraints such as restricting transactions to a single data item.

For big-data transactional SQL databases that need the ACID properties have less options.

This is when Big Data got Distributed System into picture. It most of all related to Map-Reduce technology.For example, 1 machine with 4 I/O channels can process 1 terabyte of data in approx 42 mins if the channel speed is 100 mb/s.
But if we have a distributed system of 100 machines, each with 4 I/O channels, and each channel speed is 100 mb/s, then it will take few sec to process the data.

To adopt distributed System, Map reduce algorithm(MR) was used.This algorithm divides the task into small parts and assigns them to many computers (cluster), and collects the results from them which when integrated, form the output data-set.

Image Courtesy: google

Using the above solution, Doug Cutting and his team developed an Open Source Project called HADOOP.

To proceed further and understand Hadoop ,its component & Architecture, please read my  Next Blog

What is Big Data ? Is it Only Hadoop ? (Tutorial day 1)

Big Data, the new buzz word in the today’s technology is gaining more importance due to its high rewards. A systematic and focused approach toward the adoption of Big Data allows one to derive maximum value and utilize the power of Big Data.

 Its nothing but a new framework or system to get insight of existing different data forms and increasing the researchers/analyst power to get more out of existing system.

As BG Univ says, “Big data is about the application of new tools to do MORE analytic on MORE data for More people.”

Lifecycle of data can be defined as :

 

People get confuse with Big Data & Hadoop as 2 similar things. But no, Big data is not only Hadoop

Big Data is not a tool or single technique. Its actually a platform or a framework having various components like Data Warehouses (providing OLAP data/History), Real time Data systems and Hadoop (provides insight to structured/semi or unstructured Data).

Examples of Big Data are like Traffic data, Flights Data/ Search engine data etc.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types :

a) Structured data: Relational data.
b) Semi Structured data: XML data.
c) Unstructured data: Word, PDF, Text, Media Logs.

 Big Data can be characterized by 3 V’s :

1) Velocity -> Batch processing data, real time
2) Variety-> Structured, semi-structured, unstructured and polymorphic data
3) Volume-> Terabytes to Petabytes

Big Data puts existing traditional systems into trouble due to many reasons because when data increases the complexity, Security, maintenance, processing time of it also increases. Big Data gets Distributed processing system into picture. Its using multiple system/disk for parallel processing.

There are various tools & technologies in the market from different vendors including IBM, Microsoft, etc., to handle big data. Few of them are:

1) No SQL Big Data systems are designed to provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. It allows massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement. For example MongoDB
2)MPP & MapReduce provide analytical capabilities for complex analysis including lot of data. Based on them we have Hadoop, Hive, Pig, Impala
3) Storage (HDFS ie Hadoop Distributed File System)
4) Servers (Google App Engine)
There are major challenges with Big Data.

Read  Day 2 tutorial to understand further and bookmark this page for future reference.

What are the difference between Hive / Impala & Pig

Comparing Impala to Hive & Pig

Similarities:
  • Queries expressed in high-level languages
  • Alternatives to writing map-reduce code
  • Used to analyze data stored on Hadoop cluster

Differences:

Impala  

It was created based on Google’s Dremel paper.
1) It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS).
2) Its an open source massively parallel processing (MPP) query engine on top of clustered systems like Apache Hadoop.
3) MPP style parallel databases have a relation model, more suitable for processing structured and semi-structured data. Due to its architectural advantages, it doesn’t involve the overheads of a MapReduce jobs viz. job setup and creation, slot assignment, split creation, map generation etc., hence enables low-latency.
4) It offers lower latency / processing time for the queries at the cost of less scalability and less stability.
5) Impala supports high-performance UDF (User Defined Function) written in C++, as well as reusing some Java-based Hive UDFs.
6) Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions similar to how they do so with traditional database systems.
7) Impala does not store or interpret timestamps using the local timezone, to avoid undesired results from unexpected time zone issues. Timestamps are stored and interpreted relative to UTC.
8) Impala utilizes the Apache Sentry authorization framework for Security, which provides fine-grained role-based access control to protect data against unauthorized access or tampering.
9) It can query data stored in HDFS or HBase tables
10) Uses subset of SQL 92 and do not support Stored Procedure
11) The Impala TIMESTAMP type can represent dates ranging from 1400-01-01 to 9999-12-31.  
12) With Impala, you can query the following File formats:Parquet /Avro /RCFile /SequenceFile
 /Unstructured text
13)  Impala shares the meta store with Hive
14) Impala can process in milliseconds when running at low load conditions and Impala is one of the valid choices if no SQL parallel processing is executed.
15) Impala is an MPP-like engine, so each query you are executing on it will start executor on each and every node of your cluster. This delivers the best performance for a single query running on the cluster, but the total throughput degrades heavily under high concurrency. In such systems you should limit the amount of parallel queries to kinda low value of ~10.

Being highly used it still has cons like:

1)  Impala can’t handle complex data types(Array,Map or Struct)

2)  Impala is not fault tolerant For e.g. if you run a query in Impala and if the query fails you will have to start the query all over again
3) Doesnot not support Parameters in scripts

4) Impala does not currently support many of HiveQL statements like ,ANALYZE TABLE (the Impala equivalent is COMPUTE STATS),DESCRIBE COLUMN,DESCRIBE DATABASE,EXPORT TABLE,IMPORT TABLE, many more
5) Impala does not implicitly cast between string and numeric or Boolean types. Always use CAST() for these conversions.
6) Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT or FLOAT, but to convert from DOUBLE to FLOAT or INT to TINYINT requires a call to CAST() in the query.
7) Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal formats for the TIMESTAMP data type and the from_unixtime() format string.
8) Impala, is not currently supported by YARN 
9) Impala is not the best choice if there is a batch execution, and SQL parallel execution 

Hive 

It is a component of Horton works Data Platform(HDP). 
1) Hive provides a SQL-like interface to data stored in Hadoop clusters. 
2) It translate SQL queries into MapReduce/Tez/Spark jobs and executes them on the cluster, to implement batch based processing. Hence best suited for ETL- long running queries.
3) Its used by Data Analyst for completely structured data.
4) Supports complex Data types like arrays, Struct etc, custom file formats, “DATE” data type,XML and JSON functions.
5) Its fault tolerant .For e.g. if you run a query in hive mapreduce and while the query is running one of your data-node goes down still the output is given as  query will start running mapreduce jobs in other nodes.Its fault tolerant.
6) Supports Parameters Which Can Come Handy While Writing Hive Scripts.
7)  Its supported by YARN. So you can manage your resources for mapreduce or any other applications supported by YARN
8) Hive runs on top of MapReduce/Tez framework which requests resources based on the amount of data to process. This way for large clusters it would give you much better concurrency for “small” queries, as each of them would request small amount of execution resources which would result in more queries running in parallel.
9) The Hive component included in CDH 5.1 and higher now includes Sentry-enabled security .GRANT, REVOKE, and CREATE/DROP ROLE statements. Earlier Hive releases had a privilege system with GRANT and REVOKE statements that were primarily intended to prevent accidental deletion of data, rather than a security mechanism to protect against malicious users.
10) Uses subset of SQL 92 and do not support Stored Procedure
11) Hive TIMESTAMP type can represent dates ranging from 0000-01-01 to 9999-12-31. 
12) Hive supports several file formats like Text File /SequenceFile /RCFile/ Avro Files/ORC Files
     / Parquet/ Custom INPUTFORMAT and OUTPUTFORMAT. 

But the cons are big as well – 

1) Since Hive uses MapReduce to access Hadoop clusters, query overheads results in high latency. 
2) lower performance especially for table joins
3) No query optimizer 

 Pig

Pig which is a scripting language with a focus on data flows.It has two parts:
a) A language for processing data, called Pig Latin.

b) A set of evaluation mechanisms for evaluating a Pig Latin program. Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) evaluation by translation into one or more Map-Reduce jobs, executed using Hadoop

1) Pig can process data of any format, such as tab delimited text files, are supported via built-in capabilities. A user can add support for a file format by writing a function that parses the bytes of a file into objects in Pig’s data model, and vice versa.
2) Pig’s data model is similar to the relational data model.
3) In Pig, tables are called bags. Pig also has a “map” data type, which is useful in representing semi-structured data, e.g. JSON or XML.
4)  It can combine multiple data sets, via operations such as join, union or co-group, OR can split a single data set into multiple ones, using an operation called split.
5) It is a Procedural Data Flow Language and mostly used by Researchers or programmers.
6) Pig is Fault Tolerant
7) Pig supports “maps” of (key, value) pairs, where retrieving the value associated with a given key is an efficient operation. Maps provide a convenient way to represent semi-structured data, where the set of non-null fields varies from record to record. Maps are helpful when processing JSON, XML, and sparse relational data (i.e., tables with a lot of null values).