Category Archives: analysis

How to Google any column value from OBIEE

Now a days end users demand a lot of GUI interactivity in Obiee reports. They are really happy if obiee reports are full of helpful functionalities. One of them is to Google out the value they click on. So this was actually my business requirement few days back. Its not much tough to implement and requires no config or RPD changes.

Steps to do it :

  1. Create a new Action. Go to New -> Action -> Navigate to a Web PAge
  2. Type http://www.google.com/search?q=:   in url section on window
  3. then click on Define Parameters in right side of window.Remove the “optional” mark for the parameter like below:
  4. img1

    Save this action with any name you remember.

  5. Now go to your Analysis, where you wanna call this action through any column Interaction.
  6. On column properties -> Interaction -> Primary Interaction, select Action Links.
  7. Now click on + sign to add Action link
  8. Now in new pop up window, clink on Select existing action
  9. Identify your Saved action and click ok.
  10. Now edit this Action link and select the “column value” as the parameter value, as shown below
  11. then Select the relevant column and tick the boxes for “hidden” (and “fixed”) options. click ok to save.
  12. click ok and check your results. When you click on any mapped column value, Google window will open up searching that value.

 

Follow For more !!

 

What is Spark..is it replacing Hadoop ? (Day1)

Spark Framework is a simple Java web framework built for fast computation. It is a free and open-source software  & an alternative to other Java web application frameworks such as JAX-RS and Spring MVC. It was started in 2009 at Berkeley.

Overview:

To define, Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. It actually extends MR model to support more computation ways like interactive/iterative algos, queries, stream processing, graph processing etc. It is designed to be highly accessible, offering simple API in languages like Python, Java, Scala & SQL.One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.

Is Spark a Hadoop module?

We see Spark is listed as a module on Hadoop’s project page, but Spark also has its own page because, while it can run in Hadoop clusters through YARN, it also has a standalone mode. So Spark is independent. By default there is no storage mechanism in Spark, so to store data, need fast and scalable file system. Hence uses S3 or HDFS or any other file system, but if you use Hadoop it’s very low cost.

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.

(Source: Internet)

Hadoop vs Spark

A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.The most important thing to remember about Hadoop and Spark is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.So we can compare them on some below points:

  1. Data Processing Engine/Operators: Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. For this it uses Map reduce,which is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. But Spark is a cluster-computing framework,Which performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation (Filter/map/join/groupby) on the data, and then writes it back to the cluster.
  2. File System: Spark has no file management and therefor must rely on Hadoop’s Distributed File System (HDFS) or some other solution like S3, Tachyon.
  3. Speed/Performance: Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.The reason that Spark is so fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t all fit into memory.Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. Example: Internet of Things sensors, log monitoring, security analytics all require Spark for faster computation.
  4. Storage: MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs)
  5. Ease of Use: Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL.Spark has an interactive mode so that developers and users can run queries.MapReduce has no interactive mode, but add-ons such as Hive and Pig  to make working with MapReduce a little easier for developers.
  6. Costs :Both MapReduce and Spark are Apache projects, which means that they’re open source and free software products. While there’s no cost for the software, there are costs associated with running either platform in personnel and in hardware. Both products are designed to run on commodity hardware, such as low cost, so-called white box server systems. However Spark systems cost more because of the large amounts of RAM required to run everything in memory. But what’s also true is that Spark’s technology reduces the number of required systems. So, you have significantly fewer systems that cost more. There’s probably a point at which Spark actually reduces costs per unit of computation even with the additional RAM requirement.
  7. API’s: Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs).
  8. Fault Tolerance: Hadoop uses Replicated blocks of data to maintain this feature. There is a link between TaskTrackers & JobTracker, so if its missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This effectively provides fault tolerance.Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously.
  9. Scalability: both MapReduce and Spark are scalable using the HDFS.
  10. Security: Hadoop supports Kerberos authentication. HDFS supports access control lists (ACLs) and a traditional file permissions model. For user control in job submission, Hadoop provides Service Level Authorization, which ensures that clients have the right permissions.Spark’s security is a bit sparse by currently only supporting authentication via shared secret (password authentication). The security bonus that Spark can enjoy is that if you run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.

Learn Spark Architecture by clicking here.

Big Data Architecture & its Challenges (Tutorial Day2)

Big Data as word describes is data that is too large to process using traditional methods.It originated with companies who had the problem of querying very large distributed semi or structured data. Google developed MapReduce to support distributed computing on large data sets on computer clusters. As discussed in earlier post, few examples of Big Data are:

  • Petabytes of data
  • Billions of records
  • distributed data
  • Flat files (cannot be seen in relation DB)
  • Semi structure data like log files
  • Video messages

Applications that produce or generate Big-data can be:

  • Transactional/operational  (CRM,ERP,Sales,HR),
  • Analytics (IT logs,Call Centre)

Big Data Architecture is set of few components joined to each other as shown in below image. Hadoop is present in middle tier of this structure, but not mandatory requirement.

big-data-components_operational-data-graph

Will discuss the components further in next tutorial blogs.

Bottlenecks with Big Data are :

  • Storage
  • Transfer
  • Sharing
  • Analysis
  • Processing
  • Visualization
  • Security

Big data is not just about size
–Finds insights from complex, noisy, heterogeneous, longitudinal, and voluminous data
–It aims to answer questions that were previously unanswered

In our existing traditional approach, we use a Data-warehouse to store data (OLTP-OLAP) in structured format. Process it , do data mining and build reports for further high level analysis.This approach works fine with those applications that process less volume of data which can be accommodated by standard db servers, or up to the limit of the processor that is processing the data.

But when it comes to dealing with huge amounts of scale-able data, it becomes a problem to process it using this tradition approach.Transactional Big-data projects cannot use Hadoop, as it is not real-time.

For transactional systems that do not need a database transaction to have ACID properties (Atomicity, Consistency, Isolation,Durability), NoSQL databases can be used, though there are constraints such as restricting transactions to a single data item.

For big-data transactional SQL databases that need the ACID properties have less options.

This is when Big Data got Distributed System into picture. It most of all related to Map-Reduce technology.For example, 1 machine with 4 I/O channels can process 1 terabyte of data in approx 42 mins if the channel speed is 100 mb/s.
But if we have a distributed system of 100 machines, each with 4 I/O channels, and each channel speed is 100 mb/s, then it will take few sec to process the data.

To adopt distributed System, Map reduce algorithm(MR) was used.This algorithm divides the task into small parts and assigns them to many computers (cluster), and collects the results from them which when integrated, form the output data-set.

Image Courtesy: google

Using the above solution, Doug Cutting and his team developed an Open Source Project called HADOOP.

To proceed further and understand Hadoop ,its component & Architecture, please read my  Next Blog

%d bloggers like this: