Five often-overlooked Hadoop

Five often-overlooked Hadoop, Big Data analytics project killers

Five often-overlooked Hadoop, Big Data analytics project killers
When you’re getting ready to perform analytics on a data set, attention often gets focused on the software you’re going to use to analyze and create your reports.  Often, companies are thinking about how they are going to store data and build visualizations for one project and one instance. However, to truly achieve maturity in your big data analytics projects, you have to be thinking about the big picture.  You must be thinking through all the criteria that can take them into success, both today and in the future.

Overlooked One: How will I get and manage the data?

In organizations where data management is immature, users and business units tend to hoard the data. Business users often have mistaken that if you own the data, you own the power. As IT professionals, we should move the organization toward data sharing – the enemy is not within, but it is with your data savvy competitors.  IT can help by introducing technologies that make for easy democratization of the data. By supporting technologies like Kafka, IT can setup a publish and subscribe infrastructure for the data to help break the data fiefdoms.

Read Also:
How Big Data is Disrupting Agriculture from Biological Discovery to Farming Practices

At HPE, of course we support Kafka in our HPE Vertica platform.  In addition, we’re working on the data democratization problem by doing things like supporting Hadoop file formats like ORC, Parquet, JSON and others so that data may be loaded into the analytics platform and anyone can be a data consumer.  The high performance of our analytics database is not only about the speeds and feeds, it’s also about giving more end users the capability to leverage the data. We rely on a strong partnership network for ETL and data curation including partners like informatica, Talend, SyncSort, Tamr and Pentaho to name just a few.

Overlooked Two: Am I running the right hardware for the task?

Corporations often have banks of IT infrastructure that they can draw upon. HPE has sold a ton of Proliant DL380P servers over the years, offering a solid foundation for most IT tasks and a very predictable plan for power usage, management and operations. However, you should be considering that different workloads in your project may have different requirements for compute, storage and latency. For example, in the Hadoop world, ETL jobs may require lots of storage and the fastest network connection to deliver performance, while BI dashboards will rely on fast CPU and lots of memory to perform better.  By thinking through how the hardware is going to be used, you can optimize and save.

Read Also:
How to Get Real-time Insight with Machine Learning and Centralized Data

This is really what HPE is accomplishing with our recent announcements of big data reference architectures for Vertica SQL on Hadoop.  We have been working with the open source community on reference architectures on both the Proliant and Apollo platforms that can be optimized for the task.  For example, if you need to turn up Hadoop compute resources, you can adjust some settings in YARN and get it.  If you need to turn up storage performance, adjust the YARN labels and go.

Overlooked Three: Is it scalable and elastic?

The big data reference architectures also help you when start to get killed by your own success. Project managers should consider what to do if the project is a wild success and you get more data, more users and more queries.

Read Full Story…


Leave a Reply

Your email address will not be published. Required fields are marked *