“For all those folks running around saying Hadoop is dead – they’re dead wrong,” asserts George Corugedo, CTO at RedPoint Global.
Hadoop has evolved through two substantial phases of development, defined by how companies initially used it and what tools emerged later on to create an ecosystem. In the initial days of Hadoop, it was a novel tool explored for research by a few experimentalists scattered around the world. An independent survey suggests that users would mainly run HBase and MapReduce which was made simple enough by tools like Pig and Hive.
When Hadoop’s ability to offer real business value was recognized, the IT industry suddenly became interested in things like predictable run times, efficiency and ROI and similar trepidations. Further development was reflected by the amassed expansion of the Hadoop ecosystem. This was when innovations like YARN, Kudu and Spark entered the scenario.
Lately another set of changes emerged that apparently indicated the instilment of a new level of maturity to Hadoop which would be more tough and categorized by new manners of functionality and accessibility.
The ways in which experts see the transformation of Hadoop in the near future are as follows:
Hadoop to be Widely Adopted
Hadoop is forecasted to be instilled by many more organizations and in turn, vendors will provide new and innovative solutions for the same. This will allow companies to chomp large amounts of data using advanced analytics and find bits of precious information for making profitable decisions. “If the Hadoop distributors can address the data governance issues and the complexity of Hadoop sub-project integration, it could be a golden year for Hadoop.” says Kunal Agarwal, CEO, Unravel Data.
Faster performance with Hadoop
SQL is the conduit of business users who want to use Hadoop data for faster and more recurring KPI dashboards. This can lead to the use of faster databases like Exasol and MemSQL, and Hadoop-based tool Kudu. Use of SQL on Hadoop engines and OLAP on Hadoop technologies act as query accelerators resulting in faster access.
Increasing Cloud Gravity
Amazon Web Services, which boasts a Hadoop installed base, is bigger than any of the other software service providers combined together. It offers its own version of Apache Hadoop called EMR (Elastic MapReduce) which runs upon its Elastic Cloud Compute infrastructure. Furthermore, consumers are free to run Apache Hadoop, HDP, CDH, and MapR distributions on their own. The inherent auto-scaling functionality of cloud based Hadoop offers an enormous advantage over the traditional Hadoop software that lacks this feature. “The existing on-premise Hadoop distros (Cloudera, Hortonworks, MapR) will be at a disadvantage compared to the cloud-based ‘Hadoop as a service’ providers like Amazon EMR, Google Dataproc, and Microsoft Azure HD Insight,” says Dave Mariani, the CEO of AtScale.
Addition to Enterprise Standards
Increase in investments is predicted in security and control components encompassing the enterprise system. Apache Sentry offers a mode for role-based authorization to data and metadata deposited on a Hadoop cluster. Apache Ranger runs consolidated security administration for Hadoop. Apache Atlas helps organizations to utilize consistent data classification across the ecosystem. Users are starting to anticipate these kinds of competences from their RDBMS platforms. These tools are on the forefront of emerging big-data technologies.
Machine Learning Automation
Machine Learning (ML) with big data is an infamously complex technology but numerous fresh systems are making it more user-friendly. Numerous companies are making use of Hadoop’s scalability to create super-sized data depository to perform known SQL statements for customary ad-hoc and BI reports using SQL engines like Hive and Spark SQL. For the top most ML, most organizations are turning to Apache Spark. It is already being used to strengthen mainframes and identify frauds. “Machine learning capabilities will start infiltrating enterprise applications, and advanced applications will provide suggestions — if not answers — and provide intelligent workflows based on data and real-time user feedback. This will allow business experts to benefit from customized machine learning without having to be machine learning experts,” explains Toufic Boubez, VP, Splunk.
Data Governance and Security
Companies are starting to realize the security risks of having poor data control strategies and understand that it can lead to disastrous consequences. Enterprises are required to take a hard look at the security of their Hadoop-based data lakes. “Current practices that dump raw log files with unknown and potentially sensitive information into Hadoop will be replaced by systematic data classification, encryption and obfuscation of all long-term data storage,” says Steve Wilkes, CEO of Striim.
Influence of Self Service Data Prep
Making Hadoop data accessible has been one of the biggest challenges of business users. The emergence of self-service analytics platform has eased this issue. Business users further want to reduce the time and the complexity needed to prepare data for analysis which is especially important when dealing with a variety of data formats. Self-service data prep tools allow data to be prepped at the source and makes the data accessible in the form of snapshots for quick and easy exploration. Tools like Alteryx, Trifacta and Paxata are lowering the barriers of entry for late Hadoop adopters.
Emergence of Data Fabrics
The motion of big data is said to be fluid. Products like Flume and Cascading helps in the flow of data streams into data reservoirs. But in the coming years, big data can be considered in other terms, such as fabrics. The data fabrics notion stands to unify significant aspects of data management, security and self-service in Hadoop and other big data platforms. Enterprises of all types and sizes are said to be embracing big data, but the gap between business expectations and the challenges of supporting big data technology, such as Hadoop has become the primary motivation to innovate with big data fabric. “The collection of technologies enables enterprise architects to integrate, secure, and govern various data sources through automation, simplification, and self-services capabilities.” says Noel Yuhanna, analyst, Forrester.