Foundational Hadoop vendor brings Hadoop 2.0 and YARN to the Windows platform

That release of Hadoop, along with its “YARN” component, allows the Big Data technology to be used on petabyte-scale datasets without having to use the batch-oriented and laboriousMapReduce algorithm.  Moving beyond MapReduce is a big breakthrough, so the question of when the Hadoop 2.0 would be available for Windows has been lingering.  Specifically, a release date for a Hadoop 2.0-based release of the Hortonworks Data Platform (HDP) for Windows distribution is what we’ve needed, since Hortonworks is the only vendor with a full Windows-based Hadoop distro.

Three months after the Hadoop 2.0-based release of HDP for Linux, the Windows release is now ready as well.  In a post to the Hortonworks blog today, Hortonworks’ VP of Strategic Marketing, John Kreisa, announced the immediate availability-for-download of HDP 2.0 for Windows.

What’s inside
In the post, Kreisa explains that HDP for Windows is the only Hadoop 2.0-based distro certified to run on Windows Server 2008 R2, 2012 and 2012 R2.  Kreisa also highlights other hallmarks of Hadoop 2.0: high availability of the Hadoop Distributed File System (HDFS) NameNode;” Phase II of the Stinger initiative, which aims to increase performance of Apache Hive by a factor of 100; and the inclusion of release 0.96 of the Apache HBase NoSQL database, which includes new features like Snapshots and improved mean time to recovery (MTTR).

The post also points out that there is single-node version of HDP 2.0 for Windows available that includes an MSI (Microsoft Installer)-based setup program.  This makes it very easy for developers to start working with HDP 2.0 for Windows, either for evaluation or debug and testing of code written against HDP.

For a more built-out single-node installation, Hortonworks makes available its HDP “Sandbox,” a virtual machine image that includes HDP itself, along with a number of tutorials and other learning resources.  Although only available as a Linux guest image, there is a Hyper-V version of the image available, making it accessible to Windows shops.

What about HDInsight?
Microsoft works closely with Hortonworks in the latter’s development of HDP for Windows.  In fact, the Redmond software giant’s Windows Azure cloud-based implementation of Hadoop, HDInsight, is built on the HDP codebase.

When will HDInsight be updated to Hadoop 2.0?  Neither Microsoft nor Hortonworks has yet announced a release date for this, but with Hortonworks putting HDP for Windows 2.0 in General Availability, HDInsight will likely catch up before too long.

How Scientists Tackle NASA’s Big Data Deluge

big-data

Every hour, NASA’s missions collectively compile hundreds of terabytes of information, which, if printed out in hard copies, would take up the equivalent of tens of millions of trees worth of paper.

This deluge of material poses somebig data challenges for the space agency. But a team at NASA’s Jet Propulsion Laboratory (JPL) in Pasadena, Calif., is coming up with new strategies to tackle problems of information storage, processing and access so that researchers can harness gargantuan amounts data that would impossible for humans to parse through by hand.

“Scientists use big data for everything from predicting weather on Earth to monitoring ice caps on Mars to searching for distant galaxies,” JPL’s Eric De Jong said in a statement. Jong is the principal investigator for one of NASA’s big data programs, the Solar System Visualization project, which aims to convert the scientific information gathered in missions into graphics that researchers can use.

“We are the keepers of the data, and the users are the astronomers and scientists who need images, mosaics, maps and movies to find patterns and verify theories,” Jong explained. For example, his team makes movies from data sets like the 120-megapixel photos by NASA’s Mars Reconnaissance Orbiter during its surveys of the Red Planet.

But even just archiving big data for some of NASA’s missions and other international projects can be daunting. The Square Kilometer Array, or SKA, for example, is a planned array of thousands of telescopes in South Africa and Australia, slated to begin construction in 2016. When it goes online, the SKA is expected to produce 700 terabytes of data each day, which is equivalent to all the data racing through the Internet every two days.

JPL researchers will help archive this flood of information. And big data specialists at the center say they are using existing hardware, developing cloud computing techniques and adapting open source programs to suit their needs for projects like the SKA instead of inventing new products.

“We don’t need to reinvent the wheel,” Chris Mattmann, a principal investigator for JPL’s big data initiative, said in a statement. “We can modify open-source computer codes to create faster, cheaper solutions.”

NASA’s big data team is also devising new ways to make this archival info more accessible and versatile for public use.

“If you have a giant bookcase of books, you still have to know how to find the book you’re looking for,” Steve Groom, of NASA’s Infrared Processing and Analysis Center at the California Institute of Technology, explained in a statement.

Groom’s center manages data from several NASA astronomy missions, including the Spitzer Space Telescope, the Wide-field Infrared Survey Explorer(WISE).

“Astronomers can also browse all the ‘books’ in our library simultaneously, something that can’t be done on their own computers,” Groom added.

More about BIG Data

I hope everyone is aware of archaeological periods (like stone age,bronze age,iron age and now historical periods). But scientists feels like we live in “Data age”now. Its not so easy to measure digital universe, IDC estimates 1.8 Zettabytes(10^21 bytes which is equal to 1 Billion TeraBytes) of data in 2011. Sources: http://www.emc.com/leadership/programs/digital-universe.htm

Flood of data is coming from many sources like

  • FB hosts approximately 10 billion photos taking up 1 PB of storage
  • BSE sensex generates 1TB/day..flight simulator generates the same for each trip it travels..all your footprints are tracked digitally by machinelogs, RFID readers, GPS etc etc
  • Your day to day life individuals interactions like phone calls, emails, documents, transactions etc..all to be saved somewhere crucially.

How to organize this huge volume of data without any loss?? How the company use this data for analysis like transaction analytics, claim fraud analytics, surveillance analytics etc ?? How safe can we keep all data in a single drive/cluster ?? what happens when it crashes?? How about processing this huge data from a drive?? Performance ??

All can be overcome easily..How??

BIG DATA:

Specifically, Big Data relates to data creation, storage, retrieval and analysis that is remarkable in terms of volume, velocity, and variety (3 V’s)

What is 3V?

  1. Volume(Huge volume)
  2. Variety(Diff kinds of data(structured,semi structured and unstructured))

3.Velocity(Speed,performance,real time streaming,batch processing)

How to operate BIG DATA??

Here comes Hadoop- a software framework written by Doug Cutting in java language.

Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Origin of name Hadoop: The name Hadoop is not an acronym; it’s a madeup name. Doug named it by his kid toy(elephant) name. His kid use to call up that toy as “Hadoop”.

Major benefit: In application layer itself it process everything and finally gives result through network. This means smooth and fast processing. Hadoop reads 64MB file at once where as normal file system in windows/linux os only reads 4KB at a time.

Digging deeper into Hadoop:

In a hadoop distributed architecture, both data and processing are distributed across multiple servers. The following are some of the key points to remember about the hadoop:

  • Each and every server offers local computation and storage. i.e When you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set. Finally, the resultset from all this local servers are consolidated.
  • In simple terms, instead of running a query on a single server, the query is split across multiple servers, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.
  • You don’t need a powerful server. Just use several less expensive commodity servers as hadoop individual nodes.
  • High fault-tolerance. If any of the nodes fails in the hadoop environment, it will still return the dataset properly, as hadoop takes care of replicating and distributing the data efficiently across the multiple nodes.
  • A simple hadoop implementation can use just two servers. But you can scale up to several thousands of servers without any additional effort.
  • Hadoop is written in Java. So, it can run on any platform.

Please keep in mind that hadoop is not a replacement for your RDBMS. You’ll typically use hadoop for unstructured data.

HDFS (Storage)

HDFS stands for Hadoop Distributed File System, which is the storage system used by Hadoop. The following is a high-level architecture that explains how HDFS works.

hadoop1

 

The following are some of the key points to remember about the HDFS

  • In the above diagram, there is one NameNode, and multiple DataNodes (servers). b1, b2, indicates data blocks.
  • When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in the hadoop cluster. HDFS creates several replication of the data blocks and distributes them accordingly in the cluster in way that will be reliable and can be retrieved faster. A typical HDFS block size is 128MB. Each and every data block is replicated to multiple nodes across the cluster.
  • Hadoop will internally make sure that any node failure will never results in a data loss.
  • There will be one NameNode that manages the file system metadata
  • There will be multiple DataNodes (These are the real cheap commodity servers) that will store the data blocks
  • When you execute a query from a client, it will reach out to the NameNode to get the file metadata information, and then it will reach out to the DataNodes to get the real data blocks

MapReduce (Processing)

hadoop2

 

MapReduce is a parallel programming model that is used to retrieve the data from the Hadoop cluster

  • In this model, the library takes care of parallelization, fault tolerance, data distribution, load balancing, etc.
  • This splits the tasks and executes on the various nodes parallely, thus speeding up the computation and retrieving required data from a huge dataset in a fast manner.
  • Programmers have to just implement (or use) two functions: map and reduce

hadoop3

 

  • The data are fed into the map function as key value pairs to produce intermediate key/value pairs
  • Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output
  • JobTracker keeps track of all the MapReduces jobs that are running on various nodes. If any one of those jobs fails, it reallocates the job to another node, etc.
  • TaskTracker performs the map and reduce tasks that are assigned by the JobTracker.

 

Map and Reduce Example

 

MapReduce works by breaking the processing into two phases: the map phase and the

reduce phase. Each phase has key-value pairs as input and output, the types of which

may be chosen by the programmer. The programmer also specifies two functions: the

map function and the reduce function.

The input to our map phase is the raw NCDC data.

Our map function is simple. We pull out the year and the air temperature, since these

are the only fields we are interested in.

To visualize the way the map works, consider the following sample lines of input data.

These lines are presented to the map function as the key-value pairs:

(0, 0067011990999991950051507004…9999999N9+00001+99999999999…)

(106, 0043011990999991950051512004…9999999N9+00221+99999999999…)

(212, 0043011990999991950051518004…9999999N9-00111+99999999999…)

(318, 0043012650999991949032412004…0500001N9+01111+99999999999…)

(424, 0043012650999991949032418004…0500001N9+00781+99999999999…)

The map function merely extracts the year and the air temperature (indicated in bold text),

and emits them as its output.

(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)

This processing sorts and groups the key-value pairs by key.

(1949, [111, 78])

(1950, [0, 22, −11])

All the reduce function

has to do now is iterate through the list and pick up the maximum reading:

(1949, 111)

(1950, 22)

This is the final output: the maximum global temperature recorded in each year.

Hadoop eco-system tools:

  1. Pig (programming tool for creating MapReduce programs used with Hadoop)
  2. Hive(data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis)
  3. Hbase(HBase is an open source, non-relational, distributed database runs on top of HDFS)
  4. Zookeeper(an open source distributed configuration service,synchronization service  and naming registry for large distributed systems. ZooKeeper’s architecture supports high-availability through redundant services.)
  5. Avro (Avro is a remote procedure call and serialization framework which means communication between Hadoop nodes, and from client programs to the Hadoop services.)
  6. Sqoop(Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases)