Abstract for Big Data Analysis in IEEE Format
BIG DATA ANALYSIS
AUTHORS:-
S.AMIRRTHA KESAVAN, C.SUGEETH
CSE DEPT, MOUNT ZION COLLEGE OF ENGG AND TECH, PUDUKKOTTAI
amirlatha@gmail.com
ABSTRACT- Big data is frontier of a firm’s ability to store the process of all the data into needs to operate electively make decisions, reduce and serve customers
I-INTRODUCTION: Big data is storing information from multiple devices like laptop, tablet, and mobiles in a particular server. Lately, the term big data tends to refer to the use of predictive analytics, user behavior or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem. Analysis of data sets can find new correlations to "spot business trends, prevent diseases, and combat crime and so on. Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas include finetech urban and business informatics. Scientists encounter limitations in HYPERLINK "https://en.wikipedia.org/wiki/E-Science" \o "E-Science" e-Science work, including meteorology, connectomics, complex physics simulations, biology and environmental research.
II-CHARACTERIZATION:
i)Volume
ii)Velocity
iii)Variety
Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Gartner's definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that. Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value Additionally, a new V "Veracity" is added by some organizations to describe it, revisionism challenged by some industry authorities. The 3Vs have been expanded to other complementary characteristics of big data:
VOLUME: Volume is said to be the size the data which stores everything in a server. Smart phones which generates the data in every second that results in billions of new feeds locations, information. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not. Big data implies enormous volumes of data. It used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analyzed is massive. Yet, Inderal states that the volume of data is not as much the problem as other V’s like veracity.
VELOCITY: Machine to machine process exchange data between the billions of devices. Infrastructure and sensor generates massive log data real time. Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the velocity. Inderal suggest that sampling data can help deal with issues like volume and velocity. Velocity the transferring speed of the data in a server to a server.
VARIETY: Photos and videos and audio recordings and email messages and documents and books and presentations and tweets and ECG strips are all data, but they're generally unstructured, and incredibly varied. Big data is also geospical data that various data, audio, video,text, log files. Big data analysis includes different types of data. More sources of data means more varieties of data in different formats: from traditional documents and databases, to semi-structured and unstructured data from click streams, GPS location data, social media apps, and IoT (to name a few). Different data formats means it’s tougher to derive value (meaning) from the data because it must all be extracted for processing in different ways. Traditional computing methods don’t work on all these different varieties of data.
III-OVERVIEW OF BIG DATA STORES: The Cortana Intelligence Suite offers the following two offerings, as part of the Big Data Stores pillar, specifically designed and optimized for Big Data and Analytics Workloads with the ability to store and manage the modern data volumes and formats at scale and with high degree of elasticity. We will cover: Azure Data Lake Store and Azure SQL Data Warehouse.
IV-PROCESSING OF BIG DATA: Big Data processing techniques analyze big data sets at terabyte or even petabyte scale. Offline batch data processing is typically full power and full scale, tackling arbitrary BI use cases. While real-time stream processing is performed on the most current slice of data for data profiling to pick outliers, fraud transaction detections, security monitoring, etc. The toughest task however is to do fast (low latency) or real-time ad-hoc analytics on a complete big data set. It practically means you need to scan terabytes (or even more) of data within seconds. This is only possible when data is processed with high parallelism.
V-BATCH PROCESSING OF BIG DATA: Apache Hadoop is a distributed computing framework modeled after Google MapReduce to process large amounts of data in parallel. Once in a while, the first thing that comes to my mind when speaking about distributed computing is EJB. EJB is de facto a component model with remoting capability but short of the critical features being a distributed computing framework that include computational parallelization, work distribution, and tolerance to unreliable hardware and software. Hadoop on the other hand has these merits built-in. ZooKeeper modeled on Google Chubby is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and group services for the Hadoop cluster. Hadoop Distributed File System (HFDS) modeled on Google GFS is the underlying file system of a Hadoop cluster.
VI- WHY BIG DATA IS IMPORTANT: Increase of storage capacities, processing power, Everyday various devices produces millions and billions of data so those much of data has no more place to store.
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.
VII-STATISTICS: Commonly a social network creates a data of nearly 500TB per day, If so a single social network alone can create nearly 500TB then think about all social networks which produces more than the data which produced by the particular network. Moreover 90% of the data was created in last 2 years. YouTube users upload 48 hours of new video every minute of the day. (Source)
571 new websites are created every minute of the day. (Source)
Brands and organizations on Facebook receive 34,722 Likes every minute of the day. (Source)
100 terabytes of data uploaded daily to Facebook. (Source)
According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day, and has more than 465 million accounts. (Source)
30 Billion Pieces of content shared on Facebook every month. (Source)
Data production will be 44 times greater in 2020 than it was in 2009. (Source)
In late 2011, IDC Digital Universe published a report indicating that some 1.8 zettabytes of data will be created that year. (Source)
In other words, the amount of data in the world today is equal to:
Every person in the US tweeting three tweets per minute for 26,976 years.
VIII-HOW BIG DATA IS DIFFERENT: Big Data technology is about solving this problem once and for all, with performance that scales gracefully to datasets of any size. Systems like Hadoop are nothing special for small datasets (in fact they’re worse, due to parallelism overhead), but as the gigabytes pile up, the time to process all your data increases only gradually – say, an extra several minutes every time your dataset doubles in size. For the massive datasets being mined today, this becomes the only practical option.
The key to Big Data technologies is “horizontal scaling”. This means that, instead of upgrading to fancier computers when the current system becomes insufficient you just add more computers to a cluster. The focus is on parallelism, rather than the performance of any one node. Typically the cost of a few commodity machines pales next to a better processor, and the best processor on the market represents an absolute bound on vertical scaling’s power. But Big Data systems have no such limit.
It was Auto-generated by a machine and typically an entire new source. It was not designed to be a friendly user.
IX-TYPES OF TOOLS USED: Cloud is the most used tool for hosting. Distributed storage is used for storing the data. Map Reduce is used for programming model. Semantic operations are performed on data. HDFS (Hadoop Distributed File System) is part of Hadoop and it is known as a special file system which deals with distribution and storage of large set of data. HDFS stores file as sequence of same size of block except the last block. It also deals with hardware failure and smoothen the data handling.
Hive was initiated by Facebook. Hive is data warehouse tool which is based on Hadoop and converts query language into MapReduce jobs. It deals with the storage, analysis and queries of large set of data. Query language in hive used as HQL statement. Hive Query Language is similar to standard SQL statement.
X-APPLICATIONS OF BIG DATA: Big Data tools and Technologies help the companies to interpret the huge amount of data very faster which helps to boost production efficiency and also to develop new data‐driven products and services. So, big data applications are creating a new era in every industry.
It is used in following processes Traffic control, Telecom services, Manufacturing, Trading analytics, Health care.
XI-IMPACTS: In 2015 4.4 million IT jobs in big data among 1.9Million in USA alone. In India minimum we need 1Lakh of data scientists to work on Big data to data analyst and management.
XII-BENEFITS: Business organizations are using big data to target customer for their business. Big data is already important for nearly $64billion in market. Cost Savings some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost advantages to business when large amounts of data are to be stored and these tools also help in identifying more efficient ways of doing business. The high speed of tools like Hadoop and in-memory analytics can easily identify new sources of data which helps businesses analyzing data immediately and make quick decisions based on the learnings.
XI-CONCLUSION: Big Data helps the organizations to create new growth opportunities and entirely new categories of companies that can combine and analyze industry data. These companies have ample information about the products and services, buyers and suppliers, consumer preferences that can be captured and analyzed.
XII- REFERENCES
1. Hilary Mason Blog: Blog by Hilary Mason @hmason a Data Scientist in Residence at Accel, Scientist Emeritus atbitly, co-founder of HackNY, co-host of DataGotham, and member of NYCResistor.
2.Stefan Groschupf’s Blog : CEO of Datameer, one of the most recognised Big Data startups has interesting video.
AUTHORS:-
S.AMIRRTHA KESAVAN, C.SUGEETH
CSE DEPT, MOUNT ZION COLLEGE OF ENGG AND TECH, PUDUKKOTTAI
amirlatha@gmail.com
ABSTRACT- Big data is frontier of a firm’s ability to store the process of all the data into needs to operate electively make decisions, reduce and serve customers
I-INTRODUCTION: Big data is storing information from multiple devices like laptop, tablet, and mobiles in a particular server. Lately, the term big data tends to refer to the use of predictive analytics, user behavior or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem. Analysis of data sets can find new correlations to "spot business trends, prevent diseases, and combat crime and so on. Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas include finetech urban and business informatics. Scientists encounter limitations in HYPERLINK "https://en.wikipedia.org/wiki/E-Science" \o "E-Science" e-Science work, including meteorology, connectomics, complex physics simulations, biology and environmental research.
II-CHARACTERIZATION:
i)Volume
ii)Velocity
iii)Variety
Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Gartner's definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that. Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value Additionally, a new V "Veracity" is added by some organizations to describe it, revisionism challenged by some industry authorities. The 3Vs have been expanded to other complementary characteristics of big data:
VOLUME: Volume is said to be the size the data which stores everything in a server. Smart phones which generates the data in every second that results in billions of new feeds locations, information. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not. Big data implies enormous volumes of data. It used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analyzed is massive. Yet, Inderal states that the volume of data is not as much the problem as other V’s like veracity.
VELOCITY: Machine to machine process exchange data between the billions of devices. Infrastructure and sensor generates massive log data real time. Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the velocity. Inderal suggest that sampling data can help deal with issues like volume and velocity. Velocity the transferring speed of the data in a server to a server.
VARIETY: Photos and videos and audio recordings and email messages and documents and books and presentations and tweets and ECG strips are all data, but they're generally unstructured, and incredibly varied. Big data is also geospical data that various data, audio, video,text, log files. Big data analysis includes different types of data. More sources of data means more varieties of data in different formats: from traditional documents and databases, to semi-structured and unstructured data from click streams, GPS location data, social media apps, and IoT (to name a few). Different data formats means it’s tougher to derive value (meaning) from the data because it must all be extracted for processing in different ways. Traditional computing methods don’t work on all these different varieties of data.
III-OVERVIEW OF BIG DATA STORES: The Cortana Intelligence Suite offers the following two offerings, as part of the Big Data Stores pillar, specifically designed and optimized for Big Data and Analytics Workloads with the ability to store and manage the modern data volumes and formats at scale and with high degree of elasticity. We will cover: Azure Data Lake Store and Azure SQL Data Warehouse.
IV-PROCESSING OF BIG DATA: Big Data processing techniques analyze big data sets at terabyte or even petabyte scale. Offline batch data processing is typically full power and full scale, tackling arbitrary BI use cases. While real-time stream processing is performed on the most current slice of data for data profiling to pick outliers, fraud transaction detections, security monitoring, etc. The toughest task however is to do fast (low latency) or real-time ad-hoc analytics on a complete big data set. It practically means you need to scan terabytes (or even more) of data within seconds. This is only possible when data is processed with high parallelism.
V-BATCH PROCESSING OF BIG DATA: Apache Hadoop is a distributed computing framework modeled after Google MapReduce to process large amounts of data in parallel. Once in a while, the first thing that comes to my mind when speaking about distributed computing is EJB. EJB is de facto a component model with remoting capability but short of the critical features being a distributed computing framework that include computational parallelization, work distribution, and tolerance to unreliable hardware and software. Hadoop on the other hand has these merits built-in. ZooKeeper modeled on Google Chubby is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and group services for the Hadoop cluster. Hadoop Distributed File System (HFDS) modeled on Google GFS is the underlying file system of a Hadoop cluster.
VI- WHY BIG DATA IS IMPORTANT: Increase of storage capacities, processing power, Everyday various devices produces millions and billions of data so those much of data has no more place to store.
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.
VII-STATISTICS: Commonly a social network creates a data of nearly 500TB per day, If so a single social network alone can create nearly 500TB then think about all social networks which produces more than the data which produced by the particular network. Moreover 90% of the data was created in last 2 years. YouTube users upload 48 hours of new video every minute of the day. (Source)
571 new websites are created every minute of the day. (Source)
Brands and organizations on Facebook receive 34,722 Likes every minute of the day. (Source)
100 terabytes of data uploaded daily to Facebook. (Source)
According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day, and has more than 465 million accounts. (Source)
30 Billion Pieces of content shared on Facebook every month. (Source)
Data production will be 44 times greater in 2020 than it was in 2009. (Source)
In late 2011, IDC Digital Universe published a report indicating that some 1.8 zettabytes of data will be created that year. (Source)
In other words, the amount of data in the world today is equal to:
Every person in the US tweeting three tweets per minute for 26,976 years.
VIII-HOW BIG DATA IS DIFFERENT: Big Data technology is about solving this problem once and for all, with performance that scales gracefully to datasets of any size. Systems like Hadoop are nothing special for small datasets (in fact they’re worse, due to parallelism overhead), but as the gigabytes pile up, the time to process all your data increases only gradually – say, an extra several minutes every time your dataset doubles in size. For the massive datasets being mined today, this becomes the only practical option.
The key to Big Data technologies is “horizontal scaling”. This means that, instead of upgrading to fancier computers when the current system becomes insufficient you just add more computers to a cluster. The focus is on parallelism, rather than the performance of any one node. Typically the cost of a few commodity machines pales next to a better processor, and the best processor on the market represents an absolute bound on vertical scaling’s power. But Big Data systems have no such limit.
It was Auto-generated by a machine and typically an entire new source. It was not designed to be a friendly user.
IX-TYPES OF TOOLS USED: Cloud is the most used tool for hosting. Distributed storage is used for storing the data. Map Reduce is used for programming model. Semantic operations are performed on data. HDFS (Hadoop Distributed File System) is part of Hadoop and it is known as a special file system which deals with distribution and storage of large set of data. HDFS stores file as sequence of same size of block except the last block. It also deals with hardware failure and smoothen the data handling.
Hive was initiated by Facebook. Hive is data warehouse tool which is based on Hadoop and converts query language into MapReduce jobs. It deals with the storage, analysis and queries of large set of data. Query language in hive used as HQL statement. Hive Query Language is similar to standard SQL statement.
X-APPLICATIONS OF BIG DATA: Big Data tools and Technologies help the companies to interpret the huge amount of data very faster which helps to boost production efficiency and also to develop new data‐driven products and services. So, big data applications are creating a new era in every industry.
It is used in following processes Traffic control, Telecom services, Manufacturing, Trading analytics, Health care.
XI-IMPACTS: In 2015 4.4 million IT jobs in big data among 1.9Million in USA alone. In India minimum we need 1Lakh of data scientists to work on Big data to data analyst and management.
XII-BENEFITS: Business organizations are using big data to target customer for their business. Big data is already important for nearly $64billion in market. Cost Savings some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost advantages to business when large amounts of data are to be stored and these tools also help in identifying more efficient ways of doing business. The high speed of tools like Hadoop and in-memory analytics can easily identify new sources of data which helps businesses analyzing data immediately and make quick decisions based on the learnings.
XI-CONCLUSION: Big Data helps the organizations to create new growth opportunities and entirely new categories of companies that can combine and analyze industry data. These companies have ample information about the products and services, buyers and suppliers, consumer preferences that can be captured and analyzed.
XII- REFERENCES
1. Hilary Mason Blog: Blog by Hilary Mason @hmason a Data Scientist in Residence at Accel, Scientist Emeritus atbitly, co-founder of HackNY, co-host of DataGotham, and member of NYCResistor.
2.Stefan Groschupf’s Blog : CEO of Datameer, one of the most recognised Big Data startups has interesting video.

Comments
Post a Comment