In the era of information explosion, enormous amounts of data have become available on hand to decision makers. Big data refers to datasets that grow so huge that they become difficult to handle using traditional tools and techniques. Due to the rapid growth of such data, solutions need to be studied and provided in order to handle and extract value and knowledge from these datasets. Such value can only be provided by using big data analytics, which is the application of advanced analytics techniques on big data. This paper aims to analyze the different methods and tools which can be applied to big data, as well as the opportunities provided and the challenges which much be faced.
Imagine a world without data storage; a place where every detail about a person or organization, every transaction performed, or every aspect which can be documented is lost directly after use. Organizations would thus lose the ability to extract valuable information and knowledge, perform detailed analyses, as well as provide new opportunities and advantages. Data is an essential part of our lives, and the ability to store and access such data has become a crucial task which we cannot live without. Anything ranging from customer names and addresses, to products available, to purchases made, to employees hired, etc. has become essential for day to day continuity. Data is the building block upon which any organization thrives.
Now imagine the extent of details and the surge of data and information provided nowadays through the advancements in technologies and the internet. With the increase in storage capabilities and methods of data collection, huge amounts of data have become easily available. Every second, more and more data is being created and needs to be stored and analyzed in order to extract value. Furthermore, data has become cheaper to store, so organizations need to get as much value as possible from the huge amounts of stored data. According to Gruenspecht , there has been a tremendous surge in the use of digital storage, as well as a drop in its price within the last twenty years. This has eliminated the requirement of clearing out previous data, increased the storage of metadata, or data about the data, as well as made backup storage a common practice against data loss. Additionally, companies and individuals possess more technologies and devices which create and capture more data in different categories. A single user nowadays, can own a desktop, laptop, smartphone, tablet, and more, where each device carries very large amounts of valuable data.
Such sheer amounts of data need to be properly analyzed, and pertaining information should be extracted. Big data analytics is the operation of advanced analytic techniques on big data . This paper will further examine the concept of big data analytics and how it can be applied on big data sets. The purpose of the paper is to discover some of the opportunities and challenges related to the field of big data, as well as the application of data analytics on big data.
The paper is organized as follows. The first section explains big data in details, as well as the characteristics which define big data, and the importance of storing and analyzing such voluminous data. The second section discusses big data analytics, starting with the data analytics lifecycle, followed by the possible advanced data analytics methods. It will also take a look at the theories and methods, as well as the technologies and tools for needed for big data analytics. Finally, we will conclude the paper by analyzing the challenges related to big data, and the need for future research in the field.
2. Big data
The term "Big Data" has recently been applied to datasets that grow so large that they become awkward
to work with using traditional on-hand database management tools. They are data sets whose size is beyond the ability of commonly used software tools and storage systems to capture, store, manage, as well as process the data within a tolerable elapsed time. Big data also refers to databases which are measured in terabytes and above, and are too complex and large to be effectively used on conventional systems .
Big data sizes are a constantly moving target, currently ranging from a few dozen terabytes to many petabytes of data in a single data set. Consequently, some of the difficulties related to big data include capture, storage, search, sharing, analytics, and visualizing. Today, enterprises are exploring large volumes of highly detailed data so as to discover facts they didn’t know before . Business benefit can commonly be derived from analyzing larger and more complex data sets that require real time or near-real time capabilities, however, this leads to a need for new data architectures, analytical methods, and tools. In this section, we will discuss the characteristics of big data, as well the issues surround storing and analyzing such data.
2.1. Big data characteristics
Big data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures, analytics, and tools in order to enable insights that unlock new sources of business value. Big data is characterized by three main features: volume, variety, and velocity. The volume of the data is its size, and how enormous it is. Velocity refers to the rate with which data is changing, or how often it is created. Finally, variety includes the different formats and types of data, as well as the different kinds of uses and ways of analyzing the data .
Figure 1: The three V's of big data
The three V’s of Big Data are shown in Figure 1 . Data volume is the primary attribute of big data. Big data can be quantified by size in terabytes or petabytes, as well as even the number of records, transactions, tables, or files. Additionally, one of the things that makes big data really big is that it’s coming from a greater variety of sources than ever before, including logs, clickstreams, and social media. Using these sources for analytics means that common structured data is now joined by unstructured data, such as text and human language, and semistructured data, such as XML or RSS feeds. There’s also data which is hard to categorize since it comes from audio, video, and other devices. Furthermore, multidimensional data can be drawn from a data warehouse to add historic context to big data. Thus, with big data, variety is just as big as volume.
Furthermore, big data can be described by its velocity or speed. This is basically the frequency of data generation or the frequency of data delivery. The leading edge of big data is streaming data, which is collected in real-time the websites . So now that we know what big data is and what characterizes big data, we start asking ourselves why we need to consider such data with all its volume, variety, and velocity. In the following section, we will take a look at the importance of managing such data, and why it has become a recent trend.
2.2. Importance of managing big data
According to Manyika, et al. , there are five broad ways in which using big data can create value. First of all, big data can unlock significant value by making information transparent and usable at a much higher frequency. Second of all, as organizations create and store more and more transactional data in a digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days. This can therefore expose variability in the data and boost performance.
Third of all, big data allows a narrower segmentation of customers and therefore much more precisely tailored products or services to meet their needs and requirements. Fourth of all, sophisticated analytics performed on big data can substantially improve decision making. Finally, big data can also be used to improve the development of the next generation of products and services. For example, manufacturers are currently using data obtained from sensors which are embedded in products to create innovative after-sales service offerings such as proactive maintenance, which are preventive measures
that take place before a failure occurs or is even noticed by the customer.
Nowadays, along with the increasing ubiquity of technology comes the increase in the amount of electronic data. Only a few years ago, corporate databases tended to be measured in the range of tens to hundreds of gigabytes. Now, however, multi-terabyte (TB) or even petabyte (PB) databases have become normal. According to Longbottom , the World Data Center for Climate (WDDC) stores over 6PB of data overall and the National Energy Research Scientific Computing Center (NERSC) has over 2.8PB of available data around atomic energy research, physics projects and so on. These are only a couple of examples of the enormous amounts of data which must be dealt with nowadays.
Furthermore, even companies such as Amazon are running with databases in the tens of terabytes, and companies which wouldn’t be expected to have to worry about such massive systems are dealing with databases with sizes of hundreds of terabytes. Additionally, other companies with large databases in place include telecom companies and service providers, as well as social media sites. For telecom companies, just dealing with log files of all the events happening and call logs can easily build up database sizes. Moreover, social media sites, even those that are primarily text, such as Twitter or Facebook, have big enough problems; and sites such as YouTube have to deal with massively expanding datasets. With such increasing amounts of big data, there arises an essential need to be able to analyze the datasets. Thus, big data analytics will be discussed in the subsequent section.
3. Big data analytics
Big data analytics is where advanced analytic techniques operate on big data sets. Analytics based on large data samples reveals and leverages business change. However, the larger the set of data, the more difficult it becomes to manage . Sophisticated analytics can substantially improve decision making, minimize risks, and unearth valuable insights from the data that would otherwise remain hidden. Sometimes decisions do not necessarily need to be automated, but rather augmented by analyzing huge, entire datasets using big data techniques and technologies instead of just smaller samples that individuals with spreadsheets can handle and understand. Therefore, decision making may never be the same. Some organizations are already making better decisions by analyzing entire datasets from customers, employees, or even sensors embedded in products . In this section, we will discuss the data analytics lifecycle, followed by some advanced data analytics methods, as well as some possible tools and methods for big data analytics in particular.
3.1. Advanced data analytics methods
With the evolution of technology and the increased multitudes of data flowing in and out of organizations daily, there has become a need for faster and more efficient ways of analyzing such data. Having piles of data on hand is no longer enough to make efficient decisions at the right time. As Oueslati and Akaichi  acknowledged, the acquired data must not only be accurate, consistent, and sufficient enough to base decisions upon, but it must also be integrated and subject-oriented, as well as non volatile and variant with time. New tools and algorithms have been designed to aid decision makers in automatically filtering and analyzing these diverse pools of data.
Data analytics is the process of applying algorithms in order to analyze sets of data and extract useful and unknown patterns, relationships, and information . Furthermore, data analytics are used to extract previously unknown, useful, valid, and hidden patterns and information from large data sets, as well as to detect important relationships among the stored variables . Thus, analytics have had a significant impact on research and technologies, since decision makers have become more and more interested in learning from previous data, thus gaining competitive advantage .
Nowadays, people don’t just want to collect data, they want to understand the meaning and importance of the data, and use it to aid them in making decisions. Data analytics have gained a great amount of interest from organizations throughout the years, and have been used for many diverse applications. Some of the applications of data analytics include science, such as particle physics, remote sensing, and bioinformatics, while other applications focus on commerce, such as customer relationship management, consumer finance, and fraud detection .
In this section, we will take a look at some of the most common data analytics methods. In order to fully grasp the concept of data analytics, we will take a look at some of the most common approaches as well as how they can be applied and what algorithms are frequently used. Three different data analytics approaches will be discussed: association rules, clustering, and decision trees.
3.2. Association rules
Association rules are one of the most popular data analytics tasks for discovering interesting relations
between variables in large databases. It is an approach for pattern detection which finds the most common combinations of categorical variables . Using association rules shows relationships between data items by identifying patterns of their co-occurrence . Since so many various association rules can be derived from even a tiny dataset, the interest in such rules is restricted to those that apply to a reasonably large number of instances and have a reasonably high accuracy on the instances to which they apply to.
Association rule analytics discover interesting correlations between attributes of a database by using two measures, support and confidence. Support is the probability that two different attributes occur together in a single event, or the frequency of occurrence, while confidence is the probability that when one attribute occurs, the other will also occur in the same event . Association rules are normally used in business applications to determine the items which are usually purchased together . An example of an association rule would be the statement that people who buy cars also buy CD’s 80% of the time, written as Car → CD. In this case the two attributes being associated are the car and the CD, while the confidence value is the 80% and the support value is how many times in the database both a car and a CD were bought together . If a rule passes the minimum support then it is considered as a frequent rule, while rules which pass both support and confidence are considered strong rules.
One of the most common algorithms for association rule analytics is the Apriori algorithm. Like most association rule algorithms, it splits the problem into two major tasks. The first task is frequent itemset generation, in which the objective is to find all the itemsets which satisfy the minimum support threshold and are thus frequent itemsets. The formula for calculating support is:
The second task is rule generation, in which the objective is to extract the high confidence, or strong, rules from the previously found frequent itemsets . The formula for calculating confidence is:
Since the first step is computationally expensive and requires the generation of all combinations of itemsets, the Apriori algorithm provides a principle for guiding itemset generation and reducing computational requirements.
The Apriori principle states that a subset of a frequent itemset must also be frequent. In this case, if an itemset is not frequent, then it will be discarded and will not be used as a subset for the generation of another itemset . The algorithm uses a breadth first search strategy and a tree structure, as shown in Figure 2, to count candidate itemsets efficiently.
Figure 2: An itemset lattice 
Each level in the tree contains all the k-itemsets, where k is the number of items in the itemset. For example level 1 contains all 1-itemsets, level 2 all 2-itemsets, and so forth. Instead of ending up with so many itemsets through all possible combinations of items, the Apriori algorithm only considers the frequent itemsets. So in the first level, the algorithm calculates the support of each itemset. Frequent itemsets which pass the minimum support are taken to the next level, and all possible 2-itemset combinations are made only out of these frequent sets, while all others are discarded. Finally, rules are extracted from the frequent itemsets in the form of A → B (if A then B). The confidence for each rule is calculated, and rules which pass the minimum confidence are taken as strong rules.
Data clustering is a technique which uses unsupervised learning, or in other words discovers unknown structures . Clustering is the process of grouping sets of objects together into classes based on similarity measures and the behavior of the group. Instances within the same group are similar to each other, and are dissimilar to instances in other groups . Clustering is similar to classification in that it groups data into classes; however the main difference is that clustering is unsupervised, and the classes are
defined by the data alone, hence they are not predefined. Therefore, data to be analyzed is not compared to a model built from training data, but is rather compared to other data and clustered according to the level of similarity between them . Several representations of clusters are depicted in Figure 5 .
Figure 3: Different Ways of Representing Clusters 
Figure 3(a) portrays how instances fall into different clusters by partitioning the space to show each cluster. Some algorithms allow for one instance to belong to more than one cluster, so the diagram can lay out all the instances and draw overlapping subsets in a Venn diagram which represent each cluster as shown in Figure 3(b). Additionally, some clustering algorithms associate the instances with clusters probabilistically rather than categorically. As depicted in Figure 3(c), for every instance there is a probability, or a degree of membership, to which it belongs with each cluster. Furthermore, other clustering algorithms produce a hierarchical structure of clusters such that the top level the instance space is divided into a few clusters, each of which keeps dividing into its own sub-cluster at the next level down. Elements which are joined together in clusters at lower levels are more tightly clustered than those joined together at higher levels. Such diagrams, as shown in Figure 3(d), are called dendrograms .
The k-means algorithm is one of the most popular clustering techniques. The principle of k-mean is to assign members to the same group according to their score of similarity. It relies on a similarity or difference measurement to group relevant data values to the same cluster. K-means uses an iterative looping method to group data into a predetermined k number of clusters. If, for instance, k = 3 then that means we want to algorithms to return 3 different clusters of similar instances. Then, the k-means algorithms starts the iterative process by randomly selecting k points from the raw data to represent the initial centroids, or center points, of the k cluster. So in our example, 3 random points will be selected as the initial representative centroids of the 3 clusters .
Next, the similarity or distance measure chosen is used to assign each of the remaining data points to its most similar cluster. The Euclidean distance is the most commonly used proximity measure to calculate the distance between each data point and each centroid, and to assign the point to the nearest, or most similar, group which is the one closest in distance. The Euclidean distance between two n-dimensional points p and q, where p = (p1, p2,…,pn) and q = (q1, q2,…, qn), is calculated as follows:
Subsequently, after assigning all the points to clusters, the algorithm calculates the average of each cluster and assigns that value to the new representative centroid. The previous steps are repeated, and each of the data points is reassigned to the cluster of the nearest centroid. The new centroids are again calculated and the process continues iteratively until each cluster has a stable centroid and cluster members do not change their groups . Furthermore, other stopping criteria, such as number of iterations or percentage of movement of members, can be set. The result of the k-means algorithm is a k number of similar clusters, where each data point is grouped with only one cluster based on its similarity to the other points in the cluster.
3.4. Decision trees
Another type of data analytics technique is the decision tree. Decision trees are used as predictive models to map observations about an attribute to conclusions about an attribute’s target value. A decision tree is a hierarchical structure of nodes and directed edges which consists of three types of nodes. The root node is a node with no incoming edges and zero or more outgoing edges to other nodes. An internal node is a node in the middle levels of the tree, and consists of one incoming edge and two or more outgoing edges. Finally, the leaf node has exactly one incoming edge and no outgoing edges, and is assigned a class label which provided the decision of the tree .
Each of the tree’s nodes specifies a test of a certain attribute of the instance, and each descending branch from the node corresponds to one of the attribute’s possible values. An instance is classified by moving
down the tree by starting at the root node, testing the attribute specified by that node, and moving down the branch which corresponds to the value of the given attribute to a new node. The same process is repeated at that node, until a leaf node providing a decision is finally reached .
The C4.5 algorithm is a commonly used extension to the previously used ID3 algorithm. Like ID3, it is also built upon Hunt’s algorithm for decision tree induction. In Hunt’s algorithm, the decision tree is grown recursively by partitioning the training records into successively purer subsets. If all records at a certain node belong to the same class, then that node is declared a leaf node labeled with the name of the class. However, if the node contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is then created for each outcome of the test condition, and the records at the parent node are distributed to the children nodes based on their outcome. The algorithm is then recursively applied to each child node until all records in the training set of data have been classified, all attributes have been split upon, or a specified criterion has been met .
C4.5 builds decision trees from data using the concept of information entropy. C4.5 chooses one attribute at each node in the tree which most effectively splits the set of sample data at the node into subsets enriched in a particular class. In other words, it chooses the attribute which provides the split with the highest information gain . Therefore, at each node it calculates the information gain using the following formula, and makes its decision for the attribute split based on the highest result:
k1 i i split Entropy(i) n n Entropy(p) GAIN
P is the parent node which is split into k partitions, and ni is the number of records in partition i. The entropy measures the homogeneity of a node. Entropy of 0 implies that all records belong to one class and provides the most information. The higher the entropy, the more records are distributed among classes, implying the least information . The entropy is calculated as follows, where p(j/t) is the frequency of class j at node t:
j tj ptj ptEntropy) / ( log )/ ( ) (
The C4.5 algorithm differs from the ID3 algorithm in that it can handle both continuous and discrete attributes, as opposed to only handling discrete values, by creating a threshold and splitting the list of values into those who are greater than the threshold, and those who are less than or equal to it. Furthermore, C4.5 can handle data with missing attribute values by allowing them to be marked by a question mark, and not using them in entropy and gain calculations . Finally, C4.5 also prunes the resulting trees by going back through the tree after its creation and removing useless branches of no help by replacing them with leaf nodes. This simplifies the tree, and removes unneeded checks and space .
4. Big data analytics tools and methods
Big data is too large to be handled by conventional means, and the larger the data grows, the more organizations purchase more powerful hardware and computational resources. However, the data keeps on growing and performance needs increase, but the available resources have a maximum capacity and capability. According to EMC , the MapReduce paradigm is based on adding more computers or resources, rather than increasing the power or storage capacity of a single computer; in other words, scaling out rather than scaling up. The fundamental idea of MapReduce is breaking a task down into stages and executing the stages in parallel in order to reduce the time needed to complete the task.
Map Reduce is a parallel programming model which is suitable for big data processing. It is built on Hadoop, which is a concrete platform which implements MapReduce. In MapReduce, data is split into distributable chunks, which are called shards. The steps to process those chunks are defined, and the big data processing is run in parallel on the chunks. This model is scalable, in that the bigger the data processing becomes, or the more computational resources are the required, the more machines can be added to process the chunks.
The first phase of the MapReduce job is to map input values to a set of key/value pairs as output. Thus, unstructured data such as text can be mapped to a structured key/value pair, where, in this case, the key could be the word in the text and the value is the number of occurrences of the word. This output is then the input to the "Reduce" function. Reduce then performs the collection and combination of this output. So assuming we have millions of text documents and would like to count the occurrence of a certain word. The text documents would be divided upon several workers, or machines, which will perform parallel processing. These workers will act as mappers and map the desired word to the number of occurrences in the
text documents given to it for processing in parallel. The reducers will then aggregate these counts, thus giving the total count in the millions of text documents.
Hadoop is a framework for performing big data analytics which provides reliability, scalability, and manageability by providing an implementation for the MapReduce paradigm as well as gluing the storage and analytics together. Hadoop consists of two main components: the Hadoop Distributed File System (HDFS) for the big data storage, and MapReduce for big data analytics. The HDFS storage function provides a redundant and reliable distributed file system which is optimized for large files. Data is stored in replicated file blocks across the multiple Data Nodes, and the Name Node acts as a regulator between the client and the Data Node, directing the client to the particular Data Node which contains the requested data. Additionally, the data processing and analytics functions are performed by MapReduce which consists of a java API as well as software in order to implement the services which Hadoop needs to function.
The MapReduce function within Hadoop depends on two different nodes: the Job Tracker and the Task Tracker nodes. The Job Tracker nodes are the ones which are responsible for distributing the Mapper and Reducer functions to the available Task Trackers, as well as monitoring the results. On the other hand, the Task Tracker nodes actually run the jobs and communicate results back to the Job Tracker. That communication between nodes is often through files and directories in HDFS so inter-node communication is minimized.
Figure 4: MapReduce and HDFS 
Figure 4 shows how the MapReduce nodes and the HDFS work together. At step 1, there is a very large dataset including log files, sensor data, or anything of the sorts. The HDFS stores replicas of the data, represented by the blue, yellow, beige, and pink icons, across the Data Nodes. In step 2, the client defines and executes a map job and a reduce job on a particular data set, and sends them both to the JobTracker. The Job Tracker then distributes the jobs across the Task Trackers in step 3. The TaskTracker runs the mapper, and the mapper produces output that is then stored in the HDFS file system. Finally, in step 4, the reduce job runs across the mapped data in order to produce the result.
5. Big Data Challenges
Several issues will have to be addressed in order to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability all need to be addressed in a big data world. Organizations need to put the right talent and technology in place, as well as additionally structure workflows and incentives to optimize the use of big data. Access to data is critical, and companies will need to increasingly integrate information from multiple data sources, often from third parties or different locations. Furthermore, questions on how to store and analyze data with volume, variety, and velocity have arisen, and current research lacks the capability for providing an answer.
Consequently, the biggest problem has become not only the sheer volume of data, but the fact that the type of data companies must deal with is changing. In order to accommodate for the change in data, the approaches for storing data have changed throughout the years. Data storage started with data warehouses, data marts, data cubes, and then moved on to master data management, data federation and other techniques such as in-memory databases. However, database suppliers are still struggling to cope with enormous amounts of data, and the emergence of interest in big data has led to a need for storing and managing such large amounts of data.
Several consultants and organizations have tried coming up with solutions in order to be able to store and manage big data. Thus, Longbottom , recommends that organizations carefully research the following aspects regarding suggested big data solutions before adopting one:
• Can this solution deal with different data types, including text, image, video and sound?
• Can this solution deal with disparate data sources, both within and outside of the organization's environment?
• Will the solution create a new, massive data warehouse that will only make existing problems
worse, or will it use metadata and pointers to minimize data replication and redundancy?
• How can, and will, the solution present findings back to the organization, and will this only be based on what has already happened, or can it predict with some degree of certainty what may happen in the future?
• How will the solution deal with back-up and restore of data? Is it inherently fault tolerant and can more resource easily be applied to the system as required?
Thus, from the challenges of big data is finding or creating a solution which meets the above criteria in regards to the organization.
In this paper we examined the concept of big data, as well as some of its different opportunities and challenges. In the first section of the paper, we discussed big data in general, as well as some of its common characteristics. After looking at the importance of big data and its management within organizations, and the value it can add, we discussed big data analytics as an option for data management and the extraction of essential information from such large amounts of data. Association rules, clustering, and decision trees were covered.
However, with enormous amounts of data, performing typical analytics is not enough. Thus, in the following section, we discussed Hadoop which consists of the HDFS and MapReduce. This facilitates the storage of big data as well as parallel processing. Finally, we covered the challenges which arise when dealing with big data, and still need further research. Future research can include applying the big data analytics methods discussed on real business cases within organizations facing big data problems. Furthermore, the challenges related to big data previously discussed can be tackled or studied in more detail.
Thus, we have seen that big data is a very important concept nowadays with comes with many opportunities and challenges. Organizations need to seize these opportunities and face the challenges in order to get the most value and knowledge out of their massive data piles.