Data science includes an important term called Data Exploration. It is the process of analyzing data in order to understand its characteristics. It can help in pre-processing specific data and provide analysis techniques.
A database is a repository where we store data, processed or unprocessed. It is of two types:
-
OLTP: Online Transaction Processing- it is a database technique that stores day-to-day activities. The old data which is not frequently used is erased and new value is added.
-
OLAP: Online Analysis Processing- this stores historical data. It is usually summarized. It doesn’t delete old data, but abstracts and stores it.
These is a very basic concept when it comes to handling data. Now if we observe, when we have to store huge amount of data in OLAP, the database that stores historical data and when we feed large amounts of data in the range of terabytes to petabytes, it needs to retrieve, access, summarize and store it for later retrieval. It should be stored in such a way that it is easy for retrieving the data again. We also need to make sure that the data should not be altered on unauthorized access. Permissions only to specific authorizers should be given.
Summary statistics-
These are quantities like median, mean and standard deviation that is used to estimate the characteristics of a large set of data with a single set of numbers. Examples may include the annual income year after year, the graduated students of a university every year and many more.
When it comes to managing data, there are several aspects. They are:
-
Percentiles: for ordered data, it is very important to consider percentile of a set of values in data. Percentile is a measure of the value for which a given percentage of observations fall.
-
Mean and median: Of course, when we consider having a large amount of data and would want to operate it, we may require to find the average of values or may need to find the mid data for which we use mean and median respectively.
-
Frequencies and mode: frequency is the rate at which some variations occur over a period of time. The mode is the measure of the highest frequency value.
For considering all these aspects, having a single view point will not work. We need to consider it from various dimensions. In other words, multidimensionality. We use this to aggregate data in many ways. Computing those aggregate values involves fixing specific values for some dimensions and then summarizing over all possible ways and then conclude it and feed it into the storage. There is a certain architecture for this kind of process and it called the multi-tier architecture. That first fetches data from external sources, extracts, cleans, transforms, loads and refreshes that data and stores in the data marts which is a part of the data warehouse. That data is then represented in multidimensional OLAP servers. Then it is monitored statistically and allowed users to query the data.
All of those were just the basics. Data science is a vast collection of many such concepts.
Resource box-
As we can see, data science is a very huge and complex field and almost everything that we deal with today is related to data. This is the reason why data science as a career is meant to be the most emerging field today. To ease and build this career to a wide range of people, Excelr introduces data science course which will help you to build your career and to develop the future technology.
Source: Click Here