Home >Common Problem >What does data cleaning include?
Data cleaning methods include: 1. The binning method, which is to put the data to be processed into boxes according to certain rules, and then test it; 2. The regression method, which is to use function data to draw images , and then smooth the image; 3. Clustering method is to group abstract objects into different sets and find unexpected isolated points in the set.
#The operating environment of this article: Windows 7 system, Dell G3 computer.
What does data cleaning include?
There are three methods for cleaning data, namely binning method, clustering method and regression method.
1. Binning method
is a frequently used method. The so-called binning method is to put the data that needs to be processed into boxes according to certain rules, and then test each data in the boxes, and take methods to process the data based on the actual conditions of each box in the data.
2. Regression method
The regression method uses the data of the function to draw the image, and then smoothes the image. There are two types of regression methods, one is single linear regression and the other is multilinear regression. Single linear regression is to find the best straight line between two attributes, which can predict one attribute from the other. Multilinear regression is to find many attributes to fit the data to a multidimensional surface, so that noise can be eliminated.
3. Clustering method
The workflow of the clustering method is relatively simple, but the operation is indeed complicated. The so-called clustering method is to group abstract objects into different The set of , find the unexpected isolated points in the set, these isolated points are noise. In this way, you can directly find the noise and then remove it.
Extended information:
As you can see from the name, data cleaning is to "wash away" the "dirty", which refers to discovering and correcting data files. Last resort procedures for identifiable errors, including checking data consistency, handling invalid and missing values, etc.
Because the data in the data warehouse is a collection of data oriented to a certain topic. These data are extracted from multiple business systems and contain historical data. In this way, it is unavoidable that some data are incorrect data and some data are incorrect. The data conflicts with each other. These erroneous or conflicting data are obviously unwanted and are called "dirty data".
We need to "wash out" "dirty data" according to certain rules. This is data cleaning. The task of data cleaning is to filter the data that does not meet the requirements, and hand the filtered results to the business department in charge to confirm whether it is filtered out or corrected by the business unit before extraction.
The data that does not meet the requirements mainly fall into three categories: incomplete data, erroneous data, and duplicate data. Data cleaning is different from questionnaire review. Data cleaning after entry is generally completed by computers rather than manually.
For more related knowledge, please visit the FAQ column!
The above is the detailed content of What does data cleaning include?. For more information, please follow other related articles on the PHP Chinese website!