Home Backend Development Python Tutorial Data Visualisation Basics

Data Visualisation Basics

Sep 07, 2024 pm 02:32 PM

Why use data vis

When you need to work with a new data source, with a huge amount of data, it can be important to use data visualization to understand the data better.
The data analysis process is most of the times done in 5 steps:

  1. Extract - Obtain the data from a spreadsheet, SQL, the web, etc.

  2. Clean - Here we could use exploratory visuals.

  3. Explore - Here we use exploratory visuals.

  4. Analyze - Here we might use either exploratory or explanatory visuals.

  5. Share - Here is where explanatory visuals live.

Types of data

To be able to choose an appropriate plot for a given measure, it is important to know what data you are dealing with.

Qualitative aka categorical types

Nominal qualitative data

Labels with no order or rank associated with the items itself.
Examples: Gender, marital status, menu items

Ordinal qualitative data

Labels that have an order or ranking.
Examples: letter grades, rating

Quantitative aka numeric types

Discrete quantitative values

Numbers can not be split into smaller units
Examples: Pages in a Book, number of trees in a park

Continuous quantitative values

Numbers can be split in smaller units
Examples: Height, Age, Income, Workhours

Summary Statistics

Numerical Data

Mean: The average value.
Median: The middle value when the data is sorted.
Mode: The most frequently occurring value.
Variance/Standard Deviation: Measures of spread or dispersion.
Range: Difference between the maximum and minimum values.

Categorical Data

Frequency: The count of occurrences of each category.
Mode: The most frequent category.

Visualizations

You can get insights to a new data source very quick and also see connections between different datatypes easier.
Because when you only use the standard statistics to summarize your data, you will get the min, max, mean, median and mode, but this might be misleading in other aspects. Like it is shown in Anscombe's Quartet: the mean and deviation are always the same, but the data distribution is always different.

In data visualization, we have two types:

  1. Exploratory data visualization We use this to get insights about the data. It does not need to be visually appealing.
  2. Explanatory data visualization This visualizations need to be accurate, insightful and visually appealing as this is presented to the users.

Chart Junk, Data Ink Ratio and Design Integrity

Chart Junk

To be able to read the information provided via plot without distraction, it is important to avoid chart junk. Like:

  • Heavy grid lines
  • Pictures in the visuals
  • Shades
  • 3d components
  • Ornaments
  • Superfluous texts Data Visualisation Basics

Data Ink Ratio

The lower your chart junk in a visual is the higher the data ink ratio is. This just means the more "ink" in the visual is used to transport the message of the data, the better it is.

Design Integrity

The Lie Factor is calculated as:

$$
text{Lie Factor} = frac{text{Size of effect shown in graphic}}{text{Size of effect in data}}
$$

The delta stands for the difference. So it is the relative change shown in the graphic divided by the actual relative change in the data. Ideally it should be 1. If it is not, it means that there is some missmatch in the way the data is presented and the actual change.

Data Visualisation Basics
In the example above, taken from the wiki, the lie factor is 3, when comparing the pixels of each doctor, representing the numbers of doctors in California.

Data Visualisation Basics

Tidy data

make sure you're data is cleaned properly and ready to use:

  • each variable is a column
  • each observation is a row
  • each type of observational unit is a table

Univariate Exploration of Data

This refers to the analysis of a single variable (or feature) in a dataset.

Bar Chart

  • always plot starting with 0 to present values in real comparable way.
  • sort nominal data
  • don't sort ordinal data - here it is more important to know how often the most important category appears than the most frequent
  • if you have a lot of categories use a horizontal bar chart: having the categories on the y-axes, to make it better readable. Data Visualisation Basics

Data Visualisation Basics

Data Visualisation Basics

Data Visualisation Basics

Histogram

  • quantitative version of a bar chart. This is used to plot numeric values.
  • values are grouped into continous bins, one bar for each is plotted Data Visualisation Basics

KDE - Kernel Density Estimation

  • often a Gaussian or normal distribution, to estimate the density at each point.
  • KDE plots can reveal trends and the shape of the distribution more clearly, especially for data that is not uniformly distributed. Data Visualisation Basics

Pie Chart and Donut Plot

  • data needs to be in relative frequencies
  • pie charts work best with 3 slices at maximum. If there are more wedges to display it gets unreadable and the different amounts are hard to compare. Then you would prefer a bar chart. Data Visualisation Basics

BiVariate Exploration of Data

Analyzes the relationship between two variables in a dataset.

Clustered Bar Charts

  • displays the relationship between two categorical values. The bars are organized in clusters based on the level of the first variable. Data Visualisation Basics

Scatterplots

  • each data point is plotted individually as a point, its x-position corresponding to one feature value and its y-position corresponding to the second.
  • if the plot suffers from overplotting (too many datapoints overlap): you can use transparency and jitter (every point is moved slightly from its true value) Data Visualisation Basics

Heatmaps

  • 2d version of a Histogram
  • data points are placed with its x-position corresponding to one feature value and its y-position corresponding to the second.
  • the plotting area is divided into a grid, and the numbers of points add up there and the counts are indicated by color Data Visualisation Basics

Violin plots

  • show the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
  • the distribution is plotted like a kernel density estimate, so we can have a clear
  • to display the key statistics at the same time, you can embedd a box plot in a violin plot. Data Visualisation Basics

Box plots

  • it also plots the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
  • compared to the violin plot, the box plot leans more on the summarization of the data, primarily just reporting a set of descriptive statistics for the numeric values on each categorical level.
  • it visualizes the five-number summary of the data: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.

Key elements of a boxplot:
Box: The central part of the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). This contains the middle 50% of the data.

Median Line: Inside the box, a line represents the median (Q2, 50th percentile) of the dataset.

Whiskers: Lines extending from the box, known as "whiskers," show the range of the data that lies within 1.5 times the IQR from Q1 and Q3. They typically extend to the smallest and largest values within this range.

Outliers: Any data points that fall outside 1.5 times the IQR are considered outliers and are often represented by individual dots or marks beyond the whiskers.
Data Visualisation Basics

Combined Violin and Box Plot

The violin plot shows the density across different categories, and the boxplot provides the summary statistics
Data Visualisation Basics

Faceting

  • the data is divided into disjoint subsets, most often by different levels of a categorical variable. For each of these subsets of the data, the same plot type is rendered on other variables, ie more histograms next to each other with different categorical values. Data Visualisation Basics

Line plot

  • used to plot the trend of one number variable against a seconde variable. Data Visualisation Basics

Quantile-Quantile (Q-Q) plot

  • is a type of plot used to compare the distribution of a dataset with a theoretical distribution (like a normal distribution) or to compare two datasets to check if they follow the same distribution. Data Visualisation Basics

Swarm plot

  • Like to a scatterplot, each data point is plotted with position according to its value on the two variables being plotted. Instead of randomly jittering points as in a normal scatterplot, points are placed as close to their actual value as possible without allowing any overlap. Data Visualisation Basics

Spider plot

  • compare multiple variables across different categories on a radial grid. Also know as radar chart. Data Visualisation Basics

Useful links

My sample notebook

Sample Code

Libs used for the sample plots:

  • Matplotlib: a versatile library for visualizations, but it can take some code effort to put together common visualizations.
  • Seaborn: built on top of matplotlib, adds a number of functions to make common statistical visualizations easier to generate.
  • pandas: while this library includes some convenient methods for visualizing data that hook into matplotlib, we'll mainly be using it for its main purpose as a general tool for working with data (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

Further reading:

  • Anscombes Quartett: Same stats for the data, but different distribution: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
  • Chartchunk: https://en.wikipedia.org/wiki/Chartjunk
  • Data Ink Ratio: https://infovis-wiki.net/wiki/Data-Ink_Ratio
  • Lie factor: https://infovis-wiki.net/wiki/Lie_Factor
  • Tidy data: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
  • Colorblind-friendly visualizations: https://www.tableau.com/blog/examining-data-viz-rules-dont-use-red-green-together

The above is the detailed content of Data Visualisation Basics. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

SQLAlchemy 2.0 Deprecation Warning and Connection Close Problem Resolving Guide SQLAlchemy 2.0 Deprecation Warning and Connection Close Problem Resolving Guide Aug 05, 2025 pm 07:57 PM

This article aims to help SQLAlchemy beginners resolve the "RemovedIn20Warning" warning encountered when using create_engine and the subsequent "ResourceClosedError" connection closing error. The article will explain the cause of this warning in detail and provide specific steps and code examples to eliminate the warning and fix connection issues to ensure that you can query and operate the database smoothly.

How to automate data entry from Excel to a web form with Python? How to automate data entry from Excel to a web form with Python? Aug 12, 2025 am 02:39 AM

The method of filling Excel data into web forms using Python is: first use pandas to read Excel data, and then use Selenium to control the browser to automatically fill and submit the form; the specific steps include installing pandas, openpyxl and Selenium libraries, downloading the corresponding browser driver, using pandas to read Name, Email, Phone and other fields in the data.xlsx file, launching the browser through Selenium to open the target web page, locate the form elements and fill in the data line by line, using WebDriverWait to process dynamic loading content, add exception processing and delay to ensure stability, and finally submit the form and process all data lines in a loop.

python pandas styling dataframe example python pandas styling dataframe example Aug 04, 2025 pm 01:43 PM

Using PandasStyling in JupyterNotebook can achieve the beautiful display of DataFrame. 1. Use highlight_max and highlight_min to highlight the maximum value (green) and minimum value (red) of each column; 2. Add gradient background color (such as Blues or Reds) to the numeric column through background_gradient to visually display the data size; 3. Custom function color_score combined with applymap to set text colors for different fractional intervals (≥90 green, 80~89 orange, 60~79 red,

How to create a virtual environment in Python How to create a virtual environment in Python Aug 05, 2025 pm 01:05 PM

To create a Python virtual environment, you can use the venv module. The steps are: 1. Enter the project directory to execute the python-mvenvenv environment to create the environment; 2. Use sourceenv/bin/activate to Mac/Linux and env\Scripts\activate to Windows; 3. Use the pipinstall installation package, pipfreeze>requirements.txt to export dependencies; 4. Be careful to avoid submitting the virtual environment to Git, and confirm that it is in the correct environment during installation. Virtual environments can isolate project dependencies to prevent conflicts, especially suitable for multi-project development, and editors such as PyCharm or VSCode are also

python schedule library example python schedule library example Aug 04, 2025 am 10:33 AM

Use the Pythonschedule library to easily implement timing tasks. First, install the library through pipinstallschedule, then import the schedule and time modules, define the functions that need to be executed regularly, then use schedule.every() to set the time interval and bind the task function. Finally, call schedule.run_pending() and time.sleep(1) in a while loop to continuously run the task; for example, if you execute a task every 10 seconds, you can write it as schedule.every(10).seconds.do(job), which supports scheduling by minutes, hours, days, weeks, etc., and you can also specify specific tasks.

How to handle large datasets in Python that don't fit into memory? How to handle large datasets in Python that don't fit into memory? Aug 14, 2025 pm 01:00 PM

When processing large data sets that exceed memory in Python, they cannot be loaded into RAM at one time. Instead, strategies such as chunking processing, disk storage or streaming should be adopted; CSV files can be read in chunks through Pandas' chunksize parameters and processed block by block. Dask can be used to realize parallelization and task scheduling similar to Pandas syntax to support large memory data operations. Write generator functions to read text files line by line to reduce memory usage. Use Parquet columnar storage format combined with PyArrow to efficiently read specific columns or row groups. Use NumPy's memmap to memory map large numerical arrays to access data fragments on demand, or store data in lightweight data such as SQLite or DuckDB.

python logging to file example python logging to file example Aug 04, 2025 pm 01:37 PM

Python's logging module can write logs to files through FileHandler. First, call the basicConfig configuration file processor and format, such as setting the level to INFO, using FileHandler to write app.log; secondly, add StreamHandler to achieve output to the console at the same time; Advanced scenarios can use TimedRotatingFileHandler to divide logs by time, for example, setting when='midnight' to generate new files every day and keep 7 days of backup, and make sure that the log directory exists; it is recommended to use getLogger(__name__) to create named loggers, and produce

HDF5 Dataset Name Conflicts and Group Names: Solutions and Best Practices HDF5 Dataset Name Conflicts and Group Names: Solutions and Best Practices Aug 23, 2025 pm 01:15 PM

This article provides detailed solutions and best practices for the problem that dataset names conflict with group names when operating HDF5 files using the h5py library. The article will analyze the causes of conflicts in depth and provide code examples to show how to effectively avoid and resolve such problems to ensure proper reading and writing of HDF5 files. Through this article, readers will be able to better understand the HDF5 file structure and write more robust h5py code.

See all articles