R data analysis

The objectives of the data analysis project are:

  1. To compare data from air pollution from XVPCA server and meteocat XEMA server (e.g. January 2019 vs January 2020) in Barcelona to discover if there is any statistical difference among air pollution data. Air pollution is supposed to be improving lately due to old car restrictions created by Barcelona local regulations.
  2. To show data from 1999 to 2020 with powerful graphs using openair R library with an excellent manual and general purpose exploratory data analysis (e.g. ExPanDaR R library, ggplot2 library)
  3. To combine and organize air pollution data from meteocat and XVPCA using different R libraries (e.g. dplyr)
  4. To analyse normality using R normality test and depending on the results choose the corresponding test to show possible statistical differences between cities, seasons, weekdays, weekends, years, etc
  5. To analyse all the values above the limits approved by the European Union, limit values by World Heath Organisation, etc. As an example download the pdf report of a PM10 episode in January 2020 at the very end of this webpage.
  6. To forecast air pollution using deep learning and artificial intelligence algorithms in R in order to find out which one is the best one regarding prediction capabilities.
  7. To create a real-time app by using Socrata Open Data API to access air pollution data from Government of Catalonia. An example here by student Zakaria Sadiki), example using weather.io by student David Contreras, example using OpenWeatherMap data by student David Díaz.

Tools to be used:

To analyze data you need a pendrive to download the following software R CRAN and R Studio zip file in your pendrive. First execute R CRAN and install it in your pendrive and after unzip RStudio also in you pendrive. To end the installation please execute R Studio by clicking on rstudio.exe in the bin folder (create a shortcut in your pendrive to work better) and the first time  you need to link it to R CRAN as it is shown in the next images:

AIR POLLUTION ANALYSIS USING R SOFTWARE
It is very important to use correct comma separated values (csv) files. Remember to save as csv in LibreOffice Calc or Microsoft Office Excel and to replace all commas to points, then all semicolons to commas and finally all Sense dades to NA. After all that you can use your data in R Studio using the instructions found in the table below.

To order messy data you need to use tidyr library in order to reshape all the table contents. Please use e.g. gather and unite functions from tidyr library from the tidyr cheatsheet and the lubridate cheatsheet to work with dates. The objective of these data manipulations is to have something similar to the next csv (save a copy with write.csv function).

excel

sublimetext

rstudio

Read dataframe martorell<-read.csv(“E://PollutionData/martorell2015.csv”)
Calculate quartiles, median, min, max, outliers summary (martorell)
Create boxplot boxplot(martorell$NO)
Create histogram hist(martorell$NO, breaks=20, xlab=”NO(micrograms/m3)”, main=”Martorell air pollution”, col= “pink”)
Install and load normality test install.packages(” nortest “)

 

library(nortest)

Shapiro- Wilk normality test

 

Anderson-Darling normality test

Cramer-von-Mises normality test

Kolmogorov-Smirnov normality test

Pearson normality test

Shapiro-Francia normality test

shapiro.test(martorell$NO)

 

ad.test(martorell$NO)

cvm.test(martorell$NO)

lillie.test(martorell$NO)

pearson.test(martorell$NO)

sf.test(martorell$NO)

library (e1071) skewness(x), kurtosis(x)
skewness (negative:left tail, positive: right tail, normal:0) skewness(martorell$NO)
kurtosis (negative : platycurtic, positive : leptocurtic, normal :0) kurtosis(martorell$NO)
Student t test (compare 2 normal data) t.test(martorell$NO, santandreu$NO)
U Mann Whitney (compare 2 non normal data) wilcox.test(martorell$NO, santandreu$NO)
Compare > 2 normal groups ANOVA test
Compare > 2 non-normal grous Kruskal-Wallis test
Homocedascity (equal variability) leveneTest(x) in car library
View possible correlation pairs (martorell)
Object type : class (x) numeric, date, time series, dataframe, etc
Convert numeric to date martorell$date<-as.Date(martorell$date)
Create time series martorellNO.ts <- ts(martorell$NO, start=c(2015, 1, 1), end=c(2015, 12,31), frequency=365)
Bind column data NOcompare<-cbind(martorellNO.ts,santandreuNO.ts)
Plotting multiple time series plot(NOcompare, plot.type=”m”,col=c(“blue”, “red”))
Subset time series mytssummer2015 <- window(myts, start=c(2015, 6,21), end=c(2015, 9,22))
Correlation plot library (corrplot)

 

corrplot.mixed (martorell)

Linear model fit<-lm(martorell$NO~martorell$NO2)

 

summary(fit)

martorellhour

Regarding NO2 levels found in previous image:

Is Martorell following the NO2 air pollution standards of the EU?

What is the EU decission 3 about NO2 pollution? In Spanish (authentic and valid)

Analyse all EU air standards and compare to WHO  air standards

> plot(martorell$date, martorell$NO, type = "l", xlab = "year 2015",ylab = "Nitric oxide (microg/m3)")
nohour

Normality tests

install.packages(“nortest”)

normalitytest

normality2

> plot(martorell$date, martorell$NO, type = “l”, xlab = “year 2015”,ylab = “Nitric oxide (microg/m3)”, main=”Air pollution in Martorell (8760 observations)”)

8760

> plot(martorell$date[1:168], martorell$NO[1:168], type = “l”, xlab = “year 2015”,ylab = “Nitric oxide (microg/m3)”, main=”Air pollution in Martorell”)

oneweek

> plot(martorell$date[144:168], martorell$NO[144:168], type = “l”, xlab = “year 2015”,ylab = “Nitric oxide (microg/m3)”, main=”Air pollution in Martorell (7th January)”)

7january

Time series air pollution comparison Martorell vs Sant Andreu de la Barca (2015)

sabmartorell

T- test (normal data) or U Mann- Withney (non-normal data)

ttest

Air pollutant in Martorell (2015)

martorell2015ts1

One week forecast of air pollution in Martorell

oneweekforecastets

Download the manual of Open air library for R in order to analyse air pollution data

"The best way to predict the future is to invent it" (Alan Kay)