Description: The downloaded dataset from before will be combined into one file. In total, the period from 1963 to 2017 was chosen. The dataset contains all imports and exports of all commodities of Austria to the world and it was reported from Statistic Austria to the UN Comtrade database.
library(readr)
library(dplyr)
library(WriteXLS)
library(rjson)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# setting directories like before (in the download file)
setwd("../")
loc_fol <- paste0(getwd(),"/0_data/comtrade")
down_files <- list.files(path=loc_fol,pattern="*.csv",full.names=TRUE) # get all downloaded csv-files
combination <- lapply(down_files,read.csv)
df <- do.call("rbind", combination)
dim(df)
head(df, 2)
- 426813
- 35
pfCode | yr | period | periodDesc | aggrLevel | IsLeaf | rgCode | rgDesc | rtCode | rtTitle | ⋯ | qtAltCode | qtAltDesc | TradeQuantity | AltQuantity | NetWeight | GrossWeight | TradeValue | CIFValue | FOBValue | estCode |
S1 | 1963 | 1963 | 1963 | 1 | 0 | 1 | Import | 40 | Austria | ⋯ | NA | NA | NA | NA | NA | NA | 194876768 | NA | NA | 0 |
S1 | 1963 | 1963 | 1963 | 1 | 0 | 2 | Export | 40 | Austria | ⋯ | NA | NA | NA | NA | NA | NA | 71546640 | NA | NA | 0 |
# The data still contains the names from comtrade. Here the names get changed:
names <- c("Classification","Year","Period","Period Desc.","Aggregate Level","Is Leaf Code","Trade Flow Code","Trade Flow","Reporter Code","Reporter","Reporter ISO","Partner Code","Partner","Partner ISO","2nd Partner Code","2nd Partner","2nd Partner ISO","Customs Proc. Code","Customs","Mode of Transport Code","Mode of Transport","Commodity Code","Commodity","Qty Unit Code","Qty Unit","Qty","Alt Qty Unit Code","Alt Qty Unit","Alt Qty","Netweight (kg)","Gross weight (kg)","Trade Value (USD)","CIF Trade Value (USD)","FOB Trade Value (USD)","Flag")
colnames(df) <- names
head(df,5)
Classification | Year | Period | Period Desc. | Aggregate Level | Is Leaf Code | Trade Flow Code | Trade Flow | Reporter Code | Reporter | ⋯ | Qty | Alt Qty Unit Code | Alt Qty Unit | Alt Qty | Netweight (kg) | Gross weight (kg) | Trade Value (USD) | CIF Trade Value (USD) | FOB Trade Value (USD) | Flag |
S1 | 1963 | 1963 | 1963 | 1 | 0 | 1 | Import | 40 | Austria | ⋯ | NA | NA | NA | NA | NA | NA | 194876768 | NA | NA | 0 |
S1 | 1963 | 1963 | 1963 | 1 | 0 | 2 | Export | 40 | Austria | ⋯ | NA | NA | NA | NA | NA | NA | 71546640 | NA | NA | 0 |
S1 | 1963 | 1963 | 1963 | 2 | 0 | 1 | Import | 40 | Austria | ⋯ | NA | NA | NA | NA | NA | NA | 8827853 | NA | NA | 0 |
S1 | 1963 | 1963 | 1963 | 2 | 0 | 2 | Export | 40 | Austria | ⋯ | NA | NA | NA | NA | NA | NA | 40104828 | NA | NA | 0 |
S1 | 1963 | 1963 | 1963 | 3 | 0 | 1 | Import | 40 | Austria | ⋯ | NA | NA | 21781116 | NA | 21781116 | NA | 8827853 | NA | NA | 0 |
# After having a little look on the data, some columns are not necessary for our analysis.
# Since it is yearly data period is not relevant. The trade flow code is redundand to Trade Flow.
# Partner and Reporter codes are not important, as we predefined them: Austria and World.
# So we define a name vector with redundand information:
col_removal <- c("Period","Period Desc.","Trade Flow Code","Reporter Code", "Reporter ISO", "Partner Code", "Partner ISO")
# Then remove that columns with the following line:
cleandat_red <- df[ , !(colnames(df) %in% col_removal)] # Removal Step one - redundand information
# The data type factor often shows some unwanted side-effects, so we change factors into characters.
cleandat_fac <- cleandat_red %>% mutate_if(is.factor, as.character) # Removal Step 2: factor to character
# Now in addition we remove all columns that only contain NA-values (-> empty columns).
cleandat_na <- Filter(function(x) !all(is.na(x)), cleandat_fac) # Removal Step 3: remove empty colums
head(cleandat_na,2) # Now we have a more compact data set.
Classification | Year | Aggregate Level | Is Leaf Code | Trade Flow | Reporter | Partner | Commodity Code | Commodity | Qty Unit Code | Qty Unit | Alt Qty Unit | Netweight (kg) | Trade Value (USD) | Flag |
S1 | 1963 | 1 | 0 | Import | Austria | World | 0 | Food and live animals | 1 | No Quantity | NA | NA | 194876768 | 0 |
S1 | 1963 | 1 | 0 | Export | Austria | World | 0 | Food and live animals | 1 | No Quantity | NA | NA | 71546640 | 0 |
# Export Data: To interact with other people and distribute spreadsheets the formats .csv and .xlsx are both good.
write_fol <- paste0(getwd(),"/2_pipeline/Combine_Downloads")
WriteXLS(cleandat_na,paste0(write_fol,"/combo_ct_dat.xlsx"),row.names=FALSE)
write.csv(cleandat_na,paste0(write_fol,"/combo_ct_dat.csv"),row.names=FALSE)
Now we combined all the downloaded datasets and removed some bloat. Next we have to make sure, that the product categories (commodities) match over the whole period. For this we have to map all of them to the same reference system. The people of Comtrade thankfully provide lists for that.