Coursera Data Science Specialization: Reproducible Research Course Project 2
My Course Project 2 of the Reproducible Research course, part of the Data Science Specialization by JHU on Coursera.
You can find the html results here, the R Markdown format here and the final RPubs publication here.
Impact of weather events on public health and economy in the United States
Massimiliano Figini
July 6, 2016
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011.
The analysis in this document try to respond with tables and graphs at two questions:
- Across the United States, which types of events are most harmful with respect to population health?
- Across the United States, which types of events have the greatest economic consequences?
1. Data Processing
The data for the assignment can be downloaded here.
1.1 Settings
Load data and required libraries.
storm <- read.csv("repdata%2Fdata%2FStormData.csv")
library(dplyr)
stormDP <- tbl_df(storm)
library(ggplot2)
library(xtable)
1.2 Table summary for find the most harmful events with respect to population health.
Injuries and fatalities are the variables considerated for this part of the analysis. Tables with summaries are created with this new variables for each event: Num (total number), Fatalities, Injuries, FatalitiesAVG (average number of fatalities), InjuriesAVG, PercWithFatalities (percentage of events with at least one dead) PercWithInjuries (percentage of events with at least one injury).
# table with total injuries and fatalities for event
StormSummary <- stormDP %>% group_by(EVTYPE) %>% summarize(Num=n(),Fatalities=sum(FATALITIES),Fatalities_AVG=round(mean(FATALITIES),2),Injuries=sum(INJURIES),Injuries_AVG=round(mean(INJURIES),2))
# tables with events with at least one injury / death
WithInjuriesTB <- stormDP %>% filter(INJURIES>0) %>% group_by(EVTYPE) %>% summarize(WithInjuries=n())
WithDeadsTB <- stormDP %>% filter(FATALITIES>0) %>% group_by(EVTYPE) %>% summarize(WithDeads=n())
# join with summary table
StormSummary <- left_join(StormSummary,WithInjuriesTB, by="EVTYPE")
StormSummary <- left_join(StormSummary,WithDeadsTB, by="EVTYPE")
# percentage with at least one injury / fatality
StormSummary <- mutate(StormSummary, Perc_with_Injuries=round(WithInjuries/Num*100,2))
StormSummary <- mutate(StormSummary, Perc_with_Fatalities=round(WithDeads/Num*100,2))
# final summary table for the analysis
StormSummary2 <- StormSummary %>% select(EVTYPE,Num,Fatalities,Fatalities_AVG,Perc_with_Fatalities,Injuries,Injuries_AVG,Perc_with_Injuries) %>% arrange(desc(Num))
1.3 Table summary for find the events that have the greatest economic consequences.
Property and crop damage exponents for each level is listed out and assigned those values for the property exponent data. Invalid data was excluded. Property damage value was calculated by multiplying the property damage and property exponent value. Total damages are the final variable that sum property and crop damages.
# values of PROPDMGEXP
unique(stormDP$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# traduction of PROPDMGEXP
stormDP$PropExpN <- 0
stormDP$PropExpN[stormDP$PROPDMGEXP == ""] <- 1
stormDP$PropExpN[stormDP$PROPDMGEXP == "-"] <- 0
stormDP$PropExpN[stormDP$PROPDMGEXP == "?"] <- 0
stormDP$PropExpN[stormDP$PROPDMGEXP == "+"] <- 0
stormDP$PropExpN[stormDP$PROPDMGEXP == "0"] <- 1
stormDP$PropExpN[stormDP$PROPDMGEXP == "1"] <- 10
stormDP$PropExpN[stormDP$PROPDMGEXP == "2"] <- 100
stormDP$PropExpN[stormDP$PROPDMGEXP == "3"] <- 1000
stormDP$PropExpN[stormDP$PROPDMGEXP == "4"] <- 10000
stormDP$PropExpN[stormDP$PROPDMGEXP == "5"] <- 100000
stormDP$PropExpN[stormDP$PROPDMGEXP == "6"] <- 1000000
stormDP$PropExpN[stormDP$PROPDMGEXP == "7"] <- 10000000
stormDP$PropExpN[stormDP$PROPDMGEXP == "8"] <- 100000000
stormDP$PropExpN[stormDP$PROPDMGEXP == "B"] <- 1000000000
stormDP$PropExpN[stormDP$PROPDMGEXP == "h"] <- 100
stormDP$PropExpN[stormDP$PROPDMGEXP == "H"] <- 100
stormDP$PropExpN[stormDP$PROPDMGEXP == "K"] <- 1000
stormDP$PropExpN[stormDP$PROPDMGEXP == "m"] <- 1000000
stormDP$PropExpN[stormDP$PROPDMGEXP == "M"] <- 1000000
# Final value for property damages
stormDP$PropDMGN <- stormDP$PROPDMG*stormDP$PropExpN
# values of CROPDMGEXP
unique(stormDP$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
# traduction of CROPDMGEXP
stormDP$CropExpN <- 0
stormDP$CropExpN[stormDP$CROPDMGEXP == ""] <- 1
stormDP$CropExpN[stormDP$CROPDMGEXP == "?"] <- 0
stormDP$CropExpN[stormDP$CROPDMGEXP == "0"] <- 1
stormDP$CropExpN[stormDP$CROPDMGEXP == "2"] <- 100
stormDP$CropExpN[stormDP$CROPDMGEXP == "B"] <- 1000000000
stormDP$CropExpN[stormDP$CROPDMGEXP == "k"] <- 1000
stormDP$CropExpN[stormDP$CROPDMGEXP == "K"] <- 1000
stormDP$CropExpN[stormDP$CROPDMGEXP == "m"] <- 1000000
stormDP$CropExpN[stormDP$CROPDMGEXP == "M"] <- 1000000
# Final value for crop damages
stormDP$CropDMGN <- stormDP$CROPDMG*stormDP$CropExpN
# summary table for this analysis
StormSummary3 <- stormDP %>% group_by(EVTYPE) %>% summarize(PropDam=round(sum(PropDMGN),2),PropDam_AVG=round(mean(PropDMGN),2),CropDam=sum(CropDMGN),CropDam_AVG=round(mean(CropDMGN),2), TotalDamages = round(sum(PropDMGN)+sum(CropDMGN),2), TotalDamages_AVG = round(mean(sum(PropDMGN)+sum(CropDMGN)),2)) %>% arrange(desc(TotalDamages))
2. Results
2.1 The most harmful events with respect to population health.
The table and the graph below show the events with the large number of fatalities.
# top 20 for fatalities
print(xtable(as.data.frame(StormSummary2 %>% arrange(desc(Fatalities)))[1:20, ], auto = TRUE, caption='Top 20 events for number of fatalities'),type='html')
EVTYPE | Num | Fatalities | Fatalities_AVG | Perc_with_Fatalities | Injuries | Injuries_AVG | Perc_with_Injuries | |
---|---|---|---|---|---|---|---|---|
1 | TORNADO | 60652 | 5633 | 0.09 | 2.64 | 91346 | 1.51 | 12.70 |
2 | EXCESSIVE HEAT | 1678 | 1903 | 1.13 | 34.39 | 6525 | 3.89 | 9.83 |
3 | FLASH FLOOD | 54277 | 978 | 0.02 | 1.17 | 1777 | 0.03 | 0.70 |
4 | HEAT | 767 | 937 | 1.22 | 23.21 | 2100 | 2.74 | 6.00 |
5 | LIGHTNING | 15754 | 816 | 0.05 | 4.82 | 5230 | 0.33 | 17.83 |
6 | TSTM WIND | 219940 | 504 | 0.00 | 0.18 | 6957 | 0.03 | 1.21 |
7 | FLOOD | 25326 | 470 | 0.02 | 1.18 | 6789 | 0.27 | 0.61 |
8 | RIP CURRENT | 470 | 368 | 0.78 | 69.57 | 232 | 0.49 | 25.32 |
9 | HIGH WIND | 20212 | 248 | 0.01 | 0.91 | 1137 | 0.06 | 2.03 |
10 | AVALANCHE | 386 | 224 | 0.58 | 45.08 | 170 | 0.44 | 27.72 |
11 | WINTER STORM | 11433 | 206 | 0.02 | 1.11 | 1321 | 0.12 | 1.33 |
12 | RIP CURRENTS | 304 | 204 | 0.67 | 59.54 | 297 | 0.98 | 28.29 |
13 | HEAT WAVE | 74 | 172 | 2.32 | 35.14 | 309 | 4.18 | 12.16 |
14 | EXTREME COLD | 655 | 160 | 0.24 | 16.49 | 231 | 0.35 | 2.29 |
15 | THUNDERSTORM WIND | 82563 | 133 | 0.00 | 0.13 | 1488 | 0.02 | 0.74 |
16 | HEAVY SNOW | 15708 | 127 | 0.01 | 0.58 | 1021 | 0.06 | 0.83 |
17 | EXTREME COLD/WIND CHILL | 1002 | 125 | 0.12 | 8.68 | 24 | 0.02 | 0.90 |
18 | STRONG WIND | 3566 | 103 | 0.03 | 2.52 | 280 | 0.08 | 4.18 |
19 | BLIZZARD | 2719 | 101 | 0.04 | 2.21 | 805 | 0.30 | 1.77 |
20 | HIGH SURF | 725 | 101 | 0.14 | 9.24 | 152 | 0.21 | 5.10 |
# modification for the graph label
levels(StormSummary2$EVTYPE) <- gsub(" ", "\n",levels(StormSummary2$EVTYPE))
# desc order for the first graph
StormSummary2$EVTYPE <- factor(StormSummary2$EVTYPE, levels = StormSummary2$EVTYPE[order(StormSummary2$Fatalities, decreasing=TRUE)])
# graph with fatalities per event
g <- ggplot(head(as.data.frame(StormSummary2),n=8), aes(EVTYPE, Fatalities))
g+geom_bar(stat='identity')+labs(title="Top weather events for number of fatalities", x="Event",y="Fatalities")
The table and the graph below show the events with the large number of injuries.
# top 20 for injuries
print(xtable(as.data.frame(StormSummary2 %>% arrange(desc(Injuries)))[1:20, ], auto = TRUE, caption='Top 20 events for number of injuries'),type='html')
EVTYPE | Num | Fatalities | Fatalities_AVG | Perc_with_Fatalities | Injuries | Injuries_AVG | Perc_with_Injuries | |
---|---|---|---|---|---|---|---|---|
1 | TORNADO | 60652 | 5633 | 0.09 | 2.64 | 91346 | 1.51 | 12.70 |
2 | TSTM WIND | 219940 | 504 | 0.00 | 0.18 | 6957 | 0.03 | 1.21 |
3 | FLOOD | 25326 | 470 | 0.02 | 1.18 | 6789 | 0.27 | 0.61 |
4 | EXCESSIVE HEAT | 1678 | 1903 | 1.13 | 34.39 | 6525 | 3.89 | 9.83 |
5 | LIGHTNING | 15754 | 816 | 0.05 | 4.82 | 5230 | 0.33 | 17.83 |
6 | HEAT | 767 | 937 | 1.22 | 23.21 | 2100 | 2.74 | 6.00 |
7 | ICE STORM | 2006 | 89 | 0.04 | 2.84 | 1975 | 0.98 | 3.09 |
8 | FLASH FLOOD | 54277 | 978 | 0.02 | 1.17 | 1777 | 0.03 | 0.70 |
9 | THUNDERSTORM WIND | 82563 | 133 | 0.00 | 0.13 | 1488 | 0.02 | 0.74 |
10 | HAIL | 288661 | 15 | 0.00 | 0.00 | 1361 | 0.00 | 0.10 |
11 | WINTER STORM | 11433 | 206 | 0.02 | 1.11 | 1321 | 0.12 | 1.33 |
12 | HURRICANE/TYPHOON | 88 | 64 | 0.73 | 21.59 | 1275 | 14.49 | 13.64 |
13 | HIGH WIND | 20212 | 248 | 0.01 | 0.91 | 1137 | 0.06 | 2.03 |
14 | HEAVY SNOW | 15708 | 127 | 0.01 | 0.58 | 1021 | 0.06 | 0.83 |
15 | WILDFIRE | 2761 | 75 | 0.03 | 1.01 | 911 | 0.33 | 6.66 |
16 | THUNDERSTORM WINDS | 20843 | 64 | 0.00 | 0.24 | 908 | 0.04 | 1.55 |
17 | BLIZZARD | 2719 | 101 | 0.04 | 2.21 | 805 | 0.30 | 1.77 |
18 | FOG | 538 | 62 | 0.12 | 6.88 | 734 | 1.36 | 14.50 |
19 | WILD/FOREST FIRE | 1457 | 12 | 0.01 | 0.62 | 545 | 0.37 | 8.92 |
20 | DUST STORM | 427 | 22 | 0.05 | 2.11 | 440 | 1.03 | 10.30 |
# desc order for the second graph
StormSummary2$EVTYPE <- factor(StormSummary2$EVTYPE, levels = StormSummary2$EVTYPE[order(StormSummary2$Injuries, decreasing=TRUE)])
# graph with injuries per event
g2 <- ggplot(head(as.data.frame(StormSummary2),n=8), aes(EVTYPE, Injuries))
g2+geom_bar(stat='identity')+labs(title="Top weather events for number of injuries", x="Event",y="Injuries")
Based on the data, TORNADO caused the maximum number of fatalities and injuries, and for this reason it's the most harmful with respect to population health.
2.2 The events that have the greatest economic consequences.
# top 20 for damages
print(xtable(as.data.frame(StormSummary3)[1:20, ], digits=0, auto = TRUE, caption='Top 20 events for economic damages'),type='html')
## Warning in formatC(x = c(5661968450, 2607872800, 414953270, 5000,
## 3025954473, : NAs introduced by coercion to integer range
EVTYPE | PropDam | PropDam_AVG | CropDam | CropDam_AVG | TotalDamages | TotalDamages_AVG | |
---|---|---|---|---|---|---|---|
1 | FLOOD | 144657709807 | 5711826 | NA | 223563 | 150319678257 | 150319678257 |
2 | HURRICANE/TYPHOON | 69305840000 | 787566364 | NA | 29634918 | 71913712800 | 71913712800 |
3 | TORNADO | 56947380617 | 938920 | 414953270 | 6842 | 57362333887 | 57362333887 |
4 | STORM SURGE | 43323536000 | 165990559 | 5000 | 19 | 43323541000 | 43323541000 |
5 | HAIL | 15735267513 | 54511 | NA | 10483 | 18761221986 | 18761221986 |
6 | FLASH FLOOD | 16822673979 | 309941 | 1421317100 | 26186 | 18243991079 | 18243991079 |
7 | DROUGHT | 1046106000 | 420461 | NA | 5615983 | 15018672000 | 15018672000 |
8 | HURRICANE | 11868319010 | 68208730 | NA | 15758103 | 14610229010 | 14610229010 |
9 | RIVER FLOOD | 5118945500 | 29589280 | NA | 29072017 | 10148404500 | 10148404500 |
10 | ICE STORM | 3944927860 | 1966564 | NA | 2503546 | 8967041360 | 8967041360 |
11 | TROPICAL STORM | 7703890550 | 11165059 | 678346000 | 983110 | 8382236550 | 8382236550 |
12 | WINTER STORM | 6688497251 | 585017 | 26944000 | 2357 | 6715441251 | 6715441251 |
13 | HIGH WIND | 5270046260 | 260738 | 638571300 | 31594 | 5908617560 | 5908617560 |
14 | WILDFIRE | 4765114000 | 1725865 | 295472800 | 107017 | 5060586800 | 5060586800 |
15 | TSTM WIND | 4484928495 | 20392 | 554007350 | 2519 | 5038935845 | 5038935845 |
16 | STORM SURGE/TIDE | 4641188000 | 31359378 | 850000 | 5743 | 4642038000 | 4642038000 |
17 | THUNDERSTORM WIND | 3483122472 | 42187 | 414843050 | 5025 | 3897965522 | 3897965522 |
18 | HURRICANE OPAL | 3172846000 | 352538444 | 19000000 | 2111111 | 3191846000 | 3191846000 |
19 | WILD/FOREST FIRE | 3001829500 | 2060281 | 106796830 | 73299 | 3108626330 | 3108626330 |
20 | HEAVY RAIN/SEVERE WEATHER | 2500000000 | 1250000000 | 0 | 0 | 2500000000 | 2500000000 |
# modification for the graph label
levels(StormSummary3$EVTYPE) <- gsub(" ", "\n",levels(StormSummary3$EVTYPE))
# desc order for the graph
StormSummary3$EVTYPE <- factor(StormSummary3$EVTYPE, levels = StormSummary3$EVTYPE[order(StormSummary3$TotalDamages, decreasing=TRUE)])
# graph with damages per event
h <- ggplot(head(as.data.frame(StormSummary3),n=8), aes(EVTYPE, TotalDamages/1000000000))
h+geom_bar(stat='identity')+labs(title="Top weather events for damages (billions of dollars)", x="Event",y="Total Damages (billions of dollars)")
Based on the data, FLOOD have the greatest economic consequences.