Data Processing in Shell

Data Processing in Shell

Download Data using curl

https://curl.haxx.se/download.html

curl -O https://websitename.com/file001.txt     
#-O -> download file with it's name

curl -o newname.txt https://websitename.com/file001.txt     
#-o -> download file with new name

curl -O https://websitename.com/*.txt     

curl -O https://websitename.com/file[001-100].txt

curl -O https://websitename.com/file[001-100:10].txt
#download only file010.txt, file020.txt,....,file100.txt

curl -L -C -O https://websitename.com/file[001-100:10].txt
#-L -> redirects the http url if a 300 error code occurs
#-C -> resumes a previous file transfer if it times out

with curl you can also download data from 20+ protocols

Downloading Data using Wget

https://www.gnu.org/software/wget/

same to curl with this unique option flag:
-b -> background download
-q -> turn off output
-c -> resume broken download

#you can link the different option flags
wget -bqc https://websitename.com/file001.txt

#take all the file links in the txt and dowload them
wget -i list.txt

#other option flags must go before -i
#set download limit rate to 200k per second
wget --limit-rate=200k -i list.txt

#put a 1.5 second pause between each file download
wget --wait=1.5 -i list.txt

with wget you can also download data from (s)ftp

Data Cleaning and Munging with csvkit

pip install csvkit
pip install --upgrade csvkit

#convert first sheet to csv
in2csv file.xlsx > file.csv     

#convert NameSheet2 sheet to csv
in2csv file.xlsx --sheet "NameSheet2" > file.csv    

#see csv file content
csvlook file.csv     

#print summary statistics of all columns
csvstat file.csv

csvcut to filter colums

csvcut -n     #see column names and positions
csvcut -c 1 file.csv     #return first column
csvcut -c "colname" file.csv     #return colname column 
csvcut -c 2,3 file.csv     #return second and third columns
csvcut -c "colname","colname2" file.csv

csvgrep to filter rows

must be paired with one of these options:
-m -> exact row value to filter
-r -> regex pattern
-f -> path to a file

#search in file.csv where col column equal to value
csvgrep -c "col" -m value file.csv

#search in file.csv where third column equal to value
csvgrep -c 3 -m value file.csv

csvstack for stacking multiple csv files, the files must have same columns, same order, same data types

csvstack file1.csv file2.csv > fileall.csv

#make a group column with g1 if the record is from file1
csvstack -g "g1","g2" file1.csv file2.csv > fileall.csv

#same as before but choose a name for the group column
csvstack -g "g1","g2" -n "source" file1.csv file2.csv > fileall.csv

Pulling data from database with csvkit

sql2csv work with SQL Server, MySQL, Oracle, Postgresql, and others
sql2csv exec a query and put the results in a csv

#extract data from sqlite db to csv
sql2csv --db "sqlite:///dbname.db" --query "SELECT * FROM table" > file.csv

csvsql uses SQL for query csv files

csvsql --query "SELECT * FROM file" file.csv
csvsql --query "SELECT * FROM file" file.csv | csvlook     #better printout
csvsql --query "SELECT * FROM file" file.csv > result.csv

#join two files
csvsql --query "SELECT * FROM file JOIN file2 ON ..." file.csv file2.csv

#use shell variable for store the query
sqlquery = "SELECT * FROM file JOIN file2 ON file.col = file2.col"
csvsql --query "$sqlquery" file.csv file2.csv

csvsql used also for querying a db, manipulate objects on it and push data

#insert data in a new table called file in the mysql db dbs
csvsql --db "mysql:///dbs" -- insert file.csv

#use query
csvsql --db "mysql:///dbs" --query "SELECT * FROM file" -- insert file.csv

other useful attributes:
--no-inference -> create all table columns in text format
--no-constraints -> no length limits or null checks

Python on the command line

python     #start Python
echo "print('hello world!')"  > helloworld.py

python helloworld.py     #execute the script

pip install --upgrade pip

pip list     #see installed packages

pip install scikit-learn

pip install -r requirements.txt     #install packages listed in file

cron: time-based job scheduler for scripts

#echo scheduler command into cromtab
echo "* * * * * python create_model.py" | crontab

* * * * * run job every minutes forever
* meaning and order
first * -> minute (0-59)
second * -> hour (0-23)
third * -> day of month (1-31)
fourth * -> month (1-12)
fifth * -> day of week (0-6, Sunday = 0)

# see job scheduled with cron
crontab -l