Scott's Blog

学则不固, 知则不惑

0%

Bash for Data Manipulation

Do complex things with just a few keystrokes!

Bash is a Unix shell and command language, it survived and thrived for almost 50 years because it lets people do complex things with just a few keystrokes. Sometimes called "the universal glue of programming," it helps users combine existing programs in new ways, automate repetitive tasks, and run programs on clusters and clouds that may be halfway around the world.

Basic

How to move around in the shell, and how to create, modify, and delete files and folders.

  • pwd print working directory
  • ls: listing files or directories
  • cd: change directory
  • cp: copy
  • mv: move or rename
  • rm: remove
  • $1 or $2 in bash script file, receive ARGV in bash scripts. $@ , $* means get ARGV list, $# means get ARGV length.

Find files

by name

  • find ., list all file and folder below current
  • find folder
  • find . -type d, find all foldre, no file
  • find . -type f, find all file , no foldre
  • find . -type f -name "test.txt", name as text txt file
  • find . -type f -name "text*", name as txt all file
  • find . -type f -iname "text*", 不区分大小写
  • find . -type f -name "*.py"

by time

  • find . -type f -mmin -10,过去十分钟修改过的文件
  • find . -type f -mmin +10
  • find . -type f -mmin +1 -mmin -5
  • find . -type f -mtime -20

amin,atime: access min and access day; cmin,ctime: change min and change day; mmin,mtime: modify;

by size

  • find . -size +5m, k,g is work too
  • ls -lah ./folders, info about sub folder and files,including size
  • find . -empty

by permission

  • find. -perm 777, read, write, and excute
  • find folder -exec chown coreschafer:www-data {} +
  • find folder, will return all folder, -exec will run the command in that results, {} palceholder, + end of the command.
  • find folder -type f -exec chmod 664 {} +
  • find folder -perm 664
  • find . -type f -name "*.jpg"
  • find . -type f -name "*.jpg" -maxdepth 1, searched 1 level down
  • find . -type f -name "*.jpg" -maxdepth 1 -exec rm {} +, delete serched files

Grep

Grep single file

searched text

  • grep "text_you_want_search" filename.txt
  • grep -w "text_you_want_search" filename.txt, have to match all words
  • grep -wi "text_you_want_search" filename.txt, igore the lowcase and uppearcse.
  • grep -win "text_you_want_search" filename.txt, get info about the line number
  • grep -win -B 4 "text_you_want_search" filename.txt, return the context about the searched words, 4 line, behind
  • grep -win -A 4 "text_you_want_search" filename.txt, return the context about the searched words, 4 line, ahead
  • grep -win -C 4 "text_you_want_search" filename.txt, return the context about the searched words, 4 line, two line before and two behind.

Grep multi file

  • grep -win "text_" ./*, all file
  • grep -win "text_" ./*.txt, txt file
  • grep -winr "text" ./ , search all subdir
  • grep -wirl "text" ./ , no need match info, just file list
  • grep -wirc "text" ./ , show matched number in eatch file

Grep command history

  • history | grep "git commit"
  • history | grep "git commit" | grep "dotfile"

Grep rgx

  • grep -P "--" file.txt, work well in linux, mac need to config, I configed

cURL

Requests

  • curl url
  • curl http://localhost:5000
  • curl http:www.wittyfans.com/json_file
  • curl -i http:www.wittyfans.com/json_file, details info about the get
  • curl http:www.wittyfans.com/method
  • curl -d "first=name&last=lastname" http:www.wittyfans.com/method, d for data, Post request
  • curl -X PUT -d "first=name&last=lastname" http:www.wittyfans.com/method, d for data, Pust request
  • curl -X DELETE http:www.wittyfans.com/method, delete request

Verify

Could not verify your access ?

curl -u username:password http://wittyfans.com, Auth

Download

  • curl http://wittyfans.com/folder, return binary file , error
  • curl -o filename.jpg http://wittyfans.com/folder , sucess
  • curl -o file_name.json http:/.api.wittyfans.com , Saving large json file

rsync

Install

aviable in Mac, debian-based linux need to install

  • apt-get install rsync
  • yum install rsync

Use

  • rsync folder1/* backup/ , sync fils to backup folder,will skping the subfolder's file, but affected subfolder
  • rsync -r folder1/* backup/ , including subfolder's file
  • rsync -r folder1 backup/, sync folder, not content in it

Check chage before run

  • rsync -a --dry-run folder1/* backup/, check before the command run, now view showed
  • rsync -av --dry-run folder1/* backup/, auto view

Source_folder has new file

  • rsync -av --delete --dry-run original/ backup/, check, be careful !

Do it in local and host

  • rsync -zaP -p local_folder username@ip:~/public/, z for compress, a for all, P for tarnsfer in internet
  • rsync 0zaP username@ip:~/public/file ~/Downloads/, revers

Manipulating data

How to work with the data in those files

cat: view a files contents, meaning concatenate

less & more: view contents piece by piece, more is superseded by less now, In less:

  • :n, Move to next file
  • :p, Go back to previous file
  • :q, quit

head: look at the start of a text file, head -3, only display the first three lines

tail: look at the end of a text file, tail -n +7, display content from line 7 to end

ls : list everything below a directory, ls -R -F, -R recursive -Fprints a / after the name of every directory and a * after the name of every runnable program.

man: manual, automatically invokes less

cut -f 2-5,8 -d values.csv: select columns 2 through 5 and columns 8, using comma as the separator. -d means delimiter, -f meaning fields to specify columns

!command: re-run the most recent use of that command matched

grep bicuspid seasonal/winter.csv: prints lines from winter.csv that contain "bicuspid"

cat two_cities.txt | egrep 'Sydney Carton|Charles Darnay' | wc -l : Count the number of lines in the book that contain either the character 'Sydney Carton' or 'Charles Darnay'.

1
2
3
4
5
6
7
-c: print a count of matching lines rather than the lines themselves
-h: do not print the names of files when searching multiple files
-i: ignore case (e.g., treat "Regression" and "regression" as matches)
-l: print the names of files that contain matches, not the matches
-n: print line numbers for matching lines
-v: invert the match, i.e., only show lines that don't match
egrep: grep -E

Combining tools

How to use this power to select the data you want, and introduce commands for sorting values and removing duplicates.

head -n 5 seasonal/summer.csv > top.csv: get first 5 rows content of summer.csv, write to top.csv

cut -d , -f 2 seasonal/summer.csv | grep -v Tooth, select all of the tooth names from column 2 of the comma delimited file.

wc, word count, count a date from a file. grep 2017-07 seasonal/spring.csv | wc -l

head -n 3 seasonal/s*, show all s* files first 3 rows.

  • *, all
  • ?, single word
  • [...] matches any one of the characters inside the square brackets, 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt
  • {...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.

sort, -n: sort numerically, -r: reverse, -b: ignore leading blanks, -f: be case-insensitive

uniq, remove adjacent duplicated lines .

wc -l seasonal/*.csv, Print line numbers for each file in folder seasonal

wc -l seasonal/*.csv | grep -v 'total' | sort -n | head -n 1, remove rows with word 'total' and select first row.

Batch processing

How to make your own pipelines do that. Along the way, you will see how the shell uses variables to store information.

set: check environment variables

echo: print

  • $User, user name
  • $OSTYPE name of the kind of operating system you are using

training=seasonal/summer.csv then echo $training, define a variable and print it.

For loop:

1
2
3
4
5
6
7
8
9
# Example 1
for filetype in gif jpg png; do echo $filetype; done
# Example 2
for filename in seasonal/*.csv; do echo $filename; done
# Example 3
datasets=seasonal/*.csv
for filename in $datasets; do echo $filename; done
# Example 4
for file in seasonal/*.csv; do head -n 2 $file | tail -n 1; done

Do not using space in file name, it will causing issue in bash.

Creating new tools

How to go one step further and create new commands of your own.

Edit file using nano:

nano filename.txt, edit a file

  • Ctrl + K: delete a line.
  • Ctrl + U: un-delete a line.
  • Ctrl + O: save the file ('O' stands for 'output'). You will also need to press Enter to confirm the filename!
  • Ctrl + X: exit the editor.

grep -h -v Tooth spring.csv summer.csv > temp.csv, -h stop it from printing filenames, -v printing all rows exclude Tooth

history | tail -n 3, Show most recent 3 commands

$@, pass filenames to scripts. tail -q -n +2 $@ | wc -l

Downloading data

how to download data files from web servers via the command line

curl, Client for URLs. man curl , check curl installation.

  • curl -O url save the file with it's original name
  • curl -o newname.txt url new file name
  • Download all 100 data file, curl -O https://s3.amazonaws.com/datafile[001-100].txt

Wget, World Wide Web and get. better than curl when downloading multiple files recursively. which wget, check wget installation.

wget -c -b https://wittyfans.com/201812SpotifyData.zip

  • -c, resume broken download
  • -b, go to background
  • wget --wait=1 -i url_list.txt, # Create a mandatory 1 second pause between downloading all files in url_list.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Use curl, download and rename a single file from URL
curl -O Spotify201812.zip -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip

# Unzip, delete, then re-name to Spotify201812.csv
unzip Spotify201812.zip && rm Spotify201812.zip
mv 201812SpotifyData.csv Spotify201812.csv

# View url_list.txt to verify content
cat url_list.txt

# Use Wget, limit the download rate to 2500 KB/s, download all files in url_list.txt
wget --limit-rate=2500k -i url_list.txt

# Take a look at all files downloaded
ls

CSV Kit

Using csvkit to convert, preview, filter and manipulate files to prepare our data for further analyses.

pip install csvk for install, Doc

in2csv

  • in2csv -h, converting files to csv.
  • in2csv SpotifyData.xlsx > SpotifyData.csv
  • in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv, Only converting a sheet

csvlook

  • csvlook -h, data preview on the command line.
  • csvlook SpotifyData.csv

csvsort

  • csvsort -c 2 Spotify_Popularity.csv | csvlook

csvstat

  • csvstat Spotify_Popularity.csv, summary statistics

csvcut

  • csvcut -n Spotify_MusicAttributes.csv, Print a list of column headers in data file
  • csvcut -c 1,3,5 Spotify_MusicAttributes.csv, Print the first column, by position
  • csvcut -c "track_id","duration_ms","loudness" Spotify_MusicAttributes.csv, Print the track id, song duration, and loudness, by name

csvgrep

  • csvgrep -c "danceability" -m 0.812 Spotify_MusicAttributes.csv, filter row danceability by value 0.812, column name must with "".

csvstack

  • csvstack, merge files.
  • csvstack Spotify_Rank6.csv Spotify_Rank7.csv > Spotify.csv, merge two files to one
  • csvstack -g "Rank6","Rank7" \ Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_Al, merge two files to one and add a source column.

chain commands

  • ;, links commands together and runs sequentially
  • &&, links commands together, but only runs the 2nd command if the 1st succeeds
  • >, using outputs from the 1st command
  • |, using outputs form the 1st as input to the 2nd

sql2csv

  • sql2csv -v or sql2csv --verbose, printing more tracebacks and logs
1
2
3
# Pull the entire Spotify_Popularity table and print in log
sql2csv --db "sqlite:///SpotifyDatabase.db" \
--query "SELECT * FROM Spotify_Popularity"

csvsql

Manipulating data using SQL syntax (Small to medium files only) :

1
2
3
# Reformat the output using csvlook
csvsql --query "SELECT * FROM Spotify_MusicAttributes ORDER BY duration_ms LIMIT 1" \
Spotify_MusicAttributes.csv | csvlook

using bash variable:

1
2
3
4
5
# Store SQL query as shell variable
sqlquery="SELECT * FROM Spotify_MusicAttributes ORDER BY duration_ms LIMIT 1"

# Apply SQL query to Spotify_MusicAttributes.csv
csvsql --query "$sqlquery" Spotify_MusicAttributes.csv

join two file:

1
2
3
4
5
6
7
8
# Store SQL query as shell variable
sql_query="SELECT ma.*, p.popularity FROM Spotify_MusicAttributes ma INNER JOIN Spotify_Popularity p ON ma.track_id = p.track_id"

# Join 2 local csvs into a new csv using the saved SQL
csvsql --query "$sql_query" Spotify_MusicAttributes.csv Spotify_Popularity.csv > Spotify_FullData.csv

# Preview newly created file
csvstat Spotify_FullData.csv

Pushing data back to database:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Store SQL for querying from SQLite database
sqlquery_pull="SELECT * FROM SpotifyMostRecentData"

# Apply SQL to save table as local file
sql2csv --db "sqlite:///SpotifyDatabase.db" --query "$sqlquery_pull" > SpotifyMostRecentData.csv

# Store SQL for UNION of the two local CSV files
sqlquery_union="SELECT * FROM SpotifyMostRecentData UNION ALL SELECT * FROM Spotify201812"

# Apply SQL to union the two local CSV files and save as local file
csvsql --query "$sqlquery_union" SpotifyMostRecentData.csv Spotify201812.csv > UnionedSpotifyData.csv

# Push UnionedSpotifyData.csv to database as a new table
csvsql --db "sqlite:///SpotifyDatabase.db" --insert UnionedSpotifyData.csv

Bash Script

Stream editor

sed: stream editor.

cat soccer_scores.csv | sed 's/Cherno/Cherno City/g' > soccer_scores_edited.csv : replace word Cherno to herno City then save it to a new file, for more, check this.

Argument

$1 or $2 in bash script file, receive ARGV in bash scripts. $@ , $* means get ARGV list, $# means get ARGV length. cat hire_data/*.csv | grep "$1" > "$1".csv: take in a city (an argument) as a variable, filter all the files by this city and output to a new CSV with the city name.

Quotes

Single,double,backticks.

  • Single quotes ('sometext') = Shell interprets what is between literally
  • Double quotes ("sometext") = Shell interpret literally except using $ and backticks
  • Backticks (`sometext`) = Shell runs the command and captures STDOUT back into a variable

Numeric variables

In bash, Type >>> 1 + 5 will get error. instead, you need type expr 1 + 5. expr is utility program just like cat and grep. but expr cannot natively handle decimal places. expr 1 + 2.5 will get not a decimal nuber error.

Introduce bc (basic calculator), a useful command-line program. using bc without opening the calculator:

echo "5+7.5" | bc , bc has a scale argument for how many decimal places: echo "scale=3; 10 /3 | bc", ; is to separate lines in terminal.

Array

Normal array

Create array: capital_cities=("Sydney" "New York" "Paris")

Add element:

1
2
3
4
5
6
7
# Create a normal array with the mentioned elements using the declare method
declare -a capital_cities

# Add (append) the elements
capital_cities+=("Sydney")
capital_cities+=("New York")
capital_cities+=("Paris")

Get all element and length of array:

1
2
3
4
5
6
7
8
# The array has been created for you
capital_cities=("Sydney" "New York" "Paris")

# Print out the entire array
echo ${capital_cities[@]}

# Print out the array length
echo ${#capital_cities[@]}

Associative arrays

Like dictionary in python.

Create:

1
2
3
4
5
6
7
# Create empty associative array
declare -A model_metrics

# Add the key-value pairs
model_metrics[model_accuracy]=98
model_metrics[model_name]="knn"
model_metrics[model_f1]=0.82

Create in one line:

1
2
3
4
5
# An associative array has been created for you
declare -A model_metrics=([model_accuracy]=98 [model_name]="knn" [model_f1]=0.82)

# Print out just the keys
echo ${____model_metrics[____]}

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Create variables from the temperature data files
temp_b="$(cat temps/region_B)"
temp_c="$(cat temps/region_C)"

# Create an array with these variables as elements
region_temps=($temp_b $temp_c)

# Call an external program to get average temperature
average_temp=$(echo "scale=2; (${region_temps[0]} + ${region_temps[1]}) / 2" | bc)

# Append to array
region_temps+=($average_temp)

# Print out the whole array
echo ${region_temps[@]}

Control Statements

IF statements

1
2
3
4
5
if [ condition1 ] && [ condition2 ]; then
# some code
else
# some code
fi

Move files based on content:

1
2
3
4
5
6
7
8
9
10
11
12
# Extract Accuracy from first ARGV element
accuracy=$(grep Accuracy $1 | sed 's/.* //')

# Conditionally move into good_models folder
if [ $accuracy -ge 90 ]; then
mv $1 ./good_models
fi

# Conditionally move into bad_models folder
if [ $accuracy -lt 90 ]; then
mv $1 ./bad_models
fi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Normal flags
-eq =
-ne !=
-lt <
-le <=
-gt >
-ge >=

# File related flags
-e if the file exists
-s if the file exists and has size greater than zero
-r if the file exists and is readable
-w if the file exists and is writable

## And and OR
&& for and
|| for or

For loops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Use a FOR loop on files in directory
for file in inherited_folder/*.R
do
# Echo out each file
echo $file
done

# Move python file to to_keep if it using random forest class fier
for file in robs_files/*.py
do
# Create IF statement using grep
if grep -q 'RandomForestClassifier' $file ; then
# Move wanted files to to_keep/ folder
mv $file to_keep/
fi
done

Case

1
2
3
4
5
6
7
8
9
10
11
12
# Create a CASE statement matching the first ARGV element
case $1 in
# Match on all weekdays
Monday|Tuesday|Wednesday|Thursday|Friday)
echo "It is a Weekday!";;
# Match on all weekend days
Saturday|Sunday)
echo "It is a Weekend!";;
# Create a default
*)
echo "Not a day!";;
esac

Function

1
2
3
4
5
6
7
8
9
10
function function_name {
#function_code
return #something
}

function print_hello () {
echo "Hello world!"
}

print_hello

Scope: All variables in Bash are global by default!Using local val to restrict variable scope.

To get data out from function:

  1. Assign to a global variable
  2. echo what we want back in last line and capture using shell-within-shell
1
2
3
4
function convert {
echo $(echo $1)
}
converted = $(convert 30)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Create a function with a local base variable
function sum_array () {
local sum=0
# Loop through, adding to base variable
for number in "$@"
do
sum=$(echo "$sum + $number" | bc)
done
# Echo back the result
echo $sum
}
# Call function with array
test_array=(14 12 23.5 16 19.34)
total=$(sum_array "${test_array[@]}")
echo "The total sum of the test array is $total"

Python script on bash

Scheduling a job with crontab

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Preview both Python script and requirements text file
cat create_model.py
cat requirements.txt

# Pip install Python dependencies in requirements file
pip install -r requirements.txt

# Run Python script on command line
python create_model.py

# Add CRON job that runs create_model.py every minute
echo "* * * * * python create_model.py" | crontab

# Verify that the CRON job has been scheduled via CRONTAB
crontab -l

Cron

  • crontab -l, list the crons

set your editor to nano, default vim

  • export EDITOR=/user/bin/nano
  • crontab -e, open editor
  • press i to input
1
2
3
4
5
6
7
8
9
10
11
12
13

Algorhythm for schuel you task with time

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │ 7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * command_to_execute