Scripting CSV Merge & Deduplication in Linux



Script to download a number of CSVs from an S3 Bucket folder of today’s date, merge the CSVs into one CSV, remove the Header columns of all but the first CSV, deduplicate all lines with identical 1st and 12th columns and sync back to the same S3 Bucket.


Standard bash script: 


Specify today’s date then make and enter a directory with that name:

date=$(date +%Y-%m-%d)
echo “$date”
mkdir ~/$date
cd ~/$date

Then sync the contents of today’s date directory in S3 bucket called testbucket to our today’s date directory: 

aws s3 sync s3://testbucket/$date/ .

Move the contents of CSVs you want to keep to a temporary directory:

mv *testphrase* ../Working/

Remove remaining files you no longer want:

rm -f ~/$date/*

Move all the CSVs you want to merge back to the today’s date directory:

mv ../Working/* .

Merge all CSVs into one with today’s date:

cat * > merged_testphrase_$date.csv

Remove the header from all but the first CSV – in this case the last line of the previous file always ends in Lastline and Headercolumns is the header for the new CSV:

sed -i -e ‘s/LastlineHeadercolumns/Lastline/g’ merged_testphrase_$date.csv

In this case the separator is not a , but a | so using these as column separators remove any lines which have identical content for both first and twelfth column. Create new file called merged_testphrase_dedupe_$date.csv:

awk -F ‘|’ ‘!seen[$1,$12]++’ merged_testphrase_$date.csv > merged_testphrase_dedupe_$date.csv

Remove non deduplicated CSV:

rm -f merged_testphrase_$date.csv*

Rename deduplicated CSV back to original name:

mv merged_testphrase_dedupe_$date.csv merged_testphrase_$date.csv

Sync content back to S3 bucket:

aws s3 sync merged_testphrase_$date.csv s3://testbucket/$date/


Optionally,  change the CSV delimiters from | to , so you can open in CSV with proper columns:

sed -i -e ‘s/|/,/g’ merged_testphrase_$date.csv


Further Reading