Unix has so many great ways to perform text manipulation but one niche which hasn’t been filled is splitting a tabular file into pieces based on the contents of certain columns. There are two commands, split
and csplit
, that do a similar role. split
can split a file into a certain number of bytes or lines; csplit
uses a regular expression to determine where to split the file. Often for my purposes neither of these tools is a good fit, and what I really want is an equivalent to the “group by” clause in SQL databases. Group by sorts rows based on certain grouping columns so that you can then perform summaries on that group. split_by
splits out a file based on the grouping columns for further processing on each chunk. For example, say I’ve got a delimited text file containing a mix of categorical and numerical data like the following:
A | B | C | D | E | F |
---|---|---|---|---|---|
AU | BR | 447 | 223.2 | 55958 | US |
AU | FR | 348 | 16.8 | 58484 | AU |
US | UK | 12 | 53.30 | 129 | PG |
I want to split this file into smaller files based on what’s in column A, which in this example would make two files for the 2 lines containing “AU” and for the 1 line containing “US”. Neither, split
or csplit
really handle this case and the only option is to use an awk
, perl
or other script to handle it. (It’s also possible to do this in multiple passes with grep
) After writing similar awk one-liners to do this sort of thing I decided to make it a bit more reusable:
#!/usr/bin/awk -f
BEGIN{
split(cols, group_by, ",")
col_sep = "_"
}
{
if(header == 1 && NR == 1) {
next
}
out_file = ""
for(col in group_by) {
if(col == 1) {
out_file = group_by[col]
} else {
out_file = out_file col_sep group_by[col]
}
}
out_file = (prefix out_file suffix)
print $0 > out_file
}
That’s the whole program! You can run it like the following, taking advantage of the variables defined in the script:
./split_by -vFS="\t" -vprefix=test_split_ -vsuffix=".tsv" -vcols=1 -vheader=1 file.tsv
This will produce two files on the example above called “test_split_AU.tsv” and “test_split_US.tsv”. Only the cols argument is needed to say which of the columns is used to split on. If you want to split on multiple columns pass a comma separated list -vcols=1,2
. The output file names are generated based on the contents of those columns so if there are any special characters in there that might mess up a file name, you’re in trouble. The prefix and suffix variables default to nothing but are useful for being able to find the files for later by giving them all a common prefix. The nice thing about this is that all of the awk
variables are still available so changing between a csv and tsv file can be achieved using the FS parameter on the command line.