Archive for the ‘unix’ tag
Learning Unix find: Searching in/Excluding certain folders
I love playing around with commands on the Unix shell but one of the ones that I’ve found the most difficult to learn beyond the very basics is find.
I think this is partially because I find the find man page quite difficult to read and partially because it’s usually quicker to work out how to solve my problem with a command I already know than to learn another one.
However, I recently came across Greg’s wiki which seems to do a pretty good job of explaining it.
Reasonably frequently I want to get a list of files to scan but want to exclude files in the .git directory since the results tend to become overwhelming with those included:
$ find . ! -path "*.git*" -type f -print
Here we’re saying find items which don’t have git in their path and which are of type ‘f’ (file) and then print them out.
If we don’t include the -type flag then the results will also include directories which isn’t what we want in this case. The -print is optional in this case since by default what we select will be printed.
Sometimes we want to exclude more than one directory which can be done with the following command:
$ find . \( ! -path "*target*" -a ! -path "*tools*" -a ! -path "*.git*" -print \)
Here we’re excluding the ‘target’, ‘tools’ and ‘git’ directories from the listing of files that we return.
The -a flag stands for ‘and’ so the above command reads ‘find all files/directories which do not have target in their path and do not have tools in their path and do not have .git in their path’.
We can always make that command a bit more specific if any of those words legitimately appear in a path.
As well as the -print flag there is also a -prune flag which we can use to stop find from descending into a folder.
The first command could therefore be written like this:
$ find . -path "*.git*" -prune -o -type f -print
This reads ‘don’t go any further into a folder which has git in the path but print any other files which don’t have git in their path’.
I’m still finding -prune a bit confusing to understand and as the wiki points out:
The most confusing property of -prune is that it is an ACTION, and thus no further filters are processed after it.
To use it, you have to combine it with -o to actually process the non-skipped files, like so:
A couple of months ago I was playing around with our git repository trying to get a list of all the scala files in the ‘src/main’ directory and I went with this command:
$ find . -type f -regex ".*src/main.*\.scala$"
Using the above flags it could instead be written like this:
$ find . -path "*src/main*" -type f -iname "*\.scala*"
or
$ find . -type f -path "*src/main/*\.scala"
Interestingly those latter two versions seem to be a bit slower than the one that uses the -regex flag.
I’m not entirely sure why that is – presumably by supplying two flags on the latter two solutions find has to do more operations per line than it does with the -regex option or something like that?
Bash: Reusing previous commands
A lot of the time when I’m using the bash shell I want to re-use commands that I’ve previously entered and I’ve recently learnt some neat ways to do this from my colleagues Tom and Kief.
If we want to list the history of all the commands we’ve entered in a shell session then the following command does the trick:
> history ... 761 sudo port search pdfinfo 762 to_ipad andersen-phd-thesis.pdf 763 vi ~/.bash_profile 764 source ~/.bash_profile 765 to_ipad andersen-phd-thesis.pdf 766 to_ipad spotify-p2p10.pdf 767 mkdir LinearAlgebra
If we want to execute any of those commands again then we can do that by entering ![numberOfCommand. For example, to execute the last command on that list we’d do this:
> !767 mkdir LinearAlgebra mkdir: LinearAlgebra: File exists
We can also search the history and execute the last command that matches the search by doing the following:
> !mk mkdir LinearAlgebra mkdir: LinearAlgebra: File exists
A safer way to do this would be to suffix that with :p so the command gets printed to stdout rather than executed:
> !mk:p mkdir LinearAlgebra
A fairly common use case that I’ve come across is to search for a file and then once you’ve found it open it in a text editor.
We can do this by using the !! command which repeats the previously executed command:
> find . -iname "someFile.txt" > vi `!!`
We can achieve the same thing by wrapping ‘!!’ inside ‘$()’ as well:
> find . -iname "someFile.txt" > vi $(!!)
Sam Rowe has a cool post where he goes into this stuff in even more detail.
I’m sure there are more tricks that I haven’t learnt yet so please let me know if you know some!
Unix: Getting the page count of a linearized PDF
We were doing some work last week to rasterize a PDF document into a sequence of images and wanted to get a rough idea of how many pages we’d be dealing with if we created an image per page.
The PDFs we’re dealing with are linearized since they’re available for viewing on the web:
A LINEARIZED PDF FILE is one that has been organized in a special way to enable efficient incremental access in a network environment.
The file is valid PDF in all respects, and is compatible with all existing viewers and other PDF applications. Enhanced viewer applications can recognize that a PDF file has been linearized and can take advantage of that organization (as well as added “hint” information) to enhance viewing performance.
The neat thing about this is it means that the document has meta data detailing the number of pages it contains:
Part 2: Linearization parameter dictionary
43 0 obj
<< /Linearized 1.0 % Version
/L 54567 % File length
/H [475 598] % Primary hint stream offset and length (part 5)
/O 45 % Object number of first page’s page object (part 6)
/E 5437 % Offset of end of first page
/N 11 % Number of pages in document
/T 52786 % Offset of first entry in main cross-reference table (part 11)
>>
endobj
By making use of the strings command Duncan and I hacked together a little script that lets us grab the number of pages in The Games of Strategy PDF or any other linearized PDF:
strings RAND_CB149-1.pdf |
awk '/Linearized/ { inmeta = 1; } match($0, /\/N [0-9]+/) { if(inmeta) print substr( $0, RSTART, RLENGTH ); exit;}' |
cut -d" " -f2It seems much more difficult to find the count if the document hasn’t been linearized but we didn’t need to solve that problem for the moment!
Unix: Summing the total time from a log file
As I mentioned in my last post we’ve been doing some profiling of a data ingestion job and as a result have been putting some logging into our code to try and work out where we need to work on.
We end up with a log file peppered with different statements which looks a bit like the following:
18:50:08.086 [akka:event-driven:dispatcher:global-5] DEBUG - Imported document. /Users/mneedham/foo.xml in: 1298 18:50:09.064 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo2.xml in: 798 18:50:09.712 [akka:event-driven:dispatcher:global-4] DEBUG - Imported document. /Users/mneedham/foo3.xml in: 298 18:50:10.336 [akka:event-driven:dispatcher:global-3] DEBUG - Imported document. /Users/mneedham/foo4.xml in: 898 18:50:10.982 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo5.xml in: 12298
I can never quite tell which column I need to get so end up doing some exploration with awk like this to find out:
$ cat foo.log | awk ' { print $9 }'
1298
798
298
898
12298Once we’ve worked out the column then we can add them together like this:
$ cat foo.log | awk ' { total+=$9 } END { print total }'
15590I think that’s much better than trying to determine the total run time in the application and printing it out to the log file.
We can also calculate other stats if we record a log entry for each record:
$ cat foo.log | awk ' { total+=$9; number+=1 } END { print total/number }'
3118$ cat foo.log | awk 'min=="" || $9 < min {min=$9; minline=$0}; END{ print min}'
298