Back to Table of Contentsgentoo - A Click-Ass Filemanager Go to Obsession Developments Homepage
  LICENSE   NOTES   GUIDE   INTRO   USAGE CONFIG HISTORY CONTRIBUTING ACKS  

File Types

Introduction

Almost all of the files we use every day can be said to have a specific "type", something that categorizes the file, be it by its name, its contents, or some other property. This chapter is about how you can teach gentoo about the file types you work most with, so it can (for example) tell a Perl source code file from a HTML text file.

By itself, this typing doesn't achieve much. Sure, you can turn on the "Type" pane content column so you can learn about file types in a directory at a glance. You can even sort on that column, thus grouping files of equal types together in the listing. While all of this is useful, the actual purpose for the file typing mechanism is more subtle: each type is associated with exactly one style, and styles are real neat things, as we will see later on.

File Types in gentoo

A file type in gentoo has a pretty simple structure. It is basically a set of rules of various types which are applied to files in order to determine if they "belong" to the type in question. The type also has a name, to make it easier to work with, and a link to something called a style. That's really all there is to it.

Type Rules

When you define a new type, you must specify the type's rule set. The rules are applied to each row to be displayed by gentoo, and as soon as all rules of some type match, the file is said to have (be of, belong to) that type. You must try to be as exclusive as possible when you design type rules, so that the type doesn't "eat up" all files, thus causing incorrect typing.

There are five different kinds of rule you can use. Of these five, one is obligatory and must always be used. The other four are optional; you choose freely among them, using none, a few, or all. The rules are:

  1. Intrinsic Type (obligatory)
  2. Protection
  3. File Name Suffix
  4. File Name Regular Expression
  5. 'file' Command Regular Expression

Let's investigate each of these in turn:

Intrinsic Type

All objects in the file system have an intrinsic type. For example, a directory just isn't a regular file; it's intrinsic type is directory and that cannot be changed. If you create a type and specify e.g. "character device" as the type's intrinsic type requirement, only character device files will ever be considered as beloning to your new type.

There are seven intrinsic types: file, directory, soft link, block device, character device, FIFO and socket. You must specify exactly one.

Protection

A file's protection (or mode) is a file system level intrinsic property. All files always have protection information available. The protection information can be seen in gentoo by using the various "mode" column content types. You can change a file's protection with the built-in ChMod command (named after a standard shell command which does the same thing). Checking a file's protection is a fast operation. A protection rule is specified as a set of six flags; each flag requires something from the file's protection. The rule matches when all flags succeed. These are the flags:

SetUID
Set this to require files to have the SetUID protection bit set.
SetGID
When set, this requires files to have the SetGID bit set.
Sticky
This requires files to be "sticky". Not often used.
Readable, Writeable, Executable
These three flags allow you to require that a file shall be readable, writaable, or executable, respectively. They are interesting because they are not just direct flag comparisons against files. Rather, these three are a little intelligent. They each require that you, i.e. the user currently running gentoo, have the permission in question. When evaluating these rules, gentoo compares your (UID,GID) values against those of files, and apply logic to determine which of the three sets of RWX flags available in the file applies.

File Name Suffix

This is a simple file name rule. It allows you to specify a string, and then requires candidate files to have names ending in that very string for a match to be considered. If you always use the same suffix (sometimes called extension) for your file names, this rule will maybe be all you need. Typically, a file type suffix is separated from the actual name of the file by a dot; this rule pretends it doesn't know that, so you must always include the dot as the first character in the suffix. The suffix comparison is case-insensitive, so .jpg, .JPG and .JPg all mean the same thing, and all will match each other.

A "problem" with this rule is that it only allows you to specify one suffix. Many file types have several popular suffices, one of which is generally a dot followed by three letters for compatibility with the broken nightmare known as FAT. For example, HTML hypertext files are often given a suffix of ".html" or just ".htm". You cannot specify such alternatives with this rule; it has been optimized to check for just one suffix.

File Name Regular Expression

For those cases when a simple suffix isn't enough, but the type is anyway deductable from just the name of a file, you can use the regular expression matching rule. This rule lets you enter a full regular expression against which the names of files are checked. A match is required for the rule to succeed.

When entering regular expressions for file name matching, remember that the dot (.) is in fact a RE meta-character and need to be escaped (by a backslash; \.) if you really want to match against a dot. Also note that your regular expression is used as a "search RE"; if a match is produced between your RE and any part of a file name, that is enough. So try to be restrictive when you write regular expressions; for example by using the ^ and $ metasymbols appropriately.

As an example of when RE matching comes in handy, consider attempting to define a file type for JPEG image files. Such files are generally given the extension ".jpeg" on real filesystems, or just ".jpg" on FAT ugly ones. This rules out using the simple suffix matcher, but lends itself perfectly to REs. One naive RE could be ".+\.(jpeg|jpg)". This works fine, but is long and unnecessarily complex to write. A neater way, IMHO, is ".+\.jpe?g". As always, remember to quote that dot!

'file' Command Regular Expression

For some files, it is not possible to deduce their type from just file names. Consider ordinary executables, such as shell commands and applications. If you were to enter a RE to match the possible names of those, you would be working for a while... There must be a better way! In fact, there is, and it's called the 'file' command.

'file' is a standard Un*x command-line tool which is used to (take a guess) identify file types! How incredibly handy! As usual among standard Un*x tools, 'file' is incredibly powerful. It uses a text file (/etc/magic on most systems, I believe) containing advanced file identification rules. These rules allow looking inside the files for various values, thus making identifying e.g. executables easy: just look for the same things the OS do!

When run, 'file' outputs one line of text for each filename it is given to inspect. For example, on my system 'file' has the following to say about the gentoo executable itself:

~/data/projects/gentoo> file ./gentoo
./gentoo: ELF 32-bit LSB executable, Intel 80386, version 1, dynamically linked, stripped

That's quite a load, but don't worry; you don't have to care about all of it. With the 'file' RE rule, you specify a regular expression which is then matched against the output of 'file' when run on each of the files in a directory. The file name, colon, and space output by 'file' are removed before the RE is applied. So, to find executables, an acceptable expression is just ".+executable.+". A better one might be "ELF.+executable.+".

Note!

Using 'file' carries a pretty heavy performance penalty! Although some considerable attempts have been made in gentoo to lessen the impact, it is still there. For maximum performance, don't use types with 'file' RE rules. If you don't define any type using 'file' RE matching, gentoo will detect this and optimize the entire file typing process somewhat.

For the special case of just recognizing any executable file (i.e. not just binary ELF files), simply use the protection flag checks described above. It'll be a lot quicker.

Rule Combinations

As has been hinted above, you can use any number of rules from one (intrinsic only) to five (all of 'em!) to identify your types. The rules are applied in the order they were mentioned here, starting with the intrinsic and ending with the 'file' RE match if used. The type doesn't match unless all of its rules do.

This can sometimes be somewhat useful, for example, imagine a category of files identified by their names beginning with either "cfg_" or "cmd_", and all ending in ".c". You can set up a type using first the simple suffix matcher to lock onto the ".c" suffix, and then the name RE matcher to check for the correct prefix. Doing it this way, rather than just including the suffix into the RE, saves involving the RE routines (which are orders of magnitude more complex than the simple suffix check) until we know it's necessary.

Built-In Types

There is one type that is always available. It is called "Unknown", and uses a magic rule system: any file is considered to be of type "Unknown"! Therefore, to prevent it from "eating up" all files, it is tested for after all your user-defined rules have failed. Basically, the existance of the "Unknown" file with these semantics guarantee that all files always have exactly one type, which is a very nice property.

Also, the "Unknown" type links to the (equally magic) "Root" style. This causes display of all untyped files to use the "Root" style, which is just as things should be. For more information on styles, check the relevant chapter.

Tips on Naming Types

Always try to use two (or more) words in your type names, going from the general to the specific. For example, a good name for the JPEG type mentioned above might be "Image, JPEG", or something similar. Likewise, you could call the executable type "Executable, ELF". Since the types are listed alphabetically in the configuration page, naming them like this helps keep related types together and makes things easier to overlook and manage.