Intelligent File System

By Pype on 26.11.01
Request For Comment - comments by Nowan (28.11.01) - your comments

File Classes

Most currently existing file system has a very weak notion of "file types", this is, what kind of datas are lying in the raw 1s and 0s  of a file. Is it a music, a postscript, a pure text, a picture ...
In dos-based system, the file extension is used to type the file, so if you file is ended by .jpg, the system will try to open it as an image, regardless the true content of an image. And if the extension was wrong, then you have virtually no help to know what is the true content type.
Under Linux, you usually have some magic numbers (some byte values that reflect a file type if they're put at strategic locations. For instance, you have the magic sequence GIF at the start of each .gif imageà that help identifying the file content, but they are not stored with metadatas, so the classification by magic numbers has to be done every time you want to know the file type.

We want to make this better under linux and have true file classes , this is refusing a picture to be played as a sound simply because it's named my_face.wav. Therefore, a place is reserved to store each file's class, which is guessed from magic-numbers matching and then checked by trying to open the file with a recognizer application. This classification happens only once, when the file arrives on your system. It's results is stored in the file's metadatas (with its length, access date, etc).

Note that we should let the user define which volumes (partitions) will use classifying and which wont. A trivial example is that we won't classify swap or /tmp ... Neither should the applications caches be classified, nor the floppies, etc ...

In order to make the classification process faster, we could use some external informations (like mime types or file extensions as a hint, trying this format first, then trying the rest if it didn't work)

Informations relating to a file class

In addition to the recognizer, we could register other programs related to a class file like players, editors, etc. Each will have a shortcut name of its operation (print, edit, view ...), letting a file browser display smart popup menus when right-clicking a file.

Another possible smart use of such file class is to have a program that will complete the popup menu according to the file content itself - for instance displaying each possible target on a makefile so that you can directly select between make all, make clean or make install from that popup :)

Abstract file classes

Sometimes, it's useful to group files that hold the same kind of informations. For instance, both midi, mp3 and mod are music files. both can be played  or converted to a sound through a custom program. Further, a sound and a music are both documents, which means they have an author and a title. Rather than redefining everything, you can decide to reuse class definition from previous one.

In this scheme, sound, music and documents are abstract types: you can't recognize or create them! There is no file format for a music : only mp3, midi and FastTracker modules ! Anyway, all music have an author and all sounds have a sound quality and a play duration, which all can be used in search/sort requests.
an example fileclasses hierarchy

Meta datas

There are often informations you'd like to remember about a file and that you want to make available to some programs that does not explicitly know that file type. For instance, it's interresting that your file browser can display the author and the playtime of some MP3 even if it's not able to play them ... Similarly, you could expect your file searcher to use keywords of your text documents even if it's not able to display the full text itself and does not know the internals of that file format.

All those secondary informations like author, keywords, title, album, contact ... are useful for files management and searching. You might for instance want to create a "virtual directory" collecting all music files which author is Cyborg Jeff or make a quick search on files having Clicker as keyword on your file system. Such operation need that you can access those metadatas (the secondary infos) without needing to calling a specialized program or even without opening the file itself, but just by looking some directories.

First try: put metadatas with datas.

We could decide to store all those metadatas as if they were regular datas, but at file starting and with typical typing so that they can be quickly recovered. This is a "IFFway of doing things" that was already in use with Amigas Interchange File Format. The true datas are encapsulated within a file having some metadatas. It would look like

Cyborg Jeff
Crossing Over The Scene
First Step On Stage
all the true datas come right here until end of file

This is easy to implement. All you need to do is having an offset when doing "regular" access (offset is computed at file opening). Accessing the datas may be made quick if you have some direct pointers to the first block and if the metadatas doesn't exceed one block.

Another great aspect of this version is that it's easy to export the typed-file with its metadatas on a filesystem that does not support such feature like floppies, networks, or other partitions...

The trouble is that you cannot use it to quickly recover all files given a field value, neither can't you have big metadatas like notes or changelog ...

Second attemp: put metadatas within directory entry

So, we could put all those informations together with the file name, no ? Well ... not really! First because there's quite few place in a directory entry to store and second (most important), because it means that you'll have to copy metadatas in each directory that refers to the file and browse a list of referrers each time you update metadatas if you want all directories to get synchronized values...

So all kind of metadatas we can lay in the directory entry is the filename (because links may have different names for a same content), and the file class (because it's unlikely to change). Even access and modification time should not reside there if we want coherent infos. That's why unix introduced inodes ages ago ;-)

Third fashion: put metadatas within inode

That's elegant because that's why inode is made for. Unfortunately, if we want inodes to be only one block, we should make the metadatas as short as possible. Letting notes laying here is not a good idea at all, for instance. Instead, we will use the inode to store keys for the values and identifiers of what kind of metadata we are talking about.

uml view of C32 IFS
Now let's try to see what that UML diagram means ...
  1. the system has a database of each registered file classes and gives each a unique identifier on this system (the File Class Unique IDentifier ). MP3 could be one of its classes. let's say it has FCID #1234. Each inode of a mp3 file will have 1234 as FCID value.
  2. With the file class, we store the description of all possible metadata fields that could be remembered about that field. Each field gets a identifier that is unique within the file class. It could inherit them from another abstract file class. For instance, you will have author that is a string and has Class Metadata IDentifier of 1, while length is a number and has CMID of 2. Album is a text with CMID of 3.
  3. A inode of a mp3 file might then wear metadata fields looking like <CMID=1:key=AuthorKeyForCyborgJeff>,  <CMID=2: 3'50s> and <CMID=3:see external metadatas block, offset 16>. This reflects the three kind of way you can store a metadata: 
    • either directly within the inode, like for duration - the size is then limitted,
    • in an external data block - a kind of parallel file - so that you can have as much bytes as you want.
    • either as a key to be looked in a specific database (the database in our case)
As stated in the post-it, each entry of might have a database of all inodes that have, so it becomes trivial to quickly find all mp3 files which author is Cyborg Jeff ...

Selecting which field has a values-database and which hasn't, or if the inodes database per value is activated or not is a user-issue. It might be set up by some programs or through system configuration.

Powered by SourceForge