Content-aware recovery of email messages and databases

Today’s data recovery techniques rely heavily on content-aware algorithms instead of using the possibly corrupted file system as the only source of information about disk location of files being recovered. Often referred to as signature-search algorithms, these technologies read the entire disk surface sector after sector in order to discover the missing files.

This article reveals the internals of one of such algorithms in application to recovering email databases and individual email messages in RFC -822 format, discussing quirks and issues the developers faced when implementing content-aware recovery of users’ emails. The article comes from the developers of numerous data recovery tools employing signature-search algorithms in their products. Expertise shared by the developers will help computer users better understand strengths and weaknesses of much-touted content-aware algorithms. The article will also benefit software developers, giving them valuable hints and steering them in the right direction when implementing signature search techniques for various purposes.

What Can Be Recovered

Our content-aware algorithms can discover email messages in databases in many popular formats including MS Outlook (PST), Mozilla Thunderbird, and RFC-822 EML format. PST files are de-facto standard in most offices using MS Outlook, while EML format is heavily utilized in many free email clients such as those used in Unix and Linux systems, Microsoft Vista Mail and Live Mail. Interestingly, Mozilla Thunderbird makes use of the EML format regardless of storing mail in a solid database file. Its database is, in fact, a linear storage of individual EML files separated with program-specific tags.

Many Email Applications but Only Two Formats

Fortunately, recognizing just two email storage formats is well enough for successful recovery of emails and databases produced by most popular email clients. In fact, we’re using two branches of the signature-search algorithm. One branch handles PST databases in binary format, while the second one takes care of text-only EML and Thunderbird files.

To illustrate how the algorithm works, we’ll be using Delphi code using WinAPI. No cross-platform compatibility was planned or intended at the time of initial development, but cross-platform developers will certainly get the main idea.

Text Files and RFC-822 EML: Simple Format, Complex Recovery

Text is an incredibly simple format. At the same time, missing text files are tough ones to recover. The thing is, unlike many binary formats such as PDF, DOC or ZIP, text files do not contain information about their length or location on the hard drive. We can’t just read sectors on the disk looking for ASCII text. If we do, we’ll inevitably end up with tons of garbage.

We faced this exact problem when developing SoftAmbulance software family. The tools are designed to recover missing and deleted files of various types from badly damaged, corrupted, formatted and repartitioned disks. That’s to say, the type of disks that may no longer hold their file systems intact, so we can’t reliably rely on FAT or NTFS records to discover the location of the file on the hard disk.

The main idea behind our approach is that EML files are not exactly text files. In addition to message body, which might be HTML or text in whatever encoding, emails contain structured headers in a strictly defined RFC-822 format.

Strictly speaking, RFC-822 is a text format. However, it has enough of a structure for our algorithm to detect the beginning of an EML file. The assumption we made is everything that goes after the RFC-822 header that’s not text is not part of an EML file.

In order to detect what’s text and what is not, we’re creating an array of characters that are considered valid text symbols corresponding to ANSI character set of user’s currently selected locale. Listing 1 shows how this works in AnsiInitialize_LOCALE_USER_DEFAULT and WideInitialize_LOCALE_USER_DEFAULT functions. In Windows world, we’re utilizing the LOCALE_USER_DEFAULT constant. The functions return an array of characters that are considered valid text symbols for the user’s currently selected locale.

LISTING 1
type
  TBoolAnsi = array[AnsiChar]of Boolean;
  TBooleArr = packed array[WideChar]of boolean;

  // Create a list of valid ANSI characters in user’s
  // currently selected locale
function AnsiInitialize_LOCALE_USER_DEFAULT(var boolAnsi: TBoolAnsi): Integer;
var
  c: AnsiChar;
  CharType: word;
begin
  for c := Low(AnsiChar) to High(AnsiChar) do
    boolAnsi[c]:= GetStringTypeExA(LOCALE_USER_DEFAULT, CT_CTYPE1, @c, 1, CharType)
                  and (CharType<>0)
                  and (CharType and C1_CNTRL = 0)
                  and (CharType and C1_PRINTABLE <> 0);
end;

The rest is quite simple. The algorithm is reading consecutive sectors on the disk looking for signs of a typical RFC-822 header. We’re using a weighed score method that triggers when a few typical strings such as "x-mailer:", "mime-version:", "from:" "date:", "content-type:", "subject:" etc. are discovered in close proximity. When the algorithm believes a valid RFC-822 header is discovered, it starts verifying consecutive sectors for text data by checking sectors it reads against the boolWide array containing what we consider to be valid text characters.

Of course, our text detection algorithm is also weighed score based. We’ll consider data a text if less than 2 per cent of characters fall outside of the defined list of text symbols. When the 2 per cent threshold is reached, the data is considered binary; the algorithm stops there and returns the location of yet another EML file.

Knowing the beginning (RFC-822 header) and end of an email message, the data can be saved as a regular EML file. When looking for Mozilla Thunderbird files, the list of RFC-822 headers is extended with two extra fields specific to that email client: "x-mozilla-status:" and "x-mozilla-status2:".

Recovering Outlook PST Files

The simplest and most widely utilized way to locate Outlook Personal Folder (.pst) files on the disk is looking for PST binary header structures starting with "!BDN" string.

Located at the very beginning of the file, the header structure contains essential information about the PST file. For data recovery purposes, PST file size is the most important piece of information available in the header. Although the layout of the header structure differs slightly between Unicode and ANSI versions, the differences are minor enough to consider it being a single format instead of employing two different branches of the data recovery algorithm.

[Table 1: Unicode header]

0	1	2	3	4	5	6	7	8	9	1 0	1	2	3	4	5	6	7	8	9	2 0	1	2	3	4	5	6	7	8	9	3 0	1
dwMagic
CRCPartial
wMagicClient																wVer
wVerClient																bPlatformCreate								bPlatformAccess
dwReserved1
dwReserved2
bidUnused
...
bidNextP
...
bidNextB
...
dwUnique
rgnid[] (128 bytes)
...
qwUnused
...
root (72 bytes)
...
dwAlign
rgbFM (128 bytes)
...
rgbFP (128 bytes)
...
bSentinel								bCryptMethod								rgbReserved
bidNextB
...
dwCRCFull
ullReserved
...
dwReserved
...
rgbReserved2																								bReserved
rgbReserved3 (32 bytes)
...

[Table 2: ANSI header]

0	1	2	3	4	5	6	7	8	9	1 0	1	2	3	4	5	6	7	8	9	2 0	1	2	3	4	5	6	7	8	9	3 0	1
dwMagic
CRCPartial
wMagicClient																wVer
wVerClient																bPlatformCreate								bPlatformAccess
dwReserved1
dwReserved2
bidNextB
bidNextP
dwUnique
rgnid[] (128 bytes)
...
root (40 bytes)
...
rgbFM (128 bytes)
...
rgbFP (128 bytes)
...
bSentinel								bCryptMethod								rgbReserved
ullReserved
...
dwReserved
rgbReserved2																								bReserved
rgbReserved3 (32 bytes)
...

Tables 1 and 2 demonstrate the differences between Unicode and ANSI headers. In order to determine the size of the PST file, we’ll need to discover which format it’s in. The value of the wVer parameter defines PST format as ANSI (values 13 and 14) or Unicode (23).

The following listing illustrates how our algorithm searches for PST file headers and determines the size of the PST file.


type
  TPWS_Data = record
    pBuff: Pointer;
    buffSize: Integer;
    Stream: TStream;

    Output_Ext : TPWS_Extension;
    Output_size: int64;
  end;

function CheckHeader(var Data: TPWS_Data): boolean;
const
  Pst_File_Header   = $4E444221; // !BDN

  ver1 = $0E;
  ver2 = $17;

  REST = 253952;
type
  PPST_Header = ^TPST_Header;
  TPST_Header = packed record
    MagicHeader: Cardinal;
    unk1: Cardinal;
    unk2: word;
    Version: byte;
    res1: packed array[0..156] of byte;
    File_Size1  : integer;
    lastSegment1: integer;
    res2: packed array[0..7] of byte;
    File_Size2  : Int64;
    lastSegment2: int64;
  end;

var
  Header: ^TPST_Header;
  fileSize: int64;
  bFSize32, bFSize64: Boolean;
begin
// Algorithm:
// If the two bytes representing version number are valid, then
//     If “size” value is obviously invalid (is less than
//     or equal to zero or equals MaxInt), calculate size based
//     on contextual data;
//     Else use value from the FileSizeX field;
// Else assume calculated and stored sizes are the same
// in order to be valid for one version or another
  Result:=False;
  if PPST_Header(Data.pBuff).MagicHeader <> Pst_File_Header then
    Exit;

    Header:=Pointer(Data.pBuff);

    Result:=True;
    Data.Output_Ext:=FilePst.Extension;
    Data.Output_size:=-1;

    // Determining length
    if (Header.Version=ver1)or(Header.Version=13) then begin
      fileSize:= 0;
      if (Header.File_Size1 > 0) and (Header.File_Size1 < MaxInt) then
        fileSize:= Header.File_Size1;

      if (Header.lastSegment1 > 0) and (fileSize = 0)
        and (Header.lastSegment1 < MaxInt - REST )
      then
        fileSize:= Header.lastSegment1 + REST;

      // Setting length value
      if fileSize > 0 then
        Data.Output_size:= fileSize;

    end else

      if Header.Version=ver2 then begin
        fileSize:= 0;
        if (Header.File_Size2 > 0) and (Header.File_Size2 < FilePst.MaxFileSize) then
          fileSize:= Header.File_Size2;

        if (Header.lastSegment2 > 0) and (fileSize = 0)
          and (Header.lastSegment2 < FilePst.MaxFileSize-REST)
        then
          fileSize:= Header.lastSegment2 + REST;

      // Setting length value
        if fileSize > 0 then
          Data.Output_size:=fileSize;

      end else begin
        // If Header.Version is wrong
        bFSize32:= (Header.File_Size1 > 0) and (Header.lastSegment1 > 0)
              and (Header.lastSegment1 < MaxInt - REST )
              and (Header.File_Size1 - Header.lastSegment1 <= REST );

        bFSize64:= (Header.File_Size2 > 0) and (Header.lastSegment2 > 0)
              and (Header.File_Size2 < FilePst.MaxFileSize)
              and (Header.lastSegment2 < FilePst.MaxFileSize - REST )
              and (Header.File_Size2 - Header.lastSegment2 <= REST );


        if (bFSize32 <> bFSize64) then begin
          // Setting length value
          if bFSize32 then
            Data.Output_size:=Header.File_Size1;

          if bFSize64 then
            Data.Output_size:=Header.File_Size2;
        end;
      end;
end;

By using the code shown in Listing 2, one can reliably detect the type and size of a PST file. Knowing the exact position of a PST file on the disk, one can easily extract and save the file.

It’s important to note that all information being recovered must be written onto a different disk. Otherwise, one faces the risk of overwriting information instead of recovering it. Of course, this rule equally applies to all other types of data being recovered.

What about PST Encryption?

PST data blocks are encoded. However, they are not technically encrypted in a truly forensic sense. According to Microsoft, "These algorithms only provide data obfuscation and can be conveniently decoded once the exact encoding algorithm is understood. Moreover, only end-user data blocks are encoded in the PST. All the other infrastructure information, including the header, allocation metadata pages and BTree pages are stored without obfuscation. In summary, the strength of the encoded PST data blocks provides no additional security beyond data obfuscation."

As such, PST encryption does not present a particular challenge. We don’t even need to decode information as only user data (actual email messages, appointments, organizer information etc.) is being encrypted, while all headers and technical information are left in their plain form.

To quote Microsoft again, "“The PST Password, which is stored as a property value in the Message store, is a superficial mechanism that requires the client implementation to enforce the stored password. Because the password itself is not used as a key to the encoding and decoding cipher algorithms, it does not provide any security benefit to preventing the PST data to be read by unauthorized parties. Moreover, the password is stored as a CRC-32 hash of the original password string, which is prone to collisions and is relatively weak against a brute-force approach”."

From what you see, using a password to protect Microsoft Outlook (PST) files is not a good idea. Not only does it fail to provide any sort of protection against unauthorized access to user’s personal information, but the cryptographically insecure CRC-32 hash makes it a perfect target for an accelerated brute-force attack.

For these reasons, our data recovery algorithms will not use passwords when recovering PST files (or, rather, when creating a new PST file on another disk).

Conclusion

With multiple email clients available on the market, the majority of formats can be actually recovered with just two algorithms. After reading this article, you have learned how to detect the beginning and end of an EML file, distinguish between text and binary data, and discover the location of PST files. The issue of PST encoding was covered to demonstrate the encryption is not of an issue from consumer data recovery standpoint (and is of negative value from forensic standpoint, presenting a security issue regarding the insecure password hashing prone to fast brute-force attacks).

Author’s Bio

Dmitry Solop is a leading developer managing the entire range of email recovery products offered by SoftAmbulance. With more than five years of experience, Dmitry knows everything about email, disk and data recovery techniques. He developed key algorithms currently employed in SoftAmbulance products. Dmitry has a B.Sc. in Applied Mathematics and Social Informatics. He is currently busy developing a database recovery product to fix MS SQL, MySQL, MS Exchange, Active directory, MS Sharepoint, and MS Project Server files. He is also involved in the maintenance of existing email recovery products.