• PDF X-RAY Without Storage

    by  • November 10, 2011 • Uncategorized

    If you have uploaded a very large document to PDF X-RAY (hosted or local) then you may have noticed that processing doesn’t work out too well. If the generated text is over a certain size then MongoDB refuses to store it. As demonstrated before, you can pack almost any document, so there are times when you may need to perform analysis on something that can’t be stored. 

    I was faced with an issue like this and decided to address it. I just committed memory_runner to the PDF X-RAY builder folder. This python script uses the same functions PDF X-RAY does, but instead of storing the output, it passes it directly to the malobjclass helper class for analysis. There is function defined for user code where you can add your post processing instructions.

    To give you an idea of how you could use this tool, I present a quick hack script I wrote to find shared object hashes between three PDF files in a directory. These files are large, but that is not an issue with memory_runner, unless of course you exhaust your memory and then I leave the solution creating to you. 


    Below was the output from my running:

    root@pdfxray:~# python cli/memory_runner.py -d /home/xray/testing/

    [+] Analyzing file randon1.pdf

    [+] Analyzing file randon2.pdf

    [+] Analyzing file randon3.pdf

    [+] File Data

    [+] MD5 beab29554c70d08e4c38af519cc6be28

    [+] Object Count 49

    [+] File Data

    [+] MD5 44be579854c8218c1947d36562df8998

    [+] Object Count 49

    [+] File Data

    [+] MD5 4e4984f0ae76b7bea686700bfebec8d5

    [+] Object Count 49

    [+] The following 0 encoded stream hashes are shared

    [+] The following 26 raw hashes are shared

    e7e368af9ea29ad4c700b0ff6049634a ID: 49

    1dd559ddbcf17072b9677e42f36ba4c6 ID: 31

    d2ab692775a25405828526d74f40581d ID: 45

    859cb5af5b5bce8fd456a6f8eeacab8d ID: 33

    d5a3c494726131f7d2f9ccb4d1318b46 ID: 5

    dffb7be9b9bec70ca1b5a7fbdf9fca8b ID: 3

    9654c69cf4e6fa0d66f7e39d73b0dceb ID: 29

    69979df21eaec6267c1f4f961822185f ID: 46

    6cc0c8370b3cd4903b2ea1ad0bb6efdf ID: 39

    84ed8e164750230a6827554c5252c380 ID: 43

    37d66d9085b3417a2a24af45cb5e8a14 ID: 11

    19b1854d6b9d55d120d424be77aeffa5 ID: 35

    f0d7b433cc452199b7a90d946c157a3f ID: 23

    85422a420cd796a130841b272bacebaf ID: 1

    a56baa863c81465218798c3b6f41210f ID: 41

    7a079febb0af4e66e73c47df22805cdd ID: 2

    1872e12ff7a5d02584b22d9a1b8147af ID: 7

    35ad9c858343494f912be279fda9e0cd ID: 19

    2d107f8be3c4ff780c8ee6b8d7b1d0e0 ID: 17

    ff7050be3dd5c87754cbe56a3938cc71 ID: 13

    303c33871e82a9211bd71b0250af9fa3 ID: 25

    2c10661d8de9f6c4e5981f56c2efacbf ID: 21

    f1069d78988708f5faf650d117b5149f ID: 15

    dd953edc7367297830d7a7e52cc6282b ID: 9

    b4fc0eea24d9852fe13140a3b7bd4397 ID: 37

    685909605bc719f54fedb03fa5c68ca9 ID: 27

    As you can see from the output, a bunch of matches were found with the object definitions. When comparing files used in specific attacks this sort of information can be useful. This extension of memory_runner certainly has some limitations, but it demonstrates how easy it is to add your own code to get the output you want.