Skip to content
Sam Duy edited this page Jun 27, 2017 · 25 revisions

Motivation

  • The threat of unpatched software.
  • The use of open-source software/library in development increases the risk of security bugs.
  • The massive WannaCry attack in May of 2017 due to an unpatched bug in Windows OS.
  • The limitation of package manager tools in keeping track of manually installed and other types of softwares (e.g. browser extensions)

=> The need of finding a new way to keep track and maintain the latest status of all the types of software in our system.

Goal

  • Project description
The goal of this project is to automatically identify the provenance of each program installed in a running system. For instance, executables and libraries can be part of programs downloaded from the Web, or cloned from github, or
installed from commercial software.
The student needs to identify techniques and heuristics to identify how, from which source, and which precise version of a software was installed in both Linux and Windows systems.
  • Automatically: all of the processes should be run without or with the least human interaction.
  • Provenance:
    • How? Downloaded or Cloned
    • From which source?: from Web or from GitHub, or somewhere else.
    • Which precise version?
  • Executables and libraries: type of the program. Other types such as: shell, python, scripts... can also be covered.
  • Operating system: both Windows and Linux.
  • Additional:
    • Which is the latest version of a software available on the Internet?
    • Is the software still Active (the development, maintenance still going on)? If the software is NOT Active (the developer hasn't updated it for a long time), there is a big possibility that it is vulnerable to some security bugs.

The two big challenges

Challenge-1:

With an arbitrary file in the system, how to know exactly the version (and related info: e.g., released date,...) of the program/package it belongs to?

Challenge-2:

With a program name (or its file names), how to find information about its latest version from Internet?

Software classification

From this project's perspective, we classify the software in a system into the following categories. With each category, we will have diffrent approaches (techniques and heuristics) in order to tackle the big challenges mentioned above.

Category OS Type of program Open/Closed How From Approach [Chall1-Chall2]
1-1 Linux Any Open Downloaded GitHub [S-A]
1-2 Linux Any Open Downloaded Others [S-B]
1-3 Linux Any Open Cloned Any [S-C]
1-4 Linux Any Closed Downloaded Others [S-B]
1-5 Linux ELF Open Compiled Local dir -
2-1 Windows Any Open Downloaded GitHub -
2-2 Windows Any Open Downloaded Others -
2-3 Windows Any Open Cloned Any -
2-4 Windows Any Closed Downloaded Others -
2-5 Windows PE Open Compiled Local dir -
  • Notes:
    • Others: Any other wesbites rather than GitHub.
    • Local dir: could be Downloaded or Cloned from GitHub or any Other websites.
    • Example: Browser extensions can fall into category 1-4 (or 2-4).
    • S: Simple approach for Challenge-1.
    • A, B, C: Approaches for Challenge-2.

Scope of this semester project

In the duration of this semester project (2017-SPRING-41), we focus on some approaches ([A], [B], [C]) to tackle the Challenge-2 with Open source softwares, the operating system is Linux.

A simple aproach ([S]) for Challenge-1 is also implemented. Limitation will be disussed and some suggestions for improvement will be stated.

Methodology

Challenge 1

Algorithm [S]: Simple way to detect creation date of a program.

  • Hypothesis:
    • H1: All the files inside program directory has the same creation date.
    • H2: All the files have been untouched since it was installed to the system. (It means, file's Modified Date equals to its Creation Date).
  • Input: A directory path
  • Output: A date when the package was created.
  1. Choose any file path in the directory.
  2. Use stat command to get its time of data modification (with %Y or %y parameters).

Algorithm [Sim]: Improvement of simple way to detect creation date of a program.

  • Hypothesis:
    • H1: Creation Date of a program equal to Modification Date of the vast majority of the files in its directory.
  • Input: A directory path
  • Output: A date when the package was created.
  1. With every files in the directory, use stat command to get its time of data modification (with %Y or %y parameters).
  2. Sort the files based on its Modification Date.
  3. The Modification Date of the highest number of files will be returned value.

Challenge 2:

Algorithm [A]: Search for latest information of a package on GitHub.

  • Hypothesis:
    • H1: A directory basename is probably its repo name on GitHub.
    • H2: The longer file path, the more unique it is to the package.
  • Input: A directory path and its files.
  • Output: Information from GitHub.
    • Generic info: User, Repo (package) Name
    • Updated info: Latest Release, Released Date, Latest Commit, Committed Date of the package on GitHub.
  1. Set a number of files need to search: n. The higher number the relatively more accurate it is, but slower in processing and so strict that sometime can lose the correct result. We choose n = 3.
  2. Identify n files with the longest file paths in the directory.
  3. For each of the file i above (i from 0 to n-1):
    • Use GitHub API to search for the filepath on GitHub.
    • Return a set of results: R1[i].
  4. If there is any item r appear in all R1[i] (with i from 0 to n-1), it is our candidate => Put r to our candidate list: R1.
  5. Use GitHub API to search for the repository with the directory basename d, we have a set of results: R2.
  6. The first item in R2 that appears in R1 is the returned repository: R.
  7. If no item in R2 matches with R1, all the candidate items in R1 will be considered returned repositories. R = R1.
  8. Use GitHub API to get the latest information of the repository (or all the repositories in the set) R.

Algorithm [B]: Search for latest information of a package on Google.

Some techniques for advanced search with Google are investigated. However no actual algorithm has been implemented yet in this phase. The idea is: by utilizing specific query, the desired information (such as: Latest information, Released date,...) will be printed out directly on the Google result page.

  1. All in text or In text
allintext:skype latest version 2017
skype intext:latest version 2017
  1. All in title or In title
allintitle:skype latest version 2017
skype intitle:latest version 2017
  1. File type and combine multiple options
firefox download filetype:zip OR filetype:tar OR filetype:bz2
  1. Specific website
freeradius server latest version download site:sourceforge.net
  1. Some combination examples:
burp suite intext:latest version release intitle:download
ccleaner windows 32 intext:"latest version" intitle:version

Results:

Download Burp Suite Free Edition_. Burp Suite Free Edition v1.7.23. Latest Stable. Released 22 May 2017 | v1.7.23 Release notes. Download ...
Get new version of CCleaner. Cleans ... CCleaner Latest version 5.30.6065 ... CCleaner is the most well known PC cleaning, Windows based software program .

Algorithm [C]: Get the latest information by git command.

  • Hypothesis: No hypothesis, this is the most accurate method. (But it's limited to only cloned directories).
  • Input: A directory path to scan.
  • Output: All of cloned packages and their latest information: Latest Release, Released Date, Latest Commit, Committed Date.
  1. Scan for all .git directories inside provided path.
  2. For each returned result, update the local git database to latest one from internet: git fetch -tq > /dev/null
  3. Extract all information we need from .git/refs/.

Other challenges:

Challenge 3: Extract package directory

With the list of many files and directories, how can I identify which directories and sub-directories belong to the same package and where is the root directory of the package

For example, with the list of directories as follows:

/usr/share/program1
/usr/share/program2
/opt/program3/sub-dir31
/opt/program3/sub-dir32
/program4/sub-dir41
/program4/sub-dir42

The expected output should be the list of the root directory of each package.

/usr/share/program1
/usr/share/program2
/opt/program3/
/program4

Algorithm [D]: Extract package directory

  • Hypothesis:
    • H1: The vast majority files of the same package have the same Creation Date.
    • H2: The directory is not a project under development (it means the user does not frequently modifies its files).
    • H3: The number of files in the directory is large enough (more than 7).
  • Input: A list of directories, sub-directories and files.
  • Output: List of package's root directories.
  1. Choose a threshold (magic number): n=7
  2. For each of the directory Di in the list:

2.1. Check if it belong to only one or more than one package:

  • Use stat command (with %Y or %y parameter) to get Modification date of all its files and sub-directories.
  • Sort and Uniq them based on their Modification Date
  • If the number of different unique Modification Date is
    • Less than n => they belong to the same package.
    • Bigger or equal n => they belong to different packages.

2.2. If Di belong to ONE package, check its parent Pi.

  • If Pi belongs to MORE THAN ONE package. Di is the package directory. Put Di to return result R.
  • If Pi also belongs to ONE package, both 'DiandPiare sub-directories of the same package => Check their parentPPi`. (come back to step 2.1).
  1. uniq nd sort the returned result R.

Implementation

Input:

  • Scan directory: D
  • Interest file types: T

Output:

  • List of interesting directories, their package names and their statuses (Updated or not, Active or not, Dates,...)

Steps

  • List up all packages managed by APT => [apt-pkgnames.list]
    • ==> List up all files managed by APT => [apt.list]
      • ==> Filter only specific types T => [apt_sorted.list]
  • List up all packages managed by PIP => [pip-pkgnames.list]
    • ==> List up all files managed by PIP => [pip.list]
      • ==> Filter only specific typesT => [pip_sorted.list]
  • List all files of specific types T in D => [all_files.list]
  • Filter only interesting files: we don't care about the files that already managed by APT or PIP. [all_files.list] - ([apt_sorted.list] + [pip_sorted.list]) => [interesting.list]
  • Extract interesting directories by applying [Algorithm-D]. => [interesting_dirs.list]
  • For each interesting directory:
    • Get its local information (by applying [Algorithm-S] => [program_info.dat]
    • Get its latest information (by applying [Algorithm-A] => [internet_info.dat]
  • Generate final result:
    • Compare local information vs. internet information to determine status of programs. => [report]

Expected output result:

Manual

Path Name Source Updated Active Local version Latest version
/path/to/package1 Package1 Github Y Y 1.0.2 1.0.2
/path/to/package2 Package2 Others N Y 0.0.2 1.0.5

Evaluation

The program has been tested on several computers in Eurecom's computer labs as well as on private machines. The collected results then have been checked manually to see if there is any mistakes (mis-recognized directory name, mis-recognized package information,...). The following aspects have been evaluated.

Quality of Algorithm-A: Search for latest information of a package on GitHub.

Mark A-1 = Number of successfully GitHub acquisition / Total number of GitHub package directories

One GitHub acquisition can be considered successful if:

  • It detects correct user and repo names. (Or the correct one appears in its returned results).
  • It get the latest information accurately.
Mark A-2 = 1 - (Number of mis-recognized GitHub acquisition / Total number of CORRECTLY RECOGNIZED package directories)

Result

# PC Total Total GitHub only Accurate Wrong Mark A-1 Mark A-2
1 apila 16 6 4 11 67% 31%
2 diableret 109 86 26 83 30% 24%
3 glacier 9 1 1 0 100% 100%
# Total 134 93 31 94 33% 30%

Quality of Algorithm-D: Extraction of package directory

Mark D-1 = Number of accurately recognized package directories / Total number of ACTUAL package directories
Mark D-2 = 1 - (Number of incorrectly recognized package directories / Total number of RECOGNIZED package directories)

Result

# PC Total ACTUAL Total RECOGNIZED Accurate Wrong Mark D-1 Mark D-2
1 apila - 38 16 22 - 42%
2 diableret - 185 129 56 - 70%
3 glacier - 11 11 0 - 100%
# Total - 234 156 78 - 67%

Limitations

  • Algorithm [A]: GitHub search

    • May give some confidence on the returned result. The result should be accepted only if it is above some certain level of confidence. If the confidence is very low, the result should be ignored. It helps remove some packages that are not actually open source. One of the good candidate can be: the star property of the repo on GitHub.

    • For PIP (python modules): maybe it's better to get information from <pypi.org>.

References

  1. 2017 Internet Threat Security Report | Symantec
  2. 2016 Internet Threat Security Report | Symantec
  3. Cyber Risk Report 2016 - Executive summary | Hewlett Packard
  4. US-CERT, Top 30 Targeted High Risk Vulnerabilities
  5. National Vulnerability Databse
  6. https://www.whitesourcesoftware.com/whitesource-blog/open-source-security-vulnerability/
  7. Quick tips to search Google like an expert
  8. Google Inside search
  9. https://pypi.org
  10. https://rpmfind.net
  11. https://www.archlinux.org/
  12. https://github.com/settings/tokens
  13. https://developer.github.com/apps/building-integrations/setting-up-and-registering-oauth-apps/