-
Notifications
You must be signed in to change notification settings - Fork 1
Home
- The threat of unpatched software.
- The use of open-source software/library in development increases the risk of security bugs.
- The massive
WannaCryattack in May of 2017 due to an unpatched bug in Windows OS. - The limitation of package manager tools in keeping track of manually installed and other types of softwares (e.g. browser extensions)
=> The need of finding a new way to keep track and maintain the latest status of all the types of software in our system.
- Project description
The goal of this project is to automatically identify the provenance of each program installed in a running system. For instance, executables and libraries can be part of programs downloaded from the Web, or cloned from github, or
installed from commercial software.
The student needs to identify techniques and heuristics to identify how, from which source, and which precise version of a software was installed in both Linux and Windows systems.
- Automatically: all of the processes should be run without or with the least human interaction.
-
Provenance:
-
How?
DownloadedorCloned -
From which source?:
from Weborfrom GitHub, or somewhere else. - Which precise version?
-
How?
- Executables and libraries: type of the program. Other types such as: shell, python, scripts... can also be covered.
-
Operating system: both
WindowsandLinux. -
Additional:
- Which is the latest version of a software available on the Internet?
- Is the software still
Active(the development, maintenance still going on)? If the software isNOT Active(the developer hasn't updated it for a long time), there is a big possibility that it is vulnerable to some security bugs.
With an arbitrary file in the system, how to know exactly the version (and related info: e.g., released date,...) of the program/package it belongs to?
With a program name (or its file names), how to find information about its latest version from Internet?
From this project's perspective, we classify the software in a system into the following categories. With each category, we will have diffrent approaches (techniques and heuristics) in order to tackle the big challenges mentioned above.
| Category | OS | Type of program | Open/Closed | How | From | Approach [Chall1-Chall2] |
|---|---|---|---|---|---|---|
| 1-1 | Linux | Any | Open | Downloaded | GitHub | [S-A] |
| 1-2 | Linux | Any | Open | Downloaded | Others | [S-B] |
| 1-3 | Linux | Any | Open | Cloned | Any | [S-C] |
| 1-4 | Linux | Any | Closed | Downloaded | Others | [S-B] |
| 1-5 | Linux | ELF | Open | Compiled | Local dir | - |
| 2-1 | Windows | Any | Open | Downloaded | GitHub | - |
| 2-2 | Windows | Any | Open | Downloaded | Others | - |
| 2-3 | Windows | Any | Open | Cloned | Any | - |
| 2-4 | Windows | Any | Closed | Downloaded | Others | - |
| 2-5 | Windows | PE | Open | Compiled | Local dir | - |
- Notes:
-
Others: Any other wesbites rather than GitHub. -
Local dir: could beDownloadedorClonedfromGitHubor anyOtherwebsites. - Example: Browser extensions can fall into category 1-4 (or 2-4).
-
S: Simple approach forChallenge-1. -
A, B, C: Approaches forChallenge-2.
-
In the duration of this semester project (2017-SPRING-41), we focus on some approaches ([A], [B], [C]) to tackle the Challenge-2 with Open source softwares, the operating system is Linux.
A simple aproach ([S]) for Challenge-1 is also implemented. Limitation will be disussed and some suggestions for improvement will be stated.
-
Hypothesis:
- H1: All the files inside program directory has the same creation date.
- H2: All the files have been untouched since it was installed to the system. (It means, file's
Modified Dateequals to itsCreation Date).
- Input: A directory path
- Output: A date when the package was created.
- Choose any file path in the directory.
- Use
statcommand to get its time of data modification (with%Yor%yparameters).
-
Hypothesis:
- H1:
Creation Dateof a program equal toModification Dateof the vast majority of the files in its directory.
- H1:
- Input: A directory path
- Output: A date when the package was created.
- With every files in the directory, use
statcommand to get its time of data modification (with%Yor%yparameters). - Sort the files based on its
Modification Date. - The
Modification Dateof the highest number of files will be returned value.
-
Hypothesis:
- H1: A directory basename is probably its repo name on GitHub.
- H2: The longer file path, the more unique it is to the package.
- Input: A directory path and its files.
-
Output: Information from GitHub.
- Generic info: User, Repo (package) Name
- Updated info: Latest Release, Released Date, Latest Commit, Committed Date of the package on GitHub.
- Set a number of files need to search:
n. The higher number the relatively more accurate it is, but slower in processing and so strict that sometime can lose the correct result. We choosen = 3. - Identify
nfiles with the longest file paths in the directory. - For each of the file
iabove (i from 0 to n-1):- Use GitHub API to search for the filepath on GitHub.
- Return a set of results: R1[i].
- If there is any item
rappear in all R1[i] (with i from 0 to n-1), it is our candidate => Putrto our candidate list:R1. - Use GitHub API to search for the repository with the directory basename
d, we have a set of results:R2. - The first item in
R2that appears inR1is the returned repository:R. - If no item in
R2matches withR1, all the candidate items inR1will be considered returned repositories.R=R1. - Use GitHub API to get the latest information of the repository (or all the repositories in the set)
R.
Some techniques for advanced search with Google are investigated. However no actual algorithm has been implemented yet in this phase. The idea is: by utilizing specific query, the desired information (such as: Latest information, Released date,...) will be printed out directly on the Google result page.
- All in text or In text
allintext:skype latest version 2017
skype intext:latest version 2017
- All in title or In title
allintitle:skype latest version 2017
skype intitle:latest version 2017
- File type and combine multiple options
firefox download filetype:zip OR filetype:tar OR filetype:bz2
- Specific website
freeradius server latest version download site:sourceforge.net
- Some combination examples:
burp suite intext:latest version release intitle:download
ccleaner windows 32 intext:"latest version" intitle:version
Results:
Download Burp Suite Free Edition_. Burp Suite Free Edition v1.7.23. Latest Stable. Released 22 May 2017 | v1.7.23 Release notes. Download ...
Get new version of CCleaner. Cleans ... CCleaner Latest version 5.30.6065 ... CCleaner is the most well known PC cleaning, Windows based software program .
- Hypothesis: No hypothesis, this is the most accurate method. (But it's limited to only cloned directories).
- Input: A directory path to scan.
- Output: All of cloned packages and their latest information: Latest Release, Released Date, Latest Commit, Committed Date.
- Scan for all
.gitdirectories inside provided path. - For each returned result, update the local git database to latest one from internet:
git fetch -tq > /dev/null - Extract all information we need from
.git/refs/.
With the list of many files and directories, how can I identify which directories and sub-directories belong to the same package and where is the root directory of the package
For example, with the list of directories as follows:
/usr/share/program1
/usr/share/program2
/opt/program3/sub-dir31
/opt/program3/sub-dir32
/program4/sub-dir41
/program4/sub-dir42
The expected output should be the list of the root directory of each package.
/usr/share/program1
/usr/share/program2
/opt/program3/
/program4
-
Hypothesis:
- H1: The vast majority files of the same package have the same Creation Date.
- H2: The directory is not a project under development (it means the user does not frequently modifies its files).
- H3: The number of files in the directory is large enough (more than 7).
- Input: A list of directories, sub-directories and files.
- Output: List of package's root directories.
- Choose a threshold (magic number):
n=7 - For each of the directory
Diin the list:
2.1. Check if it belong to only one or more than one package:
- Use
statcommand (with%Yor%yparameter) to getModification dateof all its files and sub-directories. -
SortandUniqthem based on theirModification Date - If the number of different unique
Modification Dateis- Less than
n=> they belong to the same package. - Bigger or equal
n=> they belong to different packages.
- Less than
2.2. If Di belong to ONE package, check its parent Pi.
- If
Pibelongs to MORE THAN ONE package.Diis the package directory. PutDito return resultR. - If
Pialso belongs to ONE package, both 'DiandPiare sub-directories of the same package => Check their parentPPi`. (come back to step 2.1).
-
uniqndsortthe returned resultR.
- Scan directory:
D - Interest file types:
T
- List of interesting directories, their package names and their statuses (Updated or not, Active or not, Dates,...)
- List up all packages managed by APT => [apt-pkgnames.list]
- ==> List up all files managed by APT => [apt.list]
- ==> Filter only specific types
T=> [apt_sorted.list]
- ==> Filter only specific types
- ==> List up all files managed by APT => [apt.list]
- List up all packages managed by PIP => [pip-pkgnames.list]
- ==> List up all files managed by PIP => [pip.list]
- ==> Filter only specific types
T=> [pip_sorted.list]
- ==> Filter only specific types
- ==> List up all files managed by PIP => [pip.list]
- List all files of specific types
TinD=> [all_files.list] - Filter only
interestingfiles: we don't care about the files that already managed by APT or PIP. [all_files.list] - ([apt_sorted.list] + [pip_sorted.list]) => [interesting.list] - Extract
interestingdirectories by applying [Algorithm-D]. => [interesting_dirs.list] - For each interesting directory:
- Get its local information (by applying [Algorithm-S] => [program_info.dat]
- Get its latest information (by applying [Algorithm-A] => [internet_info.dat]
- Generate final result:
- Compare local information vs. internet information to determine status of programs. => [report]
| Path | Name | Source | Updated | Active | Local version | Latest version |
|---|---|---|---|---|---|---|
| /path/to/package1 | Package1 | Github | Y | Y | 1.0.2 | 1.0.2 |
| /path/to/package2 | Package2 | Others | N | Y | 0.0.2 | 1.0.5 |
The program has been tested on several computers in Eurecom's computer labs as well as on private machines. The collected results then have been checked manually to see if there is any mistakes (mis-recognized directory name, mis-recognized package information,...). The following aspects have been evaluated.
Mark A-1 = Number of successfully GitHub acquisition / Total number of GitHub package directories
One GitHub acquisition can be considered successful if:
- It detects correct
userandreponames. (Or the correct one appears in its returned results). - It get the latest information accurately.
Mark A-2 = 1 - (Number of mis-recognized GitHub acquisition / Total number of CORRECTLY RECOGNIZED package directories)
| # | PC | Total | Total GitHub only | Accurate | Wrong | Mark A-1 | Mark A-2 |
|---|---|---|---|---|---|---|---|
| 1 | apila | 16 | 6 | 4 | 11 | 67% | 31% |
| 2 | diableret | 109 | 86 | 26 | 83 | 30% | 24% |
| 3 | glacier | 9 | 1 | 1 | 0 | 100% | 100% |
| # | Total | 134 | 93 | 31 | 94 | 33% | 30% |
Mark D-1 = Number of accurately recognized package directories / Total number of ACTUAL package directories
Mark D-2 = 1 - (Number of incorrectly recognized package directories / Total number of RECOGNIZED package directories)
| # | PC | Total ACTUAL | Total RECOGNIZED | Accurate | Wrong | Mark D-1 | Mark D-2 |
|---|---|---|---|---|---|---|---|
| 1 | apila | - | 38 | 16 | 22 | - | 42% |
| 2 | diableret | - | 185 | 129 | 56 | - | 70% |
| 3 | glacier | - | 11 | 11 | 0 | - | 100% |
| # | Total | - | 234 | 156 | 78 | - | 67% |
-
Algorithm [A]: GitHub search
-
May give some confidence on the returned result. The result should be accepted only if it is above some certain level of confidence. If the confidence is very low, the result should be ignored. It helps remove some packages that are not actually open source. One of the good candidate can be: the
starproperty of the repo on GitHub. -
For PIP (python modules): maybe it's better to get information from <pypi.org>.
-
- 2017 Internet Threat Security Report | Symantec
- 2016 Internet Threat Security Report | Symantec
- Cyber Risk Report 2016 - Executive summary | Hewlett Packard
- US-CERT, Top 30 Targeted High Risk Vulnerabilities
- National Vulnerability Databse
- https://www.whitesourcesoftware.com/whitesource-blog/open-source-security-vulnerability/
- Quick tips to search Google like an expert
- Google Inside search
- https://pypi.org
- https://rpmfind.net
- https://www.archlinux.org/
- https://github.com/settings/tokens
- https://developer.github.com/apps/building-integrations/setting-up-and-registering-oauth-apps/
Copyright (c) 2017, Duy KHUONG at EURECOM.fr