Home

Motivation

The threat of unpatched software.
The use of open-source software/library in development increases the risk of security bugs.
The massive WannaCry attack in May of 2017 due to an unpatched bug in Windows OS.
The limitation of package manager tools in keeping track of manually installed and other types of softwares (e.g. browser extensions)

=> The need of finding a new way to keep track and maintain the latest status of all the types of software in our system.

Goal

Project description

The goal of this project is to automatically identify the provenance of each program installed in a running system. For instance, executables and libraries can be part of programs downloaded from the Web, or cloned from github, or
installed from commercial software.
The student needs to identify techniques and heuristics to identify how, from which source, and which precise version of a software was installed in both Linux and Windows systems.

Automatically: all of the processes should be run without or with the least human interaction.
Provenance:
- How? Downloaded or Cloned
- From which source?: from Web or from GitHub, or somewhere else.
- Which precise version?
Executables and libraries: type of the program. Other types such as: shell, python, scripts... can also be covered.
Operating system: both Windows and Linux.
Additional:
- Which is the latest version of a software available on the Internet?
- Is the software still Active (the development, maintenance still going on)? If the software is NOT Active (the developer hasn't updated it for a long time), there is a big possibility that it is vulnerable to some security bugs.

The two big challenges

Challenge-1:

With an arbitrary file in the system, how to know exactly the version (and related info: e.g., released date,...) of the program/package it belongs to?

Challenge-2:

With a program name (or its file names), how to find information about its latest version from Internet?

Software classification

From this project's perspective, we classify the software in a system into the following categories. With each category, we will have diffrent approaches (techniques and heuristics) in order to tackle the big challenges mentioned above.

Category	OS	Type of program	Open/Closed	How	From	Approach [Chall1-Chall2]
1-1	Linux	Any	Open	Downloaded	GitHub	[S-A]
1-2	Linux	Any	Open	Downloaded	Others	[S-B]
1-3	Linux	Any	Open	Cloned	Any	[S-C]
1-4	Linux	Any	Closed	Downloaded	Others	[S-B]
1-5	Linux	ELF	Open	Compiled	Local dir	-
2-1	Windows	Any	Open	Downloaded	GitHub	-
2-2	Windows	Any	Open	Downloaded	Others	-
2-3	Windows	Any	Open	Cloned	Any	-
2-4	Windows	Any	Closed	Downloaded	Others	-
2-5	Windows	PE	Open	Compiled	Local dir	-

Notes:
- Others: Any other wesbites rather than GitHub.
- Local dir: could be Downloaded or Cloned from GitHub or any Other websites.
- Example: Browser extensions can fall into category 1-4 (or 2-4).
- S: Simple approach for Challenge-1.
- A, B, C: Approaches for Challenge-2.

Scope of this semester project

In the duration of this semester project (2017-SPRING-41), we focus on some approaches ([A], [B], [C]) to tackle the Challenge-2 with Open source softwares, the operating system is Linux.

A simple aproach ([S]) for Challenge-1 is also implemented. Limitation will be disussed and some suggestions for improvement will be stated.

Methodology

Challenge 1

Algorithm [S]: Simple way to detect creation date of a program.

Hypothesis:
- H1: All the files inside program directory has the same creation date.
- H2: All the files have been untouched since it was installed to the system. (It means, file's Modified Date equals to its Creation Date).
Input: A directory path
Output: A date when the package was created.

Choose any file path in the directory.
Use stat command to get its time of data modification (with %Y or %y parameters).

Algorithm [Sim]: Improvement of simple way to detect creation date of a program.

Hypothesis:
- H1: Creation Date of a program equal to Modification Date of the vast majority of the files in its directory.
Input: A directory path
Output: A date when the package was created.

With every files in the directory, use stat command to get its time of data modification (with %Y or %y parameters).
Sort the files based on its Modification Date.
The Modification Date of the highest number of files will be returned value.

Challenge 2:

Algorithm [A]: Search for latest information of a package on GitHub.

Hypothesis:
- H1: A directory basename is probably its repo name on GitHub.
- H2: The longer file path, the more unique it is to the package.
Input: A directory path and its files.
Output: Information from GitHub.
- Generic info: User, Repo (package) Name
- Updated info: Latest Release, Released Date, Latest Commit, Committed Date of the package on GitHub.

Set a number of files need to search: n. The higher number the relatively more accurate it is, but slower in processing and so strict that sometime can lose the correct result. We choose n = 3.
Identify n files with the longest file paths in the directory.
For each of the file i above (i from 0 to n-1):
- Use GitHub API to search for the filepath on GitHub.
- Return a set of results: R1[i].
If there is any item r appear in all R1[i] (with i from 0 to n-1), it is our candidate => Put r to our candidate list: R1.
Use GitHub API to search for the repository with the directory basename d, we have a set of results: R2.
The first item in R2 that appears in R1 is the returned repository: R.
If no item in R2 matches with R1, all the candidate items in R1 will be considered returned repositories. R = R1.
Use GitHub API to get the latest information of the repository (or all the repositories in the set) R.

Algorithm [B]: Search for latest information of a package on Google.

Some techniques for advanced search with Google are investigated. However no actual algorithm has been implemented yet in this phase. The idea is: by utilizing specific query, the desired information (such as: Latest information, Released date,...) will be printed out directly on the Google result page.

All in text or In text

allintext:skype latest version 2017
skype intext:latest version 2017

All in title or In title

allintitle:skype latest version 2017
skype intitle:latest version 2017

File type and combine multiple options

firefox download filetype:zip OR filetype:tar OR filetype:bz2

Specific website

freeradius server latest version download site:sourceforge.net

Some combination examples:

burp suite intext:latest version release intitle:download
ccleaner windows 32 intext:"latest version" intitle:version

Results:

Download Burp Suite Free Edition_. Burp Suite Free Edition v1.7.23. Latest Stable. Released 22 May 2017 | v1.7.23 Release notes. Download ...

Get new version of CCleaner. Cleans ... CCleaner Latest version 5.30.6065 ... CCleaner is the most well known PC cleaning, Windows based software program .

Algorithm [C]: Get the latest information by `git` command.

Hypothesis: No hypothesis, this is the most accurate method. (But it's limited to only cloned directories).
Input: A directory path to scan.
Output: All of cloned packages and their latest information: Latest Release, Released Date, Latest Commit, Committed Date.

Scan for all .git directories inside provided path.
For each returned result, update the local git database to latest one from internet: git fetch -tq > /dev/null
Extract all information we need from .git/refs/.

Other challenges:

Challenge 3: Extract package directory

With the list of many files and directories, how can I identify which directories and sub-directories belong to the same package and where is the root directory of the package

For example, with the list of directories as follows:

/usr/share/program1
/usr/share/program2
/opt/program3/sub-dir31
/opt/program3/sub-dir32
/program4/sub-dir41
/program4/sub-dir42

The expected output should be the list of the root directory of each package.

/usr/share/program1
/usr/share/program2
/opt/program3/
/program4

Algorithm [D]: Extract package directory

Hypothesis:
- H1: The vast majority files of the same package have the same Creation Date.
- H2: The directory is not a project under development (it means the user does not frequently modifies its files).
- H3: The number of files in the directory is large enough (more than 7).
Input: A list of directories, sub-directories and files.
Output: List of package's root directories.

Choose a threshold (magic number): n=7
For each of the directory Di in the list:

2.1. Check if it belong to only one or more than one package:

Use stat command (with %Y or %y parameter) to get Modification date of all its files and sub-directories.
Sort and Uniq them based on their Modification Date
If the number of different unique Modification Date is
- Less than n => they belong to the same package.
- Bigger or equal n => they belong to different packages.

2.2. If Di belong to ONE package, check its parent Pi.

If Pi belongs to MORE THAN ONE package. Di is the package directory. Put Di to return result R.
If Pi also belongs to ONE package, both 'DiandPiare sub-directories of the same package => Check their parentPPi`. (come back to step 2.1).

uniq nd sort the returned result R.

Implementation

Input:

Scan directory: D
Interest file types: T

Output:

List of interesting directories, their package names and their statuses (Updated or not, Active or not, Dates,...)

Steps

List up all packages managed by APT => [apt-pkgnames.list]
- ==> List up all files managed by APT => [apt.list]
  - ==> Filter only specific types T => [apt_sorted.list]
List up all packages managed by PIP => [pip-pkgnames.list]
- ==> List up all files managed by PIP => [pip.list]
  - ==> Filter only specific typesT => [pip_sorted.list]
List all files of specific types T in D => [all_files.list]
Filter only interesting files: we don't care about the files that already managed by APT or PIP. [all_files.list] - ([apt_sorted.list] + [pip_sorted.list]) => [interesting.list]
Extract interesting directories by applying [Algorithm-D]. => [interesting_dirs.list]
For each interesting directory:
- Get its local information (by applying [Algorithm-S] => [program_info.dat]
- Get its latest information (by applying [Algorithm-A] => [internet_info.dat]
Generate final result:
- Compare local information vs. internet information to determine status of programs. => [report]

Expected output result:

Manual

Path	Name	Source	Updated	Active	Local version	Latest version
/path/to/package1	Package1	Github	Y	Y	1.0.2	1.0.2
/path/to/package2	Package2	Others	N	Y	0.0.2	1.0.5

Evaluation

The program has been tested on several computers in Eurecom's computer labs as well as on private machines. The collected results then have been checked manually to see if there is any mistakes (mis-recognized directory name, mis-recognized package information,...). The following aspects have been evaluated.

Quality of Algorithm-A: Search for latest information of a package on GitHub.

Mark A-1 = Number of successfully GitHub acquisition / Total number of GitHub package directories

One GitHub acquisition can be considered successful if:

It detects correct user and repo names. (Or the correct one appears in its returned results).
It get the latest information accurately.

Mark A-2 = 1 - (Number of mis-recognized GitHub acquisition / Total number of CORRECTLY RECOGNIZED package directories)

Result

#	PC	Total	Total GitHub only	Accurate	Wrong	Mark A-1	Mark A-2
1	apila	16	6	4	11	67%	31%
2	diableret	109	86	26	83	30%	24%
3	glacier	9	1	1	0	100%	100%
#	Total	134	93	31	94	33%	30%

Quality of Algorithm-D: Extraction of package directory

Mark D-1 = Number of accurately recognized package directories / Total number of ACTUAL package directories

Mark D-2 = 1 - (Number of incorrectly recognized package directories / Total number of RECOGNIZED package directories)

Result

#	PC	Total ACTUAL	Total RECOGNIZED	Accurate	Wrong	Mark D-1	Mark D-2
1	apila	-	38	16	22	-	42%
2	diableret	-	185	129	56	-	70%
3	glacier	-	11	11	0	-	100%
#	Total	-	234	156	78	-	67%

Limitations

Algorithm [A]: GitHub search
- May give some confidence on the returned result. The result should be accepted only if it is above some certain level of confidence. If the confidence is very low, the result should be ignored. It helps remove some packages that are not actually open source. One of the good candidate can be: the star property of the repo on GitHub.
- For PIP (python modules): maybe it's better to get information from <pypi.org>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Home

Motivation

Goal

The two big challenges

Challenge-1:

Challenge-2:

Software classification

Scope of this semester project

Methodology

Challenge 1

Algorithm [S]: Simple way to detect creation date of a program.

Algorithm [Sim]: Improvement of simple way to detect creation date of a program.

Challenge 2:

Algorithm [A]: Search for latest information of a package on GitHub.

Algorithm [B]: Search for latest information of a package on Google.

Algorithm [C]: Get the latest information by `git` command.

Other challenges:

Challenge 3: Extract package directory

Algorithm [D]: Extract package directory

Implementation

Input:

Output:

Steps

Expected output result:

Manual

Evaluation

Quality of Algorithm-A: Search for latest information of a package on GitHub.

Result

Quality of Algorithm-D: Extraction of package directory

Result

Limitations

References

Clone this wiki locally

Home

Motivation

Goal

The two big challenges

Challenge-1:

Challenge-2:

Software classification

Scope of this semester project

Methodology

Challenge 1

Algorithm [S]: Simple way to detect creation date of a program.

Algorithm [Sim]: Improvement of simple way to detect creation date of a program.

Challenge 2:

Algorithm [A]: Search for latest information of a package on GitHub.

Algorithm [B]: Search for latest information of a package on Google.

Algorithm [C]: Get the latest information by git command.

Other challenges:

Challenge 3: Extract package directory

Algorithm [D]: Extract package directory

Implementation

Input:

Output:

Steps

Expected output result:

Manual

Evaluation

Quality of Algorithm-A: Search for latest information of a package on GitHub.

Result

Quality of Algorithm-D: Extraction of package directory

Result

Limitations

References

Clone this wiki locally

Algorithm [C]: Get the latest information by `git` command.