US20130227604A1 - Automated forensic document signatures - Google Patents

Automated forensic document signatures Download PDF

Info

Publication number
US20130227604A1
US20130227604A1 US13/858,536 US201313858536A US2013227604A1 US 20130227604 A1 US20130227604 A1 US 20130227604A1 US 201313858536 A US201313858536 A US 201313858536A US 2013227604 A1 US2013227604 A1 US 2013227604A1
Authority
US
United States
Prior art keywords
signature
file
digital
signatures
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/858,536
Inventor
Thomas Clay Shields
Ophir Frieder
Marcus A. Maloof
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Georgetown University
Original Assignee
Georgetown University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/963,186 external-priority patent/US8280905B2/en
Application filed by Georgetown University filed Critical Georgetown University
Priority to US13/858,536 priority Critical patent/US20130227604A1/en
Publication of US20130227604A1 publication Critical patent/US20130227604A1/en
Assigned to GEORGETOWN UNIVERSITY reassignment GEORGETOWN UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FRIEDER, OPHIR, MALOOF, MARCUS A., SHIELDS, THOMAS CLAY
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/12Applying verification of the received information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Definitions

  • This invention relates generally to methods and systems for computer data management and tracking. Specifically, it relates to methods and systems of digital data identification and the creation, storage, management, processing and comparison of content sensitive digital signatures.
  • Digital evidence varies widely in formats and can include computer files, digital images, sound and videos, e-mail, instant messages, phone records, and so on. They are routinely gathered from seized hard drives, file servers, Internet data, mobile digital devices, digital cameras and numerous other digital sources that are growing steadily in sophistication and capacity.
  • Computer forensics is the practice of acquiring, preserving, analyzing, and reporting on data collected from a computer system, which can include personal computers, server computers, and portable electronic devices such as cellular phones, PDAs and other storage devices. Collecting and analyzing these types of data is usually called digital data identification.
  • the goal of the process is to find evidence that supports or refutes some hypothesis regarding user activity on the system.
  • digital evidence can provide the invaluable proof that helps the conviction of a criminal, or prevents a looming terrorist attack. A delay in identifying suspect data occasionally results in the dismissal of some criminal cases, where the evidence is not being produced in time for prosecution.
  • a typical computer forensic process involves first the determination that the evidence requirements merit a forensic examination. Individuals who are expected to have access to that evidence are then identified. Further, all computer systems used by these individuals which might contain relevant data are located. Forensic images of those systems are taken, and analyzed for relevant evidence. Traditionally, a forensic investigator seizes all storage media, creates a drive image or duplicates it, and then conducts their examination of the data on the drive image or duplicate copy to preserve the original evidence.
  • a “drive image” is an exact replica of the contents of a storage device, such as a hard disk, stored on a second storage device, such as a network server or another hard disk.
  • One of the first steps in the examination process is to recover latent data such as deleted files, hidden data and fragments from unallocated file space.
  • Digital forensic analysis tools used today are stand alone systems that are not coordinated with systems used by the forensic investigators and Information Technology (IT) staff.
  • Current computer forensics analysis is largely a manual labor intensive process. It requires computer forensic investigators that have specialized training. The cost of the analysis is high. The rate for some computer forensic investigators can be more than $250/hour. It usually requires a long analysis time taking from days to weeks. Because it is a manual process, there is potential for human error resulting in missed data and missed discovery.
  • it is difficult to determine what systems to analyze This may have two undesirable results: expending limited time and resources on useless systems, or missing systems that contain vital information.
  • Information can be stored as files that exist on a computer file system, and can exist in many heterogeneous forms such as plain text documents, formatted documents (e.g. Microsoft Word® documents, Open Document Format documents), spread sheets, presentations, Portable Document Format documents, images of paper documents, graphics, sound recordings, videos, faxes, email messages, voice messages, web pages, and other stored digital media.
  • Information can also be stored as entries in databases such as a relational database or a document management system. This information is subject to a wide range of user manipulations, such as create, edit, copy, rename, move, delete and backup.
  • Information can also move among the entity computer systems through various communication means, such as emails, attachments, file sharing, shared file systems and push technology. Information can also leave the entity computer systems either by someone within the entity sending it to an outsider, or can be retrieved by an outsider from the entity computer systems by obtaining information containing removable storage media or through network access protocols such as HTTP, FTP, and peer-to-peer file sharing. All of this creation, manipulation, transfers, and communication of digital information can be part of the legitimate business process. However, abuse of the computer system also involves the same processes of creation, manipulation, transfer, and communication of information, albeit unauthorized or illegitimately. The Computer Security Institute 2007 survey also revealed that insider abuse of the network access or email edged out virus incidents as the most prevalent security problem. While a majority of all computer attacks enter via the Internet, most significant of all dollar losses stem from internal intrusions.
  • IP intellectual Property
  • Corporations may also incur liability or exposure to risks when unauthorized contents are stored in the computer systems, such as child pornographic material, or pirated copies of media or software.
  • An organization must know which of its assets require protection and the real and perceived threats against them.
  • IT Information Technology
  • Content-security tools based on HTTP/SMTP proxies are used against viruses and spam. However, these tools did't designed for intrusion prevention. They don't inspect internal traffic; they scan only authorized e-mail channels. They rely on file-specific content recognition and have scalability and maintenance issues. When content security tools don't fit, they are ineffective. Relying on permissions and identity management is like running a retail store that screens you coming in but doesn't put magnetic tags on the clothes to prevent you from wearing that expensive hat going out.
  • a hash analysis is a method that can be used for comparing the content of digital evidence.
  • a cryptographic one-way hash (or “hash” for short) can be a way to calculate a digital fingerprint: a very large number that often uniquely identifies a digital file.
  • a hash is a calculated function on the bits that make up a file. Therefore, two files with different names but the exact same contents will produce the same hash.
  • using hash systems to identify conclusive or known suspect files faces several challenges. By design of the hash function, a small difference, even a single bit, in the input file will generate a significantly different output hash. The difference between two hash numbers does not reflect the level of similarity of the input files.
  • the hash method cannot be used to identify files that have been altered, whether minimally or substantially. They are therefore not able to identify derivative files, files that contain common contents but are arranged or formatted differently or contain more or less other content. For the same reason, hash analysis is not effective against multimedia files (image, video, and sound). As a consequence, an individual using these files to commit crimes may escape hash based detection and prosecution.
  • the present invention is a method, system, and computer readable media for proactively generating, preserving and comparing computer forensic evidence for a computer system.
  • the method involves generating at least one signature for at least one target based on the content of the target.
  • the at least one signature can be generated at any time, or when a predetermined operation is commenced.
  • the at least one generated signature can be stored, or not, prior to or after forensic use.
  • the generated signature(s) are compared with one or more previously generated signature(s) to determine whether any compared signatures have similarities above a predetermined threshold.
  • the present invention could, at any time, simply compare previously existing signatures generated from a target.
  • the target can be any file, any file that is owned by a user, any operating system file, any file that is part of a proprietary information system, or any file that is related to a network intrusion attack.
  • the predetermined operation can be any one or more of creating, deleting, renaming, editing, moving, updating, linking, merging, modifying and copying the file.
  • the target could also be a database entry; and when a database entry, the predetermined operation can be any one or more of selecting, inserting, updating, deleting, merging, beginning work, committing, rollback, creating, dropping, truncating, and altering of the database entry.
  • the target can further be a database definition. When the target is a database definition, the predetermined operation can be any one or more of creating, dropping and altering the database definition.
  • the target can also be network traffic; and when network traffic, the predetermined operation can be the occurrence of network traffic entering a network or leaving a network, or a network traffic is initiated from a computer system, or a computer system receives network traffic.
  • the network traffic may be any one or more of a signal protocol, an email, an attachment of an email, an instant message conversation, a text message, a remote login, a virtual private network, a viewed webpage, a file transfer and file sharing.
  • Generating the at least one signature can involve extracting a set of tokens from the at least one target, processing the set of tokens, generating a fingerprint from the set of tokens, and generating the signature for the target by combining the fingerprints with other related information of the target.
  • Processing the set of tokens can include sorting the set of tokens, and may further include filtering the set of tokens.
  • the method for generating the fingerprints may involve a hash method, or an implementation of a bit vector method.
  • Other related information of the target can be accessible by an operating system, and can be any one or more of file name, date of record, time of record, user or owner information, network address, network protocol, access history and fingerprint history. Other related information of the target could also be information accessible by an application.
  • the generated signature(s) could be stored in a manner preventing deletion or modification by a user, other than a user with special access rights, such authorized personnel or a forensic investigator.
  • the signature(s) could further be made available only to authorized personnel or a forensic investigator with access rights.
  • the signature(s) and respective targets can be stored on the same computer system, different computer systems, and/or on a shared file system.
  • the signature(s) can be stored on write-once, read-many media.
  • a computer readable medium that configures a computer system to perform the methods described above of proactively generating, preserving and comparing computer forensic evidence for a computer system.
  • computer readable medium facilitates the method of generating at least one signature for at least one target based on the content of the target; and comparing the at least one generated signature with at least one previously generated signature to determine whether the signatures have similarities above a predetermined threshold.
  • the present invention also provides an apparatus for the generation, preservation and comparison of computer forensic evidence.
  • the apparatus/system can include a processor arranged to generate at least one signature for at least one target based on the content of the target, and a comparator configured to compare the at least one generated signature with at least one previously generated signature to determine whether the signatures have similarities above a predetermined threshold.
  • the system can additionally include an extension module configured to trigger signature generation upon occurrence of a certain action, and a mechanism for storing the generated signatures.
  • the implemented system may have an operating system service (e.g., a Windows service or Unix/Linux daemon) running in the background to generate a signature for a given file and to store it, and then to query the stored signatures to determine similarity with other signatures.
  • an operating system service e.g., a Windows service or Unix/Linux daemon
  • a computerized method of proactively generating and querying computer forensic evidence for a computer system comprises the steps of generating a representation of content of at least one target within a set of targets, and generating an inverted index of the set of targets, wherein the inverted index is associated with representations of the content of each target of the set of targets.
  • the set of targets comprises one or more files.
  • the inverted index is updated upon occurrence of a predetermined operation, and the predetermined operation is one or more of creating, deleting, renaming, editing, moving, updating, linking, merging, modifying and copying a file.
  • the set of targets comprises one or more database entries.
  • the inverted index is updated upon occurrence of a predetermined operation, and the predetermined operation is one or more of select, insert, update, delete, merge, begin work, commit, rollback, create, drop, truncate, and alter of a database entry.
  • generating the representation of the content of at least one target comprises the steps of extracting a set of terms from the target, and processing the set of terms.
  • generating the representation of the content of at least one target further comprises the steps of extracting other related information of the target, and incorporating the other related information with the extracted and processed terms.
  • the other related information of the target is accessible by an operating system, and is at least one of file name, date of record, time of record, user or owner information, network address, network protocol, and access history of the target.
  • the other related information of the target is accessible by an application.
  • generating the inverted index of the set of targets comprises the steps of extracting a set of terms from the at least one target, processing the set of terms, indexing the set of terms to create the inverted index and associating the set of terms with representations of the content of each of the one or more targets.
  • the representation of the content of the at least one target is stored permanently and is not removed when the target is modified or removed.
  • the inverted index retains association with the representation of the content of the at least one target when the target is modified or removed.
  • the method further comprises the step of storing the generated inverted index in a manner preventing deletion or modification of the inverted index by a user other than authorized personnel or a forensic investigator.
  • the generated inverted index is available only to authorized personnel or a forensic investigator with access rights.
  • the generated inverted index and the set of targets are stored on the same computer system.
  • the generated inverted index is stored on a first computer system and the set of targets is stored on a second computer system accessible to the first computer system through a computer network.
  • the present invention in yet another aspect, further comprising the step of querying the inverted index.
  • a computer-readable medium that configures a computer system to perform a method of proactively generating and comparing computer forensic evidence for a computer system.
  • the method comprises the steps of generating a representation of content of at least one target within a set of targets, and generating an inverted index of the set of targets, wherein the inverted index is associated with representations of the content of each target of the set of targets.
  • an apparatus for proactively generating and comparing computer forensic evidence comprises a processor arranged to generate a representation of content of at least one target within a set of targets, and a processor arranged to generate an inverted index of the set of targets, wherein the inverted index is associated with representations of the content of each target of the set of targets.
  • a method of generating a signature of a video comprises the steps of detecting scene changes of the video, segmenting the video into a plurality of segments corresponding to each scene change, extracting a representation from each segment, forming a digital fingerprint based on the representation, and creating a signature by combining the digital fingerprint with predetermined metadata of the video.
  • the representation includes one or more frames of each segment.
  • the representation includes length information of each segment.
  • the representation includes a subtitle and caption text of each segment.
  • a method of generating a signature for an audio file comprises the steps of determining whether the audio includes music or speech, generating, if the audio includes speech, a transcript of the speech, extracting a representation from the transcript, forming a digital fingerprint based on the representation, and creating a signature by combining the digital fingerprint with predetermined metadata of the audio.
  • a method of generating a signature of a digital file comprises the steps of selecting a token indicating an informational object, generating a digital fingerprint of the digital file using the selected token, and creating the signature that includes the digital fingerprint and meta data of the digital file.
  • the method further comprises the steps of extracting a plurality of tokens from the digital file according to the selected token, and inserting the extracted plurality of tokens into a data structure that probabilistically determines whether a token is inserted before.
  • the selected token includes a token selected from the group consisting essentially of: an email address, a name, an account number, and a social security number.
  • a method of matching a forensic signature of a digital file comprising the steps of receiving, by a master apparatus, a list of query signatures, generating, by the master apparatus, a plurality of sub-lists of query signatures, transmitting, by the master apparatus, a sub-list to a slave apparatus, matching, by a slave apparatus, the query signatures with signatures in database according to digital fingerprints included in the query signatures and generating a plurality of matching lists for each digital fingerprints, merging, by the slave apparatus, the plurality of matching lists and generating a final list, calculating, by the slave apparatus, a score for each matched signature in the final list according to a closeness between the matched signature and the query signature, and transmitting, by the slave apparatus, the final list and the score to the master apparatus.
  • FIG. 1 is a schematic diagram of an exemplary computing environment
  • FIG. 2 is a schematic diagram of an exemplary network environment
  • FIG. 3 is a flow chart illustrating an exemplary method for generating a signature for a document
  • FIG. 4 is a flow chart illustrating document modification and new fingerprint generation pursuant to one embodiment of the present invention.
  • FIG. 5 is a flow chart illustrating an exemplary method to perform a latent signature search
  • FIG. 6 is a flow chart illustrating an exemplary method for user misuse detection
  • FIG. 7 is a flow chart illustrating another exemplary method for user misuse detection through the use of user signature profiles
  • FIG. 8 is a flow chart illustrating an exemplary method for the detection of an unauthorized network communication of sensitive information
  • FIG. 9 is a schematic block diagram illustrating an exemplary embodiment of a system of the present invention, showing event trigger, fingerprint/signature generation, signature query and comparison, and signature storage;
  • FIG. 10 is a flow chart illustrating an exemplary method of generating an inverted index for a set of documents according to the present invention
  • FIG. 11 is a flow chart illustrating an exemplary method for updating an inverted index in response to document addition and modification according to one aspect of the present invention.
  • FIG. 12 is a flow chart illustrating an exemplary method for performing a latent search using an inverted index.
  • FIG. 1 and the following discussion are intended to provide a brief general description of a suitable computing environment in which an example embodiment of the invention may be implemented. It should be understood, however, that handheld, portable, and other computing devices of all kinds are contemplated for use in connection with the present invention. While a general purpose computer is described below, this is but one example.
  • the present invention also may be operable on a thin client having network server interoperability and interaction.
  • an example embodiment of the invention may be implemented in an environment of networked hosted services in which very little or minimal client resources are implicated, e.g., a networked environment in which the client device serves merely as a browser or interface to the World Wide Web.
  • the invention can be implemented via an application programming interface (API), for use by a developer or tester, and/or included within the network browsing software which will be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers (e.g., client workstations, servers, or other devices).
  • program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types.
  • the functionality of the program modules may be combined or distributed as desired in various embodiments.
  • those skilled in the art will appreciate that the invention may be practiced with other computer system configurations.
  • PCs personal computers
  • server computers hand-held or laptop devices
  • multi-processor systems microprocessor-based systems
  • programmable consumer electronics network PCs, minicomputers, mainframe computers, and the like.
  • program modules may be located in both local and remote computer storage media including memory storage devices.
  • FIG. 1 thus illustrates an example of a suitable computing system environment 100 in which the invention may be implemented, although as made clear above, the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or a combination of components illustrated in the exemplary operating environment 100 .
  • an example system for implementing the invention includes a general purpose computing device in the form of a computer 110 .
  • Components of the computer 110 may include, but are not limited to, a processing unit 120 , a system memory 130 , and a system bus 121 that couples various system components including the system memory to the processing unit 120 .
  • the system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, Peripheral Component Interconnect (PCI) bus (also known as Mezzanine bus), and PCI-Express bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Mezzanine bus also known as Mezzanine bus
  • PCI-Express bus PCI-Express
  • the computer 110 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by the computer 110 and includes both volatile and nonvolatile, removable and non-removable media.
  • Computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), Electrically-Erasable Programmable Read-Only Memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CDROM), digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer 110 .
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • the system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as ROM 131 and RAM 132 .
  • BIOS basic input/output system
  • RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by the processing unit 120 .
  • FIG. 1 illustrates operating system 134 , application programs 135 , other program modules 136 , and program data 137 .
  • RAM 132 may contain other data and/or program modules.
  • the computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media.
  • FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152 , and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 , such as a CD ROM or other optical media.
  • removable/non-removable, volatile/nonvolatile computer storage media that can be used in the example operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
  • the hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140
  • magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150 .
  • the drives and their associated computer storage media discussed above and illustrated in FIG. 1 provide storage of computer readable instructions, data structures, program modules and other data for the computer 110 .
  • the hard disk drive 141 is illustrated as storing operating system 144 , application programs 145 , other program modules 146 , and program data 147 . Note that these components can either be the same as or different from operating system 134 , application programs 135 , other program modules 136 , and program data 137 .
  • Operating system 144 , application programs 145 , other program modules 146 , and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
  • a user may enter commands and information into the computer 110 through input devices such as a keyboard 162 and pointing device 161 , commonly referred to as a mouse, trackball or touch pad.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • a user input interface 160 that is coupled to the system bus 121 , but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
  • USB universal serial bus
  • a monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190 .
  • computers may also include other peripheral output devices such as speakers and a printer (not shown), which may be connected through an output peripheral interface 195 .
  • the computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180 .
  • the remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110 , although only a memory storage device 181 has been illustrated in FIG. 1 .
  • the logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173 , but may also include other networks.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • the computer 110 When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170 . When used in a WAN networking environment, the computer 110 typically includes means for establishing communications over the WAN 173 , such as the Internet. In a networked environment, program modules depicted relative to the computer 110 , or portions thereof, may be stored in the remote memory storage device.
  • FIG. 1 illustrates remote application programs 185 as residing on a memory device 181 .
  • Remote application programs 185 include, but are not limited to web server applications such as Microsoft Internet Information Services (IIS)® and Apache HTTP Server which provides content which resides on the remote storage device 181 or other accessible storage device to the World Wide Web. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • IIS Microsoft Internet Information Services
  • Apache HTTP Server which provides content which resides on the remote storage device 181 or other accessible storage device to the World Wide Web.
  • a computer 110 or other client devices can be deployed as part of a computer network.
  • the present invention pertains to any computer system having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes.
  • An embodiment of the present invention may apply to an environment with server computers and client computers deployed in a network environment, having remote or local storage.
  • the present invention may also apply to a standalone computing device, having programming language functionality, interpretation and execution capabilities.
  • FIG. 2 illustrates an embodiment of a network environment in which an embodiment of the present invention can be implemented.
  • the network environment 200 contains a number of local server systems 210 , which may include a number of file servers 211 , web servers 212 , and application servers 213 that are owned and managed by the owner of the local network. These servers are in communication with local user systems 220 which may include a large variety of systems such as workstations 221 , desktop computers 222 , laptop computers 223 , and thin clients or terminals 224 .
  • the local user systems 220 may contain their own persistent storage devices such as in the case of workstations 221 , desktop computers 222 , and laptop computers 223 . They can also have access to the persistent storage provide by the local servers 210 .
  • network storage may be the only available persistent storage.
  • the local user systems are usually connected to a variety of peripherals 260 that handle data input and output, such as scanners, printers and optical drives.
  • peripherals 260 that handle data input and output, such as scanners, printers and optical drives.
  • removable media 250 can be based on magnetic recording, such as floppy disks and portable hard drives, or be based on optical recording, such as compact disks or digital video disks.
  • removable media can also be based on non-volatile memory such as flash memory which can be a USB flash drive, and all forms of flash memory cards.
  • the users within the local network usually get access to the wider area network such as the Internet 280 though the local server systems 210 and typically some network security measures such as a firewall 270 .
  • the remote computer systems can be a variety of remote terminals 291 , remote laptops 292 , remote desktops 293 , and remote web servers 294 .
  • FIG. 2 illustrates an exemplary network environment.
  • FIG. 2 illustrates an exemplary network environment.
  • teachings of the present invention can be used with any number of network environments and network configurations.
  • the present invention teaches methods and systems to improve computer forensics with search and machine learning.
  • This invention allows organizations that anticipate the need for forensic analysis to prepare in advance by keeping small amounts of information about any content on computer systems, such as files, database entries or schema, or network traffic, as the content is created, deleted, modified, copied or transmitted or received. Computational and storage costs are expanded in advance, which allows faster, better and less expensive computer forensics investigations.
  • the present invention provides a novel proactive approach for computer forensic investigations.
  • a signature contains one or more fingerprints and other information associated with the target.
  • a fingerprint is a relatively small number of bits, as compared to the size of the file that is computed based on the content of a target.
  • the target can be any file, any file that is owned by a user, any operating system file, any file that is part of a proprietary information system, any file that is related to a network intrusion attack, any database entry or definition, or network traffic.
  • a signature contains one or more fingerprints computed based on the content of the file along with other information associated with the file, such as the file name, date and time of record, user/owner information, and fingerprint history.
  • the signature contains one or more fingerprints that are calculated based on the content of the database entry or definition along with other information associated with the database entry or definition.
  • the signature contains one or more fingerprints that are calculated based on the content of the network traffic and along with other information associated with the network traffic, such as time and date information, sender and recipient network addresses, and network protocol.
  • the fingerprints of the present invention are digital digests of the content of a target.
  • all bits that make up a file are considered as the content of a file.
  • the content of a target is defined and represented by selections of tokens that are logically selected from the target.
  • the content of a target that contains textual information can be defined by a selection of words and phrases within the target.
  • idiosyncratic characteristics of the target can be identified and used to represent the contents. Fingerprints are small, taking up a small amount of storage space, when compared to the original content of the target.
  • Fingerprints are also easy to compute, and can identify a file, a database entry or definition, or network traffic by its content as defined by the list of selected tokens. Fingerprints can accommodate small modifications of the file (e.g., small edits or reformatting of a file may not alter its fingerprint). The fingerprints of a minimally edited version of a file mostly or fully match the fingerprints of the original file.
  • the creation of a signature usually comprises four steps. First, a set of tokens of interest are extracted from a target, such as a file, database entry or definition, or network traffic. Second, the token set undergoes a predetermined sequence of processing, such as sorting and filtering. Third, a fingerprint is then generated for each retained token set. Lastly, the fingerprint is combined with other information associated with the target file, database entry or definition, or network traffic to generate a signature.
  • a target such as a file, database entry or definition, or network traffic.
  • the first step involves parsing the document, extracting text information and retaining tokens of interest.
  • Tokens of interest may include, but are not limited to, all words, phrases, selective parts of speech, e.g., nouns (names, places, etc.), words longer than a fixed number of characters, words not found in a dictionary, words found within a certain set of predefined lists of words, words of a “foreign nature”, words based on inverse document frequencies (histograms), in other words, words based on collection statistics, and acronyms.
  • Processing the token set may involve sorting the token set, and may further include filtering the token set. Sorting the token set can be based on, but not limited to, Unicode (alphabetical) ordering, biased weighting on inverse document frequency, and phrase or word length.
  • the retained tokens may be sorted again as previously described. However, sorting is unnecessary if one wishes to retain the same sorting conditions as used previously.
  • Creating one or more fingerprints of the retained token list can follow several computational methods. For example, a hash based method, where using a hash function, one can encode the sorted list of retained tokens and generate a unique hash for the retained token list. Many popular hash functions can be used for the calculation of the hash, such as MD5, SHA-1, RIPMED, WIRLPOOL, and the variations of these hash functions. Using a hash method for fingerprint creation is advantageous as it calculates quickly, and saves space. However, hash methods are not reversible (i.e., given a hash code, it is computationally impractical to retrieve the original token list).
  • bit vector method Another method for fingerprint creation is a bit vector method, which uses a bit vector to encode the presence or absence of retained tokens.
  • the bit vector could be a binary vector using a sequence of Boolean values, each stored as a single bit, or a non-binary numeric vector.
  • the advantage of the bit vector method is that it is a reversible process, but bit vectors are often more costly in terms of storage space.
  • fingerprints are generally some form of lossy compression based on a predefined operation. However, it is within the scope of the invention to use a lossless compression method. For multimedia content, such as an image, sound and/or video file, mathematical transformations can be used to create fingerprints. It is apparent to those skilled in the art that fingerprint creation can be achieved through a variety of methods, and are not limited to the above mentioned approaches.
  • fingerprints Once the fingerprints are created, other information associated with the document is extracted and combined with the fingerprints to create a signature.
  • the other associated information may be information about the document that is accessible through the operating system, which may include, but is not limited to, file name, date and time of record, user/owner information, access history, and fingerprint history.
  • Other information may also include information about the document accessible through an application, which may include, but not limited to, author, time of editing, number words, title, subject, comments, and any other customizable fields or application specific information.
  • information about the document accessible through an application may include, but not limited to, author, time of editing, number words, title, subject, comments, and any other customizable fields or application specific information.
  • FIG. 3 shows an exemplary diagram of the process of generating a signature for a document.
  • the document is first parsed and non-textual information is removed.
  • a set of tokens 311 are extracted 310 from the document.
  • the token set is then processed to yield a unique token list.
  • the processing of the token set involves sorting the token set 320 , which produces a sorted list of tokens 323 , and filtering the token set 324 , which generates one or more filtered lists of tokens 325 .
  • the processing of the token set involves sorting the token set 320 , which produces a sorted list of tokens 323 , and filtering the token set 324 , which generates one or more filtered lists of tokens 325 .
  • the retained tokens are then used to generate one or more fingerprints of the document 330 .
  • a hash or bit vector can be calculated for the entire list of retained tokens and used as a fingerprint.
  • the processed token list can be presented in the form of several subsets of tokens.
  • a hash or bit vector can be calculated for each of the subset of tokens, and the document is represented with a list of fingerprints corresponding to each retained subset of tokens.
  • a hash or bit vector is calculated for each retained token, and the document is represented with a list of fingerprints corresponding to each retained token.
  • a signature is created by combining other information associated with the document 331 with one or more fingerprints. The resulting signature is then stored.
  • the fingerprint of the file might not change, and the signature is updated with relevant other information. If modifications to a document are not small, then the modified document's fingerprint may not be sufficiently close to the original fingerprint. After such modification, a new candidate fingerprint is created and compared to the original fingerprint. If sufficient change has occurred in the document, and the candidate fingerprint does not match the original fingerprint, the new candidate fingerprint is added to the document's signature.
  • the signature may encode other information, including but not limited to information related to derivation. In other embodiments of the invention, similarity may be measured by comparing fingerprints, signatures or both.
  • FIG. 4 illustrates document modification and further fingerprint generation.
  • a new candidate fingerprint is generated 420 based on the content of the modified document using the method exemplified in FIG. 3 .
  • the new candidate fingerprint is then compared with the fingerprint representing the original version of the document 430 .
  • the actual original document does not need to be retrieved for comparison. If the candidate fingerprint does not differ from the original document, the modification of the document is minor.
  • the original fingerprint is then combined with updated other information associated with the document 450 and the updated signature is stored. If the candidate fingerprint differs from the original document, a major modification has occurred.
  • the candidate fingerprint is then added to the original fingerprint 440 .
  • a new signature of the modified document is then created, incorporating the updated other information of the document and stored. If a fingerprint history is implemented in the signature, it is also updated.
  • the present invention can easily be adapted to other types of files. It is also possible to construct similar fingerprints for multimedia files such as image, video, and sound files.
  • a variety of mathematical transformations can be used to create fingerprints for these file types, such as Laplace transform, Karhunen-Loève transform and Fourier transform.
  • Metadata text of sound, image, and video can be used to generate fingerprints. Close captioning within a video file is text which can be used to generate fingerprints, as previously described. Speech can be converted to text using existing software tools. Text thus derived can be used to generate fingerprints.
  • the digital content of these files can be encoded as a sequence of tokens, like text documents.
  • Executables and dynamically linked libraries can be represented as a sequence of tokens, which can be used to produce fingerprints. Text embedded in these files can also be used to create fingerprints.
  • Reverse engineered programs e.g., Java
  • Byte-code languages and scripting languages e.g., Perl, python
  • the fingerprint creation process produces a relatively small amount of bits, as compared to the original file, and serves as a digest of the content of the original file.
  • a person skilled in the art will appreciate that numerous methods can be used for achieving fingerprint creation.
  • the fingerprint creation process in general is a lossy compression process. However, lossless compression schemes can also be adopted for the fingerprint creation process.
  • the signatures are stored in a manner preventing a regular user from modifying or deleting the signatures. Because the signatures are used for forensic purposes, their generation and storage is preferably transparent to the regular user. Only authorized personnel and forensic investigators can have access to the stored signatures. In a network environment, signatures can be created on a user system and offloaded to a network server for storage. Signatures can also be stored on a local file system, while denying user access through use of hidden files or hidden partitions. The signatures can also be embedded in encrypted files. One can also use write-once, read-many media for storing signatures. Only authorized personnel or forensic investigators can recover the storage media and be responsible for safe keeping. Off site storage of the signatures may also be desirable. Cryptographic logging mechanisms can be implemented to control and monitor the access of the signatures.
  • the present invention can be implemented in a variety of ways.
  • a stand alone system such as an individual PC, laptop, mobile device (e.g., cell phone, PDA, etc.)
  • signature information is stored locally.
  • shared file systems such as file servers, database servers, and network attached storage (NAS)
  • signature information is stored locally or on the shared file systems.
  • any system with a network connection can have signature information stored on remote servers.
  • signatures can be stored in a variety of ways depending on the system or the network configurations of a particular environment.
  • Fingerprints can be created for information that is stored in any database and also database definitions. Signatures for each database entry are based on content and can be created for the entire database. As an example, signatures can be created for emails stored within a server database, allowing the tracing of email senders and receivers. Database definitions, such as schema, relations, tables, keys, and data domains can also have signatures created. When a data manipulation or definition event occurs, such as create table, drop table, or alter table, a new signature is created and stored.
  • signatures can be created for other applications. Changes to virtual machine file systems could be indexed as changes occur. Contents of removable media could have signatures created during mounting or un-mounting (during connection and disconnection) to a computer system. Compressed or archived files could be parsed and have signatures created.
  • signatures can be created for emails entering and exiting a network.
  • Email attachments can have separate signatures created.
  • Network traffic can thus be linked to particular emails and files when stored.
  • Contents of instant message conversations and contents of file transfers can also be used to create signatures for the particular network activity.
  • Signatures can also be created for text messages such as the ones based on Short Message Service (SMS) protocol.
  • SMS Short Message Service
  • Web pages can also have signatures generated. When integrated over time, a digest or profile of one or more user's Internet browsing history can be generated.
  • SMS Short Message Service
  • a person skilled in the art will appreciate that any information or signal transmitting protocol can be used as a target for signature creation.
  • a proxy firewall is used, and signatures are created of network traffic passing through.
  • Network policies can be configured so that the network traffic passing through the proxy firewall is not encrypted.
  • secure connections are established between an inside user computer to the proxy firewall, and the proxy firewall to an outside server using an encryption protocol such as Transport Layer Security (TLS) or Secure Sockets Layer (SSL).
  • TLS Transport Layer Security
  • SSL Secure Sockets Layer
  • Network traffic encryption only occurs between the inside user computer and the proxy firewall, and between the proxy firewall and the outside server.
  • Contents passing through the proxy firewall are not encrypted and can, therefore, have signatures created. Signatures are stored among other information associated information regarding the network traffic with IP addresses used in communication, therefore facilitating the identification of the origin and destination of the traffic.
  • signatures are stored, there are a variety of methods to analyze them. Similarity between signatures can be ascertained by comparing the signature or the fingerprints for exact matching, percentage of matching, probability of matching, or other mathematical calculation revealing the divergence of the signatures or fingerprints.
  • a latent analysis can be performed. Particular signatures and/or fingerprints on individual machines locally or remotely can be searched and compared. Signatures or fingerprints that are stored in a database can be similarly searched.
  • an active analysis is performed. Instead of simply searching with signatures and fingerprints, advance or retrospective analysis of the signatures and fingerprints can be performed for the purpose of data mining, user profiling, trend analysis, and anomaly detection.
  • FIG. 5 presents an exemplary method for performing a latent search.
  • the signature can then be used directly as a query signature.
  • a query signature can be created 510 using the method exemplified in FIG. 3 .
  • Stored signatures are then retrieved from storage 520 and compared to the query signature 530 . The comparison can be performed on signatures, the fingerprints within the signatures, or both. Similarity of the query signature to any stored signature is then determined. If the fingerprints are calculated using a hash method, the similarity is estimated based on hash matches. If the fingerprints are calculated using a bit vector method, the similarity is estimated based on bit vector correlation.
  • the similar signatures are output for further processing 540 .
  • Other information within the stored signatures similar to the query signature is extracted 550 .
  • Other documents containing content similar to the document of interest, computer systems housing the document of interest or any similar documents, and users that had possession of the document of interest or any similar documents, can all be identified 560 .
  • FIG. 6 presents an exemplary method for user misuse detection.
  • a user performs an operation to a document that is within a list of predetermined operations, such as create, modify, copy, move, or delete a document
  • the system captures this user operation 610 , and a new signature is created 620 and stored 630 .
  • This new signature is then used as a query signature, and compared with stored signatures 640 .
  • a subset of all stored signatures such as signatures of known documents containing classified or sensitive information, or illegal content can be used. If the comparison does not identify any stored signature within this subset having similarity to the query signature above a certain threshold, the user is presumably not manipulating classified, sensitive, or illegal content.
  • the operation proceeds as normal. If the comparison identifies any stored signature within this subset that has similarity to the query signature above a certain threshold, the user is presumed to be manipulating classified, sensitive, or illegal content. A further inquiry whether the user is expected to manipulate such content is performed 650 based on criteria such as security clearance, job assignment, or special permission. If the user is determined to have proper access permission, and is expected to manipulate such content, the operation proceeds as normal. However, if the user does not have proper permission, or is not expected to manipulate such content, then the suspect content is identified based on the query and the stored similar fingerprint or signature 660 , and a misuse alert is sent to authorized personnel or a forensic investigator 670 .
  • FIG. 7 presents another exemplary method for user misuse detection. All the files that belong to or are accessed by a user are identified based on ownership information and access information 710 . Signatures of the entire collection of these files can be used to generate a user profile for the user 720 and are stored 730 . An updated user profile is then generated at a later time, either by request or based on a periodic schedule. The newly generated user profile is then compared to any or all of the stored user profiles of the same user at earlier times 740 . If no difference above a certain threshold is detected among the user profiles, there is no deviation in user behavior.
  • a further inquiry is performed to determine whether there is a legitimate reason for such deviation of user behavior 750 . If a legitimate reason is found, such as change in job assignment or upgrade of security clearance, the operation proceeds as normal. If no legitimate reason is found for the deviation of user behavior, the content of the mismatched signatures is identified 760 , and an alert of possible user misuse is sent to authorized personnel or to a forensic investigator 770 .
  • FIG. 8 presents an exemplary method for detection of unauthorized network communication of sensitive information.
  • a network server receives inbound or outbound network traffic 810
  • a signature is then calculated based on the content of the network traffic 820 and stored 830 .
  • the signature is then used as a query signature and is compared to any previously stored signatures 840 .
  • the query signature has similarity to any stored signature above a certain threshold, it is then compared to a subset of all stored signatures, such as signatures of known documents containing classified or sensitive information, or illegal content 850 . If the query signature does not have similarity above a certain threshold to any of the subsets of stored signatures, no classified, sensitive, or illegal content is detected.
  • Network traffic is allowed to proceed as normal 860 . However, if classified, sensitive, or illegal content is detected, suspect content and user information is identified 870 , the network traffic is then quarantined 880 , and an alert is sent to an authorized personnel or to a forensic investigator 890 .
  • This invention can also be used for evidence discovery. Given one user or a set of users, a forensic analysis could determine documents of interest. Those identified documents could be used to seed a fingerprint search across all systems. That would rapidly identify which other systems needed further consideration for analysis. The present invention can determine the source of files that were not permanently stored, such as temporary files deleted without a user's knowledge.
  • This invention can be further used for misuse detection.
  • Many systems log accesses to restricted material.
  • restricted material is usually defined by its location within the file system, or by other attributes of the file. Once the restricted material leaves the protected file systems location, or loses its original attributes, access logging will no longer be able to detect misuse of the restricted material.
  • the present invention can detect when the access logging fails by verifying that documents that should have been logged were logged. Collection statistics and fingerprints can determine when a document is atypical for a user, which may be a sign of document misuse.
  • the present invention can also help to determine the source of leaks by identifying the systems within which a leaked document was present, and a time line that tracks movement the leaked document through a network.
  • This invention can also be used for intrusion response.
  • the signatures of files associated with the intrusion can be recovered. Even if the original files are deleted, the signatures can still be recovered based on time stamps. These recovered signatures can be used to examine across systems for similar intrusions, and also provide early detection to prevent intrusion from similar attacks.
  • FIG. 9 illustrates an exemplary system of the present invention.
  • the system of FIG. 9 comprises four components: 1) a processor for creating/generating fingerprints and signatures for a target, such as a document 910 ; 2) an extension module to the operating system (OS) configured to trigger signature generation upon occurrence of a certain action 920 ; 3) a mechanism for storing the generated signatures 930 ; and 4) a comparator for querying the system for stored signatures and comparing those retrieved for similarity 940 .
  • the implemented system may have either a Windows service or Linux daemon running in the background to generate a signature for a given file and to store it, and then to query the stored signatures to determine similarity with other signatures.
  • the system runs with administrator or root privileges.
  • the extension module of the operating system has several components. First the configuration information must be stored on the system. In Windows, this would be registry entries or configuration files. In Linux, a configuration file is used, which is stored in /etc or another location.
  • the configuration information includes mechanisms for signature creation, other information to store with signatures, mechanism and location for signature storage, events that trigger signature creation and mechanisms for extracting text based on file type. Separate programs or modules can be called to perform text extraction. In Windows, the COM model can be used to extract text from Office documents. In Linux, various utilities can be used to extract text from different file types.
  • the signature creation is linked into the OS so that signatures are created when desired system events occur, such as file deletion, file copy between file systems, and file modification.
  • desired system events such as file deletion, file copy between file systems, and file modification.
  • certain system events are remapped to invoke the signature creation process, and the system waits for the occurrence of these events.
  • the OS invokes calls to the signature creation process.
  • Linux this can be achieved by a loadable kernel module.
  • Windows this can be done through a variety of ways.
  • the system identifies the digital object (file) that triggered the operation, and passes a copy or pointer to the file for processing to the fingerprints creation process.
  • Tokens are extracted from the file and processed, fingerprints are generated for the retained token list, other information associated with the file (metadata) is incorporated with the fingerprints, and a signature is generated, all based on the criteria specified in the configuration information.
  • a basic system can incorporate the entire index of retained tokens (i.e., without filtration).
  • a simple tokenization of a document may include converting the entire document to lower-case (remove case sensitive information) and obtaining individual tokens.
  • a token for this basic system is any string of length-4 or more separated by either white space or any form of punctuation.
  • the individual tokens are then sorted according to Unicode ordering to obtain unique tokens.
  • a hash code or bit vector is then generated for each token in the sorted unique token list.
  • the same process is used for tokenization of a document and sorting of the unique token list. The process also includes the filtering of the unique token list.
  • Subsets of the unique token list are created based on a list of criteria including, but not limited to, keeping tokens of only 6 characters or longer in length, keeping tokens numbered (in order) 25-50, keeping every 7th token, keep every 25th token, or other similar rules.
  • a hash code or bit vector is then generated for each subset of tokens.
  • Fingerprints may vary in complexity.
  • a signature created based on a complete index of retained tokens such as a list sorted according to Unicode, can be highly precise but support only minimal variance.
  • the precision and tolerance to variance of a signature created based on a filtered index of retained tokens depends on the degree of filtration.
  • a signature based on a highly filtered index provides high recall but low precision.
  • the number of filters employed to generate signatures also affects the complexity. Multiple filters increase precision but also increase the time required for signature calculation and the storage space needed for signature safe-keeping.
  • a mechanism for storing signatures should be resilient against modification by users. Once the signature is created, it is stored securely. A user other than authorized personnel or a forensic investigator should have no means to modify or delete any signature entry.
  • the signatures can be inserted into a database, allowing for easy queries and off-system storage. Alternatively, signatures can be stored in flat files having only root or administrator permissions.
  • a signature When given a signature, one can check to see if the signature is in the store. If given a file or document, text is extracted from the file, fingerprints are created, then a signature, and the created query signature is checked against the store. If multiple fingerprints are used to represent a file, any or all of the fingerprints can be used to determine similarity above a predetermined threshold.
  • a proper or predetermined threshold can be the matching of all or some of the fingerprints, a probabilistic analysis of the similarity of the fingerprints, or any other mathematical analysis directed to signature divergence. The higher the threshold, the lower the rate of false positives; however, the higher the rate of false negatives.
  • the content of a target can be alternatively represented by selections of terms that are logically selected from the target.
  • the selections of terms are conceptually and functionally similar to the above described fingerprints, and together with other information associated with the target, form a representation of the content of the target similar to the above-described signature.
  • the collection of terms or a plurality of targets can further be organized and indexed in a data structure which links to or is associated with the representations of target.
  • Terms can be extracted from a target to represent its content.
  • the content of a target that contains textual information can be defined by a selection of words and phrases within the target.
  • the present invention extracts terms appropriate for one or more targets, which include, but are not limited to, bits, bytes, characters, digits, numbers, words, word sequences, phrases, sentences, meta-data, and information derived from these targets.
  • terms may include stemmed and stopped words. Stemming and stop words are recognized by one skilled in the art as standard practice for information extraction in information-retrieval systems.
  • stemming can be generally described as a process for reducing inflected (or sometimes derived) words to their stem, base, or root form.
  • the stem, base, or root of a word in the context of stemming is not necessarily identical to the linguistic root of the word. It is sufficient that related words are mapped to the same stem, base, or root.
  • stemming algorithms are currently available.
  • stop words generally refers to words which are filtered out prior to, or after, the processing of natural language data (e.g., text). In the above-mentioned embodiment, stop words can be removed to yield a list of terms that represents the content of one or more targets.
  • stop words may be intentionally preserved to include phrases in the list of terms.
  • their idiosyncratic characteristics can be identified and used to represent a target's contents.
  • Feature points of images, video clips, audio waves, etc. can likewise be extracted and treated as terms.
  • the present invention method extracts these terms in a variety of ways.
  • Term selection can be achieved by providing a predetermined list of terms of interest by using a predetermined algorithm that automatically identifies terms of interest or by a combination of these two methods.
  • the present invention method records in the representation of the target the presence and/or absence of each term by methods that include, but are not limited to, recording its presence or absence, its frequency, and/or its relative importance.
  • the representation of the target also incorporates other information associated with the target that is extracted and combined with the terms.
  • the other associated information may be data relevant to the target, and can be accessed through the operating system, which may include, but is not limited to, file name, date and time of record, user/owner information, and access history.
  • Other associated information may also include data relevant to the target that is accessible through an application, which may include, but is not limited to, author, time of editing, number words, title, subject, comments, and any other customizable fields or application-specific information.
  • the representation of the target records the inverse document frequency for each teen.
  • Other associated information includes, but is not limited to, the source document's name, owner, location, date, and host name. Such other associated information can also include, but is not limited to, terra frequency within the document, positioning information within the document, weighting means, etc.
  • sets of targets may be processed simultaneously. It is also within the scope of the invention to process individual targets sequentially, or in parallel and then merged.
  • the collection of terms contained within each individual target can be pooled and indexed in a data structure, such as an inverted index.
  • An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a target such as a database file, a document, or a set of documents.
  • the inverted index is a data structure that is keyed to a list of terms such that each term references to a posting list that refers to targets that contain each of the terms.
  • FIG. 10 shows an example of the present invention utilizing the inverted index of a document set.
  • individual documents are processed in parallel or in serial and then combined.
  • Terms are extracted from each of the documents within the document set 1010 .
  • the processing and extraction of terms may include the steps of identifying the format of a document, discarding formatting information, stemming, and removing stop words. All terms extracted from the document set are pooled and indexed.
  • One exemplary method of indexing is alphabetical indexing.
  • An inverted index of the document set is generated from the indexed terms of the document set 1020 . For each document, its collection of terms together with frequency and positional information of the terms form a representation of the content of the document 1030 .
  • the posting list of each of the terms within the inverted index of the document set may include representations of each of the individual documents of the document set, or it may contain references to the representations of the documents stored elsewhere 1050 .
  • an inverted index is not suitable for computer forensics.
  • the inverted index is modified as well.
  • the inverted index is constantly updated in response to the addition and removal of documents.
  • documents are deleted, they are removed from the inverted index.
  • the inverted index is updated to include reference to the modified document, but loses the references to the original document.
  • the traditional manipulation of the inverted index is not appropriate when using it as a forensic examination application, in which case all versions of the targets must be maintained, or at the very least, an accurate representation of the targets must be maintained.
  • representations of the contents of the targets along with other associated information including, but not limited to, file existence or date of deletion or of modification, type of modification, etc. are added or linked to the inverted index.
  • the inverted index does not remove the posting of a removed target from the index.
  • the inverted index maintains a reference to a representation of a target that is stored permanently, and which is not removed when the original target is deleted.
  • the associated other information in the representation such as the indication that a file was deleted and the identification of such user who performed the action, may or may not be updated.
  • the representation of the content of a deleted target is updated with information related to its deletion, such as time/date, user, and host/client computer where the deletion is performed.
  • New representations of new targets are generated as these new targets are added to the system, and the inverted index is updated to include the new targets.
  • the representations of the targets are either updated to account for these modifications, or if the modifications are sufficiently large—the modified targets are treated as though they are new targets.
  • the original representations are retained with all associated information pertaining to such targets, which may also include information pertaining to the circumstances of the modification.
  • FIG. 11 is a flow chart illustrating an exemplary method for document addition and modification according to one aspect of the present invention.
  • a representation of the content of the new document is generated 1110 .
  • the inverted index for the document set is then updated 1120 to include the new document and the representation of its content, as well as other information associated with the new document 1111 .
  • a candidate representation of the content of the modified document is generated 1140 .
  • This candidate representation is compared with the representation of the original document 1150 .
  • the modification is minor, (i.e., the similarity of the representations of the modified and original document is within a pre-determined threshold)
  • the original representation of the document is retained 1160 .
  • modified document is updated 1170 .
  • the modified document is treated as a new document 1180 .
  • the inverted index is updated to incorporate the modified document 1120 , and a new representation of the modified document is generated and other associated information stored. The listing of the original document and its representation are not modified.
  • the present invention permits individuals with the proper privilege to query an embodiment of the invention and the representation to determine the presence or absence of information that resides or resided on such computer systems.
  • these queries can be performed directly on one or more individual machines.
  • these queries can be performed remotely.
  • one embodiment of the present invention transmits the query to one or more computer systems, which execute the query on the representation and return the answer to that query.
  • another embodiment of the present invention includes a computer system that stores all of the representations from one or more other computer systems, queries all of these representations, and returns answers to those queries.
  • one or more computer systems periodically transmit all or portions of their representations to the computer system responsible for storing all representations.
  • This method also allows for these representations to be updated at any time by an individual with proper privilege.
  • the method also allows for these representations to be generalized or compiled into a single representation for one or more computer systems.
  • the method of the present invention allows for manual and automated query formation.
  • An individual with proper privileges may provide queries directly to a computer system employing the present invention method.
  • a query can be a term or a collection of terms.
  • an individual with proper privileges may provide information of interest in the form of a file, document, or any other format that is readable by a computer system (i.e., a query target), whereupon the present invention processes the file or document in the manner described above, and queries the inverted index of representations.
  • the form of the query includes, but is not limited to, one or more terms, terms connected with logical operators, and queries in SQL.
  • the present invention method returns the representation of the target.
  • Other information associated with the target can be extracted from the representation, which other information may include document name, host name, and any other requested meta-data that relate to the query.
  • FIG. 12 is a flow chart that illustrates an exemplary method for performing a search with the inverted index according to one aspect of the invention.
  • Query terms 1201 can be formed from a document of interest 1210 or from selected terms of interest 1202 .
  • a search is performed with the query terms 1201 against the inverted index 1220 .
  • Documents are identified 1230 based on the existence and frequency of the query terms 1201 .
  • Representations of the identified documents are retrieved 1240 , and other information associated with the document representations are extracted 1250 .
  • Information such as time/date stamp, user name, and computer name can be used to identify user or computer of interest 1260 .
  • the forensic signature created according to an embodiment includes a digital fingerprint and other information associated with a file.
  • the digital fingerprint may represent contents of the file and is resistant to minor modification of the file.
  • a dictionary is created based on the files included in the digital device.
  • the dictionary may include words that are routinely used by a person or words that are commonly used in the filed included in the digital device.
  • a digital fingerprint is created for each file of the seized digital devices according to the dictionary.
  • One or more digital fingerprints may be created for one file using one or more dictionaries.
  • the generated digital fingerprints combine with other information of the file to form a signature.
  • An investigator may use the signature to identify unseen files. These signatures are not sensitive to minor changes of the file and are portable as long as the dictionary is transmitted along with these files.
  • the investigator may be required to investigate audio and video files recovered from seized digital devices. Many times an investigator has to review the entirety of the audio and video files so that an inserted segment may not be ignored. There is also the need of identifying the same or similar audio and video files. Again, the traditional hash or fuzzy hash functions can not fulfill such a need.
  • a forensic signature is created for an audio file and video file.
  • the forensic signature includes digital fingerprints and meta information of the audio file or the video file, which has a similar structure as those described before.
  • the forensic signature effectively identifies the same audio files and the same video files.
  • the digital fingerprint of a video file may be created based on segments of the video file.
  • the video file is parsed to locate changes in scene.
  • the scene change may be detected based on changes in video shot.
  • the video file is segmented according to the scene change.
  • a number of representative information associated with each segment may be extracted. For example, the length of each scene segment is determined and a fingerprint of a video file may be an encoded series of lengths of all the scene segments.
  • each segment is analyzed using a sliding window.
  • One or more key frames from each scene segment are extracted.
  • a KL transform (or any other component reduction computation) is performed on each segment.
  • subtitles or caption text are extracted to be used in creating the fingerprints.
  • the metadata of a video file includes information about the file such as filename, creator, length of the video, encoding, resolution, and etc. Other information that may be added to the metadata, if the video file is seized as forensic evidence, includes date and location of evidence collection, collecting agent, and etc.
  • a signature for an audio file whether the audio file includes a speech or music is determined. If the audio file is determined to include music, a forensically accurate acoustic fingerprint can be identified and used in accord with techniques known to ordinarily skilled artisans, including using perceptual characteristics including average zero crossing rate, estimated tempo, average spectrum, spectral flatness, prominent tones across a set of bands, and bandwidth. If the audio file includes speech, a conversion from speech to text is performed. The text converted from the audio file is processed to generate a digital fingerprint for the audio file.
  • multimedia files that includes both audio files and video files need to be investigated.
  • a DVD file includes multiple audio and video files, which may not have one-to-one correspondence.
  • the DVD file needs to be treated as one file.
  • fingerprints for audio and video are created separately and stored.
  • the signature of a file can be informational rather than a collection of a plurality of computer-generated symbols.
  • informational signatures are created, which may assist in determining what information is included in a file.
  • a user may tokenize an informational object of interest such as an email address, name, phone number, account number, and etc. These tokens are then entered into a data structures, such as, for example, a Bloom filter or similar data structures, including data structures that can probabilistically or statistically determine if a tokenized term has been previously inserted. The result is then stored as an informational fingerprint. The informational fingerprint is then able to be used in targeted forensic investigations related to the specific information.
  • a file may be tokenized to find all instances of Social Security Number (SSN).
  • SSN Social Security Number
  • a fingerprint is created indicating the count of the number of SSNs identified in the file.
  • signature of all the files can by retrieved and examined to check if any file SSNs and counts thereof.
  • a large amount of signatures may be created for a computer system.
  • a query signature is to be compared with or matched with a large amount of signatures stored in the database, a considerable amount of time can be required for such a process, which adds overhead to the computer system.
  • the database first determines a number of fingerprints included in the query signature. For each fingerprint, the database identifies the type of that fingerprint and determines all the matching fingerprints having the same type in the database. The same process is conducted for every fingerprint included in the signature. As a result, a plurality of lists of matched fingerprints is obtained from the database. The plurality of list is merged to create a merged list. A score is calculated for each matched fingerprint in the merged list based on closeness of the match. A final list is selected to include the signatures that have highest scores or scores higher than a predetermined threshold. If the matching process is implemented by a single computer system, a substantial amount of computer time is required to complete this process.
  • solutions to reduce the overhead of the computer system caused by the matching process include using a commercial service for matching or using idle computers within the same organizational network. For example, in an embodiment fingerprints of each type are compared in parallel on different computers. During the parallel matching process, a computer system is determined to be a master. It is within the scope of this invention however to likewise support multiple simultaneous master computing systems. A list of target fingerprints is partitioned by the master into a plurality of sub-lists to be distributed across resources, which may be referred to as slaves. The sub-lists are distributed to the slaves, which conducts the matching process for each sub-list. The slaves also score the closeness of each signature using their own computer system.
  • the slaves return to the master a list of target fingerprints and associated scores.
  • the master merges the returned lists and creates a master list.
  • the master may traverse the returned lists and generates an overall score over all the fingerprint types.
  • the master may start another parallel computation to locate the highest matching signature.
  • the master may create a list that is composed of all the target signatures and the matching scores for each of the fingerprints within and distribute a subset of the list to a slave system for scoring.
  • Exemplary execution environments for the parallel matching process may include cloud computing, hadoop, message passing interface, parallel virtual machine, Linda, and communicating sequential processes.
  • the present invention method utilizing term representation of content of a target and an inverted index of the representation is suitable to carry out other application of computer forensic and security measure described above.

Abstract

Methods and systems are provided for a proactive approach for computer forensic investigations. The invention allows organizations anticipating the need for forensic analysis to prepare in advance. Forensic signatures are created including a digital fingerprint and other information associated with a file. In one aspect, informational signatures are created, which may assist in determining what information is included in a file. In another aspect, the digital fingerprint may represent contents of the file and is resistant to minor modification of the file. In another aspect, fingerprints can be compared in parallel on different computers.

Description

    RELATED APPLICATION
  • This application is a continuation of application Ser. No. 12/822,722, filed Jun. 24, 2010 which is a continuation-in-part of application Ser. No. 12/118,942 filed on May 12, 2008, which is a continuation-in-part of Ser. No. 11/963,186, filed Dec. 21, 2007, the entirety of each of which is incorporated herein by reference.
  • FIELD OF THE INVENTION
  • This invention relates generally to methods and systems for computer data management and tracking. Specifically, it relates to methods and systems of digital data identification and the creation, storage, management, processing and comparison of content sensitive digital signatures.
  • BACKGROUND OF THE INVENTION
  • Over the last decade, the use of computers and the Internet has grown exponentially. Indeed, for many individuals, government agencies and private corporations it is an integral part of their daily lives and business practices. People can communicate, transfer information, engage in commerce and expand their educational opportunities with little more than a few key strokes and the click of a mouse. Like revolutionary technologies before it, the great advancement of computer systems, information technology and the Internet carries enormous potential both for advancement and for abuse. Unfortunately, criminals exploit these same technologies to commit crimes and harm the safety, security, and privacy of the society.
  • Although there are no exact figures on the cost of computer crimes in America, estimates run into the billions of dollars each year. The United States Federal Bureau of Investigation (FBI) has indicated that digital evidence has spread from a few types of investigations, such as hacking and child pornography, to virtually every investigative classification, including fraud, extortion, homicide, identity theft, and so on. Although there are as yet no definitive statistics on the scope of the problem, there is no doubt that the number of crimes involving computers and the Internet is rising dramatically. A survey conducted by the Computer Security Institute in 2007 revealed substantial increases in computer crime. About half (46%) of the companies and government agencies surveyed reported a security incident within the preceding twelve months. The reported total loss of the participants is $66,930,950. The average annual loss for each participant is $350,424 compared to $168,000 for the previous year. And unlike more traditional crimes, computer crime is especially difficult to investigate. Other criminal and terrorist acts and preparations leading to such acts, increasingly involve the use of computer systems and information technologies as well. These criminal and terrorist activities leave behind a trail of digital evidence. Digital evidence varies widely in formats and can include computer files, digital images, sound and videos, e-mail, instant messages, phone records, and so on. They are routinely gathered from seized hard drives, file servers, Internet data, mobile digital devices, digital cameras and numerous other digital sources that are growing steadily in sophistication and capacity.
  • Computer forensics is the practice of acquiring, preserving, analyzing, and reporting on data collected from a computer system, which can include personal computers, server computers, and portable electronic devices such as cellular phones, PDAs and other storage devices. Collecting and analyzing these types of data is usually called digital data identification. The goal of the process is to find evidence that supports or refutes some hypothesis regarding user activity on the system. When accurately and timely identified by a forensic investigator, digital evidence can provide the invaluable proof that helps the conviction of a criminal, or prevents a looming terrorist attack. A delay in identifying suspect data occasionally results in the dismissal of some criminal cases, where the evidence is not being produced in time for prosecution.
  • The amount of digital evidence is growing rapidly. Not only has the number of crimes involving digital evidence increased dramatically over time, but the total volume of data that is involved has increased at an even faster pace. This is the result of the increased presence of digital devices at crime scenes combined with a heightened awareness of digital evidence by investigators. Given the declining prices of digital storage media and the corresponding increases in sales of storage devices, the volume of digital information that investigators must deal with is likely to continue its meteoric increase.
  • A typical computer forensic process involves first the determination that the evidence requirements merit a forensic examination. Individuals who are expected to have access to that evidence are then identified. Further, all computer systems used by these individuals which might contain relevant data are located. Forensic images of those systems are taken, and analyzed for relevant evidence. Traditionally, a forensic investigator seizes all storage media, creates a drive image or duplicates it, and then conducts their examination of the data on the drive image or duplicate copy to preserve the original evidence. A “drive image” is an exact replica of the contents of a storage device, such as a hard disk, stored on a second storage device, such as a network server or another hard disk. One of the first steps in the examination process is to recover latent data such as deleted files, hidden data and fragments from unallocated file space. Digital forensic analysis tools used today are stand alone systems that are not coordinated with systems used by the forensic investigators and Information Technology (IT) staff. Current computer forensics analysis is largely a manual labor intensive process. It requires computer forensic investigators that have specialized training. The cost of the analysis is high. The rate for some computer forensic investigators can be more than $250/hour. It usually requires a long analysis time taking from days to weeks. Because it is a manual process, there is potential for human error resulting in missed data and missed discovery. In addition, when facing a complex investigation that involves a large number of computer systems, it is difficult to determine what systems to analyze. This may have two undesirable results: expending limited time and resources on useless systems, or missing systems that contain vital information.
  • The tremendous increase in data exacerbates these problems for forensic investigators. The number of pieces of digital media and their increasing size will push budgets, processing capability and physical storage space available to the forensic investigators to their limits. In an effort to reduce the volume of digital files for review, seized digital evidence is processed to reduce the amount of this data. Presently, there is no effective means to quickly sort through the amount of data based on the content of the data, and identify documents and files of interest for further detailed examination. Present solutions still require manual review from forensic investigators to identify specific data needed to prove guilt or innocence.
  • Government and business entities use sophisticated computers systems to store, track and disseminate information within the entity and communicate with outside individuals and entities. Information can be stored as files that exist on a computer file system, and can exist in many heterogeneous forms such as plain text documents, formatted documents (e.g. Microsoft Word® documents, Open Document Format documents), spread sheets, presentations, Portable Document Format documents, images of paper documents, graphics, sound recordings, videos, faxes, email messages, voice messages, web pages, and other stored digital media. Information can also be stored as entries in databases such as a relational database or a document management system. This information is subject to a wide range of user manipulations, such as create, edit, copy, rename, move, delete and backup. Information can also move among the entity computer systems through various communication means, such as emails, attachments, file sharing, shared file systems and push technology. Information can also leave the entity computer systems either by someone within the entity sending it to an outsider, or can be retrieved by an outsider from the entity computer systems by obtaining information containing removable storage media or through network access protocols such as HTTP, FTP, and peer-to-peer file sharing. All of this creation, manipulation, transfers, and communication of digital information can be part of the legitimate business process. However, abuse of the computer system also involves the same processes of creation, manipulation, transfer, and communication of information, albeit unauthorized or illegitimately. The Computer Security Institute 2007 survey also revealed that insider abuse of the network access or email edged out virus incidents as the most prevalent security problem. While a majority of all computer attacks enter via the Internet, most significant of all dollar losses stem from internal intrusions.
  • The most important asset of many companies is their intellectual Property (IP). Customer lists, customer credit card lists, copyrights including computer code, confidential product designs, proprietary information such as new products in development, and trade secrets are all forms of IP that can be used against the company by its competitors. Common risks for a corporation may be theft of trade secrets and other privileged information, theft of customer or partner information, disclosure of confidential information, and disclosure of trade secrets and other valuable information (designs, formulas etc.).
  • Corporations may also incur liability or exposure to risks when unauthorized contents are stored in the computer systems, such as child pornographic material, or pirated copies of media or software. An organization must know which of its assets require protection and the real and perceived threats against them.
  • Current information security builds layers of firewalls and content security at the network perimeter, and utilizes permissions and identity management to control access by trusted insiders to digital assets, such as business transactions, data warehouses and files. This structure lulls the business managers into a false sense of security. Many employees are restricted in their access to sensitive data, but access control is usually not easily fine tuned to accommodate the ever changing assignments and business needs of all the employees. Moreover, as is necessary to perform their function, Information Technology (IT) employees have access to sensitive data and processes. Indeed, IT employees are the custodians and authors of those objects. This may place them in positions to reveal information to others that will damage the company or directly sabotage a company's operations in various ways. IT employees who are disgruntled, angry, or seeking to steal information for profitable gain, may attempt to steal sensitive digital information which could lead to substantial losses for the organization. A laid-off employee is a prime source of potential leakage of such information.
  • Content-security tools based on HTTP/SMTP proxies are used against viruses and spam. However, these tools weren't designed for intrusion prevention. They don't inspect internal traffic; they scan only authorized e-mail channels. They rely on file-specific content recognition and have scalability and maintenance issues. When content security tools don't fit, they are ineffective. Relying on permissions and identity management is like running a retail store that screens you coming in but doesn't put magnetic tags on the clothes to prevent you from wearing that expensive hat going out.
  • A hash analysis is a method that can be used for comparing the content of digital evidence. A cryptographic one-way hash (or “hash” for short) can be a way to calculate a digital fingerprint: a very large number that often uniquely identifies a digital file. A hash is a calculated function on the bits that make up a file. Therefore, two files with different names but the exact same contents will produce the same hash. However, using hash systems to identify conclusive or known suspect files faces several challenges. By design of the hash function, a small difference, even a single bit, in the input file will generate a significantly different output hash. The difference between two hash numbers does not reflect the level of similarity of the input files. The hash method cannot be used to identify files that have been altered, whether minimally or substantially. They are therefore not able to identify derivative files, files that contain common contents but are arranged or formatted differently or contain more or less other content. For the same reason, hash analysis is not effective against multimedia files (image, video, and sound). As a consequence, an individual using these files to commit crimes may escape hash based detection and prosecution.
  • It would be beneficial and desirable to integrate newer, advanced technologies to automate the detection and classification process for suspect files and identify related altered or derivative files. This would allow forensic investigators to focus on identifying relevant data during the forensic process and addresses many of the problems of efficiency, cost and delay facing digital forensic examinations today. There is also a need for a technology to scan and manage digital data on a computer system based on the content of the data. There is a further need for a solution to allow government agencies and corporations to automatically monitor and prevent unauthorized use or exchange of classified or proprietary data.
  • SUMMARY OF THE INVENTION
  • The present invention is a method, system, and computer readable media for proactively generating, preserving and comparing computer forensic evidence for a computer system. The method involves generating at least one signature for at least one target based on the content of the target. The at least one signature can be generated at any time, or when a predetermined operation is commenced. The at least one generated signature can be stored, or not, prior to or after forensic use. The generated signature(s) are compared with one or more previously generated signature(s) to determine whether any compared signatures have similarities above a predetermined threshold. Alternatively, the present invention could, at any time, simply compare previously existing signatures generated from a target.
  • The target can be any file, any file that is owned by a user, any operating system file, any file that is part of a proprietary information system, or any file that is related to a network intrusion attack. When the target is any type of file, the predetermined operation can be any one or more of creating, deleting, renaming, editing, moving, updating, linking, merging, modifying and copying the file. The target could also be a database entry; and when a database entry, the predetermined operation can be any one or more of selecting, inserting, updating, deleting, merging, beginning work, committing, rollback, creating, dropping, truncating, and altering of the database entry. The target can further be a database definition. When the target is a database definition, the predetermined operation can be any one or more of creating, dropping and altering the database definition.
  • The target can also be network traffic; and when network traffic, the predetermined operation can be the occurrence of network traffic entering a network or leaving a network, or a network traffic is initiated from a computer system, or a computer system receives network traffic. The network traffic may be any one or more of a signal protocol, an email, an attachment of an email, an instant message conversation, a text message, a remote login, a virtual private network, a viewed webpage, a file transfer and file sharing.
  • Generating the at least one signature can involve extracting a set of tokens from the at least one target, processing the set of tokens, generating a fingerprint from the set of tokens, and generating the signature for the target by combining the fingerprints with other related information of the target. Processing the set of tokens can include sorting the set of tokens, and may further include filtering the set of tokens. The method for generating the fingerprints may involve a hash method, or an implementation of a bit vector method.
  • Other related information of the target can be accessible by an operating system, and can be any one or more of file name, date of record, time of record, user or owner information, network address, network protocol, access history and fingerprint history. Other related information of the target could also be information accessible by an application.
  • The generated signature(s) could be stored in a manner preventing deletion or modification by a user, other than a user with special access rights, such authorized personnel or a forensic investigator. The signature(s) could further be made available only to authorized personnel or a forensic investigator with access rights. The signature(s) and respective targets can be stored on the same computer system, different computer systems, and/or on a shared file system. Finally, the signature(s) can be stored on write-once, read-many media.
  • In another aspect of the present invention, a computer readable medium is provided that configures a computer system to perform the methods described above of proactively generating, preserving and comparing computer forensic evidence for a computer system. In summary, computer readable medium facilitates the method of generating at least one signature for at least one target based on the content of the target; and comparing the at least one generated signature with at least one previously generated signature to determine whether the signatures have similarities above a predetermined threshold.
  • In a further aspect of the present invention, the present invention also provides an apparatus for the generation, preservation and comparison of computer forensic evidence. The apparatus/system can include a processor arranged to generate at least one signature for at least one target based on the content of the target, and a comparator configured to compare the at least one generated signature with at least one previously generated signature to determine whether the signatures have similarities above a predetermined threshold. In addition, the system can additionally include an extension module configured to trigger signature generation upon occurrence of a certain action, and a mechanism for storing the generated signatures. The implemented system may have an operating system service (e.g., a Windows service or Unix/Linux daemon) running in the background to generate a signature for a given file and to store it, and then to query the stored signatures to determine similarity with other signatures.
  • In another aspect of the present invention, a computerized method of proactively generating and querying computer forensic evidence for a computer system is provide. The method comprises the steps of generating a representation of content of at least one target within a set of targets, and generating an inverted index of the set of targets, wherein the inverted index is associated with representations of the content of each target of the set of targets.
  • In one aspect of the present invention, the set of targets comprises one or more files. According to some embodiments of the present invention, the inverted index is updated upon occurrence of a predetermined operation, and the predetermined operation is one or more of creating, deleting, renaming, editing, moving, updating, linking, merging, modifying and copying a file.
  • In another aspect of the present invention, the set of targets comprises one or more database entries. According to some embodiments of the present invention, the inverted index is updated upon occurrence of a predetermined operation, and the predetermined operation is one or more of select, insert, update, delete, merge, begin work, commit, rollback, create, drop, truncate, and alter of a database entry.
  • According to one aspect of the present invention, generating the representation of the content of at least one target comprises the steps of extracting a set of terms from the target, and processing the set of terms.
  • According to another aspect of the present invention, generating the representation of the content of at least one target further comprises the steps of extracting other related information of the target, and incorporating the other related information with the extracted and processed terms.
  • In some embodiments of the present invention, the other related information of the target is accessible by an operating system, and is at least one of file name, date of record, time of record, user or owner information, network address, network protocol, and access history of the target.
  • In some other embodiments of the present invention, the other related information of the target is accessible by an application.
  • According to one aspect of the present invention, generating the inverted index of the set of targets comprises the steps of extracting a set of terms from the at least one target, processing the set of terms, indexing the set of terms to create the inverted index and associating the set of terms with representations of the content of each of the one or more targets.
  • In some embodiments of the present invention, the representation of the content of the at least one target is stored permanently and is not removed when the target is modified or removed.
  • In some other embodiments of the present invention, the inverted index retains association with the representation of the content of the at least one target when the target is modified or removed.
  • In yet another aspect of the present invention, the method further comprises the step of storing the generated inverted index in a manner preventing deletion or modification of the inverted index by a user other than authorized personnel or a forensic investigator.
  • According to one aspect of the present invention, the generated inverted index is available only to authorized personnel or a forensic investigator with access rights.
  • According to another aspect of the present invention, the generated inverted index and the set of targets are stored on the same computer system.
  • According to yet another aspect of the present invention, the generated inverted index is stored on a first computer system and the set of targets is stored on a second computer system accessible to the first computer system through a computer network.
  • The present invention, in yet another aspect, further comprising the step of querying the inverted index.
  • In one aspect of the present invention, a computer-readable medium that configures a computer system to perform a method of proactively generating and comparing computer forensic evidence for a computer system is provided. The method comprises the steps of generating a representation of content of at least one target within a set of targets, and generating an inverted index of the set of targets, wherein the inverted index is associated with representations of the content of each target of the set of targets.
  • In another aspect of the present invention, an apparatus for proactively generating and comparing computer forensic evidence is provided. The apparatus comprises a processor arranged to generate a representation of content of at least one target within a set of targets, and a processor arranged to generate an inverted index of the set of targets, wherein the inverted index is associated with representations of the content of each target of the set of targets.
  • In another aspect of the present invention, a method of generating a signature of a video is provided. The method comprises the steps of detecting scene changes of the video, segmenting the video into a plurality of segments corresponding to each scene change, extracting a representation from each segment, forming a digital fingerprint based on the representation, and creating a signature by combining the digital fingerprint with predetermined metadata of the video.
  • According to another aspect of the present invention, the representation includes one or more frames of each segment.
  • According to another aspect of the invention, the representation includes length information of each segment.
  • According to another aspect of the invention, the representation includes a subtitle and caption text of each segment.
  • In another aspect of the present invention, a method of generating a signature for an audio file is provided. The method comprises the steps of determining whether the audio includes music or speech, generating, if the audio includes speech, a transcript of the speech, extracting a representation from the transcript, forming a digital fingerprint based on the representation, and creating a signature by combining the digital fingerprint with predetermined metadata of the audio.
  • In another aspect of the present invention, a method of generating a signature of a digital file is provided. The method comprises the steps of selecting a token indicating an informational object, generating a digital fingerprint of the digital file using the selected token, and creating the signature that includes the digital fingerprint and meta data of the digital file.
  • According to another aspect of the present invention, the method further comprises the steps of extracting a plurality of tokens from the digital file according to the selected token, and inserting the extracted plurality of tokens into a data structure that probabilistically determines whether a token is inserted before.
  • According to another aspect of the present invention, the selected token includes a token selected from the group consisting essentially of: an email address, a name, an account number, and a social security number.
  • In another aspect of the present invention, a method of matching a forensic signature of a digital file is provided, comprising the steps of receiving, by a master apparatus, a list of query signatures, generating, by the master apparatus, a plurality of sub-lists of query signatures, transmitting, by the master apparatus, a sub-list to a slave apparatus, matching, by a slave apparatus, the query signatures with signatures in database according to digital fingerprints included in the query signatures and generating a plurality of matching lists for each digital fingerprints, merging, by the slave apparatus, the plurality of matching lists and generating a final list, calculating, by the slave apparatus, a score for each matched signature in the final list according to a closeness between the matched signature and the query signature, and transmitting, by the slave apparatus, the final list and the score to the master apparatus.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic diagram of an exemplary computing environment;
  • FIG. 2 is a schematic diagram of an exemplary network environment;
  • FIG. 3 is a flow chart illustrating an exemplary method for generating a signature for a document;
  • FIG. 4 is a flow chart illustrating document modification and new fingerprint generation pursuant to one embodiment of the present invention;
  • FIG. 5 is a flow chart illustrating an exemplary method to perform a latent signature search;
  • FIG. 6 is a flow chart illustrating an exemplary method for user misuse detection;
  • FIG. 7 is a flow chart illustrating another exemplary method for user misuse detection through the use of user signature profiles;
  • FIG. 8 is a flow chart illustrating an exemplary method for the detection of an unauthorized network communication of sensitive information;
  • FIG. 9 is a schematic block diagram illustrating an exemplary embodiment of a system of the present invention, showing event trigger, fingerprint/signature generation, signature query and comparison, and signature storage;
  • FIG. 10 is a flow chart illustrating an exemplary method of generating an inverted index for a set of documents according to the present invention;
  • FIG. 11 is a flow chart illustrating an exemplary method for updating an inverted index in response to document addition and modification according to one aspect of the present invention; and
  • FIG. 12 is a flow chart illustrating an exemplary method for performing a latent search using an inverted index.
  • DETAILED DESCRIPTION Example Computing Environment
  • FIG. 1 and the following discussion are intended to provide a brief general description of a suitable computing environment in which an example embodiment of the invention may be implemented. It should be understood, however, that handheld, portable, and other computing devices of all kinds are contemplated for use in connection with the present invention. While a general purpose computer is described below, this is but one example. The present invention also may be operable on a thin client having network server interoperability and interaction. Thus, an example embodiment of the invention may be implemented in an environment of networked hosted services in which very little or minimal client resources are implicated, e.g., a networked environment in which the client device serves merely as a browser or interface to the World Wide Web.
  • Although not required, the invention can be implemented via an application programming interface (API), for use by a developer or tester, and/or included within the network browsing software which will be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers (e.g., client workstations, servers, or other devices). Generally, program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations. Other well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers (PCs), server computers, hand-held or laptop devices, multi-processor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. An embodiment of the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
  • FIG. 1 thus illustrates an example of a suitable computing system environment 100 in which the invention may be implemented, although as made clear above, the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or a combination of components illustrated in the exemplary operating environment 100.
  • With reference to FIG. 1, an example system for implementing the invention includes a general purpose computing device in the form of a computer 110. Components of the computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, Peripheral Component Interconnect (PCI) bus (also known as Mezzanine bus), and PCI-Express bus.
  • The computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by the computer 110 and includes both volatile and nonvolatile, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), Electrically-Erasable Programmable Read-Only Memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CDROM), digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as ROM 131 and RAM 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by the processing unit 120. By way of example, and not limitation, FIG. 1 illustrates operating system 134, application programs 135, other program modules 136, and program data 137. RAM 132 may contain other data and/or program modules.
  • The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156, such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the example operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150.
  • The drives and their associated computer storage media discussed above and illustrated in FIG. 1 provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In FIG. 1, for example, the hard disk drive 141 is illustrated as storing operating system 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from operating system 134, application programs 135, other program modules 136, and program data 137. Operating system 144, application programs 145, other program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 110 through input devices such as a keyboard 162 and pointing device 161, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus 121, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
  • A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to monitor 191, computers may also include other peripheral output devices such as speakers and a printer (not shown), which may be connected through an output peripheral interface 195.
  • The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in FIG. 1. The logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes means for establishing communications over the WAN 173, such as the Internet. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 1 illustrates remote application programs 185 as residing on a memory device 181. Remote application programs 185 include, but are not limited to web server applications such as Microsoft Internet Information Services (IIS)® and Apache HTTP Server which provides content which resides on the remote storage device 181 or other accessible storage device to the World Wide Web. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • One of ordinary skill in the art can appreciate that a computer 110 or other client devices can be deployed as part of a computer network. In this regard, the present invention pertains to any computer system having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes. An embodiment of the present invention may apply to an environment with server computers and client computers deployed in a network environment, having remote or local storage. The present invention may also apply to a standalone computing device, having programming language functionality, interpretation and execution capabilities.
  • Example Network Environment
  • FIG. 2 illustrates an embodiment of a network environment in which an embodiment of the present invention can be implemented. The network environment 200 contains a number of local server systems 210, which may include a number of file servers 211, web servers 212, and application servers 213 that are owned and managed by the owner of the local network. These servers are in communication with local user systems 220 which may include a large variety of systems such as workstations 221, desktop computers 222, laptop computers 223, and thin clients or terminals 224. The local user systems 220 may contain their own persistent storage devices such as in the case of workstations 221, desktop computers 222, and laptop computers 223. They can also have access to the persistent storage provide by the local servers 210. In the case of thin clients and terminals 224, network storage may be the only available persistent storage. The local user systems are usually connected to a variety of peripherals 260 that handle data input and output, such as scanners, printers and optical drives. There may also be a number of different kinds of removable media 250 that attach to the user systems 220 at times. These removable media 250 can be based on magnetic recording, such as floppy disks and portable hard drives, or be based on optical recording, such as compact disks or digital video disks. Further, removable media can also be based on non-volatile memory such as flash memory which can be a USB flash drive, and all forms of flash memory cards. The users within the local network usually get access to the wider area network such as the Internet 280 though the local server systems 210 and typically some network security measures such as a firewall 270. There might also be a number of remote systems 290 that can be in communication with the local server systems 210 and also the local user systems 220. The remote computer systems can be a variety of remote terminals 291, remote laptops 292, remote desktops 293, and remote web servers 294.
  • FIG. 2 illustrates an exemplary network environment. Those of ordinary skill in the art will appreciate that the teachings of the present invention can be used with any number of network environments and network configurations.
  • The Present Invention
  • The present invention teaches methods and systems to improve computer forensics with search and machine learning. This invention allows organizations that anticipate the need for forensic analysis to prepare in advance by keeping small amounts of information about any content on computer systems, such as files, database entries or schema, or network traffic, as the content is created, deleted, modified, copied or transmitted or received. Computational and storage costs are expanded in advance, which allows faster, better and less expensive computer forensics investigations.
  • The present invention provides a novel proactive approach for computer forensic investigations. For any type of content that is created, deleted, modified, copied, transmitted or received, a small amount of information about the content, called a signature, is created and stored away. A signature contains one or more fingerprints and other information associated with the target. A fingerprint is a relatively small number of bits, as compared to the size of the file that is computed based on the content of a target. The target can be any file, any file that is owned by a user, any operating system file, any file that is part of a proprietary information system, any file that is related to a network intrusion attack, any database entry or definition, or network traffic. For a text file, for example, a signature contains one or more fingerprints computed based on the content of the file along with other information associated with the file, such as the file name, date and time of record, user/owner information, and fingerprint history. For a database entry or definition, the signature contains one or more fingerprints that are calculated based on the content of the database entry or definition along with other information associated with the database entry or definition. For network traffic, the signature contains one or more fingerprints that are calculated based on the content of the network traffic and along with other information associated with the network traffic, such as time and date information, sender and recipient network addresses, and network protocol.
  • The fingerprints of the present invention are digital digests of the content of a target. In the hash method, all bits that make up a file are considered as the content of a file. In the present invention, however, the content of a target is defined and represented by selections of tokens that are logically selected from the target. As an example, the content of a target that contains textual information can be defined by a selection of words and phrases within the target. For targets that lack a semantic meaning, idiosyncratic characteristics of the target can be identified and used to represent the contents. Fingerprints are small, taking up a small amount of storage space, when compared to the original content of the target. Fingerprints are also easy to compute, and can identify a file, a database entry or definition, or network traffic by its content as defined by the list of selected tokens. Fingerprints can accommodate small modifications of the file (e.g., small edits or reformatting of a file may not alter its fingerprint). The fingerprints of a minimally edited version of a file mostly or fully match the fingerprints of the original file.
  • The creation of a signature usually comprises four steps. First, a set of tokens of interest are extracted from a target, such as a file, database entry or definition, or network traffic. Second, the token set undergoes a predetermined sequence of processing, such as sorting and filtering. Third, a fingerprint is then generated for each retained token set. Lastly, the fingerprint is combined with other information associated with the target file, database entry or definition, or network traffic to generate a signature.
  • Using a document that contains text information as an example, the first step involves parsing the document, extracting text information and retaining tokens of interest. Tokens of interest may include, but are not limited to, all words, phrases, selective parts of speech, e.g., nouns (names, places, etc.), words longer than a fixed number of characters, words not found in a dictionary, words found within a certain set of predefined lists of words, words of a “foreign nature”, words based on inverse document frequencies (histograms), in other words, words based on collection statistics, and acronyms.
  • Processing the token set may involve sorting the token set, and may further include filtering the token set. Sorting the token set can be based on, but not limited to, Unicode (alphabetical) ordering, biased weighting on inverse document frequency, and phrase or word length. Filtering the token set and retaining a subset of the tokens can be based on, but not limited to, rules such as selecting the top X % of the tokens, (i.e., X>=T1); or middle tokens, (i.e., T2>=X>=T1; or bottom tokens, i.e., X<=T2); or selective sets of tokens, (i.e., every t tokens, e.g., third, seventh, etc.); or no filtration at all, namely retaining all tokens. The retained tokens may be sorted again as previously described. However, sorting is unnecessary if one wishes to retain the same sorting conditions as used previously.
  • Creating one or more fingerprints of the retained token list can follow several computational methods. For example, a hash based method, where using a hash function, one can encode the sorted list of retained tokens and generate a unique hash for the retained token list. Many popular hash functions can be used for the calculation of the hash, such as MD5, SHA-1, RIPMED, WIRLPOOL, and the variations of these hash functions. Using a hash method for fingerprint creation is advantageous as it calculates quickly, and saves space. However, hash methods are not reversible (i.e., given a hash code, it is computationally impractical to retrieve the original token list).
  • Another method for fingerprint creation is a bit vector method, which uses a bit vector to encode the presence or absence of retained tokens. The bit vector could be a binary vector using a sequence of Boolean values, each stored as a single bit, or a non-binary numeric vector. The advantage of the bit vector method is that it is a reversible process, but bit vectors are often more costly in terms of storage space.
  • The creation of fingerprints is generally some form of lossy compression based on a predefined operation. However, it is within the scope of the invention to use a lossless compression method. For multimedia content, such as an image, sound and/or video file, mathematical transformations can be used to create fingerprints. It is apparent to those skilled in the art that fingerprint creation can be achieved through a variety of methods, and are not limited to the above mentioned approaches. Once the fingerprints are created, other information associated with the document is extracted and combined with the fingerprints to create a signature. The other associated information may be information about the document that is accessible through the operating system, which may include, but is not limited to, file name, date and time of record, user/owner information, access history, and fingerprint history. Other information may also include information about the document accessible through an application, which may include, but not limited to, author, time of editing, number words, title, subject, comments, and any other customizable fields or application specific information. There are numerous possibilities regarding the information that can be incorporated into a signature. A person skilled in the art could choose to incorporate any number of desired attributes of the target into a signature, depending on the specific implementation.
  • FIG. 3 shows an exemplary diagram of the process of generating a signature for a document. The document is first parsed and non-textual information is removed. A set of tokens 311 are extracted 310 from the document. One ordinarily skilled in the art would appreciate that there are a number of other acceptable ways to perform the extraction of the token list. The token set is then processed to yield a unique token list. In the FIG. 3 embodiment, the processing of the token set involves sorting the token set 320, which produces a sorted list of tokens 323, and filtering the token set 324, which generates one or more filtered lists of tokens 325. One ordinarily skilled in the art would appreciate that there are a number of other acceptable ways to perform the processing of the token set. The retained tokens are then used to generate one or more fingerprints of the document 330. In one embodiment of the invention, a hash or bit vector can be calculated for the entire list of retained tokens and used as a fingerprint. In another embodiment of the invention, the processed token list can be presented in the form of several subsets of tokens. A hash or bit vector can be calculated for each of the subset of tokens, and the document is represented with a list of fingerprints corresponding to each retained subset of tokens. In yet another embodiment of the invention, a hash or bit vector is calculated for each retained token, and the document is represented with a list of fingerprints corresponding to each retained token. A signature is created by combining other information associated with the document 331 with one or more fingerprints. The resulting signature is then stored.
  • When a document is modified, if the modification is small, the fingerprint of the file might not change, and the signature is updated with relevant other information. If modifications to a document are not small, then the modified document's fingerprint may not be sufficiently close to the original fingerprint. After such modification, a new candidate fingerprint is created and compared to the original fingerprint. If sufficient change has occurred in the document, and the candidate fingerprint does not match the original fingerprint, the new candidate fingerprint is added to the document's signature. The signature may encode other information, including but not limited to information related to derivation. In other embodiments of the invention, similarity may be measured by comparing fingerprints, signatures or both.
  • FIG. 4 illustrates document modification and further fingerprint generation. When a document is modified 410, a new candidate fingerprint is generated 420 based on the content of the modified document using the method exemplified in FIG. 3. The new candidate fingerprint is then compared with the fingerprint representing the original version of the document 430. The actual original document does not need to be retrieved for comparison. If the candidate fingerprint does not differ from the original document, the modification of the document is minor. The original fingerprint is then combined with updated other information associated with the document 450 and the updated signature is stored. If the candidate fingerprint differs from the original document, a major modification has occurred. The candidate fingerprint is then added to the original fingerprint 440. A new signature of the modified document is then created, incorporating the updated other information of the document and stored. If a fingerprint history is implemented in the signature, it is also updated.
  • The present invention can easily be adapted to other types of files. It is also possible to construct similar fingerprints for multimedia files such as image, video, and sound files. A variety of mathematical transformations can be used to create fingerprints for these file types, such as Laplace transform, Karhunen-Loève transform and Fourier transform. Metadata text of sound, image, and video can be used to generate fingerprints. Close captioning within a video file is text which can be used to generate fingerprints, as previously described. Speech can be converted to text using existing software tools. Text thus derived can be used to generate fingerprints. Moreover, the digital content of these files can be encoded as a sequence of tokens, like text documents. Executables and dynamically linked libraries (DLL) can be represented as a sequence of tokens, which can be used to produce fingerprints. Text embedded in these files can also be used to create fingerprints. Reverse engineered programs (e.g., Java) can be treated as text. Byte-code languages and scripting languages (e.g., Perl, python) can also be treated as text. The fingerprint creation process produces a relatively small amount of bits, as compared to the original file, and serves as a digest of the content of the original file. A person skilled in the art will appreciate that numerous methods can be used for achieving fingerprint creation. The fingerprint creation process in general is a lossy compression process. However, lossless compression schemes can also be adopted for the fingerprint creation process.
  • The signatures are stored in a manner preventing a regular user from modifying or deleting the signatures. Because the signatures are used for forensic purposes, their generation and storage is preferably transparent to the regular user. Only authorized personnel and forensic investigators can have access to the stored signatures. In a network environment, signatures can be created on a user system and offloaded to a network server for storage. Signatures can also be stored on a local file system, while denying user access through use of hidden files or hidden partitions. The signatures can also be embedded in encrypted files. One can also use write-once, read-many media for storing signatures. Only authorized personnel or forensic investigators can recover the storage media and be responsible for safe keeping. Off site storage of the signatures may also be desirable. Cryptographic logging mechanisms can be implemented to control and monitor the access of the signatures.
  • The present invention can be implemented in a variety of ways. In a stand alone system, such as an individual PC, laptop, mobile device (e.g., cell phone, PDA, etc.), signature information is stored locally. In a system that has access to shared file systems, such as file servers, database servers, and network attached storage (NAS), signature information is stored locally or on the shared file systems. In a network based implementation, any system with a network connection can have signature information stored on remote servers. One skilled in the art will appreciate that signatures can be stored in a variety of ways depending on the system or the network configurations of a particular environment.
  • Fingerprints can be created for information that is stored in any database and also database definitions. Signatures for each database entry are based on content and can be created for the entire database. As an example, signatures can be created for emails stored within a server database, allowing the tracing of email senders and receivers. Database definitions, such as schema, relations, tables, keys, and data domains can also have signatures created. When a data manipulation or definition event occurs, such as create table, drop table, or alter table, a new signature is created and stored.
  • In addition to files, signatures can be created for other applications. Changes to virtual machine file systems could be indexed as changes occur. Contents of removable media could have signatures created during mounting or un-mounting (during connection and disconnection) to a computer system. Compressed or archived files could be parsed and have signatures created.
  • It can also be useful to create and store signatures for network traffic. For example, signatures can be created for emails entering and exiting a network. Email attachments can have separate signatures created. Network traffic can thus be linked to particular emails and files when stored. Contents of instant message conversations and contents of file transfers can also be used to create signatures for the particular network activity. Signatures can also be created for text messages such as the ones based on Short Message Service (SMS) protocol. Web pages can also have signatures generated. When integrated over time, a digest or profile of one or more user's Internet browsing history can be generated. A person skilled in the art will appreciate that any information or signal transmitting protocol can be used as a target for signature creation. In one embodiment of the invention, a proxy firewall is used, and signatures are created of network traffic passing through. Network policies can be configured so that the network traffic passing through the proxy firewall is not encrypted. When so configured, secure connections are established between an inside user computer to the proxy firewall, and the proxy firewall to an outside server using an encryption protocol such as Transport Layer Security (TLS) or Secure Sockets Layer (SSL). Network traffic encryption only occurs between the inside user computer and the proxy firewall, and between the proxy firewall and the outside server. Contents passing through the proxy firewall are not encrypted and can, therefore, have signatures created. Signatures are stored among other information associated information regarding the network traffic with IP addresses used in communication, therefore facilitating the identification of the origin and destination of the traffic.
  • Once signatures are stored, there are a variety of methods to analyze them. Similarity between signatures can be ascertained by comparing the signature or the fingerprints for exact matching, percentage of matching, probability of matching, or other mathematical calculation revealing the divergence of the signatures or fingerprints. In one embodiment of the invention, a latent analysis can be performed. Particular signatures and/or fingerprints on individual machines locally or remotely can be searched and compared. Signatures or fingerprints that are stored in a database can be similarly searched. In another embodiment of the invention, an active analysis is performed. Instead of simply searching with signatures and fingerprints, advance or retrospective analysis of the signatures and fingerprints can be performed for the purpose of data mining, user profiling, trend analysis, and anomaly detection.
  • FIG. 5 presents an exemplary method for performing a latent search. When provided with a signature of interest, the signature can then be used directly as a query signature. Where a document of interest is provided, a query signature can be created 510 using the method exemplified in FIG. 3. Stored signatures are then retrieved from storage 520 and compared to the query signature 530. The comparison can be performed on signatures, the fingerprints within the signatures, or both. Similarity of the query signature to any stored signature is then determined. If the fingerprints are calculated using a hash method, the similarity is estimated based on hash matches. If the fingerprints are calculated using a bit vector method, the similarity is estimated based on bit vector correlation. If the comparison identifies any stored signatures having similarity above a predetermined threshold, the similar signatures are output for further processing 540. Other information within the stored signatures similar to the query signature is extracted 550. Other documents containing content similar to the document of interest, computer systems housing the document of interest or any similar documents, and users that had possession of the document of interest or any similar documents, can all be identified 560.
  • FIG. 6 presents an exemplary method for user misuse detection. When a user performs an operation to a document that is within a list of predetermined operations, such as create, modify, copy, move, or delete a document, the system captures this user operation 610, and a new signature is created 620 and stored 630. This new signature is then used as a query signature, and compared with stored signatures 640. In one embodiment of the invention, a subset of all stored signatures, such as signatures of known documents containing classified or sensitive information, or illegal content can be used. If the comparison does not identify any stored signature within this subset having similarity to the query signature above a certain threshold, the user is presumably not manipulating classified, sensitive, or illegal content. No action needs to be taken, the operation proceeds as normal. If the comparison identifies any stored signature within this subset that has similarity to the query signature above a certain threshold, the user is presumed to be manipulating classified, sensitive, or illegal content. A further inquiry whether the user is expected to manipulate such content is performed 650 based on criteria such as security clearance, job assignment, or special permission. If the user is determined to have proper access permission, and is expected to manipulate such content, the operation proceeds as normal. However, if the user does not have proper permission, or is not expected to manipulate such content, then the suspect content is identified based on the query and the stored similar fingerprint or signature 660, and a misuse alert is sent to authorized personnel or a forensic investigator 670.
  • FIG. 7 presents another exemplary method for user misuse detection. All the files that belong to or are accessed by a user are identified based on ownership information and access information 710. Signatures of the entire collection of these files can be used to generate a user profile for the user 720 and are stored 730. An updated user profile is then generated at a later time, either by request or based on a periodic schedule. The newly generated user profile is then compared to any or all of the stored user profiles of the same user at earlier times 740. If no difference above a certain threshold is detected among the user profiles, there is no deviation in user behavior. However, if the newly generated user profile differs from the stored user profile above a certain threshold, a further inquiry is performed to determine whether there is a legitimate reason for such deviation of user behavior 750. If a legitimate reason is found, such as change in job assignment or upgrade of security clearance, the operation proceeds as normal. If no legitimate reason is found for the deviation of user behavior, the content of the mismatched signatures is identified 760, and an alert of possible user misuse is sent to authorized personnel or to a forensic investigator 770.
  • FIG. 8 presents an exemplary method for detection of unauthorized network communication of sensitive information. When a network server receives inbound or outbound network traffic 810, a signature is then calculated based on the content of the network traffic 820 and stored 830. The signature is then used as a query signature and is compared to any previously stored signatures 840. In one embodiment of the invention, if the query signature has similarity to any stored signature above a certain threshold, it is then compared to a subset of all stored signatures, such as signatures of known documents containing classified or sensitive information, or illegal content 850. If the query signature does not have similarity above a certain threshold to any of the subsets of stored signatures, no classified, sensitive, or illegal content is detected. Network traffic is allowed to proceed as normal 860. However, if classified, sensitive, or illegal content is detected, suspect content and user information is identified 870, the network traffic is then quarantined 880, and an alert is sent to an authorized personnel or to a forensic investigator 890.
  • This proactive approach makes investigations faster, easier, and less expensive. Given one document, all systems containing that or similar documents can be found quickly and easily. This is true even if the given document is a hard copy. Text information can be extracted from the hard copy either automatically (e.g., scanned, segmented, and converted to text using optical character recognition) or manually (e.g., transcribed by hand into a computer readable format) and used to create a query signature. The present invention can identify systems where a document once existed, even if it is now or otherwise deleted. In classified computer networks not connected to the Internet, such as those employed by government intelligence agencies and defense contractors, strict control of content entering and leaving the classified network is necessary. However, traditionally, there generally is no effective mechanism to track the flow of information within the classified network. The present invention can locate any content within the classified network, and provide a system-wide tracking of any content of interest. In one embodiment of the invention, a real time, system-wide map of the distribution of any particular content can be generated and monitored.
  • This invention can also be used for evidence discovery. Given one user or a set of users, a forensic analysis could determine documents of interest. Those identified documents could be used to seed a fingerprint search across all systems. That would rapidly identify which other systems needed further consideration for analysis. The present invention can determine the source of files that were not permanently stored, such as temporary files deleted without a user's knowledge.
  • This invention can be further used for misuse detection. Many systems log accesses to restricted material. However, restricted material is usually defined by its location within the file system, or by other attributes of the file. Once the restricted material leaves the protected file systems location, or loses its original attributes, access logging will no longer be able to detect misuse of the restricted material. The present invention, however, can detect when the access logging fails by verifying that documents that should have been logged were logged. Collection statistics and fingerprints can determine when a document is atypical for a user, which may be a sign of document misuse. The present invention can also help to determine the source of leaks by identifying the systems within which a leaked document was present, and a time line that tracks movement the leaked document through a network.
  • This invention can also be used for intrusion response. When an intrusion is discovered, the signatures of files associated with the intrusion can be recovered. Even if the original files are deleted, the signatures can still be recovered based on time stamps. These recovered signatures can be used to examine across systems for similar intrusions, and also provide early detection to prevent intrusion from similar attacks.
  • FIG. 9 illustrates an exemplary system of the present invention. The system of FIG. 9 comprises four components: 1) a processor for creating/generating fingerprints and signatures for a target, such as a document 910; 2) an extension module to the operating system (OS) configured to trigger signature generation upon occurrence of a certain action 920; 3) a mechanism for storing the generated signatures 930; and 4) a comparator for querying the system for stored signatures and comparing those retrieved for similarity 940. The implemented system may have either a Windows service or Linux daemon running in the background to generate a signature for a given file and to store it, and then to query the stored signatures to determine similarity with other signatures. The system runs with administrator or root privileges.
  • The extension module of the operating system has several components. First the configuration information must be stored on the system. In Windows, this would be registry entries or configuration files. In Linux, a configuration file is used, which is stored in /etc or another location. The configuration information includes mechanisms for signature creation, other information to store with signatures, mechanism and location for signature storage, events that trigger signature creation and mechanisms for extracting text based on file type. Separate programs or modules can be called to perform text extraction. In Windows, the COM model can be used to extract text from Office documents. In Linux, various utilities can be used to extract text from different file types.
  • The signature creation is linked into the OS so that signatures are created when desired system events occur, such as file deletion, file copy between file systems, and file modification. As soon as the computer system starts, certain system events are remapped to invoke the signature creation process, and the system waits for the occurrence of these events. When any one of these events is captured, the OS invokes calls to the signature creation process. In Linux, this can be achieved by a loadable kernel module. In Windows, this can be done through a variety of ways. When called, the system identifies the digital object (file) that triggered the operation, and passes a copy or pointer to the file for processing to the fingerprints creation process. Tokens are extracted from the file and processed, fingerprints are generated for the retained token list, other information associated with the file (metadata) is incorporated with the fingerprints, and a signature is generated, all based on the criteria specified in the configuration information.
  • A basic system can incorporate the entire index of retained tokens (i.e., without filtration). In this embodiment of the invention, a simple tokenization of a document may include converting the entire document to lower-case (remove case sensitive information) and obtaining individual tokens. A token for this basic system is any string of length-4 or more separated by either white space or any form of punctuation. The individual tokens are then sorted according to Unicode ordering to obtain unique tokens. A hash code or bit vector is then generated for each token in the sorted unique token list. In another embodiment of the invention, the same process is used for tokenization of a document and sorting of the unique token list. The process also includes the filtering of the unique token list. Subsets of the unique token list are created based on a list of criteria including, but not limited to, keeping tokens of only 6 characters or longer in length, keeping tokens numbered (in order) 25-50, keeping every 7th token, keep every 25th token, or other similar rules. A hash code or bit vector is then generated for each subset of tokens.
  • Fingerprints may vary in complexity. A signature created based on a complete index of retained tokens, such as a list sorted according to Unicode, can be highly precise but support only minimal variance. The precision and tolerance to variance of a signature created based on a filtered index of retained tokens depends on the degree of filtration. A signature based on a highly filtered index provides high recall but low precision. The number of filters employed to generate signatures also affects the complexity. Multiple filters increase precision but also increase the time required for signature calculation and the storage space needed for signature safe-keeping.
  • A mechanism for storing signatures should be resilient against modification by users. Once the signature is created, it is stored securely. A user other than authorized personnel or a forensic investigator should have no means to modify or delete any signature entry. The signatures can be inserted into a database, allowing for easy queries and off-system storage. Alternatively, signatures can be stored in flat files having only root or administrator permissions.
  • When given a signature, one can check to see if the signature is in the store. If given a file or document, text is extracted from the file, fingerprints are created, then a signature, and the created query signature is checked against the store. If multiple fingerprints are used to represent a file, any or all of the fingerprints can be used to determine similarity above a predetermined threshold. A proper or predetermined threshold can be the matching of all or some of the fingerprints, a probabilistic analysis of the similarity of the fingerprints, or any other mathematical analysis directed to signature divergence. The higher the threshold, the lower the rate of false positives; however, the higher the rate of false negatives.
  • In another aspect of the present invention, the content of a target can be alternatively represented by selections of terms that are logically selected from the target. The selections of terms are conceptually and functionally similar to the above described fingerprints, and together with other information associated with the target, form a representation of the content of the target similar to the above-described signature. The collection of terms or a plurality of targets can further be organized and indexed in a data structure which links to or is associated with the representations of target.
  • Terms can be extracted from a target to represent its content. As an example, the content of a target that contains textual information can be defined by a selection of words and phrases within the target. The present invention extracts terms appropriate for one or more targets, which include, but are not limited to, bits, bytes, characters, digits, numbers, words, word sequences, phrases, sentences, meta-data, and information derived from these targets. In one embodiment of the present invention, terms may include stemmed and stopped words. Stemming and stop words are recognized by one skilled in the art as standard practice for information extraction in information-retrieval systems. For purpose of illustration, but without limiting the scope of the invention, stemming can be generally described as a process for reducing inflected (or sometimes derived) words to their stem, base, or root form. The stem, base, or root of a word in the context of stemming is not necessarily identical to the linguistic root of the word. It is sufficient that related words are mapped to the same stem, base, or root. Several varieties of stemming algorithms are currently available. The term “stop words” generally refers to words which are filtered out prior to, or after, the processing of natural language data (e.g., text). In the above-mentioned embodiment, stop words can be removed to yield a list of terms that represents the content of one or more targets. In other embodiments of the present invention, stop words may be intentionally preserved to include phrases in the list of terms. For targets that lack textual terms, their idiosyncratic characteristics can be identified and used to represent a target's contents. Feature points of images, video clips, audio waves, etc. can likewise be extracted and treated as terms. The present invention method extracts these terms in a variety of ways. Term selection can be achieved by providing a predetermined list of terms of interest by using a predetermined algorithm that automatically identifies terms of interest or by a combination of these two methods.
  • Once identified and extracted, the present invention method records in the representation of the target the presence and/or absence of each term by methods that include, but are not limited to, recording its presence or absence, its frequency, and/or its relative importance. The representation of the target also incorporates other information associated with the target that is extracted and combined with the terms. The other associated information may be data relevant to the target, and can be accessed through the operating system, which may include, but is not limited to, file name, date and time of record, user/owner information, and access history. Other associated information may also include data relevant to the target that is accessible through an application, which may include, but is not limited to, author, time of editing, number words, title, subject, comments, and any other customizable fields or application-specific information. There are numerous possibilities regarding the information that can be incorporated into a representation. A person skilled in the art could choose to incorporate any number of desired attributes of the target into a representation, depending on the specific implementation. In one embodiment of the present invention, the representation of the target records the inverse document frequency for each teen. Other associated information includes, but is not limited to, the source document's name, owner, location, date, and host name. Such other associated information can also include, but is not limited to, terra frequency within the document, positioning information within the document, weighting means, etc.
  • According to one aspect of the present invention, sets of targets may be processed simultaneously. It is also within the scope of the invention to process individual targets sequentially, or in parallel and then merged. When sets of targets are processed, the collection of terms contained within each individual target can be pooled and indexed in a data structure, such as an inverted index. An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a target such as a database file, a document, or a set of documents. In other words, the inverted index is a data structure that is keyed to a list of terms such that each term references to a posting list that refers to targets that contain each of the terms.
  • FIG. 10 shows an example of the present invention utilizing the inverted index of a document set. For a given document set, individual documents are processed in parallel or in serial and then combined. Terms are extracted from each of the documents within the document set 1010. For purposes of illustration, but without limiting the present invention, the processing and extraction of terms may include the steps of identifying the format of a document, discarding formatting information, stemming, and removing stop words. All terms extracted from the document set are pooled and indexed. One exemplary method of indexing is alphabetical indexing. An inverted index of the document set is generated from the indexed terms of the document set 1020. For each document, its collection of terms together with frequency and positional information of the terms form a representation of the content of the document 1030. Other information 1031 associated with the original document, such as time/date, location, and ownership information, is incorporated with the selection of terms to form the representation of the document 1040. The posting list of each of the terms within the inverted index of the document set may include representations of each of the individual documents of the document set, or it may contain references to the representations of the documents stored elsewhere 1050.
  • Traditionally, an inverted index is not suitable for computer forensics. When used in information retrieval system, as the collection is modified, the inverted index is modified as well. The inverted index is constantly updated in response to the addition and removal of documents. Thus, in practice, as documents are deleted, they are removed from the inverted index. When a document is modified, the inverted index is updated to include reference to the modified document, but loses the references to the original document. The traditional manipulation of the inverted index is not appropriate when using it as a forensic examination application, in which case all versions of the targets must be maintained, or at the very least, an accurate representation of the targets must be maintained. According to one aspect of the present invention, representations of the contents of the targets along with other associated information including, but not limited to, file existence or date of deletion or of modification, type of modification, etc., are added or linked to the inverted index. The inverted index, according to the present invention, does not remove the posting of a removed target from the index. The inverted index maintains a reference to a representation of a target that is stored permanently, and which is not removed when the original target is deleted. Depending on the application, the associated other information in the representation, such as the indication that a file was deleted and the identification of such user who performed the action, may or may not be updated. In one embodiment of the present invention, the representation of the content of a deleted target is updated with information related to its deletion, such as time/date, user, and host/client computer where the deletion is performed. New representations of new targets are generated as these new targets are added to the system, and the inverted index is updated to include the new targets. When existing targets are modified, the representations of the targets are either updated to account for these modifications, or if the modifications are sufficiently large—the modified targets are treated as though they are new targets. For targets that are sufficiently modified or deleted and otherwise removed from the computer system, the original representations are retained with all associated information pertaining to such targets, which may also include information pertaining to the circumstances of the modification.
  • FIG. 11 is a flow chart illustrating an exemplary method for document addition and modification according to one aspect of the present invention. For each new document that is added to the document set, a representation of the content of the new document is generated 1110. The inverted index for the document set is then updated 1120 to include the new document and the representation of its content, as well as other information associated with the new document 1111. When a document within the document set is modified 1130, a candidate representation of the content of the modified document is generated 1140. This candidate representation is compared with the representation of the original document 1150. When the modification is minor, (i.e., the similarity of the representations of the modified and original document is within a pre-determined threshold), the original representation of the document is retained 1160. Other information associated with the modified document is updated 1170. When the modification is significant (i.e., the similarity of the representations of the modified and original documents falls below a predetermined threshold), the modified document is treated as a new document 1180. The inverted index is updated to incorporate the modified document 1120, and a new representation of the modified document is generated and other associated information stored. The listing of the original document and its representation are not modified.
  • For one or more computer systems, the present invention permits individuals with the proper privilege to query an embodiment of the invention and the representation to determine the presence or absence of information that resides or resided on such computer systems. In one embodiment of the present invention, these queries can be performed directly on one or more individual machines. In another embodiment of the present invention, these queries can be performed remotely. For remote queries in a networked environment, one embodiment of the present invention transmits the query to one or more computer systems, which execute the query on the representation and return the answer to that query. For remote queries, another embodiment of the present invention includes a computer system that stores all of the representations from one or more other computer systems, queries all of these representations, and returns answers to those queries. For this embodiment of the present invention, one or more computer systems periodically transmit all or portions of their representations to the computer system responsible for storing all representations. This method also allows for these representations to be updated at any time by an individual with proper privilege. The method also allows for these representations to be generalized or compiled into a single representation for one or more computer systems.
  • The method of the present invention allows for manual and automated query formation. An individual with proper privileges may provide queries directly to a computer system employing the present invention method. A query can be a term or a collection of terms. Alternatively, an individual with proper privileges may provide information of interest in the form of a file, document, or any other format that is readable by a computer system (i.e., a query target), whereupon the present invention processes the file or document in the manner described above, and queries the inverted index of representations. The form of the query includes, but is not limited to, one or more terms, terms connected with logical operators, and queries in SQL. For a given query, the present invention method returns the representation of the target. Other information associated with the target can be extracted from the representation, which other information may include document name, host name, and any other requested meta-data that relate to the query.
  • FIG. 12 is a flow chart that illustrates an exemplary method for performing a search with the inverted index according to one aspect of the invention. Query terms 1201 can be formed from a document of interest 1210 or from selected terms of interest 1202. A search is performed with the query terms 1201 against the inverted index 1220. Documents are identified 1230 based on the existence and frequency of the query terms 1201. Representations of the identified documents are retrieved 1240, and other information associated with the document representations are extracted 1250. Information such as time/date stamp, user name, and computer name can be used to identify user or computer of interest 1260.
  • Embodiment for Forensic Signatures for Forensic Processing
  • In a classic forensic environment, computers or other digital devices used in examination may not be seen by the investigator until they are seized at a crime scene or in an intelligence gathering operation. When multiple digital devices are seized and examined by several investigators, a standard need in forensics is to recognize previously seen files to avoid repeated examinations and to identify unseen files to conduct a thorough investigation. Traditional technologies using hash function or fuzzy hash function for this purpose can lack robustness. For example, only when two files are exactly identical, the hash values of these two files are the same. Even a slight change of a file such as adding a word results a different hash value. A fuzzy hash function that calculates a hash value for multiple parts of a file can not identify a file when the file is merely reformatted. If an investigator relies on the hash function or a fuzzy hashing function during the forensic processing, the investigator risks wasting precious time and resources on a same file.
  • The forensic signature created according to an embodiment includes a digital fingerprint and other information associated with a file. The digital fingerprint may represent contents of the file and is resistant to minor modification of the file.
  • After a digital device is seized, a dictionary is created based on the files included in the digital device. The dictionary may include words that are routinely used by a person or words that are commonly used in the filed included in the digital device. According to an embodiment, a digital fingerprint is created for each file of the seized digital devices according to the dictionary. One or more digital fingerprints may be created for one file using one or more dictionaries. The generated digital fingerprints combine with other information of the file to form a signature. An investigator may use the signature to identify unseen files. These signatures are not sensitive to minor changes of the file and are portable as long as the dictionary is transmitted along with these files.
  • In the forensic environment, the investigator may be required to investigate audio and video files recovered from seized digital devices. Many times an investigator has to review the entirety of the audio and video files so that an inserted segment may not be ignored. There is also the need of identifying the same or similar audio and video files. Again, the traditional hash or fuzzy hash functions can not fulfill such a need.
  • According to an embodiment of this invention, a forensic signature is created for an audio file and video file. The forensic signature includes digital fingerprints and meta information of the audio file or the video file, which has a similar structure as those described before. The forensic signature effectively identifies the same audio files and the same video files.
  • The digital fingerprint of a video file may be created based on segments of the video file. To create a digital fingerprint for a video file, the video file is parsed to locate changes in scene. The scene change may be detected based on changes in video shot. The video file is segmented according to the scene change. A number of representative information associated with each segment may be extracted. For example, the length of each scene segment is determined and a fingerprint of a video file may be an encoded series of lengths of all the scene segments. In another embodiment, each segment is analyzed using a sliding window. One or more key frames from each scene segment are extracted. In another embodiment, a KL transform (or any other component reduction computation) is performed on each segment. In another embodiment, subtitles or caption text are extracted to be used in creating the fingerprints.
  • The metadata of a video file includes information about the file such as filename, creator, length of the video, encoding, resolution, and etc. Other information that may be added to the metadata, if the video file is seized as forensic evidence, includes date and location of evidence collection, collecting agent, and etc.
  • To create a signature for an audio file, whether the audio file includes a speech or music is determined. If the audio file is determined to include music, a forensically accurate acoustic fingerprint can be identified and used in accord with techniques known to ordinarily skilled artisans, including using perceptual characteristics including average zero crossing rate, estimated tempo, average spectrum, spectral flatness, prominent tones across a set of bands, and bandwidth. If the audio file includes speech, a conversion from speech to text is performed. The text converted from the audio file is processed to generate a digital fingerprint for the audio file.
  • In some cases, multimedia files that includes both audio files and video files need to be investigated. For example, a DVD file includes multiple audio and video files, which may not have one-to-one correspondence. The DVD file needs to be treated as one file. To create a signature of such a multimedia file that includes both audio and video, fingerprints for audio and video are created separately and stored.
  • Informational Signatures
  • Another need during forensic processing is to help investigator to search the content of a file and to determine the content of a file. To fulfill such a need, the signature of a file can be informational rather than a collection of a plurality of computer-generated symbols.
  • In an embodiment, informational signatures are created, which may assist in determining what information is included in a file. A user may tokenize an informational object of interest such as an email address, name, phone number, account number, and etc. These tokens are then entered into a data structures, such as, for example, a Bloom filter or similar data structures, including data structures that can probabilistically or statistically determine if a tokenized term has been previously inserted. The result is then stored as an informational fingerprint. The informational fingerprint is then able to be used in targeted forensic investigations related to the specific information.
  • For example, a file may be tokenized to find all instances of Social Security Number (SSN). A fingerprint is created indicating the count of the number of SSNs identified in the file. When a laptop is lost, signature of all the files can by retrieved and examined to check if any file SSNs and counts thereof.
  • Massively Parallel Signature Matching.
  • In some instances, a large amount of signatures may be created for a computer system. When a query signature is to be compared with or matched with a large amount of signatures stored in the database, a considerable amount of time can be required for such a process, which adds overhead to the computer system.
  • The following explains the matching process carried out by another embodiment. When a query signature is to be matched, the database first determines a number of fingerprints included in the query signature. For each fingerprint, the database identifies the type of that fingerprint and determines all the matching fingerprints having the same type in the database. The same process is conducted for every fingerprint included in the signature. As a result, a plurality of lists of matched fingerprints is obtained from the database. The plurality of list is merged to create a merged list. A score is calculated for each matched fingerprint in the merged list based on closeness of the match. A final list is selected to include the signatures that have highest scores or scores higher than a predetermined threshold. If the matching process is implemented by a single computer system, a substantial amount of computer time is required to complete this process.
  • According to another embodiment of this invention, solutions to reduce the overhead of the computer system caused by the matching process include using a commercial service for matching or using idle computers within the same organizational network. For example, in an embodiment fingerprints of each type are compared in parallel on different computers. During the parallel matching process, a computer system is determined to be a master. It is within the scope of this invention however to likewise support multiple simultaneous master computing systems. A list of target fingerprints is partitioned by the master into a plurality of sub-lists to be distributed across resources, which may be referred to as slaves. The sub-lists are distributed to the slaves, which conducts the matching process for each sub-list. The slaves also score the closeness of each signature using their own computer system. The slaves return to the master a list of target fingerprints and associated scores. The master merges the returned lists and creates a master list. The master may traverse the returned lists and generates an overall score over all the fingerprint types. The master may start another parallel computation to locate the highest matching signature. The master may create a list that is composed of all the target signatures and the matching scores for each of the fingerprints within and distribute a subset of the list to a slave system for scoring. Exemplary execution environments for the parallel matching process may include cloud computing, hadoop, message passing interface, parallel virtual machine, Linda, and communicating sequential processes.
  • The present invention method utilizing term representation of content of a target and an inverted index of the representation is suitable to carry out other application of computer forensic and security measure described above.
  • These and other advantages of the present invention will be apparent to those skilled in the art from the foregoing specification. Accordingly, it will be recognized by those skilled in the art that changes or modifications may be made to the above-described embodiments without departing from the broad inventive concepts of the invention. It should therefore be understood that this invention is not limited to the particular embodiments described herein, but is rather intended to include all changes and modifications that are within the scope and spirit of the invention.

Claims (13)

What is claimed:
1. A method of generating a signature of a video, comprising the steps of:
detecting scene changes of the video;
segmenting the video into a plurality of segments corresponding to each scene change;
extracting a representation from each segment;
forming a digital fingerprint based on the representation; and
creating a signature by combining the digital fingerprint with predetermined metadata of the video.
2. A method of claim 1, wherein the representation includes one or more frames of each segment.
3. A method of claim 1, wherein the representation includes length information of each segment.
4. A method of claim 1, wherein the representation includes a subtitle and caption text of each segment.
5. A method of generating a signature for an audio file, comprising the steps of:
determining whether the audio includes music or speech;
generating, if the audio includes speech, a transcript of the speech;
extracting a representation from the transcript;
forming a digital fingerprint based on the representation; and
creating a signature by combining the digital fingerprint with predetermined metadata of the audio.
6. A method of generating a signature of a digital file, comprising the steps of:
selecting a token indicating an informational object;
generating a digital fingerprint of the digital file using the selected token; and
creating the signature that includes the digital fingerprint and meta data of the digital file.
7. A method of claim 6, further comprising the steps of:
extracting a plurality of tokens from the digital file according to the selected token; and
inserting the extracted plurality of tokens into a data structure that probabilistically determines whether a token is inserted before.
8. The method of claim 6, wherein the selected token includes a token selected from the group consisting essentially of: an email address, a name, an account number, and a social security number.
9. A method of matching a forensic signature of a digital file, comprising the steps of:
receiving, by a master apparatus, a list of query signatures;
generating, by the master apparatus, a plurality of sub-lists of query signatures;
transmitting, by the master apparatus, a sub-list to a slave apparatus;
matching, by a slave apparatus, the query signatures with signatures in database according to digital fingerprints included in the query signatures and generating a plurality of matching lists for each digital fingerprints;
merging, by the slave apparatus, the plurality of matching lists and generating a final list;
calculating, by the slave apparatus, a score for each matched signature in the final list according to a closeness between the matched signature and the query signature; and
transmitting, by the slave apparatus, the final list and the score to the master apparatus.
10. A computer-readable transitory storage medium storing an executable program, when executed, causing a computer system to execute a method of generating a signature of a video, comprising the steps of:
detecting scene changes of the video;
segmenting the video into a plurality of segments corresponding to each scene change;
extracting a representation from each segment;
forming a digital fingerprint based on the representation; and
creating a signature by combining the digital fingerprint with predetermined metadata of the video.
11. A computer-readable transitory storage medium storing an executable program, when executed, causing a computer system to execute a method of generating a signature for an audio file, comprising the steps of:
determining whether the audio includes music or speech;
generating, if the audio includes speech, a transcript of the speech;
extracting a representation from the transcript;
forming a digital fingerprint based on the representation; and
creating a signature by combining the digital fingerprint with predetermined metadata of the audio.
12. A computer-readable transitory storage medium storing an executable program, when executed, causing a computer system to execute a method of generating a signature of a digital file, comprising the steps of:
selecting a token indicating an informational object;
generating a digital fingerprint of the digital file using the selected token; and
creating the signature that includes the digital fingerprint and meta data of the digital file.
13. A computer-readable transitory storage medium storing an executable program, when executed, causing a computer system to execute a method of matching a forensic signature of a digital file, comprising the steps of:
receiving, by a master apparatus, a list of query signatures;
generating, by the master apparatus, a plurality of sub-lists of query signatures;
transmitting, by the master apparatus, a sub-list to a slave apparatus;
matching, by a slave apparatus, the query signatures with signatures in database according to digital fingerprints included in the query signatures and generating a plurality of matching lists for each digital fingerprints;
merging, by the slave apparatus, the plurality of matching lists and generating a final list;
calculating, by the slave apparatus, a score for each matched signature in the final list according to a closeness between the matched signature and the query signature; and
transmitting, by the slave apparatus, the final list and the score to the master apparatus.
US13/858,536 2007-12-21 2013-04-08 Automated forensic document signatures Abandoned US20130227604A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/858,536 US20130227604A1 (en) 2007-12-21 2013-04-08 Automated forensic document signatures

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US11/963,186 US8280905B2 (en) 2007-12-21 2007-12-21 Automated forensic document signatures
US12/118,942 US8312023B2 (en) 2007-12-21 2008-05-12 Automated forensic document signatures
US12/822,722 US8438174B2 (en) 2007-12-21 2010-06-24 Automated forensic document signatures
US13/858,536 US20130227604A1 (en) 2007-12-21 2013-04-08 Automated forensic document signatures

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US12/822,722 Continuation US8438174B2 (en) 2007-12-21 2010-06-24 Automated forensic document signatures

Publications (1)

Publication Number Publication Date
US20130227604A1 true US20130227604A1 (en) 2013-08-29

Family

ID=40789809

Family Applications (3)

Application Number Title Priority Date Filing Date
US12/118,942 Active 2029-04-04 US8312023B2 (en) 2007-12-21 2008-05-12 Automated forensic document signatures
US12/822,722 Active US8438174B2 (en) 2007-12-21 2010-06-24 Automated forensic document signatures
US13/858,536 Abandoned US20130227604A1 (en) 2007-12-21 2013-04-08 Automated forensic document signatures

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US12/118,942 Active 2029-04-04 US8312023B2 (en) 2007-12-21 2008-05-12 Automated forensic document signatures
US12/822,722 Active US8438174B2 (en) 2007-12-21 2010-06-24 Automated forensic document signatures

Country Status (6)

Country Link
US (3) US8312023B2 (en)
EP (1) EP2248062B1 (en)
AU (1) AU2010202627B2 (en)
CA (2) CA2710392C (en)
IL (1) IL206493A (en)
WO (1) WO2009085845A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140081917A1 (en) * 2010-12-21 2014-03-20 Microsoft Corporation Searching files
US9229818B2 (en) 2011-07-20 2016-01-05 Microsoft Technology Licensing, Llc Adaptive retention for backup data
US9535994B1 (en) * 2010-03-26 2017-01-03 Jonathan Grier Method and system for forensic investigation of data access
US9824091B2 (en) 2010-12-03 2017-11-21 Microsoft Technology Licensing, Llc File system backup using change journal
WO2018026802A1 (en) * 2016-08-02 2018-02-08 Child Rescue Coalition, Inc. Identification of portions of data
US10439994B2 (en) 2014-07-15 2019-10-08 Samsung Electronics Co., Ltd. Method and device for encrypting and decrypting multimedia content
US11003889B2 (en) 2018-10-22 2021-05-11 International Business Machines Corporation Classifying digital documents in multi-document transactions based on signatory role analysis
US11017221B2 (en) 2018-07-01 2021-05-25 International Business Machines Corporation Classifying digital documents in multi-document transactions based on embedded dates

Families Citing this family (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11234029B2 (en) * 2017-08-17 2022-01-25 The Nielsen Company (Us), Llc Methods and apparatus to generate reference signatures from streaming media
US8738919B2 (en) * 2007-04-20 2014-05-27 Stmicroelectronics S.A. Control of the integrity of a memory external to a microprocessor
WO2009029589A1 (en) * 2007-08-25 2009-03-05 Vere Software Online evidence collection
US8250475B2 (en) * 2007-12-14 2012-08-21 International Business Machines Corporation Managing icon integrity
US20090193210A1 (en) * 2008-01-29 2009-07-30 Hewett Jeffrey R System for Automatic Legal Discovery Management and Data Collection
US8265428B2 (en) * 2008-05-13 2012-09-11 Forsigs Limited Method and apparatus for detection of data in a data store
US20090327405A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Enhanced Client And Server Systems for Operating Collaboratively Within Shared Workspaces
US8286171B2 (en) 2008-07-21 2012-10-09 Workshare Technology, Inc. Methods and systems to fingerprint textual information using word runs
WO2010059747A2 (en) 2008-11-18 2010-05-27 Workshare Technology, Inc. Methods and systems for exact data match filtering
KR101174057B1 (en) * 2008-12-19 2012-08-16 한국전자통신연구원 Method and apparatus for analyzing and searching index
US9721227B2 (en) 2009-03-27 2017-08-01 Bank Of America Corporation Custodian management system
US9330374B2 (en) 2009-03-27 2016-05-03 Bank Of America Corporation Source-to-processing file conversion in an electronic discovery enterprise system
US8417716B2 (en) * 2009-03-27 2013-04-09 Bank Of America Corporation Profile scanner
JP5299912B2 (en) * 2009-04-20 2013-09-25 株式会社ザクティ Imaging device and data structure of image file
US8365247B1 (en) * 2009-06-30 2013-01-29 Emc Corporation Identifying whether electronic data under test includes particular information from a database
US9195808B1 (en) * 2009-07-27 2015-11-24 Exelis Inc. Systems and methods for proactive document scanning
US8538188B2 (en) * 2009-08-04 2013-09-17 Mitre Corporation Method and apparatus for transferring and reconstructing an image of a computer readable medium
US8719353B2 (en) * 2009-09-01 2014-05-06 Seaseer Research And Development Llc Systems and methods for visual messaging
US9098730B2 (en) * 2010-01-28 2015-08-04 Bdo Usa, Llp System and method for preserving electronically stored information
US20110239039A1 (en) * 2010-03-26 2011-09-29 Dieffenbach Devon C Cloud computing enabled robust initialization and recovery of it services
US8489894B2 (en) * 2010-05-26 2013-07-16 Paymetric, Inc. Reference token service
US8898177B2 (en) * 2010-09-10 2014-11-25 International Business Machines Corporation E-mail thread hierarchy detection
US8788500B2 (en) 2010-09-10 2014-07-22 International Business Machines Corporation Electronic mail duplicate detection
US8928809B2 (en) * 2010-09-15 2015-01-06 Verizon Patent And Licensing Inc. Synchronizing videos
US9626456B2 (en) * 2010-10-08 2017-04-18 Warner Bros. Entertainment Inc. Crowd sourcing for file recognition
US11030163B2 (en) 2011-11-29 2021-06-08 Workshare, Ltd. System for tracking and displaying changes in a set of related electronic documents
US10025759B2 (en) 2010-11-29 2018-07-17 Workshare Technology, Inc. Methods and systems for monitoring documents exchanged over email applications
US10783326B2 (en) 2013-03-14 2020-09-22 Workshare, Ltd. System for tracking changes in a collaborative document editing environment
KR20120065819A (en) * 2010-12-13 2012-06-21 한국전자통신연구원 Digital forensic apparatus for analyzing the user activities and method thereof
US10574729B2 (en) 2011-06-08 2020-02-25 Workshare Ltd. System and method for cross platform document sharing
US9613340B2 (en) 2011-06-14 2017-04-04 Workshare Ltd. Method and system for shared document approval
US9170990B2 (en) 2013-03-14 2015-10-27 Workshare Limited Method and system for document retrieval with selective document comparison
US9948676B2 (en) 2013-07-25 2018-04-17 Workshare, Ltd. System and method for securing documents prior to transmission
US10880359B2 (en) 2011-12-21 2020-12-29 Workshare, Ltd. System and method for cross platform document sharing
US10963584B2 (en) 2011-06-08 2021-03-30 Workshare Ltd. Method and system for collaborative editing of a remotely stored document
US8949371B1 (en) * 2011-09-29 2015-02-03 Symantec Corporation Time and space efficient method and system for detecting structured data in free text
KR20130049111A (en) * 2011-11-03 2013-05-13 한국전자통신연구원 Forensic index method and apparatus by distributed processing
US8959425B2 (en) 2011-12-09 2015-02-17 Microsoft Corporation Inference-based extension activation
CN102542405A (en) * 2011-12-14 2012-07-04 金峰顺泰知识产权有限公司 Digital archive storage and identification method and system
US9679163B2 (en) 2012-01-17 2017-06-13 Microsoft Technology Licensing, Llc Installation and management of client extensions
US9449112B2 (en) * 2012-01-30 2016-09-20 Microsoft Technology Licensing, Llc Extension activation for related documents
US9256445B2 (en) 2012-01-30 2016-02-09 Microsoft Technology Licensing, Llc Dynamic extension view with multiple levels of expansion
US8843822B2 (en) 2012-01-30 2014-09-23 Microsoft Corporation Intelligent prioritization of activated extensions
US8738569B1 (en) * 2012-02-10 2014-05-27 Emc Corporation Systematic verification of database metadata upgrade
BR112015000142A2 (en) * 2012-07-12 2017-06-27 Sony Corp transmitting device, method for processing information, program, receiving device, and application cooperation system
US8875303B2 (en) * 2012-08-02 2014-10-28 Google Inc. Detecting pirated applications
JP5526209B2 (en) * 2012-10-09 2014-06-18 株式会社Ubic Forensic system, forensic method, and forensic program
US11567907B2 (en) 2013-03-14 2023-01-31 Workshare, Ltd. Method and system for comparing document versions encoded in a hierarchical representation
US9529799B2 (en) * 2013-03-14 2016-12-27 Open Text Sa Ulc System and method for document driven actions
US9607038B2 (en) 2013-03-15 2017-03-28 International Business Machines Corporation Determining linkage metadata of content of a target document to source documents
US9460201B2 (en) 2013-05-06 2016-10-04 Iheartmedia Management Services, Inc. Unordered matching of audio fingerprints
RU2580036C2 (en) 2013-06-28 2016-04-10 Закрытое акционерное общество "Лаборатория Касперского" System and method of making flexible convolution for malware detection
US9794275B1 (en) 2013-06-28 2017-10-17 Symantec Corporation Lightweight replicas for securing cloud-based services
US10911492B2 (en) 2013-07-25 2021-02-02 Workshare Ltd. System and method for securing documents prior to transmission
US9794269B2 (en) * 2013-08-29 2017-10-17 Nbcuniversal Media, Llc Method and system for validating rights to digital content using a digital token
EP2876890A1 (en) * 2013-11-21 2015-05-27 Thomson Licensing Method and apparatus for frame accurate synchronization of video streams
JP5723067B1 (en) * 2014-02-04 2015-05-27 株式会社Ubic Data analysis system, data analysis method, and data analysis program
US9953171B2 (en) * 2014-09-22 2018-04-24 Infosys Limited System and method for tokenization of data for privacy
US9756058B1 (en) 2014-09-29 2017-09-05 Amazon Technologies, Inc. Detecting network attacks based on network requests
US9426171B1 (en) * 2014-09-29 2016-08-23 Amazon Technologies, Inc. Detecting network attacks based on network records
US9805099B2 (en) * 2014-10-30 2017-10-31 The Johns Hopkins University Apparatus and method for efficient identification of code similarity
CN104376098B (en) * 2014-11-25 2017-06-30 浪潮电子信息产业股份有限公司 A kind of files in batch method of calibration based on python
WO2016092836A1 (en) * 2014-12-10 2016-06-16 日本電気株式会社 Communication monitoring system, presentation device and presentation method thereof, analysis device, and recording medium in which computer program is stored
US9600524B2 (en) * 2014-12-22 2017-03-21 Blackberry Limited Method and system for efficient feature matching
US11182551B2 (en) 2014-12-29 2021-11-23 Workshare Ltd. System and method for determining document version geneology
US10133723B2 (en) 2014-12-29 2018-11-20 Workshare Ltd. System and method for determining document version geneology
US10277402B2 (en) * 2015-03-09 2019-04-30 Lenovo (Singapore) Pte. Ltd. Digitally signing a document
US10114900B2 (en) * 2015-03-23 2018-10-30 Virtru Corporation Methods and systems for generating probabilistically searchable messages
JP6561529B2 (en) * 2015-03-26 2019-08-21 富士通株式会社 Document inspection apparatus, method, and program
US9438613B1 (en) * 2015-03-30 2016-09-06 Fireeye, Inc. Dynamic content activation for automated analysis of embedded objects
EP3089051B1 (en) * 2015-04-28 2018-04-11 Micro Systemation AB Database rollback using wal
JP6753398B2 (en) * 2015-06-26 2020-09-09 日本電気株式会社 Information processing equipment, information processing system, information processing method, and program
US9680844B2 (en) 2015-07-06 2017-06-13 Bank Of America Corporation Automation of collection of forensic evidence
US11763013B2 (en) 2015-08-07 2023-09-19 Workshare, Ltd. Transaction document management system and method
US9836535B2 (en) * 2015-08-25 2017-12-05 TCL Research America Inc. Method and system for content retrieval based on rate-coverage optimization
US20170091311A1 (en) * 2015-09-30 2017-03-30 International Business Machines Corporation Generation and use of delta index
US10394803B2 (en) * 2015-11-13 2019-08-27 International Business Machines Corporation Method and system for semantic-based queries using word vector representation
US10373131B2 (en) 2016-01-04 2019-08-06 Bank Of America Corporation Recurring event analyses and data push
US9679426B1 (en) * 2016-01-04 2017-06-13 Bank Of America Corporation Malfeasance detection based on identification of device signature
US10977284B2 (en) 2016-01-29 2021-04-13 Micro Focus Llc Text search of database with one-pass indexing including filtering
CA3014072A1 (en) 2016-02-08 2017-08-17 Acxiom Corporation Change fingerprinting for database tables, text files, and data feeds
CA3043863A1 (en) * 2016-03-21 2017-09-28 Liveramp, Inc. Data watermarking and fingerprinting system and method
US10909173B2 (en) * 2016-12-09 2021-02-02 The Nielsen Company (Us), Llc Scalable architectures for reference signature matching and updating
US10735457B2 (en) 2017-10-03 2020-08-04 Microsoft Technology Licensing, Llc Intrusion investigation
KR101897987B1 (en) * 2017-11-24 2018-09-12 주식회사 포드림 Method, apparatus and system for managing electronic fingerprint of electronic file
US11032251B2 (en) * 2018-06-29 2021-06-08 International Business Machines Corporation AI-powered cyber data concealment and targeted mission execution
EP3948597A4 (en) * 2019-03-29 2022-12-14 Drexel University Learned forensic source system for identification of image capture device models and forensic similarity of digital images
US20210352341A1 (en) * 2020-05-06 2021-11-11 At&T Intellectual Property I, L.P. Scene cut-based time alignment of video streams
US20220414377A1 (en) * 2021-06-23 2022-12-29 Motorola Solutions, Inc. System and method for presenting statements captured at an incident scene

Family Cites Families (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4697209A (en) 1984-04-26 1987-09-29 A. C. Nielsen Company Methods and apparatus for automatically identifying programs viewed or recorded
AU662805B2 (en) * 1992-04-06 1995-09-14 Addison M. Fischer A method for processing information among computers which may exchange messages
US7224819B2 (en) 1995-05-08 2007-05-29 Digimarc Corporation Integrating digital watermarks in multimedia content
US5960081A (en) 1997-06-05 1999-09-28 Cray Research, Inc. Embedding a digital signature in a video sequence
US6014183A (en) 1997-08-06 2000-01-11 Imagine Products, Inc. Method and apparatus for detecting scene changes in a digital video stream
US6263319B1 (en) * 1997-09-26 2001-07-17 Masconi Commerce Systems Inc. Fuel dispensing and retail system for providing a shadow ledger
US6078917A (en) 1997-12-18 2000-06-20 International Business Machines Corporation System for searching internet using automatic relevance feedback
US6345283B1 (en) 1998-07-20 2002-02-05 New Technologies Armor, Inc. Method and apparatus for forensic analysis of information stored in computer-readable media
US6279010B1 (en) 1998-07-20 2001-08-21 New Technologies Armor, Inc. Method and apparatus for forensic analysis of information stored in computer-readable media
US6263349B1 (en) 1998-07-20 2001-07-17 New Technologies Armor, Inc. Method and apparatus for identifying names in ambient computer data
US7739114B1 (en) 1999-06-30 2010-06-15 International Business Machines Corporation Methods and apparatus for tracking speakers in an audio stream
US6754364B1 (en) 1999-10-28 2004-06-22 Microsoft Corporation Methods and systems for fingerprinting digital data
US6772196B1 (en) 2000-07-27 2004-08-03 Propel Software Corp. Electronic mail filtering system and methods
US6714683B1 (en) 2000-08-24 2004-03-30 Digimarc Corporation Wavelet based feature modulation watermarks and related applications
US9027121B2 (en) 2000-10-10 2015-05-05 International Business Machines Corporation Method and system for creating a record for one or more computer security incidents
GB0029893D0 (en) 2000-12-07 2001-01-24 Sony Uk Ltd Video information retrieval
AU2002232817A1 (en) 2000-12-21 2002-07-01 Digimarc Corporation Methods, apparatus and programs for generating and utilizing content signatures
CN1295904C (en) 2001-01-10 2007-01-17 思科技术公司 Computer security and management system
US7603709B2 (en) 2001-05-03 2009-10-13 Computer Associates Think, Inc. Method and apparatus for predicting and preventing attacks in communications networks
US7529659B2 (en) 2005-09-28 2009-05-05 Audible Magic Corporation Method and apparatus for identifying an unknown work
US20030105739A1 (en) 2001-10-12 2003-06-05 Hassane Essafi Method and a system for identifying and verifying the content of multimedia documents
GB2381688B (en) * 2001-11-03 2004-09-22 Dremedia Ltd Time ordered indexing of audio-visual data
AU2002364961A1 (en) * 2001-11-20 2003-06-30 Pierre Bierre Secure identification system combining forensic/biometric population database and issuance of relationship-specific identifiers toward enhanced privacy
US7260722B2 (en) 2001-12-28 2007-08-21 Itt Manufacturing Enterprises, Inc. Digital multimedia watermarking for source identification
US7080091B2 (en) * 2002-05-09 2006-07-18 Oracle International Corporation Inverted index system and method for numeric attributes
US6792545B2 (en) 2002-06-20 2004-09-14 Guidance Software, Inc. Enterprise computer investigation system
US7110338B2 (en) 2002-08-06 2006-09-19 Matsushita Electric Industrial Co., Ltd. Apparatus and method for fingerprinting digital media
EP1563393A4 (en) 2002-10-22 2010-12-22 Unho Choi Integrated emergency response system in information infrastructure and operating method therefor
US7738704B2 (en) 2003-03-07 2010-06-15 Technology, Patents And Licensing, Inc. Detecting known video entities utilizing fingerprints
US6839724B2 (en) * 2003-04-17 2005-01-04 Oracle International Corporation Metamodel-based metadata change management
US7359006B1 (en) 2003-05-20 2008-04-15 Micronas Usa, Inc. Audio module supporting audio signature
US9678967B2 (en) 2003-05-22 2017-06-13 Callahan Cellular L.L.C. Information source agent systems and methods for distributed data storage and management using content signatures
US20070276823A1 (en) 2003-05-22 2007-11-29 Bruce Borden Data management systems and methods for distributed data storage and management using content signatures
US7496959B2 (en) 2003-06-23 2009-02-24 Architecture Technology Corporation Remote collection of computer forensic evidence
GB2404296A (en) 2003-07-23 2005-01-26 Sony Uk Ltd Data content identification using watermarks as distinct codes
GB2405227A (en) * 2003-08-16 2005-02-23 Ibm Authenticating publication date of a document
CA2540575C (en) 2003-09-12 2013-12-17 Kevin Deng Digital video signature apparatus and methods for use with video program identification systems
US7516492B1 (en) 2003-10-28 2009-04-07 Rsa Security Inc. Inferring document and content sensitivity from public account accessibility
US20040133548A1 (en) 2003-12-15 2004-07-08 Alex Fielding Electronic Files Digital Rights Management.
US8612479B2 (en) * 2004-02-13 2013-12-17 Fis Financial Compliance Solutions, Llc Systems and methods for monitoring and detecting fraudulent uses of business applications
US7984089B2 (en) 2004-02-13 2011-07-19 Microsoft Corporation User-defined indexing of multimedia content
US7742617B2 (en) 2004-05-19 2010-06-22 Bentley Systems, Inc. Document genealogy
EP1756693A1 (en) 2004-05-28 2007-02-28 Koninklijke Philips Electronics N.V. Method and apparatus for content item signature matching
US7434058B2 (en) 2004-06-07 2008-10-07 Reconnex Corporation Generating signatures over a document
US7594277B2 (en) * 2004-06-30 2009-09-22 Microsoft Corporation Method and system for detecting when an outgoing communication contains certain content
US7506379B2 (en) 2004-11-04 2009-03-17 International Business Machines Corporation Method and system for storage-based intrusion detection and recovery
GB0424479D0 (en) 2004-11-05 2004-12-08 Ibm Generating a fingerprint for a document
US20060256739A1 (en) 2005-02-19 2006-11-16 Kenneth Seier Flexible multi-media data management
US20060253423A1 (en) 2005-05-07 2006-11-09 Mclane Mark Information retrieval system and method
US7516130B2 (en) 2005-05-09 2009-04-07 Trend Micro, Inc. Matching engine with signature generation
ATE498358T1 (en) 2005-06-29 2011-03-15 Compumedics Ltd SENSOR ARRANGEMENT WITH CONDUCTIVE BRIDGE
US20070016648A1 (en) * 2005-07-12 2007-01-18 Higgins Ronald C Enterprise Message Mangement
US7941386B2 (en) * 2005-10-19 2011-05-10 Adf Solutions, Inc. Forensic systems and methods using search packs that can be edited for enterprise-wide data identification, data sharing, and management
US7603344B2 (en) 2005-10-19 2009-10-13 Advanced Digital Forensic Solutions, Inc. Methods for searching forensic data
US8326775B2 (en) 2005-10-26 2012-12-04 Cortica Ltd. Signature generation for multimedia deep-content-classification by a large-scale matching system and method thereof
WO2007075813A2 (en) 2005-12-23 2007-07-05 Advanced Digital Forensic Solutions, Inc. Enterprise-wide data identification, sharing and management, and searching forensic data
US8009861B2 (en) 2006-04-28 2011-08-30 Vobile, Inc. Method and system for fingerprinting digital video object based on multiresolution, multirate spatial and temporal signatures
US20070283158A1 (en) * 2006-06-02 2007-12-06 Microsoft Corporation Microsoft Patent Group System and method for generating a forensic file
US8577889B2 (en) * 2006-07-18 2013-11-05 Aol Inc. Searching for transient streaming multimedia resources
US7765215B2 (en) * 2006-08-22 2010-07-27 International Business Machines Corporation System and method for providing a trustworthy inverted index to enable searching of records
US7752193B2 (en) * 2006-09-08 2010-07-06 Guidance Software, Inc. System and method for building and retrieving a full text index
US8312558B2 (en) 2007-01-03 2012-11-13 At&T Intellectual Property I, L.P. System and method of managing protected video content
US20080065811A1 (en) 2007-11-12 2008-03-13 Ali Jahangiri Tool and method for forensic examination of a computer
KR101078288B1 (en) 2009-08-21 2011-10-31 한국전자통신연구원 Method and apparatus for collecting evidence

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9535994B1 (en) * 2010-03-26 2017-01-03 Jonathan Grier Method and system for forensic investigation of data access
US9824091B2 (en) 2010-12-03 2017-11-21 Microsoft Technology Licensing, Llc File system backup using change journal
US10558617B2 (en) 2010-12-03 2020-02-11 Microsoft Technology Licensing, Llc File system backup using change journal
US11100063B2 (en) 2010-12-21 2021-08-24 Microsoft Technology Licensing, Llc Searching files
US20140081948A1 (en) * 2010-12-21 2014-03-20 Microsoft Corporation Searching files
US9870379B2 (en) * 2010-12-21 2018-01-16 Microsoft Technology Licensing, Llc Searching files
US20140081917A1 (en) * 2010-12-21 2014-03-20 Microsoft Corporation Searching files
US9229818B2 (en) 2011-07-20 2016-01-05 Microsoft Technology Licensing, Llc Adaptive retention for backup data
US10439994B2 (en) 2014-07-15 2019-10-08 Samsung Electronics Co., Ltd. Method and device for encrypting and decrypting multimedia content
WO2018026802A1 (en) * 2016-08-02 2018-02-08 Child Rescue Coalition, Inc. Identification of portions of data
US11263177B2 (en) 2016-08-02 2022-03-01 Child Rescue Coalition, Inc. Identification of portions of data
US11017221B2 (en) 2018-07-01 2021-05-25 International Business Machines Corporation Classifying digital documents in multi-document transactions based on embedded dates
US11810070B2 (en) 2018-07-01 2023-11-07 International Business Machines Corporation Classifying digital documents in multi-document transactions based on embedded dates
US11003889B2 (en) 2018-10-22 2021-05-11 International Business Machines Corporation Classifying digital documents in multi-document transactions based on signatory role analysis
US11769014B2 (en) 2018-10-22 2023-09-26 International Business Machines Corporation Classifying digital documents in multi-document transactions based on signatory role analysis

Also Published As

Publication number Publication date
EP2248062A2 (en) 2010-11-10
EP2248062A4 (en) 2012-11-14
US8312023B2 (en) 2012-11-13
US20090164427A1 (en) 2009-06-25
US8438174B2 (en) 2013-05-07
WO2009085845A2 (en) 2009-07-09
WO2009085845A3 (en) 2009-10-22
CA2992001A1 (en) 2009-07-09
IL206493A (en) 2015-09-24
US20100287196A1 (en) 2010-11-11
CA2710392C (en) 2018-03-13
CA2992001C (en) 2020-01-21
IL206493A0 (en) 2010-12-30
EP2248062B1 (en) 2018-09-26
AU2010202627B2 (en) 2014-06-19
AU2010202627A1 (en) 2010-07-15

Similar Documents

Publication Publication Date Title
US8438174B2 (en) Automated forensic document signatures
US8280905B2 (en) Automated forensic document signatures
US8219588B2 (en) Methods for searching forensic data
US7941386B2 (en) Forensic systems and methods using search packs that can be edited for enterprise-wide data identification, data sharing, and management
CA2791794C (en) A method and system for managing confidential information
US8041719B2 (en) Personal computing device-based mechanism to detect preselected data
JP5165126B2 (en) Method and apparatus for handling messages containing preselected data
Shields et al. A system for the proactive, continuous, and efficient collection of digital forensic evidence
US20070139231A1 (en) Systems and methods for enterprise-wide data identification, sharing and management in a commercial context
US9152706B1 (en) Anonymous identification tokens
Damshenas et al. A survey on digital forensics trends
JP4903386B2 (en) Searchable information content for pre-selected data
WO2007075813A2 (en) Enterprise-wide data identification, sharing and management, and searching forensic data
Kayarkar et al. Mining frequent sequences for emails in cyber forensics investigation
AU2014202526A1 (en) Automated forensic document signatures
Wang et al. Research on some relevant problems in computer forensics
Walls Inference-based forensics for extracting information from diverse sources
Lin et al. Introduction to computer forensics
Flaglien Cross-computer malware detection in digital forensics
WO2007081960A2 (en) Enterprise-wide data identification, sharing and management
US20230205896A1 (en) Methods for securing data
Wang et al. Computer forensics in communication networks
Nalini et al. Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet
Malwadkar et al. Data mining techniques for digital forensic analysis
Shankar Lingam et al. Network Information Security Model Based on Web Data Mining

Legal Events

Date Code Title Description
AS Assignment

Owner name: GEORGETOWN UNIVERSITY, DISTRICT OF COLUMBIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SHIELDS, THOMAS CLAY;FRIEDER, OPHIR;MALOOF, MARCUS A.;REEL/FRAME:035974/0125

Effective date: 20100713

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION