Audio Processing and Music Recognition

(Autonomous Institution, approved by UGC and Accredited by NAAC with A Grade)

TECHNICAL SEMINARPresented byMrinmoy DalalCSE A (13311A0506)16 February 2016

1

AUDIO PROCESSING AND MUSIC RECOGNITION

2

SOUND

316 February 2016

WHAT IS SOUNDdefinitionPhysical - sound as a disturbance in the air Psychophysical - sound as perceived by the ear Sound as stimulus (physical event) & sound as a sensation. Pressures changes (in band from 20 Hz to 20 kHz)ACOUSTICS is the study of sound.

Physical termsAmplitude Frequency Spectrum

416 February 2016

HOW DO WE HEAREar connected to the brainleft brain: speechright brain: music Ear's sensitivity to frequency is logarithmic Varying frequency response Dynamic range is about 120 dB (at 3-4 kHz) Frequency discrimination 2 Hz (at 1 kHz) Intensity change of 1 dB can be detected.

16 February 20165

DIGITAL AUDIO

16 February 20166

FUNDA - MENTALSDigital audio is sound reproduction using pulse-code modulation and digital signalsDigital audio systems include analog-to-digital conversion (ADC), digital-to-analog conversion (DAC), digital storage, processing and transmission componentsA primary benefit of digital audio is in its convenience of storage, transmission and retrievalDigital audio is useful in the recording, manipulation, mass-production, and distribution of soundModern distribution of music across the Internet via on-line stores depends on digital recording and digital compression algorithms

16 February 20167

SOUND : PHYSICAL TO DIGITAL16 February 20168

PULSE CODE MODULATION

PCM consists of three steps to digitize an analog signalSamplingQuantizationBinary encoding

16 February 20169

PULSE CODE MODULATION16 February 201610

SAMPLING16 February 201611

16 February 2016121 song = 27.2 MB

1 GB Hard Drive($899 in 1995)Would hold35 songs

AUDIO COMPRESSIONAudio data compression, as distinguished from dynamic range compression, has the potential to reduce the transmission bandwidth and storage requirements of audio data. Audio compression algorithms are implemented in software as audio codecs. Lossy audio compression algorithms provide higher compression at the cost of fidelity and are used in numerous audio applications.Lossless audio compression produces a representation of digital data that decompress to an exact digital duplicate of the original audio stream16 February 201613

AUDIO FILE FORMATSRIFF (Resource Interchange File Format)MS WAV and .AVIMPEG Audio Layer (MPEG) [.mpa, .mp3]AIFC (Apple, SGI) [.aiff, .aif] HCOM (Mac) [.hcom]SND (Sun, NeXT) [.snd]VOC (SoundBlaster card proprietary standard) [.voc]AND MANY OTHERS!

16 February 201614

WHATS IN A SOUND FILEHeader InformationMagic CookieSampling RateBits/SampleChannelsByte OrderEndianCompression typeData

16 February 201615

16 February 201616

AUDIO PROCESSING

16 February 201617

AUDIO PROCESSINGAudio signal processing, sometimes referred to as audio processing, is the intentional alteration of auditory signals, or sound, often through an audio effect or effects unit. As audio signals may be electronically represented in either digital or analog format, signal processing may occur in either domain. Analog processors operate directly on the electrical signal, while digital processors operate mathematically on the digital representation of that signal.Processing methods and application areas include storage, level compression, data compression, transmission, etc. 16 February 201618

16 February 201619

AUDIO PROCESSING TECHNIQUESEqualizationModulationDelayChorusFlangerPhaserPitch ShiftingTime StretchingActive Noise Control16 February 201620

MUSIC RECOGNITION

AUDIO FINGERPRINTING

16 February 201622

16 February 201623An audio fingerprint is essentially a hash function that maps an audio object of a large number of bits to a fingerprint of only a limited number of bits. The audio object can be uniquely identified from this bit string.

AUDIO FINGERPRINT DEFINITIONF5 MB100 KB

AUDIO FINGERPRINTING ARCHITECTURE16 February 201624

CODEC LAYER16 February 201625

i) Samples (unsigned char* samples)A buffer of the actual data samples (2 bytes or 16 bits per sample)

ii) Byte Order (int byteOrder) The byte order of the samples in. This can be CONST_LITTLE_ENDIAN or CONST_BIG_ENDIAN

iii) Number of samples (long size) Number of samples read.

iv) Sample rate (int sRate) The number of samples per second of audio (samples/sec)

v) Stereo (bool stereo) Boolean value indicating whether the audio is stereo

Vi) DurationDuration of the original audio regardless of the number of samples.

Vii) FormatFormat of the original audio. This will be expressed as file extensions - .mp3, .wav etc.

FINGERPRINTING LAYER16 February 201626WAV(5MB)fea690b1-b11dce98-a(100KB)Fingerprint layer carries out the core mathematical analysis of the audio, thereby converting a 5MB audio file into a 100KB fingerprint (bit string)

16 February 201627POST /path/script.cgi HTTP/1.0From: [email protected]: HTTPTool/1.0Content-Type: application/x-www-form-urlencodedContent-Length: 32client_id=42&fingerprint=fea690b1b11dce98a

HTTP POST

DatabaseXML

Dark Side of the moonComfortably NumbPink Floyd

XML ParserAlbum Dark Side of the moonSong Comfortably NumbArtist Pink FloydPROTOCOL LAYER

16 February 201629

HOW SHAZAM WORKSBeforehand, Shazam fingerprints a comprehensive catalog of music, and stores the fingerprints in a database.A user tags a song they hear, which fingerprints a 10 second sample of audio.The Shazam app uploads the fingerprint to Shazams service, which runs a search for a matching fingerprint in their database.If a match is found, the song info is returned to the user, otherwise an error is returned.16 February 201630

SPECTROGRAM FINGERPRINTINGYou can think of any piece of music as a time-frequency graph called a spectrogram. On one axis is time, on another is frequency, and on the 3rd is intensity. Each point on the graph represents the intensity of a given frequency at a specific point in time. Assuming time is on the x-axis and frequency is on the y-axis, a horizontal line would represent a continuous pure tone and a vertical line would represent an instantaneous burst of white noise.16 February 201631

SPECTROGRAM FINGERPRINTINGThe Shazam algorithm fingerprints a song by generating this 3d graph, and identifying frequencies of peak intensity. For each of these peak points it keeps track of the frequency and the amount of time from the beginning16 February 201632Frequency in HzTime in seconds823.441.0541892.311.321712.841.703. . .. . .819.719.943

SPECTOGRAM FINGERPRINTINGShazam builds their fingerprint catalog out as a hash table, where the key is the frequency. When Shazam receives a fingerprint like the one above, it uses the first key (in this case 823.44), and it searches for all matching songs. of the track.

16 February 201633Frequency in HzTime in seconds, song information823.4353.352, Song A by Artist 1823.4434.678, Song B by Artist 2823.45108.65, Song C by Artist 3. . .. . .1892.3134.945, Song B by Artist 2

SPECTOGRAM FINGERPRINTINGIf a specific song is hit multiple times, it then checks to see if these frequencies correspond in time. They create a 2d plot of frequency hits, on one axis is the time from the beginning of the track those frequencies appear in the song, on the other axis is the time those frequencies appear in the sample. If there is a temporal relation between the sets of points, then the points will align along a diagonal. They use another signal processing method to find this line, and if it exists with some certainty, then they label the song a match.16 February 201634

16 February 201635

16 February 201636

16 February 201637

Education

Audio Processing and Music Recognition