Is there an inverse of grep: finding short lines in long patterns? Is there an inverse of grep: finding short lines in long patterns? bash bash

Is there an inverse of grep: finding short lines in long patterns?


You can use the -f switch to use a "pattern file" in grep:

egrep -i -f lookup_file pattern_file >> result_file

This will be faster because grep compiles lookup_file into a single state machine that checks all matches at the same time, rather than checking each pattern against each line separately.

If your lookup_file consists of text and not regular expressions, you can use fgrep and it will be even faster.

To get your ideal output you can use the -n and -o switches and you get a list of patterns that match each line.


Since you indicated any language is acceptable I will post a completely different approach: with shell scripting you will never beat the performance of in-memory tools or databases. If you have a lot of data you should use databases which are meant for these kind of operations and it scales much better.

So here is a simple example using sqlite (www.sqlite.org).

You need to import your patterns and data into tables, like this for example (you can script this if you want):

CREATE TABLE patterns (pattern TEXT);CREATE TABLE data (sentence TEXT);BEGIN;INSERT INTO patterns VALUES ('Sun');INSERT INTO patterns VALUES ('Rain');INSERT INTO patterns VALUES ('Cloud');INSERT INTO patterns VALUES ('Beautiful');INSERT INTO data VALUES ('The sun is shining');INSERT INTO data VALUES ('It is a beautiful day');INSERT INTO data VALUES ('It is cloudy and the sun shines');COMMIT;

Then run a select query to get your desired output:

select pattern, group_concat(sentence) as doesmatch from (    select pattern, sentence, lower(pattern) as lpattern, lower(sentence) as lsentence    from patterns left outer join data    where like('%' || lpattern || '%', lsentence)) group by pattern;

If you save the first snippet as data.sql and the second one as query.sql you use this on the command line:

sqlite3 sentences.db < data.sql    # this imports your data, run oncesqlite3 sentences.db < query.sql

This gives you:

Beautiful|It is a beautiful dayCloud|It is cloudy and the sun shinesSun|The sun is shining,It is cloudy and the sun shines

which is what you want I believe. To make it more fancy use your favourite more advanced tool with a database library. I would choose python for this.

Suggestions for further improvement:

  • use regex instead of like to filter whole words (i.e. pattern "sun" matches "sun" but not "sunny"),

  • import utility,

  • output formatting,

  • query optimization.


Your solution may actually be slow because it creates 50.000 processes all reading the 500 lines pattern_file.

Another "pure bash & unix utils" solution could be to let grep do what it can do best and just match the output against your pattern_file.

So use grep to find matching lines and the parts that actually do match.

I use word matching here, which can be turned off by removing the -w switch in the grep line and to get initial behavior as described in your example.

The output is not yet redirected to result_file.csv.. which is easy to add later 8)

#!/bin/bash# open pattern_fileexec 3<> pattern_file# declare and initialize integer variablesdeclare -i linenrdeclare -i pnr=0# loop for reading from the grep process## grep process creates following output:#   <linenumber>:<match># where linenumber is the number of the matching line in pattern_file# and   match is the actual matching word (grep -w) as found in lookup_file# grep output is piped through sed to actually get#   <linenumber> <match>while read linenr match ; do   # skip line from pattern_file till we read the line   # that contained the match   while [[ ${linenr} > ${pnr} ]] ; do       read -u 3 pline       pnr+=1   done   # echo match and line from pattern_file   echo "$match, $pline"done < <( grep -i -w -o -n -f lookup_file pattern_file | sed -e 's,:, ,' )# close pattern_fileexec 3>&-

result is

sun, The sun is shiningshining, The sun is shiningbeautiful, It is a beautiful day!

for the example given. Attention: the match is now the exact match where the case is preserved. So this does not results in Sun, ... but in sun, ....

The result is a script which reads pattern_files once using a grep which in the best case reads pattern_file and lookup_file once - depending on the actual implementation.It only starts two additional processes: grep and sed. (if needed, sed can be replaced by some bash substitution within the outer loop)

I did not try it with 50.000 line lookup_file and 500 lines pattern_file though. But I think it may be as fast as grep can be.

As long as grep can keep the lookup_file in memory it may be reasonable fast. (Who knows)

No matter if it solves your problem I would be interested how it performs compared to your initial script since I do lack nice test files.

If grep -f lookup_file uses too much memory (as you mentioned in a comment before) it may be a solution to split it in portions that actually do fit into memory and run the script more then once or use more then one machine, run all parts on those machines and just collect and concatenate the results. As long as the lookup_files do not contain dupes, you can just concatenate the results without checking for dupes. If sorting matters, You can sort all single results and then merge them quiet fast using sort -m.

Splitting up the lookup_file should not affect runtimes dramatically as long as you split the lookup_file only once and rerun the script, since your pattern_file may be small enough with its 500 lines to stay in memory cache anyway!? The same may be true for the lookup_file if you use more then one machine - its parts may just stay in memory on every machine.

EDIT:

As pointed out in my comment this will not work for overlapping files out of the box since grep -f seems to return only the longest match and will not rematch so if lookup_file contains

SunShiningisS

the result will be

sun, The sun is shiningis, The sun is shiningshining, The sun is shining

and not

sun, The sun is shiningis, The sun is shiningshining, The sun is shinings, The sun is shinings, The sun is shinings, The sun is shining

So all the matching s (it matches three times) are missing.

In fact this is another issue with this solution: If a string is found twice it will be matched twice and identical lines will be returned, which can be removed by uniq.

Possible workaround: Split the lookup_file by string length of search strings. Which will decrease maxmimum memory needed for a run of grep but also slow down the whole thing a little bit. But: You can then search in parallel (and may want to check greps --mmap option if doing that on the same server).